Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond
MetadataShow full item record
Large-scale pre-trained transformer models such as BERT have become ubiquitous in Natural Language Processing (NLP) research and applications. They bring significant improvements to both academia benchmarking tasks and industry applications: the average score on the General Language Understanding Evaluation benchmark (GLUE) has increased from 74 to 90+; commercial search engines such as Google and Microsoft Bing are also applying BERT-like models to search. Despite their exciting power, these increasingly large transformer-based models are notorious for having billions of parameters and being slow in both training and inference, making deployment difficult when inference time and resources are limited. Therefore, model efficiency has become a more important and urgent problem in the transformer era. In this thesis, we propose and innovate methods for efficient NLP models. We choose to specifically focus on inference efficiency: pre-trained models are almost always publicly available, and fine-tuning is performed on relatively small datasets without strict time constraints; inference, by contrast, needs to be performed repetitively and typically in a real-time setting. First, we propose the early exiting idea for transformers. Considering that the transformer model has multiple layers with identical structures, we try to reduce the number of layers used for inference by dynamic early exiting. During inference, if an intermediate transformer layer predicts an output of high confidence, we directly exit from this layer and use the current output as the final one. We apply the early exiting idea on sequence classification tasks and show that it is able to greatly improve inference efficiency. We then explore a few extensions to the early exiting idea: (1) early exiting for low-resource datasets - in this case, the straightforward fine-tuning methods fail to train the model to its full potential and we propose a method to better balance all layers of the model; (2) early exiting for regression datasets - in this case, the output is no longer a distribution where we can directly estimate confidence, and we design a learning-to-exit module to explicitly learn confidence estimation; (3) early exiting for document reranking - in this case, the two classes that the model tries to distinguish are highly asymmetric and we design an asymmetric early exiting method to better handle this task. We also extend early exiting to another direction - selective prediction. In this setting, if we have low confidence in the final prediction, we abstain from making predictions at all. We propose better ways for confidence estimation and also discuss a few applications for selective prediction. Finally, we discuss the combination of multiple efficiency methods, including early exiting itself and other popular methods such as distillation, pruning, quantization, etc. We propose a conceptual framework to treat each efficiency method as an operator. We conduct experiments to show interesting properties of these operators when they combine, which provide useful guidelines for designing and evaluating the application of combining multiple efficiency methods. The thesis presents a series of modeling and experimental contributions for efficient transformer models. We not only largely reduce the inference time for many NLP and IR applications, but also provide insights to understand the efficiency problem from a novel perspective.
Cite this version of the work
Ji Xin (2023). Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond. UWSpace. http://hdl.handle.net/10012/19111