Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond

Xin, Ji

Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond

dc.contributor.advisor	Lin, Jimmy
dc.contributor.advisor	Yu, Yaoliang
dc.contributor.author	Xin, Ji
dc.date.accessioned	2023-01-24T16:30:25Z
dc.date.available	2023-01-24T16:30:25Z
dc.date.issued	2023-01-24
dc.date.submitted	2023-01-19
dc.description.abstract	Large-scale pre-trained transformer models such as BERT have become ubiquitous in Natural Language Processing (NLP) research and applications. They bring significant improvements to both academia benchmarking tasks and industry applications: the average score on the General Language Understanding Evaluation benchmark (GLUE) has increased from 74 to 90+; commercial search engines such as Google and Microsoft Bing are also applying BERT-like models to search. Despite their exciting power, these increasingly large transformer-based models are notorious for having billions of parameters and being slow in both training and inference, making deployment difficult when inference time and resources are limited. Therefore, model efficiency has become a more important and urgent problem in the transformer era. In this thesis, we propose and innovate methods for efficient NLP models. We choose to specifically focus on inference efficiency: pre-trained models are almost always publicly available, and fine-tuning is performed on relatively small datasets without strict time constraints; inference, by contrast, needs to be performed repetitively and typically in a real-time setting. First, we propose the early exiting idea for transformers. Considering that the transformer model has multiple layers with identical structures, we try to reduce the number of layers used for inference by dynamic early exiting. During inference, if an intermediate transformer layer predicts an output of high confidence, we directly exit from this layer and use the current output as the final one. We apply the early exiting idea on sequence classification tasks and show that it is able to greatly improve inference efficiency. We then explore a few extensions to the early exiting idea: (1) early exiting for low-resource datasets - in this case, the straightforward fine-tuning methods fail to train the model to its full potential and we propose a method to better balance all layers of the model; (2) early exiting for regression datasets - in this case, the output is no longer a distribution where we can directly estimate confidence, and we design a learning-to-exit module to explicitly learn confidence estimation; (3) early exiting for document reranking - in this case, the two classes that the model tries to distinguish are highly asymmetric and we design an asymmetric early exiting method to better handle this task. We also extend early exiting to another direction - selective prediction. In this setting, if we have low confidence in the final prediction, we abstain from making predictions at all. We propose better ways for confidence estimation and also discuss a few applications for selective prediction. Finally, we discuss the combination of multiple efficiency methods, including early exiting itself and other popular methods such as distillation, pruning, quantization, etc. We propose a conceptual framework to treat each efficiency method as an operator. We conduct experiments to show interesting properties of these operators when they combine, which provide useful guidelines for designing and evaluating the application of combining multiple efficiency methods. The thesis presents a series of modeling and experimental contributions for efficient transformer models. We not only largely reduce the inference time for many NLP and IR applications, but also provide insights to understand the efficiency problem from a novel perspective.	en
dc.identifier.uri	http://hdl.handle.net/10012/19111
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	NLP	en
dc.subject	Transformers	en
dc.subject	efficiency	en
dc.title	Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond	en
dc.type	Doctoral Thesis	en
uws-etd.degree	Doctor of Philosophy	en
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Lin, Jimmy
uws.contributor.advisor	Yu, Yaoliang
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Xin_Ji.pdf
Size:: 2.23 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science