Novel Methods for Natural Language Modeling and Pretraining

Bai, He

Novel Methods for Natural Language Modeling and Pretraining

Files

Bai_He.pdf (9.33 MB)

Date

2023-02-21

Authors

Bai, He

Advisor

Ming, Li

Publisher

University of Waterloo

Abstract

This thesis is about modeling language sequences to achieve lower perplexity, better generation, and benefit downstream language tasks; specifically, this thesis addresses the importance of natural language features including the segmentation feature, lexical feature, and alignment feature. We present three new techniques that improve language sequence modeling with different language features. 1. Segment-Aware Language Modeling is a novel model architecture leveraging the text segementation feature for text sequence modeling. It encodes richer positional information for language modeling, by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. By applying our approach to Transformer-XL, we train a new language model, Segatron-XL, that achieves a 6.6-7.8% relative reduction in perplexity. Additionally, BERT pretrained with our method -- SegaBERT -- outperforms BERT on general language understanding, sentence representation learning, and machine reading comprehension tasks. Furthermore, our SegaBERT-large model outperforms RoBERTa-large on zero-shot STS tasks. These experimental results demonstrate that our proposed Segatron works on both language models with relative position embeddings and pretrained language models with absolute position embeddings. 2. Hypernym-Instructed Language Modeling is a novel training method leveraging the lexical feature for rare word modeling. It maps words that have a common WordNet hypernym to the same class and trains large neural LMs by gradually annealing from predicting the class to token prediction during training. Class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. Empirically, this curriculum learning strategy consistently reduces perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and ArXiv. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words. 3. Alignment-Aware Acoustic and Text Modeling is a novel pretraining method leveraging both the segmentation and alignment features for text-speech sequence modeling. It reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and new speaker TTS directly. Experiments show A3T outperforms SOTA models on speech editing and improves multi-speaker speech synthesis without the external speaker verification model.