dc.description.abstract | Thanks to increases in computing power and the growing availability of large datasets,
neural networks have achieved state of the art results in many natural language process-
ing (NLP) and computer vision (CV) tasks. These models require a large number of
training examples that are balanced between classes, but in many application areas they
rely on training sets that are either small or imbalanced, or both. To address this, data
augmentation has become standard practice in CV. This research is motivated by the ob-
servation that, relative to CV, data augmentation is underused and understudied in NLP.
Three methods of data augmentation are implemented and tested: synonym replacement,
backtranslation, and contextual augmentation. Tests are conducted with two models: a
Recurrent Neural Network (RNN) and Bidirectional Encoder Representations from Trans-
formers (BERT). To develop learning curves and study the ability of augmentation methods
to rebalance datasets, each of three binary classification datasets are made artificially small
and made artificially imbalanced. The results show that these augmentation methods can
offer accuracy improvements of over 1% to models with a baseline accuracy as high as
92%. On the two largest datasets, the accuracy of BERT is usually improved by either
synonym replacement or backtranslation, while the accuracy of the RNN is usually im-
proved by all three augmentation techniques. The augmentation techniques tend to yield
the largest accuracy boost when the datasets are smallest or most imbalanced; the per-
formance benefits appear to converge to 0% as the dataset becomes larger. The optimal
augmentation distance, the extent to which augmented training examples tend to deviate
from their original form, decreases as datasets become more balanced. The results show
that data augmentation is a powerful method of improving performance when training on
datasets with fewer than 10,000 training examples. The accuracy increases that they offer
are reduced by recent advancements in transfer learning schemes, but they are certainly
not eliminated. | en |