Data Augmentation For Text Classification Tasks

Tamming, Daniel

Data Augmentation For Text Classification Tasks

dc.contributor.advisor	van Beek, Peter
dc.contributor.author	Tamming, Daniel
dc.date.accessioned	2020-08-12T17:14:37Z
dc.date.available	2020-08-12T17:14:37Z
dc.date.issued	2020-08-12
dc.date.submitted	2020-08-04
dc.description.abstract	Thanks to increases in computing power and the growing availability of large datasets, neural networks have achieved state of the art results in many natural language process- ing (NLP) and computer vision (CV) tasks. These models require a large number of training examples that are balanced between classes, but in many application areas they rely on training sets that are either small or imbalanced, or both. To address this, data augmentation has become standard practice in CV. This research is motivated by the ob- servation that, relative to CV, data augmentation is underused and understudied in NLP. Three methods of data augmentation are implemented and tested: synonym replacement, backtranslation, and contextual augmentation. Tests are conducted with two models: a Recurrent Neural Network (RNN) and Bidirectional Encoder Representations from Trans- formers (BERT). To develop learning curves and study the ability of augmentation methods to rebalance datasets, each of three binary classification datasets are made artificially small and made artificially imbalanced. The results show that these augmentation methods can offer accuracy improvements of over 1% to models with a baseline accuracy as high as 92%. On the two largest datasets, the accuracy of BERT is usually improved by either synonym replacement or backtranslation, while the accuracy of the RNN is usually im- proved by all three augmentation techniques. The augmentation techniques tend to yield the largest accuracy boost when the datasets are smallest or most imbalanced; the per- formance benefits appear to converge to 0% as the dataset becomes larger. The optimal augmentation distance, the extent to which augmented training examples tend to deviate from their original form, decreases as datasets become more balanced. The results show that data augmentation is a powerful method of improving performance when training on datasets with fewer than 10,000 training examples. The accuracy increases that they offer are reduced by recent advancements in transfer learning schemes, but they are certainly not eliminated.	en
dc.identifier.uri	http://hdl.handle.net/10012/16113
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	Data Augmentation For Text Classification Tasks	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws.contributor.advisor	van Beek, Peter
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tamming_Daniel.pdf
Size:: 6.14 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science