AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages
Abstract
There are over 7000 languages spoken on earth, but many of these languages suffer from a dearth of natural language processing (NLP) tools. Multilingual pretrained language models have been introduced to help alleviate this problem. However, the largest pretrained multilingual models were trained on only hundreds of languages. This is a small amount when compared to the number of spoken languages. While these models have displayed impressive performance on several languages, including those they were not pretrained on, there is a lot of ground to be covered.
A lot of languages are often left out because pretrained language models are assumed to require a lot of training data, which the languages do not have. Furthermore, a major motivation behind these models is that such lower-resource languages benefit from joint training with higher-resource languages. In this thesis, we challenge both these assumptions and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than one gigabyte of text data containing a selection of African languages.
Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. We evaluate this model on named entity recognition and text classification spanning 10 languages. Our evaluation results show that our model is very competitive with larger multilingual models - multilingual BERT and XLM-RoBERTa - on several languages. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high- resource languages. Furthermore, we present a comprehensive discussion of the implications of our findings.
Collections
Cite this version of the work
Kelechi Ogueji
(2022).
AfriBERTa: Towards Viable Multilingual Language Models for Low-resource Languages. UWSpace.
http://hdl.handle.net/10012/18662
Other formats
Related items
Showing items related by title, author, creator and subject.
-
Syntactic Complexities of Six Classes of Star-Free Languages
Brzozowski, Janusz; Li, Baiyu; Liu, David (Otto-von-Guericke-Universit¨at Magdeburg, 2012)The syntactic complexity of a regular language is the cardinality of its syntactic semi-group. The syntactic complexity of a subclass of regular languages is the maximal syntactic complexity of languages in that subclass, ... -
Online Digital Game-Based Language Learning Environments: Opportunities for Second Language Development
Scholz, Kyle (University of Waterloo, 2016-01-07)This dissertation project is an analysis of the language learning processes of 14 learners playing in and interacting with the massive multiplayer online role-playing game (MMORPG) World of Warcraft (WoW) in German in order ... -
The High German of Russian Mennonites in Ontario
Penner, Nikolai (University of Waterloo, 2010-01-20)The main focus of this study is the High German language spoken by Russian Mennonites, one of the many groups of German-speaking immigrants in Canada. Although the primary language of most Russian Mennonites is a Low German ...