SCALING PRE-TRAINING DATA & LANGUAGE MODELS FOR AFRICAN LANGUAGES

dc.contributor.advisorLIN, JIMMY
dc.contributor.authorOladipo, Akintunde
dc.date.accessioned2024-08-23T15:35:52Z
dc.date.available2024-08-23T15:35:52Z
dc.date.issued2024-08-23
dc.date.submitted2024-08-21
dc.description.abstractRecent advancements in language models, particularly for high-resource languages, have not been paralleled in low-resource languages spoken across Africa. This thesis addresses this gap by scaling pre-training data and developing improved language models for African languages. We introduce Wura, a high-quality, document-level pre-training dataset encompassing 16 African languages along with four high-resource languages commonly spoken on the continent: Arabic, English, French, and Portuguese. Leveraging Wura, we pre-train new versions of the AfriBERTa (encoder-only) and AfriTeVa (encoder-decoder) model families. These new models demonstrate superior performance across a variety of natural language understanding and generation tasks compared to existing baselines. Notably, AfriTeVa V2 Large (1B) stands as the largest sequence-to-sequence model pre-trained for African languages to date. Our methodology includes a meticulous three-stage curation process for Wura --- auditing and filtering existing web crawls, initiating new web crawls, and integrating existing language resources. The experimental setup and evaluation encompass tasks like text classification, information retrieval, translation, summarization, and cross-lingual question answering. Our new models outperform their predecessors and other established models, even those with significantly more parameters, highlighting the efficacy of high-quality pre-training data. Furthermore, we study the generalization of our models to languages not deliberately included in their pre-training data.
dc.identifier.urihttps://hdl.handle.net/10012/20872
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://huggingface.co/datasets/castorini/wura
dc.relation.urihttps://huggingface.co/castorini/afriteva_v2_base
dc.relation.urihttps://huggingface.co/castorini/afriteva_v2_large
dc.subjectmultilingual
dc.subjectnatural language processing
dc.subjectpretrained transformers
dc.titleSCALING PRE-TRAINING DATA & LANGUAGE MODELS FOR AFRICAN LANGUAGES
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentDavid R. Cheriton School of Computer Science
uws-etd.degree.disciplineComputer Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorLIN, JIMMY
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Oladipo_Akintunde.pdf
Size:
405.31 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: