Enabling Cross-lingual Information Retrieval for African Languages

Ogundepo, Odunayo

Enabling Cross-lingual Information Retrieval for African Languages

Files

Ogundepo_Odunayo.pdf (661.36 KB)

Date

2023-04-28

Authors

Ogundepo, Odunayo

Publisher

University of Waterloo

Abstract

Language diversity in NLP is critical in enabling the development of tools for a wide range of users. However, there are limited resources for building such tools for many languages, particularly those spoken in Africa. For search, most existing datasets feature few to no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages. Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages automatically created from Wikipedia. The dataset comprises 6 million queries in English and 23 million relevance judgments automatically extracted from Wikipedia inter-language links. We extract 13,050 test queries with relevant judgments across 15 languages, covering a significantly broader range of African languages than other existing information retrieval test collections. In addition to providing a much-needed resource for researchers, we also release BM25, dense retrieval, and sparse-dense hybrid baselines to establish a starting point for the development of future systems. We hope that our efforts will stimulate further research in information retrieval for African languages and lead to the creation of more effective tools for the benefit of users.

Keywords

Information Retrieval, African Languages, NLP, Natural Language Processing

URI

http://hdl.handle.net/10012/19361

Collections

Theses
Computer Science

Full item page

Enabling Cross-lingual Information Retrieval for African Languages

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By