Enabling Cross-lingual Information Retrieval for African Languages

Loading...
Thumbnail Image

Date

2023-04-28

Authors

Ogundepo, Odunayo

Advisor

Jimmy, Lin

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Language diversity in NLP is critical in enabling the development of tools for a wide range of users. However, there are limited resources for building such tools for many languages, particularly those spoken in Africa. For search, most existing datasets feature few to no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages. Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages automatically created from Wikipedia. The dataset comprises 6 million queries in English and 23 million relevance judgments automatically extracted from Wikipedia inter-language links. We extract 13,050 test queries with relevant judgments across 15 languages, covering a significantly broader range of African languages than other existing information retrieval test collections. In addition to providing a much-needed resource for researchers, we also release BM25, dense retrieval, and sparse-dense hybrid baselines to establish a starting point for the development of future systems. We hope that our efforts will stimulate further research in information retrieval for African languages and lead to the creation of more effective tools for the benefit of users.

Description

Keywords

Information Retrieval, African Languages, NLP, Natural Language Processing

LC Keywords

Citation