INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

Andiswa Bukula; Blessing Kudzaishe Sibanda; David Ifeoluwa Adelani; Dietrich Klakow; En-Shiun Annie Lee; Godson K. Kalipe; Hao Yu; Happy Buzaaba; Israel Abebe Azime; Jesujoba O. Alabi

arxiv: 2502.09814 · v1 · pith:DZXW6Y7Znew · submitted 2025-02-13 · 💻 cs.CL

INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

Hao Yu , Jesujoba O. Alabi , Andiswa Bukula , Jian Yun Zhuang , En-Shiun Annie Lee , Tadesse Kebede Guge , Israel Abebe Azime , Happy Buzaaba

show 14 more authors

Blessing Kudzaishe Sibanda Godson K. Kalipe Jonathan Mukiibi Salomon Kabongo Kabenamualu Mmasibidi Setaka Lolwethu Ndolela Nkiruka Odu Rooweither Mabuya Shamsuddeen Hassan Muhammad Salomey Osei Sokhar Samb Juliet W. Murage Dietrich Klakow David Ifeoluwa Adelani

This is my paper

classification 💻 cs.CL

keywords detectionintentlanguagesperformanceafricanenglishfine-tuninglanguage

0 comments

read the original abstract

Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse
cs.CL 2026-06 unverdicted novelty 5.0

The Meaning Intelligence Framework raises zero-shot register classification accuracy from 33.3% to 73.3% on a 30-item Nigerian discourse calibration set while showing that smaller models can outperform larger ones on ...