Swivuriso: The South African Next Voices Multilingual Speech Dataset

Andinda Bakainga; Andiswa Bukula; and Seani Rananga; Chijioke Okorie; Dale Dunbar; Francois Smit; Graham Morrissey; Idris Abdulmumin; Kayode Olaleye; Kesego Mokgosi

arxiv: 2512.02201 · v3 · pith:I77BEQYTnew · submitted 2025-12-01 · 💻 cs.CL

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivate , Kayode Olaleye , Sitwala Mundia , Andinda Bakainga , Unarine Netshifhefhe , Mahmooda Milanzie , Tsholofelo Hope Mogale , Thapelo Sindane

show 14 more authors

Zainab Abdulrasaq Kesego Mokgosi Chijioke Okorie Nia Zion Van Wyk Graham Morrissey Dale Dunbar Francois Smit Tsosheletso Chidi Rooweither Mabuya Andiswa Bukula Respect Mlambo Tebogo Macucwa Idris Abdulmumin and Seani Rananga

This is my paper

classification 💻 cs.CL

keywords africandatasetspeechswivurisodatadatasetsmultilingualnext

0 comments

read the original abstract

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

This paper has not been read by Pith yet.

Swivuriso: The South African Next Voices Multilingual Speech Dataset

discussion (0)