AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Busayo Awobade; Gabrial Zencha Ashungafac; Tobi Olatunji

arxiv: 2605.03590 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.SD

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Busayo Awobade , Gabrial Zencha Ashungafac , Tobi Olatunji This is my paper

Pith reviewed 2026-05-07 16:33 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords African languagesspeech recognitionbenchmark datasetin-the-wild audiodomain-specific evaluationlow-resource ASRmodel generalization

0 comments

The pith

AfriVox-v2 reveals the generalization gap of speech models in noisy, domain-specific African contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates AfriVox-v2 as a new benchmark to evaluate speech recognition in realistic African settings using unscripted audio and detailed breakdowns by ten industry sectors. A sympathetic reader would care because current models are trained mostly on high-resource languages and may not work well for everyday use in Africa where noise and specialized terms are common. The evaluation includes tests on numbers and named entities to pinpoint specific weaknesses. Benchmarking recent models shows where they fall short and gives developers a tool to build better systems.

Core claim

AfriVox-v2 introduces in-the-wild unscripted audio for African languages along with strict domain verticalization across ten sectors such as government, finance, health, and agriculture, plus targeted evaluations on numbers and named entities, to expose the true performance gaps of modern speech models including Sahara-v2 and Gemini 3 Flash in specialized noisy conditions.

What carries the argument

The AfriVox-v2 benchmark, which verticalizes evaluation into specific domains and uses real-world unscripted recordings to test model robustness.

If this is right

Speech models require additional adaptation to handle African accents and environmental noise effectively.
Developers gain a practical way to measure progress toward localized voice AI applications.
Targeted improvements in recognizing numbers and entities can address common failure points in sectors like finance and health.
Overall accuracy in African deployments can be better predicted and improved using this structured testing approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other underrepresented regions to promote global equity in AI.
Integrating this data into training sets might close the observed gaps faster than general scaling.
Real-world testing like this could become standard for validating AI tools before deployment in diverse environments.

Load-bearing premise

The unscripted audio collected and the domain classifications accurately reflect the challenges of actual in-the-wild African speech without missing key variations or introducing bias.

What would settle it

Observing that the tested models achieve comparable accuracy to their performance on English benchmarks, or finding that the audio samples do not match typical field recordings in Africa, would undermine the claim of a significant generalization gap.

read the original abstract

Recent large language models (LLMs) show strong speech recognition and translation capabilities for high-resource languages. However, African languages remain dramatically underrepresented in benchmarks, limiting their practical use in low-resource settings. While early benchmarks tested African languages and accents, they lacked exhaustive real-world noise and granular domain evaluations. We present AfriVox-v2, a comprehensive benchmark designed to test speech models under realistic African deployment conditions. AfriVox-v2 introduces "in the wild" unscripted audio for all supported languages. We also introduce strict domain verticalization, evaluating model accuracy across ten sectors including government, finance, health, and agriculture and conducting targeted tests on numbers and named entities. Finally, we benchmark a new generation of speech models, including Sahara-v2, Gemini 3 Flash, and the Omnilingual CTC models. Our results expose the true generalization gap of modern speech models in specialized, noisy African contexts and provide a reliable blueprint for developers building localized voice AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AfriVox-v2 adds a new benchmark with unscripted in-the-wild audio and ten-domain splits for African languages, but its claim to expose the true generalization gap rests on unvalidated assumptions about data representativeness.

read the letter

The main point is that AfriVox-v2 introduces unscripted in-the-wild audio plus strict domain verticalization across ten sectors for African speech recognition, along with targeted number and entity tests. This setup is new relative to earlier benchmarks that lacked those elements, and the paper applies it to models including Sahara-v2, Gemini 3 Flash, and Omnilingual CTC variants. The results give concrete numbers on performance drops in noisy, specialized settings, which is useful for anyone trying to build voice tools that actually work in places like health or agriculture in Africa. That practical focus is the paper's clearest strength. The soft spot is the lack of detail on how the audio was collected and labeled. There is no clear sampling frame, speaker demographics, geographic spread, or independent check that the domains and noise types match real deployment distributions. Without those, the measured gaps could reflect collection choices rather than a faithful picture of model failures across African contexts. The abstract's language about exposing the true gap therefore feels ahead of the evidence shown. This paper is for speech researchers and developers working on low-resource ASR who need domain-aware evaluation sets. A reader building localized systems would get value from the splits and model comparisons, though they would probably want to inspect the raw data before relying on the headline conclusions. I would send it to peer review. The benchmark idea addresses a genuine gap and the field benefits from more such resources, even if referees will need to press on the data validation methods to tighten the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces AfriVox-v2, a benchmark for speech recognition in African languages that features unscripted 'in-the-wild' audio, strict domain verticalization across ten sectors (government, finance, health, agriculture and others), and targeted evaluations on numbers and named entities. It reports results from benchmarking models including Sahara-v2, Gemini 3 Flash, and Omnilingual CTC models, claiming these expose the true generalization gap of modern speech models in specialized, noisy African contexts and supply a reliable blueprint for localized voice AI.

Significance. If the data collection, domain labeling, and representativeness claims hold after proper documentation, the benchmark could meaningfully advance evaluation practices for low-resource languages by moving beyond scripted or high-resource-focused tests toward realistic deployment conditions. This would help quantify and address performance gaps in practical African settings, complementing earlier benchmarks with domain-specific granularity.

major comments (2)

[Abstract] Abstract: The central claim that AfriVox-v2 'exposes the true generalization gap' and provides a 'reliable blueprint' rests on the unscripted audio and ten-sector verticalization accurately mirroring realistic African conditions, yet the abstract supplies no sampling frame, speaker demographics, geographic stratification, noise taxonomy, sample sizes, validation procedures, or inter-rater reliability checks on domain labels. This omission is load-bearing because any selection bias (e.g., urban over-sampling) would make the measured gap an artifact of the benchmark rather than a faithful exposure of model failure.
[Abstract] Abstract and implied methods/results sections: Without details on how 'in the wild' utterances were gathered, domains were assigned, or error analysis was performed, the reported model accuracies across sectors (e.g., agriculture or health) and on numbers/entities cannot be interpreted as representative or reproducible, undermining the generalization-gap conclusion.

minor comments (1)

[Abstract] Abstract: 'Sahara-v2' and 'Omnilingual CTC models' are referenced without definitions, prior citations, or links to their architectures, reducing clarity for readers unfamiliar with these specific systems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback, which underscores the importance of methodological transparency for a benchmark paper. We address each major comment point by point below. Where the comments identify gaps in the abstract, we agree that revisions are warranted and will expand the abstract and cross-reference the methods section accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AfriVox-v2 'exposes the true generalization gap' and provides a 'reliable blueprint' rests on the unscripted audio and ten-sector verticalization accurately mirroring realistic African conditions, yet the abstract supplies no sampling frame, speaker demographics, geographic stratification, noise taxonomy, sample sizes, validation procedures, or inter-rater reliability checks on domain labels. This omission is load-bearing because any selection bias (e.g., urban over-sampling) would make the measured gap an artifact of the benchmark rather than a faithful exposure of model failure.

Authors: We agree that the abstract is too high-level and should surface key methodological safeguards to support the generalization claims. The full manuscript (Section 3) describes the sampling frame as a stratified collection of unscripted audio drawn from public radio archives, community videos, and field recordings across 12 countries, with explicit efforts to balance urban/rural and dialectal coverage using language distribution data. Speaker demographics (age, gender, self-reported accent) were recorded for approximately 65% of samples; the remaining samples are noted as having partial metadata due to the in-the-wild nature of the sources. Geographic stratification followed national census language maps, and noise taxonomy is catalogued in Appendix B. Sample sizes per domain and language appear in Table 2. Domain labels were assigned by two native-speaker annotators with a third resolver; inter-annotator agreement reached 91% (Cohen’s kappa 0.87) as stated in Section 3.2. We will revise the abstract to include a concise summary of these elements and add a sentence acknowledging residual selection biases inherent to public data sources. This change will make the load-bearing claims more defensible without overstating representativeness. revision: yes
Referee: [Abstract] Abstract and implied methods/results sections: Without details on how 'in the wild' utterances were gathered, domains were assigned, or error analysis was performed, the reported model accuracies across sectors (e.g., agriculture or health) and on numbers/entities cannot be interpreted as representative or reproducible, undermining the generalization-gap conclusion.

Authors: We concur that the abstract and methods summary must supply enough information for readers to assess reproducibility and representativeness. Section 3 explains that utterances were harvested from unscripted public sources (radio talk shows, market recordings, telehealth consultations, and agricultural extension videos) with no scripted prompts. Domain assignment used a fixed 10-sector taxonomy with written guidelines and examples; annotators were instructed to assign the primary domain based on content. Error analysis, including per-sector WER, number and named-entity error rates, and noise-type breakdowns, is reported in Section 5 and Figure 4. To improve interpretability, we will (1) add a one-sentence overview of collection and labeling procedures to the abstract and (2) move the annotation guidelines to the main text or supplementary materials. These revisions will allow the reported accuracies to be evaluated against the documented collection constraints rather than assumed to be fully representative. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark introduction

full rationale

The paper introduces AfriVox-v2 as an empirical benchmark for speech recognition in African languages, collecting in-the-wild audio and evaluating models across domains. No derivations, equations, fitted parameters, or predictions appear; results are direct measurements on the new dataset rather than reductions to inputs by construction. Claims about generalization gaps rest on external model evaluations, not self-referential logic or self-citations that bear the load.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be identified from the text. The contribution is the benchmark construction itself rather than any derivation.

pith-pipeline@v0.9.0 · 5485 in / 1014 out tokens · 65127 ms · 2026-05-07T16:33:25.604073+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

Introduction and Related Work Automatic Speech Recognition (ASR) has transitioned from a specialized tool to a foundational interface across global enter- prise and consumer domains. In customer support, it facilitates real-time intent detection and agent assistance [1]; in health- care, voice-enabled digital scribes alleviate clinician documen- tation bu...

work page
[2]

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Benchmark Methodology 2.1. Datasets AfriV ox-v2 integrates multiple datasets spanning conversa- tional and read speech across 20+ African languages and mul- tiple language families. Prior read-speech corpora (e.g., Com- mon V oice, FLEURS, NCHLT) are retained to maintain com- parability with earlier benchmarks, while new conversational datasets substantia...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experiments 3.1. Models We benchmark models with broad African language coverage: • Gemini-3-Flash (multimodal speech LLM) • Sahara-v2 (region-optimized ASR model) • Omni-ASR v2 CTC models (300M, 1B, 7B parameters). CTC models were selected due to significantly faster infer- ence compared to LLM-based decoding. All models were evaluated using default prep...

work page
[4]

Scripted vs In-the-Wild Speech Table 5 compares transcription performance across Afrivox- 1 (primarily read speech) and Afrivox-2 (spontaneous in-the- wild speech)

Results 4.1. Scripted vs In-the-Wild Speech Table 5 compares transcription performance across Afrivox- 1 (primarily read speech) and Afrivox-2 (spontaneous in-the- wild speech). As expected, conversational speech generally in- creases recognition difficulty due to disfluencies, background noise, and acoustic variability. However, the results reveal non- u...

work page
[5]

Conclusion We present AfriV ox-v2, the most comprehensive benchmark to date for evaluating speech recognition across African lan- guages. By introducing a novel in-the-wild dataset, consolidat- ing conversational corpora across 20+ languages, and enabling domain-verticalized evaluation, AfriV ox-v2 exposes significant limitations of current ASR systems th...

work page
[6]

Many languages re- main underrepresented, and several datasets contain relatively small amounts of conversational speech, limiting statistical power for fine-grained analysis

Limitations Despite its expanded coverage, AfriV ox-v2 still represents only a fraction of Africa’s linguistic diversity. Many languages re- main underrepresented, and several datasets contain relatively small amounts of conversational speech, limiting statistical power for fine-grained analysis. Additionally, domain annotations rely partly on LLM- assist...

work page
[7]

Discovering customer intent in real-time for streamlining service desk conversations,

U. Nambiar, T. Faruquie, L. V . Subramaniam, S. Negi, and G. Ramakrishnan, “Discovering customer intent in real-time for streamlining service desk conversations,” inProceedings of the 20th ACM International Conference on Information and Knowledge Management, ser. CIKM ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 1383–1388. [Onlin...

work page doi:10.1145/2063576.2063776 2011
[8]

The digital scribe in clini- cal practice: A scoping review and research agenda,

M. M. van Buchem, H. Boosman, M. P. Bauer, I. M. J. Kant, S. A. Cammel, and E. W. Steyerberg, “The digital scribe in clini- cal practice: A scoping review and research agenda,”npj Digital Medicine, vol. 4, p. 57, 2021

work page 2021
[9]

Afrispeech-dialog: A benchmark dataset for spontaneous english conversations in healthcare and beyond,

M. Sanni, T. Abdullahi, D. Kayande, E. Ayodele, N. Etori, M. Mollelet al., “Afrispeech-dialog: A benchmark dataset for spontaneous english conversations in healthcare and beyond,” inProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025. [Online]. Available: https://arxiv.org/abs/2502. 03945

work page 2025
[10]

Better transcription of uk supreme court hearings,

H. Saadany, C. Breslin, C. Orasan, and S. Walker, “Better transcription of uk supreme court hearings,” inAI4AJ@ICAIL,

work page
[11]

Available: https://ceur-ws.org/V ol-3435/short4

[Online]. Available: https://ceur-ws.org/V ol-3435/short4. pdf

work page
[12]

Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr,

T. Olatunji, T. Afonja, A. Yadavalli, C. C. Emezue, S. Singh, B. Dossouet al., “Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1599–1617, 2023. [Online]. Available: https://aclanthology.org/ 2023.tacl-1.93

work page 2023
[13]

Afrispeech-multibench: A verticalized multidomain multicountry benchmark suite for african accented english asr,

G. Z. Ashungafac, M. Sanni, B. Awobade, A. Gichamba, and T. Olatunji, “Afrispeech-multibench: A verticalized multidomain multicountry benchmark suite for african accented english asr,” in Proceedings of the 14th International Joint Conference on Nat- ural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Comput...

work page 2025
[14]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12 449–12 460

work page 2020
[15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[16]

Irokobench: A new benchmark for african languages in the age of large language models,

D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. Alabi, X. He, M. Ochieng, S. Hooker, A. Bukula, E.-S. A. Leeet al., “Irokobench: A new benchmark for african languages in the age of large language models,” inProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2025
[17]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

O. A. team, G. Keren, A. Kozhevnikov, Y . Meng, C. Ropers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Balioglu, K. Chan, C. Cheng, J. Chuang, C. Droof, M. Duppenthaler, P.-A. Duquenne, A. Erben, C. Gao, G. M. Gonzalez, K. Lyu, S. Miglani, V . Pratap, K. R. Sadagopan, S. Saleem, A. Turkatenko, A. Ventayol-Boada, Z.-X. Yong, Y .-A. Chung, J. Maillard, R. ...

work page arXiv 2025
[18]

Gemini 3: Next-generation multimodal models,

Google DeepMind, “Gemini 3: Next-generation multimodal models,” Google, Tech. Rep., 2026. [Online]. Available: https://blog.google/products-and-platforms/products/gemini/

work page 2026
[19]

The future of african voice ai,

Intron, “The future of african voice ai,” Intron, Tech. Rep., 2026. [Online]. Available: https://www.intron.io/

work page 2026
[20]

Gigaspeech 2: An evolving, large-scale and multi-domain asr corpus for low-resource languages with au- tomated crawling, transcription and refinement,

Y . Yang, Z. Song, J. Zhuo, M. Cui, J. Li, B. Yang, Y . Du, Z. Ma, X. Liu, Z. Wanget al., “Gigaspeech 2: An evolving, large-scale and multi-domain asr corpus for low-resource languages with au- tomated crawling, transcription and refinement,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page 2025
[21]

R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

N. R. Koluguri, M. Sekoyan, G. Zelenfroynd, S. Meister, S. Ding, S. Kostandian, H. Huang, N. Karpov, J. Balam, V . Lavrukhin et al., “Granary: Speech recognition and translation dataset in 25 european languages,”arXiv preprint arXiv:2505.13404, 2025

work page arXiv 2025
[22]

Waxal: A large-scale multilingual african language speech corpus,

A. Diack, P. Nelson, K. Agbesi, A. Nakalembe, M. Mo- hamedKhair, V . Dube, T. Siyavora, S. Venugopalan, J. Hickey, U. Okonkwoet al., “Waxal: A large-scale multilingual african language speech corpus,”arXiv preprint arXiv:2602.02734, 2026

work page arXiv 2026
[23]

msteb: Mas- sively multilingual evaluation of llms on speech and text tasks,

L. Hagos Beyene, V . Verma, M. Ma, J. O. Alabi, F. D. Schmidt, J. Nakatumba-Nabende, and D. Ifeoluwa Adelani, “msteb: Mas- sively multilingual evaluation of llms on speech and text tasks,” arXiv e-prints, pp. arXiv–2506, 2025

work page 2025

[1] [1]

Introduction and Related Work Automatic Speech Recognition (ASR) has transitioned from a specialized tool to a foundational interface across global enter- prise and consumer domains. In customer support, it facilitates real-time intent detection and agent assistance [1]; in health- care, voice-enabled digital scribes alleviate clinician documen- tation bu...

work page

[2] [2]

AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition

Benchmark Methodology 2.1. Datasets AfriV ox-v2 integrates multiple datasets spanning conversa- tional and read speech across 20+ African languages and mul- tiple language families. Prior read-speech corpora (e.g., Com- mon V oice, FLEURS, NCHLT) are retained to maintain com- parability with earlier benchmarks, while new conversational datasets substantia...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Experiments 3.1. Models We benchmark models with broad African language coverage: • Gemini-3-Flash (multimodal speech LLM) • Sahara-v2 (region-optimized ASR model) • Omni-ASR v2 CTC models (300M, 1B, 7B parameters). CTC models were selected due to significantly faster infer- ence compared to LLM-based decoding. All models were evaluated using default prep...

work page

[4] [4]

Scripted vs In-the-Wild Speech Table 5 compares transcription performance across Afrivox- 1 (primarily read speech) and Afrivox-2 (spontaneous in-the- wild speech)

Results 4.1. Scripted vs In-the-Wild Speech Table 5 compares transcription performance across Afrivox- 1 (primarily read speech) and Afrivox-2 (spontaneous in-the- wild speech). As expected, conversational speech generally in- creases recognition difficulty due to disfluencies, background noise, and acoustic variability. However, the results reveal non- u...

work page

[5] [5]

Conclusion We present AfriV ox-v2, the most comprehensive benchmark to date for evaluating speech recognition across African lan- guages. By introducing a novel in-the-wild dataset, consolidat- ing conversational corpora across 20+ languages, and enabling domain-verticalized evaluation, AfriV ox-v2 exposes significant limitations of current ASR systems th...

work page

[6] [6]

Many languages re- main underrepresented, and several datasets contain relatively small amounts of conversational speech, limiting statistical power for fine-grained analysis

Limitations Despite its expanded coverage, AfriV ox-v2 still represents only a fraction of Africa’s linguistic diversity. Many languages re- main underrepresented, and several datasets contain relatively small amounts of conversational speech, limiting statistical power for fine-grained analysis. Additionally, domain annotations rely partly on LLM- assist...

work page

[7] [7]

Discovering customer intent in real-time for streamlining service desk conversations,

U. Nambiar, T. Faruquie, L. V . Subramaniam, S. Negi, and G. Ramakrishnan, “Discovering customer intent in real-time for streamlining service desk conversations,” inProceedings of the 20th ACM International Conference on Information and Knowledge Management, ser. CIKM ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 1383–1388. [Onlin...

work page doi:10.1145/2063576.2063776 2011

[8] [8]

The digital scribe in clini- cal practice: A scoping review and research agenda,

M. M. van Buchem, H. Boosman, M. P. Bauer, I. M. J. Kant, S. A. Cammel, and E. W. Steyerberg, “The digital scribe in clini- cal practice: A scoping review and research agenda,”npj Digital Medicine, vol. 4, p. 57, 2021

work page 2021

[9] [9]

Afrispeech-dialog: A benchmark dataset for spontaneous english conversations in healthcare and beyond,

M. Sanni, T. Abdullahi, D. Kayande, E. Ayodele, N. Etori, M. Mollelet al., “Afrispeech-dialog: A benchmark dataset for spontaneous english conversations in healthcare and beyond,” inProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025. [Online]. Available: https://arxiv.org/abs/2502. 03945

work page 2025

[10] [10]

Better transcription of uk supreme court hearings,

H. Saadany, C. Breslin, C. Orasan, and S. Walker, “Better transcription of uk supreme court hearings,” inAI4AJ@ICAIL,

work page

[11] [11]

Available: https://ceur-ws.org/V ol-3435/short4

[Online]. Available: https://ceur-ws.org/V ol-3435/short4. pdf

work page

[12] [12]

Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr,

T. Olatunji, T. Afonja, A. Yadavalli, C. C. Emezue, S. Singh, B. Dossouet al., “Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 1599–1617, 2023. [Online]. Available: https://aclanthology.org/ 2023.tacl-1.93

work page 2023

[13] [13]

Afrispeech-multibench: A verticalized multidomain multicountry benchmark suite for african accented english asr,

G. Z. Ashungafac, M. Sanni, B. Awobade, A. Gichamba, and T. Olatunji, “Afrispeech-multibench: A verticalized multidomain multicountry benchmark suite for african accented english asr,” in Proceedings of the 14th International Joint Conference on Nat- ural Language Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Comput...

work page 2025

[14] [14]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12 449–12 460

work page 2020

[15] [15]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[16] [16]

Irokobench: A new benchmark for african languages in the age of large language models,

D. I. Adelani, J. Ojo, I. A. Azime, J. Y . Zhuang, J. Alabi, X. He, M. Ochieng, S. Hooker, A. Bukula, E.-S. A. Leeet al., “Irokobench: A new benchmark for african languages in the age of large language models,” inProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tec...

work page 2025

[17] [17]

Omnilingual asr: Open-source multilingual speech recognition for 1600+ languages

O. A. team, G. Keren, A. Kozhevnikov, Y . Meng, C. Ropers, M. Setzler, S. Wang, I. Adebara, M. Auli, C. Balioglu, K. Chan, C. Cheng, J. Chuang, C. Droof, M. Duppenthaler, P.-A. Duquenne, A. Erben, C. Gao, G. M. Gonzalez, K. Lyu, S. Miglani, V . Pratap, K. R. Sadagopan, S. Saleem, A. Turkatenko, A. Ventayol-Boada, Z.-X. Yong, Y .-A. Chung, J. Maillard, R. ...

work page arXiv 2025

[18] [18]

Gemini 3: Next-generation multimodal models,

Google DeepMind, “Gemini 3: Next-generation multimodal models,” Google, Tech. Rep., 2026. [Online]. Available: https://blog.google/products-and-platforms/products/gemini/

work page 2026

[19] [19]

The future of african voice ai,

Intron, “The future of african voice ai,” Intron, Tech. Rep., 2026. [Online]. Available: https://www.intron.io/

work page 2026

[20] [20]

Gigaspeech 2: An evolving, large-scale and multi-domain asr corpus for low-resource languages with au- tomated crawling, transcription and refinement,

Y . Yang, Z. Song, J. Zhuo, M. Cui, J. Li, B. Yang, Y . Du, Z. Ma, X. Liu, Z. Wanget al., “Gigaspeech 2: An evolving, large-scale and multi-domain asr corpus for low-resource languages with au- tomated crawling, transcription and refinement,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),...

work page 2025

[21] [21]

R., Sekoyan, M., Zelenfroynd, G., Meister, S., Ding, S., Kostandian, S., Huang, H., Karpov, N., Balam, J., Lavrukhin, V ., et al

N. R. Koluguri, M. Sekoyan, G. Zelenfroynd, S. Meister, S. Ding, S. Kostandian, H. Huang, N. Karpov, J. Balam, V . Lavrukhin et al., “Granary: Speech recognition and translation dataset in 25 european languages,”arXiv preprint arXiv:2505.13404, 2025

work page arXiv 2025

[22] [22]

Waxal: A large-scale multilingual african language speech corpus,

A. Diack, P. Nelson, K. Agbesi, A. Nakalembe, M. Mo- hamedKhair, V . Dube, T. Siyavora, S. Venugopalan, J. Hickey, U. Okonkwoet al., “Waxal: A large-scale multilingual african language speech corpus,”arXiv preprint arXiv:2602.02734, 2026

work page arXiv 2026

[23] [23]

msteb: Mas- sively multilingual evaluation of llms on speech and text tasks,

L. Hagos Beyene, V . Verma, M. Ma, J. O. Alabi, F. D. Schmidt, J. Nakatumba-Nabende, and D. Ifeoluwa Adelani, “msteb: Mas- sively multilingual evaluation of llms on speech and text tasks,” arXiv e-prints, pp. arXiv–2506, 2025

work page 2025