Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Aaditya Pareek; Amritansh Walecha; Bhaskar Singh; Hanuman Sidh; Kaushal Bhogale; Manas Dhir; Manmeet Kaur; Mitesh M. Khapra; Sagar Jain; Shobhit Banga

arxiv: 2604.19151 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.SD· eess.AS

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Kaushal Bhogale , Manas Dhir , Amritansh Walecha , Manmeet Kaur , Vanshika Chhabra , Aaditya Pareek , Hanuman Sidh , Sagar Jain

show 5 more authors

Bhaskar Singh Utkarsh Singh Tahir Javed Shobhit Banga Mitesh M. Khapra

This is my paper

Pith reviewed 2026-05-10 02:32 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords speech recognitionIndian languagesASR benchmarkunscripted speechtelephonic conversationsmultilingual ASRregional disparities

0 comments

The pith

A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for speech recognition in India rely on scripted recordings and penalize natural spelling variations in transcripts, which can lead to overfitting on specific test sets rather than real performance. The paper introduces Voice of India as a new closed-source dataset collected from actual telephonic calls across 15 languages and 139 regional clusters, with over 536 hours of speech and transcripts that allow for spelling differences. This allows evaluation of systems under realistic conditions including variations in audio quality and speaker demographics. By analyzing results at district level and across factors like gender and device, the work identifies where models struggle most in everyday use.

Core claim

The central discovery is that a large-scale benchmark built from unscripted telephonic conversations in 15 major Indian languages provides a more representative test for automatic speech recognition systems than existing scripted datasets. With 306230 utterances from 36691 speakers totaling 536 hours and transcripts that account for spelling variations, it exposes geographic disparities in performance and highlights challenges related to audio quality, speaking rate, gender, and device type.

What carries the argument

The Voice of India benchmark dataset, derived from real telephonic conversations with manually transcribed utterances that permit spelling variants to reflect natural language use.

If this is right

ASR systems exhibit varying performance across different Indian districts, indicating regional biases in current models.
Performance degrades under conditions of poor audio quality, fast speaking rates, or certain device types.
Accounting for spelling variations in evaluation leads to fairer assessment of code-mixed speech.
Insights from the dataset can guide targeted improvements in real-world Indic ASR applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of speech systems for multilingual countries could use similar unscripted collection methods to create more practical benchmarks.
The geographic analysis suggests that ASR performance may correlate with socioeconomic factors in different regions, warranting further study.
Extending this approach to other languages with high dialectal variation could improve global ASR equity.

Load-bearing premise

That collecting data from unscripted telephonic conversations and creating transcripts with spelling variants produces a benchmark that is less biased and more reflective of real-world speech than scripted alternatives.

What would settle it

Demonstrating that state-of-the-art ASR models achieve comparable word error rates on this unscripted benchmark as on existing scripted ones, without specific adaptations for spelling variations or regional accents, would undermine the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.19151 by Aaditya Pareek, Amritansh Walecha, Bhaskar Singh, Hanuman Sidh, Kaushal Bhogale, Manas Dhir, Manmeet Kaur, Mitesh M. Khapra, Sagar Jain, Shobhit Banga, Tahir Javed, Utkarsh Singh, Vanshika Chhabra.

**Figure 1.** Figure 1: The WER map of India: Average Word Error Rate (WER) for ASR models for districts of India rigid string matching. To avoid penalizing legitimate orthographic variation, the dataset includes multiple valid transcripts that capture natural spelling differences and alternative renderings commonly found in spontaneous and code mixed speech. A central goal of the benchmark is to expose geographic disparities … view at source ↗

**Figure 4.** Figure 4: shows that deviations from ideal acoustic conditions consistently increase error rates across DNSMOS [29] quality quartiles, speaking-rate quartiles, and utterance duration bins (<2s, 2–5s, >5s). Audio degradation raises WER monotonically; ElevenLabs Scribe rises from 15.31% to 25.20% and Gemini-3-Pro from 13.42% to 23.44% between the highest and lowest quality quartiles. Speaking rate exhibits a U-shaped… view at source ↗

read the original abstract

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Voice of India, a closed-source benchmark of 306230 utterances (536 hours) drawn from unscripted telephonic conversations in 15 major Indian languages across 139 regional clusters, involving 36691 speakers. Transcripts accommodate spelling variations. The authors report geographic performance disparities at the district level and factor analyses on audio quality, speaking rate, gender, and device type to identify challenges for existing ASR systems.

Significance. If the dataset construction, transcript quality, and analyses prove sound upon verification, the work could usefully document real-world Indic ASR difficulties beyond scripted benchmarks. The scale and multi-factor breakdown offer potential guidance for system improvements in under-resourced settings. The closed-source status, however, sharply curtails community adoption and independent testing of the claimed advantages.

major comments (3)

[Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.
[Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.
[§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.

minor comments (2)

[Abstract] Abstract: format the utterance count as 306,230 for standard readability.
[Introduction] Introduction: specify the exact criteria used to define the 139 regional clusters and how speaker demographics were sampled to ensure geographic coverage.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed review and constructive comments on our manuscript. We address each major comment below, clarifying our approach and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.

Authors: The abstract condenses findings from the district-level geographic analysis and the multi-factor breakdowns in §§4–5. These sections quantify performance variations across audio quality, speaking rate, gender, device type, and regional clusters, which directly support the claims of disparities and system struggles. While the abstract itself avoids lengthy numerical tables for brevity, the underlying analyses contain the supporting quantitative breakdowns. In revision we will add a concise sentence to the abstract referencing key aggregate statistics (e.g., WER ranges and factor-specific deltas) drawn from §§4–5. revision: partial
Referee: [Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.

Authors: We recognize that closed-source status precludes independent reproduction. The decision stems from privacy and consent constraints inherent to real, unscripted telephonic conversations involving 36 691 speakers. The manuscript supplies the full collection protocol, transcription guidelines (including spelling-variant handling), sampling strategy across 139 clusters, and all statistical results from the factor analyses. These elements allow the community to understand the observed challenges and to design mitigation strategies even without direct data access. We therefore retain the closed-source designation. revision: no
Referee: [§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.

Authors: Sections 4 and 5 demonstrate that the combination of unscripted speech and variant-aware transcripts surfaces realistic error patterns (e.g., higher WER on code-mixed terms and dialectal variants) that scripted benchmarks typically mask. While we do not include head-to-head WER tables against every existing corpus, the factor analyses isolate the contribution of each variable and show elevated difficulty relative to the clean, scripted conditions described in prior work. In revision we will expand the discussion to cite quantitative comparisons reported in the literature for the same model families and to articulate why direct re-evaluation on Voice of India is not feasible under the current release policy. revision: partial

standing simulated objections not resolved

Independent reproduction of the district-level disparity results and factor analyses is not possible because the benchmark remains closed-source for privacy reasons.

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with no derivations or self-referential reductions

full rationale

The paper introduces Voice of India as a closed-source dataset of unscripted telephonic speech across 15 Indic languages, with accompanying geographic and factor analyses. No mathematical derivations, model predictions, fitted parameters, or uniqueness theorems are claimed. The central contribution is data collection and empirical reporting; all performance observations are presented as direct measurements on the collected utterances rather than outputs derived from prior self-citations or internal definitions. No load-bearing steps reduce by construction to the paper's own inputs, satisfying the default expectation of no significant circularity for a purely empirical benchmark effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data collection and benchmarking paper with no mathematical derivations, free parameters, axioms, or invented entities. All claims rest on the described collection process and empirical observations.

pith-pipeline@v0.9.0 · 5505 in / 1228 out tokens · 36563 ms · 2026-05-10T02:32:46.543796+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Introduction Recent progress in Indic Automatic Speech Recognition (ASR) has been driven by shared tasks and large scale benchmarks such as MUCS [1], IndicSUPERB [2], Vistaar [3], and datasets like IndicV oices [4], which have expanded coverage across lan- guages, accents, orthographies, and code switching. However, improvements on benchmark leaderboards ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Related Work Benchmarks for Indian Language ASR. Early efforts to evaluate ASR for Indian languages include the Interspeech 2018 Low Resource ASR Challenge [6], mul- tilingual speech corpora released through OpenSLR [7, 8], and the MUCS 2021 shared task [1].More recent benchmarks in- clude IndicSUPERB [2], Vistaar [3], accent focused datasets such as Svar...

work page 2018
[3]

The Voice of India Benchmark 3.1. Speech Data Collection Platform and Contributor Onboarding.Speech data was col- lected through an online platform enabling large scale remote participation, where contributors across India recorded audio through a peer to peer interface. Recruitment was conducted through a large nationwide digital community platform 1 wit...

work page 2011
[4]

and moderately frequent words (301-1000) were assigned weights of 20 and 5 respectively, and common words (>1000) receiving a weight of 0.5. Segment scores were computed as the mean weight of their constituent words, allowing the selection process to favor segments with richer and more diverse vocabu- lary while maintaining demographic balance. 3.2. Trans...

work page
[5]

Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models

Experiment Setup 4.1. Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models. A model is evaluated for a language only if it provides explicit support through a native language tag or can be reliably conditioned through prompts indicating the target language. For dialects such as Bhojpuri and Chhattis- garhi, d...

work page 2011
[6]

Results and Discussion 5.1. Evaluation of models on the Voice of India Benchmark Table 2a shows that most models exceed a WER of 20 (high- lighted in red), a threshold often associated with practical us- ability, and no system meets this criterion consistently across all languages. Even the best performing model, SARVAMAU- DIO, exceeds this threshold on B...

work page
[7]

Conclusion We introduce V oice of India, a benchmark for evaluating ASR systems on real world Indian speech collected from unscripted telephonic conversations across multiple languages and regions. The benchmark incorporates multiple transcription variants and evaluates systems using orthographically informed WER to better reflect natural spelling variati...

work page
[8]

MUCS 2021: Multilingual and code- switching ASR challenges for low resource indian languages,

A. Diwan, R. V . andsrivastava2018 Sanket Shah, A. S. andhe2020 Srinivasa Raghavan, S. K. andsrivastava2018 Vinit Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P. K. Ghosh, P. Jyothi, K. Bali, V . Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, and K. Sankaranarayanan, “MUCS 2021: Multilingual and code- switching ASR challenges for low re...

work page 2021
[9]

Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,

T. Javed, K. S. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023). AAAI Press, 2023, pp. 12 942–12 950

work page 2023
[10]

Vistaar: Diverse benchmarks and training sets for indian language ASR,

K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for indian language ASR,” inProc. Interspeech 2023, 2023, pp. 4384–4388

work page 2023
[12]

Available: https://arxiv.org/abs/2403.01926

[Online]. Available: https://arxiv.org/abs/2403.01926

work page arXiv
[13]

Rethink- ing evaluation in asr: Are our models robust enough?

T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evalu- ation in asr: Are our models robust enough?”arXiv preprint arXiv:2010.11745, 2020

work page arXiv 2010
[14]

In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,

B. M. L. Srivastava, S. Sitaram, R. K. Mehta, K. D. Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak, “In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018, pp. 11–14

work page 2018
[15]

Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,

F. He, S. C. Chu, O. Kjartansson, C. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C. Johny, M. Jansche, S. Sarin, and K. Pipatsrisawat, “Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020...

work page 2020
[16]

Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,

A. Butryna, S. C. Chu, I. Demirsahin, A. Gutkin, L. Ha, F. He, M. Jansche, C. Johny, A. Katanova, O. Kjartanssonet al., “Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources ...

work page 2020
[17]

Svarah: Eval- uating english ASR systems on indian accents,

T. Javed, S. Joshi, V . Nagarajan, S. Sundaresan, J. Nawale, A. Ra- man, K. S. Bhogale, P. Kumar, and M. M. Khapra, “Svarah: Eval- uating english ASR systems on indian accents,” inProc. Inter- speech 2023, 2023, pp. 5087–5091

work page 2023
[18]

LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,

T. Javed, J. Nawale, S. Joshi, E. I. George, K. S. Bhogale, D. Mehendale, and M. M. Khapra, “LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,” inProc. Interspeech 2024, 2024

work page 2024
[19]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources Association, 2020, pp. 4218–4222. [Online]. Available: https: //aclant...

work page 2020
[20]

Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,

O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, and L. Ha, “Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Lan- guages (SLTU 2018), 2018, pp. 52–55

work page 2018
[21]

FLEURS: FEW-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT 2022). IEEE, 2022, pp. 798–805

work page 2022
[22]

W AXAL: A large-scale speech dataset for sub-saharan african languages,

W AXAL Consortium, “W AXAL: A large-scale speech dataset for sub-saharan african languages,” 2025, preprint / Technical Report. Entry requires verification — full author list and venue not con- firmed from available sources

work page 2025
[23]

Multi-reference WER for evaluating ASR for languages with no orthographic rules,

A. M. Ali, W. Magdy, P. Bell, and S. Renals, “Multi-reference WER for evaluating ASR for languages with no orthographic rules,” in2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015. IEEE, 2015, pp. 576–580. [Online]. Available: https://doi.org/10.1109/ASRU.2015.7404847

work page doi:10.1109/asru.2015.7404847 2015
[24]

Towards variability resistant dialectal speech evaluation,

A. Ali, S. Khalifa, and N. Habash, “Towards variability resistant dialectal speech evaluation,” in20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 336–340. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2692

work page doi:10.21437/interspeech.2019-2692 2019
[25]

Lenient evaluation of japanese speech recognition: Modeling naturally occurring spelling inconsistency,

S. Karita, R. Sproat, and H. Ishikawa, “Lenient evaluation of japanese speech recognition: Modeling naturally occurring spelling inconsistency,”CoRR, vol. abs/2306.04530, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.04530

work page doi:10.48550/arxiv.2306.04530 2023
[26]

[47]McNamara, Q., Fern ´andez, M

Q. McNamara, M. ´A. del R ´ıo Fern ´andez, N. Bhandari, M. Ratajczak, D. Chen, C. Miller, and M. Jett ´e, “Style- agnostic evaluation of ASR using multiple reference transcripts,” CoRR, vol. abs/2412.07937, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2412.07937

work page doi:10.48550/arxiv.2412.07937 2024
[27]

Cmu-cambridge statistical lan- guage modeling toolkit v2,

R. Rosenfeld and P. Clarkson, “Cmu-cambridge statistical lan- guage modeling toolkit v2,” 1997

work page 1997
[28]

Werd: Using social text spelling variants for evaluating dialectal speech recognition,

A. Ali, P. Nakov, P. Bell, and S. Renals, “Werd: Using social text spelling variants for evaluating dialectal speech recognition,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 141–148

work page 2017
[29]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,...

work page 2023
[30]

Pillai, and Elizabeth Sherly

K. Manohar and L. G. Pillai, “What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,” CoRR, vol. abs/2409.02449, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2409.02449

work page doi:10.48550/arxiv.2409.02449 2024
[31]

Homophone identification and merging for code-switched speech recognition,

B. M. L. Srivastava and S. Sitaram, “Homophone identification and merging for code-switched speech recognition,” inInter- speech 2018, 2018, pp. 1943–1947

work page 2018
[32]

Unsu- pervised language agnostic wer standardization,

S. Guha, R. Ambavat, A. Gupta, M. Gupta, and R. Mehta, “Unsu- pervised language agnostic wer standardization,”arXiv preprint arXiv:2303.05046, 2023

work page arXiv 2023
[33]

Improving speech recog- nition systems for the morphologically complex malayalam lan- guage using subword tokens for language modeling,

K. Manohar, A. R. Jayan, and R. Rajan, “Improving speech recog- nition systems for the morphologically complex malayalam lan- guage using subword tokens for language modeling,”EURASIP J. Audio Speech Music. Process., vol. 2023, no. 1, p. 47, 2023. [On- line]. Available: https://doi.org/10.1186/s13636-023-00313-7

work page doi:10.1186/s13636-023-00313-7 2023
[34]

Advocating character error rate for multilingual ASR evaluation,

T. D. K, J. James, D. P. Gopinath, and M. A. K, “Advocating character error rate for multilingual ASR evaluation,”CoRR, vol. abs/2410.07400, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2410.07400

work page arXiv 2024
[35]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

work page 2024
[36]

SpeechBrain: A general- purpose speech toolkit

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Ben- gio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624

work page arXiv 2021
[37]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497

work page 2021
[38]

Census of india 2011,

C. Chandramouli and R. General, “Census of india 2011,”Pro- visional Population Totals. New Delhi: Government of India, pp. 409–413, 2011

work page 2011
[39]

Indicconformer,

AI4Bharat, “Indicconformer,” 2024, accessed: 2025-02-19

work page 2024
[40]

Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,

K. S. Bhogale, T. Javed, G. S. John, D. Rathi, A. Padmana- ban, N. Parasa, and M. M. Khapra, “Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,”arXiv preprint arXiv: 2603.00941, 2026

work page arXiv 2026
[41]

L3cube- indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert,

S. Deode, J. Gadre, A. Kajale, A. Joshi, and R. Joshi, “L3cube- indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert,” inProceedings of the 37th Pacific Asia Conference on Language, Information and Com- putation, 2023, pp. 154–163

work page 2023

[1] [1]

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Introduction Recent progress in Indic Automatic Speech Recognition (ASR) has been driven by shared tasks and large scale benchmarks such as MUCS [1], IndicSUPERB [2], Vistaar [3], and datasets like IndicV oices [4], which have expanded coverage across lan- guages, accents, orthographies, and code switching. However, improvements on benchmark leaderboards ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Related Work Benchmarks for Indian Language ASR. Early efforts to evaluate ASR for Indian languages include the Interspeech 2018 Low Resource ASR Challenge [6], mul- tilingual speech corpora released through OpenSLR [7, 8], and the MUCS 2021 shared task [1].More recent benchmarks in- clude IndicSUPERB [2], Vistaar [3], accent focused datasets such as Svar...

work page 2018

[3] [3]

The Voice of India Benchmark 3.1. Speech Data Collection Platform and Contributor Onboarding.Speech data was col- lected through an online platform enabling large scale remote participation, where contributors across India recorded audio through a peer to peer interface. Recruitment was conducted through a large nationwide digital community platform 1 wit...

work page 2011

[4] [4]

and moderately frequent words (301-1000) were assigned weights of 20 and 5 respectively, and common words (>1000) receiving a weight of 0.5. Segment scores were computed as the mean weight of their constituent words, allowing the selection process to favor segments with richer and more diverse vocabu- lary while maintaining demographic balance. 3.2. Trans...

work page

[5] [5]

Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models

Experiment Setup 4.1. Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models. A model is evaluated for a language only if it provides explicit support through a native language tag or can be reliably conditioned through prompts indicating the target language. For dialects such as Bhojpuri and Chhattis- garhi, d...

work page 2011

[6] [6]

Results and Discussion 5.1. Evaluation of models on the Voice of India Benchmark Table 2a shows that most models exceed a WER of 20 (high- lighted in red), a threshold often associated with practical us- ability, and no system meets this criterion consistently across all languages. Even the best performing model, SARVAMAU- DIO, exceeds this threshold on B...

work page

[7] [7]

Conclusion We introduce V oice of India, a benchmark for evaluating ASR systems on real world Indian speech collected from unscripted telephonic conversations across multiple languages and regions. The benchmark incorporates multiple transcription variants and evaluates systems using orthographically informed WER to better reflect natural spelling variati...

work page

[8] [8]

MUCS 2021: Multilingual and code- switching ASR challenges for low resource indian languages,

A. Diwan, R. V . andsrivastava2018 Sanket Shah, A. S. andhe2020 Srinivasa Raghavan, S. K. andsrivastava2018 Vinit Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P. K. Ghosh, P. Jyothi, K. Bali, V . Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, and K. Sankaranarayanan, “MUCS 2021: Multilingual and code- switching ASR challenges for low re...

work page 2021

[9] [9]

Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,

T. Javed, K. S. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023). AAAI Press, 2023, pp. 12 942–12 950

work page 2023

[10] [10]

Vistaar: Diverse benchmarks and training sets for indian language ASR,

K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for indian language ASR,” inProc. Interspeech 2023, 2023, pp. 4384–4388

work page 2023

[11] [12]

Available: https://arxiv.org/abs/2403.01926

[Online]. Available: https://arxiv.org/abs/2403.01926

work page arXiv

[12] [13]

Rethink- ing evaluation in asr: Are our models robust enough?

T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evalu- ation in asr: Are our models robust enough?”arXiv preprint arXiv:2010.11745, 2020

work page arXiv 2010

[13] [14]

In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,

B. M. L. Srivastava, S. Sitaram, R. K. Mehta, K. D. Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak, “In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018, pp. 11–14

work page 2018

[14] [15]

Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,

F. He, S. C. Chu, O. Kjartansson, C. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C. Johny, M. Jansche, S. Sarin, and K. Pipatsrisawat, “Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020...

work page 2020

[15] [16]

Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,

A. Butryna, S. C. Chu, I. Demirsahin, A. Gutkin, L. Ha, F. He, M. Jansche, C. Johny, A. Katanova, O. Kjartanssonet al., “Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources ...

work page 2020

[16] [17]

Svarah: Eval- uating english ASR systems on indian accents,

T. Javed, S. Joshi, V . Nagarajan, S. Sundaresan, J. Nawale, A. Ra- man, K. S. Bhogale, P. Kumar, and M. M. Khapra, “Svarah: Eval- uating english ASR systems on indian accents,” inProc. Inter- speech 2023, 2023, pp. 5087–5091

work page 2023

[17] [18]

LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,

T. Javed, J. Nawale, S. Joshi, E. I. George, K. S. Bhogale, D. Mehendale, and M. M. Khapra, “LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,” inProc. Interspeech 2024, 2024

work page 2024

[18] [19]

Common voice: A massively-multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources Association, 2020, pp. 4218–4222. [Online]. Available: https: //aclant...

work page 2020

[19] [20]

Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,

O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, and L. Ha, “Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Lan- guages (SLTU 2018), 2018, pp. 52–55

work page 2018

[20] [21]

FLEURS: FEW-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT 2022). IEEE, 2022, pp. 798–805

work page 2022

[21] [22]

W AXAL: A large-scale speech dataset for sub-saharan african languages,

W AXAL Consortium, “W AXAL: A large-scale speech dataset for sub-saharan african languages,” 2025, preprint / Technical Report. Entry requires verification — full author list and venue not con- firmed from available sources

work page 2025

[22] [23]

Multi-reference WER for evaluating ASR for languages with no orthographic rules,

A. M. Ali, W. Magdy, P. Bell, and S. Renals, “Multi-reference WER for evaluating ASR for languages with no orthographic rules,” in2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015. IEEE, 2015, pp. 576–580. [Online]. Available: https://doi.org/10.1109/ASRU.2015.7404847

work page doi:10.1109/asru.2015.7404847 2015

[23] [24]

Towards variability resistant dialectal speech evaluation,

A. Ali, S. Khalifa, and N. Habash, “Towards variability resistant dialectal speech evaluation,” in20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 336–340. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2692

work page doi:10.21437/interspeech.2019-2692 2019

[24] [25]

Lenient evaluation of japanese speech recognition: Modeling naturally occurring spelling inconsistency,

S. Karita, R. Sproat, and H. Ishikawa, “Lenient evaluation of japanese speech recognition: Modeling naturally occurring spelling inconsistency,”CoRR, vol. abs/2306.04530, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.04530

work page doi:10.48550/arxiv.2306.04530 2023

[25] [26]

[47]McNamara, Q., Fern ´andez, M

Q. McNamara, M. ´A. del R ´ıo Fern ´andez, N. Bhandari, M. Ratajczak, D. Chen, C. Miller, and M. Jett ´e, “Style- agnostic evaluation of ASR using multiple reference transcripts,” CoRR, vol. abs/2412.07937, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2412.07937

work page doi:10.48550/arxiv.2412.07937 2024

[26] [27]

Cmu-cambridge statistical lan- guage modeling toolkit v2,

R. Rosenfeld and P. Clarkson, “Cmu-cambridge statistical lan- guage modeling toolkit v2,” 1997

work page 1997

[27] [28]

Werd: Using social text spelling variants for evaluating dialectal speech recognition,

A. Ali, P. Nakov, P. Bell, and S. Renals, “Werd: Using social text spelling variants for evaluating dialectal speech recognition,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 141–148

work page 2017

[28] [29]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,...

work page 2023

[29] [30]

Pillai, and Elizabeth Sherly

K. Manohar and L. G. Pillai, “What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,” CoRR, vol. abs/2409.02449, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2409.02449

work page doi:10.48550/arxiv.2409.02449 2024

[30] [31]

Homophone identification and merging for code-switched speech recognition,

B. M. L. Srivastava and S. Sitaram, “Homophone identification and merging for code-switched speech recognition,” inInter- speech 2018, 2018, pp. 1943–1947

work page 2018

[31] [32]

Unsu- pervised language agnostic wer standardization,

S. Guha, R. Ambavat, A. Gupta, M. Gupta, and R. Mehta, “Unsu- pervised language agnostic wer standardization,”arXiv preprint arXiv:2303.05046, 2023

work page arXiv 2023

[32] [33]

Improving speech recog- nition systems for the morphologically complex malayalam lan- guage using subword tokens for language modeling,

K. Manohar, A. R. Jayan, and R. Rajan, “Improving speech recog- nition systems for the morphologically complex malayalam lan- guage using subword tokens for language modeling,”EURASIP J. Audio Speech Music. Process., vol. 2023, no. 1, p. 47, 2023. [On- line]. Available: https://doi.org/10.1186/s13636-023-00313-7

work page doi:10.1186/s13636-023-00313-7 2023

[33] [34]

Advocating character error rate for multilingual ASR evaluation,

T. D. K, J. James, D. P. Gopinath, and M. A. K, “Advocating character error rate for multilingual ASR evaluation,”CoRR, vol. abs/2410.07400, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2410.07400

work page arXiv 2024

[34] [35]

Scaling speech technology to 1,000+ languages,

V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024

work page 2024

[35] [36]

SpeechBrain: A general- purpose speech toolkit

M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Ben- gio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624

work page arXiv 2021

[36] [37]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497

work page 2021

[37] [38]

Census of india 2011,

C. Chandramouli and R. General, “Census of india 2011,”Pro- visional Population Totals. New Delhi: Government of India, pp. 409–413, 2011

work page 2011

[38] [39]

Indicconformer,

AI4Bharat, “Indicconformer,” 2024, accessed: 2025-02-19

work page 2024

[39] [40]

Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,

K. S. Bhogale, T. Javed, G. S. John, D. Rathi, A. Padmana- ban, N. Parasa, and M. M. Khapra, “Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,”arXiv preprint arXiv: 2603.00941, 2026

work page arXiv 2026

[40] [41]

L3cube- indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert,

S. Deode, J. Gadre, A. Kajale, A. Joshi, and R. Joshi, “L3cube- indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert,” inProceedings of the 37th Pacific Asia Conference on Language, Information and Com- putation, 2023, pp. 154–163

work page 2023