Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India
Pith reviewed 2026-05-10 02:32 UTC · model grok-4.3
The pith
A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a large-scale benchmark built from unscripted telephonic conversations in 15 major Indian languages provides a more representative test for automatic speech recognition systems than existing scripted datasets. With 306230 utterances from 36691 speakers totaling 536 hours and transcripts that account for spelling variations, it exposes geographic disparities in performance and highlights challenges related to audio quality, speaking rate, gender, and device type.
What carries the argument
The Voice of India benchmark dataset, derived from real telephonic conversations with manually transcribed utterances that permit spelling variants to reflect natural language use.
If this is right
- ASR systems exhibit varying performance across different Indian districts, indicating regional biases in current models.
- Performance degrades under conditions of poor audio quality, fast speaking rates, or certain device types.
- Accounting for spelling variations in evaluation leads to fairer assessment of code-mixed speech.
- Insights from the dataset can guide targeted improvements in real-world Indic ASR applications.
Where Pith is reading between the lines
- Developers of speech systems for multilingual countries could use similar unscripted collection methods to create more practical benchmarks.
- The geographic analysis suggests that ASR performance may correlate with socioeconomic factors in different regions, warranting further study.
- Extending this approach to other languages with high dialectal variation could improve global ASR equity.
Load-bearing premise
That collecting data from unscripted telephonic conversations and creating transcripts with spelling variants produces a benchmark that is less biased and more reflective of real-world speech than scripted alternatives.
What would settle it
Demonstrating that state-of-the-art ASR models achieve comparable word error rates on this unscripted benchmark as on existing scripted ones, without specific adaptations for spelling variations or regional accents, would undermine the claimed superiority.
Figures
read the original abstract
Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Voice of India, a closed-source benchmark of 306230 utterances (536 hours) drawn from unscripted telephonic conversations in 15 major Indian languages across 139 regional clusters, involving 36691 speakers. Transcripts accommodate spelling variations. The authors report geographic performance disparities at the district level and factor analyses on audio quality, speaking rate, gender, and device type to identify challenges for existing ASR systems.
Significance. If the dataset construction, transcript quality, and analyses prove sound upon verification, the work could usefully document real-world Indic ASR difficulties beyond scripted benchmarks. The scale and multi-factor breakdown offer potential guidance for system improvements in under-resourced settings. The closed-source status, however, sharply curtails community adoption and independent testing of the claimed advantages.
major comments (3)
- [Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.
- [Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.
- [§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.
minor comments (2)
- [Abstract] Abstract: format the utterance count as 306,230 for standard readability.
- [Introduction] Introduction: specify the exact criteria used to define the 139 regional clusters and how speaker demographics were sampled to ensure geographic coverage.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive comments on our manuscript. We address each major comment below, clarifying our approach and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims that the dataset 'reveals disparities' and 'highlight[s] where current ASR systems struggle' are unsupported by any quantitative WER numbers, baseline comparisons, or error rates. Without these, the asserted superiority over scripted benchmarks cannot be evaluated.
Authors: The abstract condenses findings from the district-level geographic analysis and the multi-factor breakdowns in §§4–5. These sections quantify performance variations across audio quality, speaking rate, gender, device type, and regional clusters, which directly support the claims of disparities and system struggles. While the abstract itself avoids lengthy numerical tables for brevity, the underlying analyses contain the supporting quantitative breakdowns. In revision we will add a concise sentence to the abstract referencing key aggregate statistics (e.g., WER ranges and factor-specific deltas) drawn from §§4–5. revision: partial
-
Referee: [Abstract and §3] Dataset release statement (Abstract and §3): declaring the benchmark closed-source prevents reproduction of the district-level disparity results and the audio-quality/speaking-rate/gender/device breakdowns. This directly undermines the paper's stated purpose of supplying a usable real-world benchmark for the Indic ASR community.
Authors: We recognize that closed-source status precludes independent reproduction. The decision stems from privacy and consent constraints inherent to real, unscripted telephonic conversations involving 36 691 speakers. The manuscript supplies the full collection protocol, transcription guidelines (including spelling-variant handling), sampling strategy across 139 clusters, and all statistical results from the factor analyses. These elements allow the community to understand the observed challenges and to design mitigation strategies even without direct data access. We therefore retain the closed-source designation. revision: no
-
Referee: [§4–5] Analysis sections (§4–5): the claim that unscripted telephonic data plus spelling-variant transcripts yield a meaningfully less biased representation requires explicit validation, such as side-by-side WER evaluation of the same models on Voice of India versus existing scripted Indic corpora.
Authors: Sections 4 and 5 demonstrate that the combination of unscripted speech and variant-aware transcripts surfaces realistic error patterns (e.g., higher WER on code-mixed terms and dialectal variants) that scripted benchmarks typically mask. While we do not include head-to-head WER tables against every existing corpus, the factor analyses isolate the contribution of each variable and show elevated difficulty relative to the clean, scripted conditions described in prior work. In revision we will expand the discussion to cite quantitative comparisons reported in the literature for the same model families and to articulate why direct re-evaluation on Voice of India is not feasible under the current release policy. revision: partial
- Independent reproduction of the district-level disparity results and factor analyses is not possible because the benchmark remains closed-source for privacy reasons.
Circularity Check
No circularity: empirical benchmark paper with no derivations or self-referential reductions
full rationale
The paper introduces Voice of India as a closed-source dataset of unscripted telephonic speech across 15 Indic languages, with accompanying geographic and factor analyses. No mathematical derivations, model predictions, fitted parameters, or uniqueness theorems are claimed. The central contribution is data collection and empirical reporting; all performance observations are presented as direct measurements on the collected utterances rather than outputs derived from prior self-citations or internal definitions. No load-bearing steps reduce by construction to the paper's own inputs, satisfying the default expectation of no significant circularity for a purely empirical benchmark effort.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India
Introduction Recent progress in Indic Automatic Speech Recognition (ASR) has been driven by shared tasks and large scale benchmarks such as MUCS [1], IndicSUPERB [2], Vistaar [3], and datasets like IndicV oices [4], which have expanded coverage across lan- guages, accents, orthographies, and code switching. However, improvements on benchmark leaderboards ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Related Work Benchmarks for Indian Language ASR. Early efforts to evaluate ASR for Indian languages include the Interspeech 2018 Low Resource ASR Challenge [6], mul- tilingual speech corpora released through OpenSLR [7, 8], and the MUCS 2021 shared task [1].More recent benchmarks in- clude IndicSUPERB [2], Vistaar [3], accent focused datasets such as Svar...
work page 2018
-
[3]
The Voice of India Benchmark 3.1. Speech Data Collection Platform and Contributor Onboarding.Speech data was col- lected through an online platform enabling large scale remote participation, where contributors across India recorded audio through a peer to peer interface. Recruitment was conducted through a large nationwide digital community platform 1 wit...
work page 2011
-
[4]
and moderately frequent words (301-1000) were assigned weights of 20 and 5 respectively, and common words (>1000) receiving a weight of 0.5. Segment scores were computed as the mean weight of their constituent words, allowing the selection process to favor segments with richer and more diverse vocabu- lary while maintaining demographic balance. 3.2. Trans...
-
[5]
Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models
Experiment Setup 4.1. Models Evaluated We evaluate 14 ASR systems, including 11 proprietary APIs and 3 open source models. A model is evaluated for a language only if it provides explicit support through a native language tag or can be reliably conditioned through prompts indicating the target language. For dialects such as Bhojpuri and Chhattis- garhi, d...
work page 2011
-
[6]
Results and Discussion 5.1. Evaluation of models on the Voice of India Benchmark Table 2a shows that most models exceed a WER of 20 (high- lighted in red), a threshold often associated with practical us- ability, and no system meets this criterion consistently across all languages. Even the best performing model, SARVAMAU- DIO, exceeds this threshold on B...
-
[7]
Conclusion We introduce V oice of India, a benchmark for evaluating ASR systems on real world Indian speech collected from unscripted telephonic conversations across multiple languages and regions. The benchmark incorporates multiple transcription variants and evaluates systems using orthographically informed WER to better reflect natural spelling variati...
-
[8]
MUCS 2021: Multilingual and code- switching ASR challenges for low resource indian languages,
A. Diwan, R. V . andsrivastava2018 Sanket Shah, A. S. andhe2020 Srinivasa Raghavan, S. K. andsrivastava2018 Vinit Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P. K. Ghosh, P. Jyothi, K. Bali, V . Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, and K. Sankaranarayanan, “MUCS 2021: Multilingual and code- switching ASR challenges for low re...
work page 2021
-
[9]
Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,
T. Javed, K. S. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: A speech processing univer- sal performance benchmark for indian languages,” inProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023). AAAI Press, 2023, pp. 12 942–12 950
work page 2023
-
[10]
Vistaar: Diverse benchmarks and training sets for indian language ASR,
K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for indian language ASR,” inProc. Interspeech 2023, 2023, pp. 4384–4388
work page 2023
-
[12]
Available: https://arxiv.org/abs/2403.01926
[Online]. Available: https://arxiv.org/abs/2403.01926
-
[13]
Rethink- ing evaluation in asr: Are our models robust enough?
T. Likhomanenko, Q. Xu, V . Pratap, P. Tomasello, J. Kahn, G. Avidov, R. Collobert, and G. Synnaeve, “Rethinking evalu- ation in asr: Are our models robust enough?”arXiv preprint arXiv:2010.11745, 2020
-
[14]
In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,
B. M. L. Srivastava, S. Sitaram, R. K. Mehta, K. D. Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak, “In- terspeech 2018 low resource automatic speech recognition chal- lenge for indian languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), 2018, pp. 11–14
work page 2018
-
[15]
F. He, S. C. Chu, O. Kjartansson, C. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C. Johny, M. Jansche, S. Sarin, and K. Pipatsrisawat, “Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020...
work page 2020
-
[16]
A. Butryna, S. C. Chu, I. Demirsahin, A. Gutkin, L. Ha, F. He, M. Jansche, C. Johny, A. Katanova, O. Kjartanssonet al., “Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,” inProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources ...
work page 2020
-
[17]
Svarah: Eval- uating english ASR systems on indian accents,
T. Javed, S. Joshi, V . Nagarajan, S. Sundaresan, J. Nawale, A. Ra- man, K. S. Bhogale, P. Kumar, and M. M. Khapra, “Svarah: Eval- uating english ASR systems on indian accents,” inProc. Inter- speech 2023, 2023, pp. 5087–5091
work page 2023
-
[18]
LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,
T. Javed, J. Nawale, S. Joshi, E. I. George, K. S. Bhogale, D. Mehendale, and M. M. Khapra, “LAHAJA: a robust multi- accent benchmark for evaluating hindi ASR systems,” inProc. Interspeech 2024, 2024
work page 2024
-
[19]
Common voice: A massively-multilingual speech corpus,
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). European Language Resources Association, 2020, pp. 4218–4222. [Online]. Available: https: //aclant...
work page 2020
-
[20]
Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,
O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, and L. Ha, “Crowd-sourced speech corpora for Javanese, Sundanese, Sin- hala, Nepali, and five african languages,” inProc. 6th Workshop on Spoken Language Technologies for Under-Resourced Lan- guages (SLTU 2018), 2018, pp. 52–55
work page 2018
-
[21]
FLEURS: FEW-shot learning evaluation of universal representations of speech,
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “FLEURS: FEW-shot learning evaluation of universal representations of speech,” inProceedings of the IEEE Spoken Language Technology Workshop (SLT 2022). IEEE, 2022, pp. 798–805
work page 2022
-
[22]
W AXAL: A large-scale speech dataset for sub-saharan african languages,
W AXAL Consortium, “W AXAL: A large-scale speech dataset for sub-saharan african languages,” 2025, preprint / Technical Report. Entry requires verification — full author list and venue not con- firmed from available sources
work page 2025
-
[23]
Multi-reference WER for evaluating ASR for languages with no orthographic rules,
A. M. Ali, W. Magdy, P. Bell, and S. Renals, “Multi-reference WER for evaluating ASR for languages with no orthographic rules,” in2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015. IEEE, 2015, pp. 576–580. [Online]. Available: https://doi.org/10.1109/ASRU.2015.7404847
-
[24]
Towards variability resistant dialectal speech evaluation,
A. Ali, S. Khalifa, and N. Habash, “Towards variability resistant dialectal speech evaluation,” in20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 336–340. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2692
-
[25]
S. Karita, R. Sproat, and H. Ishikawa, “Lenient evaluation of japanese speech recognition: Modeling naturally occurring spelling inconsistency,”CoRR, vol. abs/2306.04530, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.04530
-
[26]
[47]McNamara, Q., Fern ´andez, M
Q. McNamara, M. ´A. del R ´ıo Fern ´andez, N. Bhandari, M. Ratajczak, D. Chen, C. Miller, and M. Jett ´e, “Style- agnostic evaluation of ASR using multiple reference transcripts,” CoRR, vol. abs/2412.07937, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2412.07937
-
[27]
Cmu-cambridge statistical lan- guage modeling toolkit v2,
R. Rosenfeld and P. Clarkson, “Cmu-cambridge statistical lan- guage modeling toolkit v2,” 1997
work page 1997
-
[28]
Werd: Using social text spelling variants for evaluating dialectal speech recognition,
A. Ali, P. Nakov, P. Bell, and S. Renals, “Werd: Using social text spelling variants for evaluating dialectal speech recognition,” in2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 141–148
work page 2017
-
[29]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,...
work page 2023
-
[30]
K. Manohar and L. G. Pillai, “What is lost in normalization? exploring pitfalls in multilingual ASR model evaluations,” CoRR, vol. abs/2409.02449, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2409.02449
-
[31]
Homophone identification and merging for code-switched speech recognition,
B. M. L. Srivastava and S. Sitaram, “Homophone identification and merging for code-switched speech recognition,” inInter- speech 2018, 2018, pp. 1943–1947
work page 2018
-
[32]
Unsu- pervised language agnostic wer standardization,
S. Guha, R. Ambavat, A. Gupta, M. Gupta, and R. Mehta, “Unsu- pervised language agnostic wer standardization,”arXiv preprint arXiv:2303.05046, 2023
-
[33]
K. Manohar, A. R. Jayan, and R. Rajan, “Improving speech recog- nition systems for the morphologically complex malayalam lan- guage using subword tokens for language modeling,”EURASIP J. Audio Speech Music. Process., vol. 2023, no. 1, p. 47, 2023. [On- line]. Available: https://doi.org/10.1186/s13636-023-00313-7
-
[34]
Advocating character error rate for multilingual ASR evaluation,
T. D. K, J. James, D. P. Gopinath, and M. A. K, “Advocating character error rate for multilingual ASR evaluation,”CoRR, vol. abs/2410.07400, 2024. [Online]. Available: https://doi.org/10. 48550/arXiv.2410.07400
-
[35]
Scaling speech technology to 1,000+ languages,
V . Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandiet al., “Scaling speech technology to 1,000+ languages,”Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024
work page 2024
-
[36]
SpeechBrain: A general- purpose speech toolkit
M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y . Gao, R. D. Mori, and Y . Ben- gio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624
-
[37]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497
work page 2021
-
[38]
C. Chandramouli and R. General, “Census of india 2011,”Pro- visional Population Totals. New Delhi: Government of India, pp. 409–413, 2011
work page 2011
- [39]
-
[40]
Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,
K. S. Bhogale, T. Javed, G. S. John, D. Rathi, A. Padmana- ban, N. Parasa, and M. M. Khapra, “Towards orthographically- informed evaluation of speech recognition systems for indian lan- guages,”arXiv preprint arXiv: 2603.00941, 2026
-
[41]
S. Deode, J. Gadre, A. Kajale, A. Joshi, and R. Joshi, “L3cube- indicsbert: A simple approach for learning cross-lingual sentence representations using multilingual bert,” inProceedings of the 37th Pacific Asia Conference on Language, Information and Com- putation, 2023, pp. 154–163
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.