{"paper":{"title":"Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages.","cross_cats":["cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Aaditya Pareek, Amritansh Walecha, Bhaskar Singh, Hanuman Sidh, Kaushal Bhogale, Mahima Manik, Manas Dhir, Manmeet Kaur, Mitesh M. Khapra, Sagar Jain, Shobhit Banga, Tahir Javed, Utkarsh Singh, Vanshika Chhabra","submitted_at":"2026-04-21T07:02:01Z","abstract_excerpt":"Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the unscripted telephonic conversations and manually created transcripts with spelling variants provide a meaningfully superior and unbiased representation of real-world Indic speech compared to existing scripted benchmarks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Voice of India is a new 536-hour benchmark of real telephonic conversations in 15 Indian languages with variant-aware transcripts for more realistic ASR evaluation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b4a77be5fa9bed88924f0e71ccf66ba2544d5acdc914654345aeff4b15ba42d0"},"source":{"id":"2604.19151","kind":"arxiv","version":2},"verdict":{"id":"7568e54c-4235-4ee9-b54f-8e0607eced57","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-10T02:32:46.543796Z","strongest_claim":"We introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations.","one_line_summary":"Voice of India is a new 536-hour benchmark of real telephonic conversations in 15 Indian languages with variant-aware transcripts for more realistic ASR evaluation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the unscripted telephonic conversations and manually created transcripts with spelling variants provide a meaningfully superior and unbiased representation of real-world Indic speech compared to existing scripted benchmarks.","pith_extraction_headline":"A benchmark of unscripted phone conversations reveals gaps in current speech recognition for Indian languages."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.19151/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_compliance","ran_at":"2026-05-20T03:16:29.301803Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"855d986a9d6ad55c7ced48a61b8f056300a8822cfecae408266ded0a54e17ebb"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}