VAANI: Capturing the language landscape for an inclusive digital India

Abhayjeet Singh; Agneedh Basu; Amrita Kamat; Dinesh Tewari; Harsh Dhand; Nihar Desai; Partha Talukdar; Pavan Kumar J; Pranav D Bhat; Prasanta Kumar Ghosh

arxiv: 2603.28714 · v3 · pith:3ZNL2RXCnew · submitted 2026-03-30 · 📡 eess.AS

VAANI: Capturing the language landscape for an inclusive digital India

Sujith Pulikodan , Abhayjeet Singh , Agneedh Basu , Nihar Desai , Pavan Kumar J , Pranav D Bhat , Raghu Dharmaraju , Ritika Gupta

show 12 more authors

Sathvik Udupa Saurabh Kumar Sumit Sharma Visruth Sanka Dinesh Tewari Harsh Dhand Amrita Kamat Sukhwinder Singh Shikhar Vashishth Partha Talukdar Raj Acharya Prasanta Kumar Ghosh

This is my paper

classification 📡 eess.AS

keywords languagesspeechdatasetvaaniacrossaudiodigitalhours

0 comments

read the original abstract

Voice based technologies have the potential to bridge digital accessibility gaps; however, existing datasets fail to capture the linguistic and regional diversity of Indic languages. We present Project VAANI, a large scale multimodal dataset designed to represent India's linguistic landscape across 165 districts. Speech data is collected using image based prompts to elicit spontaneous responses, while images are curated through a separate pipeline covering diverse themes across regions. The dataset undergoes a rigorous multi stage quality control process, combining automated and manual evaluation to ensure high audio quality and transcription accuracy. We release approximately 289K images, 31,255 hours of speech, and 2,043 hours of transcribed audio spanning 105 languages from 28 states and 3 union territories. Many of these languages are represented at this scale for the first time, making VAANI a foundational resource for inclusive speech technology. The dataset enables the development of robust, multilingual, and multimodal models, and supports research in speech recognition, language understanding, and cross-modal learning for underrepresented languages.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi
eess.AS 2026-06 unverdicted novelty 6.0

Vaani Benchmark V1.0 is a multimodal Hindi ASR dataset from 104 districts featuring spontaneous speech recordings in real-world conditions and three independent transcriptions per segment for robust multi-reference ev...
Audio--Image Alignment as a Continued-Pretraining Stage Improves Low-Resource ASR
eess.AS 2026-06 unverdicted novelty 5.0

Audio-image representation alignment as a continued-pretraining stage improves low-resource ASR performance without requiring transcription data.
A Comparative Study of Pre-trained Speech Encoders and Training Objectives for Large-Scale Indic Spoken Language Identification
eess.AS 2026-06 unverdicted novelty 5.0

Frozen FastConformer with hierarchical softmax achieves over 90% macro accuracy on out-of-domain Indic LID benchmarks for 42 languages and outperforms Whisper and other objectives in cross-corpus settings.
Factors affecting ASR performance: A study using state of the art ASR models in Indic Languages
eess.AS 2026-06 unverdicted novelty 4.0

Empirical analysis of speaker and acoustic factors correlated with ASR word error rates across five Indic languages using zero-shot evaluation on multiple open-source models.
Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages
eess.AS 2026-06 unverdicted novelty 3.0

Joint language-district supervision on speech encoders for 60 Indic languages produces embeddings with global language clusters containing district-aligned subclusters, improving geographical separability while preser...
An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages
eess.AS 2026-06 unverdicted novelty 3.0

Empirical study measuring ASR performance gains from synthetic speech augmentation in three Indic languages, varying script sources, synthesis models, and cloned voice counts.
A study on the impact of region specific data on the performance of Indic ASR
eess.AS 2026-06 unverdicted novelty 3.0

Empirical study finds consistent positive correlation between inter-district geographic distance and ASR word error rate when models are finetuned on single-district Indic speech data.