pith. sign in

arxiv: 2605.13087 · v1 · pith:3OXKAQ6Vnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Pith reviewed 2026-05-14 19:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Indic ASRWhisper fine-tuningstudio biascurriculum learningreverse multi-stage fine-tuningspeech benchmarkparameter efficiencydecoder adaptation
0
0 comments X

The pith

Reverse multi-stage fine-tuning lets a 244M Whisper model match or exceed 769M counterparts on a tiered Indic speech benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper diagnoses studio-bias in Whisper fine-tuning for Hindi and Malayalam, where gains on read speech come at the cost of spontaneous audio performance. It introduces the Vividh-ASR benchmark with four explicit complexity tiers and runs controlled experiments on learning-rate timing and data ordering. Early large parameter updates deliver a 12-point absolute WER reduction, while a hard-to-easy curriculum adds further spontaneous-speech gains. These dynamics motivate reverse multi-stage fine-tuning, a schedule that concentrates adaptation in the decoder and preserves the pre-trained encoder geometry. The result is that a compact 244M model reaches or surpasses the accuracy of much larger conventionally tuned models.

Core claim

Reverse multi-stage fine-tuning applies large early updates followed by a hard-to-easy curriculum, enabling a 244M Whisper model to match or exceed conventionally fine-tuned 769M models on the Vividh-ASR benchmark while concentrating adaptation in the decoder and leaving the encoder's acoustic geometry largely intact.

What carries the argument

Reverse multi-stage fine-tuning (R-MFT), a training schedule that sequences large early parameter updates with hard-to-easy data ordering to focus adaptation on the decoder.

If this is right

  • Early large parameter updates alone reduce global word error rate by 12 absolute points.
  • Hard-to-easy curriculum ordering yields extra gains specifically on spontaneous speech.
  • The 244M model reaches parity with 769M models without any increase in parameter count.
  • Representational probes show adaptation concentrates in the decoder while the encoder remains stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same schedule could lower inference costs for ASR deployment in low-resource settings.
  • Similar tiered benchmarks might expose comparable biases in other multilingual speech models.
  • Targeted decoder-only adaptation may generalize to other sequence-to-sequence speech tasks.
  • Extending the tiers to additional Indic languages would test whether the ordering effect is language-independent.

Load-bearing premise

The four complexity tiers adequately represent real-world Indic usage distributions and the observed schedule benefits will hold across other languages, models, and conditions.

What would settle it

Run the R-MFT 244M model on an independent collection of spontaneous conversational Indic recordings and check whether its word error rate stays within a few points of the 769M baseline.

Figures

Figures reproduced from arXiv: 2605.13087 by Aditya Srinivas Menon, Arghya Bhattacharya, Kavya Manohar, Kumarmanas Nethil, Kush Juvekar.

Figure 1
Figure 1. Figure 1: Comparison of training curricula. Both use identical decreasing LR schedules. R-MFT (right) places spontaneous data in the high-LR phase. data, model architecture, and optimizer configuration constant. Our findings challenge the status quo: we demonstrate that re￾versing standard heuristics—applying high-magnitude updates initially on the hardest data—yields drastic WER improvements on identical data distr… view at source ↗
Figure 2
Figure 2. Figure 2: shows training loss for the Malayalam Whisper￾medium model (representative; Hindi and Whisper-small ex￾hibit identical trends). The conservative LR (1e−5) plateaus within the first 7K steps at a loss an order of magnitude higher than the 2e−4 schedule, consistent with the hypothesis that the pre-trained prior creates a deep, narrow basin from which small gradients cannot escape [PITH_FULL_IMAGE:figures/fu… view at source ↗
read the original abstract

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam ASR with four tiers (studio, broadcast, spontaneous, synthetic noise) to address studio-bias in fine-tuned Whisper models. Controlled experiments on learning-rate timing and curriculum ordering show that early large updates yield 12-point WER gains and hard-to-easy ordering benefits spontaneous speech; these motivate reverse multi-stage fine-tuning (R-MFT), enabling a 244M model to match or exceed conventionally fine-tuned 769M models. CKA and SVD analyses indicate effective schedules concentrate adaptation in the decoder while preserving encoder geometry. The benchmark and models are released.

Significance. If the central empirical claims hold, the work delivers a practical, parameter-efficient recipe for robust Indic ASR that mitigates studio-bias and improves spontaneous-speech performance. The tiered benchmark and public release of data/models constitute a clear contribution to reproducibility; the concrete WER deltas and representational checks (CKA/SVD) provide falsifiable support for the optimization-dynamics findings.

major comments (2)
  1. [Experiments] Experiments section: the reported 12 absolute-point WER improvement from early large updates and the headline R-MFT result (244M matching/exceeding 769M) are presented without error bars, statistical significance tests, or explicit train/validation/test splits, which are required to establish that the gains are robust rather than split-dependent.
  2. [Benchmark] Benchmark construction: the four complexity tiers are introduced as representative of real-world usage, yet the manuscript provides no quantitative validation (e.g., acoustic or linguistic statistics) that the tier distributions match deployment conditions for Hindi/Malayalam, weakening the claim that R-MFT gains will generalize beyond the benchmark.
minor comments (2)
  1. [Abstract] Abstract: replace the qualitative phrase 'match or exceed' with the actual WER numbers for the 244M R-MFT model versus the 769M baseline on each tier.
  2. [Analysis] Representational analysis: the CKA/SVD results are summarized at a high level; adding a short table of layer-wise similarity scores would make the decoder-adaptation claim easier to verify.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and have revised the manuscript to strengthen the presentation of experimental results and benchmark validation.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the reported 12 absolute-point WER improvement from early large updates and the headline R-MFT result (244M matching/exceeding 769M) are presented without error bars, statistical significance tests, or explicit train/validation/test splits, which are required to establish that the gains are robust rather than split-dependent.

    Authors: We agree that error bars, significance testing, and explicit splits are important for establishing robustness. In the revised manuscript we report means and standard deviations over three independent runs with different random seeds for the 12-point WER result and the 244M-vs-769M comparisons. We have added paired t-test p-values confirming statistical significance (p < 0.01) and now explicitly describe the train/validation/test splits (80/10/10 per tier and language) in the Experiments section. revision: yes

  2. Referee: [Benchmark] Benchmark construction: the four complexity tiers are introduced as representative of real-world usage, yet the manuscript provides no quantitative validation (e.g., acoustic or linguistic statistics) that the tier distributions match deployment conditions for Hindi/Malayalam, weakening the claim that R-MFT gains will generalize beyond the benchmark.

    Authors: We accept that quantitative validation would better support the representativeness claim. The revised Benchmark section now includes a new table reporting acoustic statistics (average SNR, speaking rate, duration) and linguistic features (vocabulary diversity, average utterance length) for each tier, together with comparisons to reference values drawn from other public Hindi and Malayalam corpora. These additions provide concrete evidence of alignment with typical deployment conditions while acknowledging that exact equivalence to every possible real-world scenario cannot be guaranteed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper is an empirical study that introduces the Vividh-ASR benchmark with four complexity tiers and evaluates training recipes such as reverse multi-stage fine-tuning through controlled experiments measuring WER on held-out data. The central performance claim is supported by direct comparisons of model sizes, learning-rate schedules, and curriculum orderings, with post-hoc representational analysis via CKA and SVD. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation; the results are externally falsifiable via the released benchmark and models, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about WER as a reliable metric and the transferability of fine-tuning dynamics across model sizes; no new free parameters, axioms beyond domain norms, or invented entities are introduced.

axioms (1)
  • domain assumption Word error rate (WER) is an adequate primary metric for evaluating spontaneous and noisy speech recognition performance.
    The paper reports all gains in absolute WER points without discussing limitations of WER for disfluent or accented speech.

pith-pipeline@v0.9.0 · 5472 in / 1369 out tokens · 49404 ms · 2026-05-14T19:34:16.090899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    However, zero-shot word er- ror rates (WER) for many Indic languages often exceed 100%

    Introduction Large-scale weakly supervised pre-training has brought auto- matic speech recognition (ASR) close to human performance in high-resource languages [1, 2]. However, zero-shot word er- ror rates (WER) for many Indic languages often exceed 100%. Fine-tuned models such as IndicWhisper [3] reduce this gap, but are trained primarily on studio-record...

  2. [2]

    Related Work Indic ASR corpora and benchmarks.Open-source datasets for Indian languages have expanded rapidly. Kathbath [8] pro- arXiv:2605.13087v1 [cs.CL] 13 May 2026 vides large-scale read speech, Shrutilipi [9] broadcast news transcriptions, and Indic V oices [10] crowdsourced sponta- neous speech. Benchmarks such as Vistaar [3] evaluate models across ...

  3. [3]

    The Vividh-ASR Benchmark Vividh-ASR is a diagnostic benchmark organized byacous- tic and prosodic complexityrather than by domain. It targets Hindi and Malayalam, representing the Indo-Aryan and Dra- vidian language families respectively, and aggregates data from Kathbath [8], Shrutilipi [9], Indic V oices [10], FLEURS [17], and additional publicly availa...

  4. [4]

    However, when adapting to low-resource languages with complex phonotactics, the model Table 1:Data distribution in hours

    Methodology Standard Whisper fine-tuning relies on conservative learning rates (1e−5) under the assumption that large updates will de- stroy the pre-trained priors [19]. However, when adapting to low-resource languages with complex phonotactics, the model Table 1:Data distribution in hours. The corpus is intention- ally weighted toward spontaneous speech ...

  5. [5]

    Learning Rate Effect Figure 2 shows training loss for the Malayalam Whisper- medium model (representative; Hindi and Whisper-small ex- hibit identical trends)

    Results 5.1. Learning Rate Effect Figure 2 shows training loss for the Malayalam Whisper- medium model (representative; Hindi and Whisper-small ex- hibit identical trends). The conservative LR (1e−5) plateaus within the first 7K steps at a loss an order of magnitude higher than the2e−4schedule, consistent with the hypothesis that the pre-trained prior cre...

  6. [6]

    over-parameterization

    Analysis We hypothesize that successful adaptation to low-resource In- dic phonotactics requires a structural asymmetry: learning new linguistic priors in the decoder while preserving the pre-trained Table 4:WER (%) across Vividh-ASR tiers for Malayalam (Mal) and Hindi (Hi). Best per-column inbold.†: uses less Tier C training data than our models (see tex...

  7. [7]

    Conclusion We introduced Vividh-ASR, a complexity-tiered benchmark designed to diagnosestudio-biasin Indic speech recognition. Through a2×2factorial study on Hindi and Malayalam, we demonstrated that optimization plasticity dominates curriculum ordering: early large parameter updates yield∼12 absolute WER points of improvement, while a hard-to-easy curric...

  8. [8]

    All final content was re- viewed, verified, and approved by the authors, who take full responsibility for the integrity of the research and its presenta- tion

    Generative AI Use Disclosure The authors utilized large language model (LLM) tools, specif- ically Gemini 2.5 Pro, to assist in the linguistic refinement and technical polishing of the manuscript. All final content was re- viewed, verified, and approved by the authors, who take full responsibility for the integrity of the research and its presenta- tion

  9. [9]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  10. [10]

    Unsupervised cross-lingual representation learning for speech recognition,

    A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” inInterspeech, 2021

  11. [11]

    Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,

    K. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR,” inInterspeech 2023, 2023, pp. 4384– 4388

  12. [12]

    whisper-finetune: Hyperparameter tun- ing,

    V . Lodagala, “whisper-finetune: Hyperparameter tun- ing,” https://github.com/vasistalodagala/whisper-finetune# hyperparameter-tuning, 2024, gitHub repository, accessed 2026-03-05

  13. [13]

    Overcoming catastrophic forgetting in neural networks,

    J. e. a. Kirkpatrick, “Overcoming catastrophic forgetting in neural networks,” inPNAS, 2017

  14. [14]

    Curriculum learning,

    Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” inICML, 2009

  15. [15]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational Con- ference on Machine Learning (ICML). PMLR, 2019, pp. 3519– 3529

  16. [16]

    Indicsuperb: A speech processing universal performance benchmark for indian languages,

    T. Javed, K. S. Bhogale, A. Raman, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” 2022. [Online]. Available: https://arxiv.org/abs/2208.11761

  17. [17]

    Effectiveness of mining au- dio and text pairs from public data for improving asr systems for low-resource languages,

    K. Bhogale, A. Raman, T. Javed, S. Doddapaneni, A. Kunchukut- tan, P. Kumar, and M. M. Khapra, “Effectiveness of mining au- dio and text pairs from public data for improving asr systems for low-resource languages,” inIcassp 2023-2023 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2023, pp. 1–5

  18. [18]

    Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,

    T. Javed, J. Nawale, E. George, S. Joshi, K. Bhogale, D. Mehen- dale, I. Sethi, A. Ananthanarayanan, H. Faquih, P. Palitet al., “Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 10 740–10 782

  19. [19]

    Esb: A benchmark for multi-domain end-to-end speech recognition,

    S. Gandhi, P. V on Platen, and A. M. Rush, “Esb: A benchmark for multi-domain end-to-end speech recognition,”arXiv preprint arXiv:2210.13352, 2022

  20. [20]

    Cba-whisper: Curriculum learning-based adalora fine-tuning on whisper for low-resource dysarthric speech recognition,

    T. Tan, X. Chen, X. Le, W. Fan, X. Xia, C. Huang, and J. Lu, “Cba-whisper: Curriculum learning-based adalora fine-tuning on whisper for low-resource dysarthric speech recognition,” inInter- speech 2025, 2025, pp. 3309–3313

  21. [21]

    Task-informed anti- curriculum by masking improves downstream performance on text,

    A. Jarca, F. A. Croitoru, and R. T. Ionescu, “Task-informed anti- curriculum by masking improves downstream performance on text,”arXiv preprint arXiv:2502.12953, 2025, accepted at ACL 2025

  22. [22]

    Layer-wise analysis of a self-supervised speech representation model,

    A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” inIEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 914–921

  23. [23]

    Comparative layer-wise analy- sis of self-supervised speech models,

    A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analy- sis of self-supervised speech models,” inIEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  24. [24]

    What happens to BERT embeddings during fine-tuning?

    A. Merchant, E. Rahimtoroghi, E. Pavlick, and I. Tenney, “What happens to BERT embeddings during fine-tuning?” inProceed- ings of the Third BlackboxNLP Workshop on Analyzing and Inter- preting Neural Networks for NLP, 2020, pp. 33–44

  25. [25]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” arXiv preprint arXiv:2205.12446, 2022. [Online]. Available: https://arxiv.org/abs/2205.12446

  26. [26]

    Re- sources for Indian languages,

    A. Baby, A. L. Thomas, N. Nishanthi, T. Consortiumet al., “Re- sources for Indian languages,” inProceedings of Text, Speech and Dialogue. CBBLR Workshop, 2016

  27. [27]

    Enhancing whisper’s accu- racy and speed for indian languages through prompt-tuning and tokenization,

    K. Tripathi, R. Gothi, and P. Wasnik, “Enhancing whisper’s accu- racy and speed for indian languages through prompt-tuning and tokenization,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  28. [28]

    A metric for distributions with applications to image databases,

    Y . Rubner, C. Tomasi, and L. Guibas, “A metric for distributions with applications to image databases,” inSixth International Con- ference on Computer Vision (IEEE Cat. No.98CH36271), 1998, pp. 59–66

  29. [29]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040