pith. sign in

arxiv: 2606.09535 · v1 · pith:CZ7WP7TPnew · submitted 2026-06-08 · 💻 cs.CL · cs.SD

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Pith reviewed 2026-06-27 16:19 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords automatic speech recognitionWhisperDravidian languageslow-resource languagesdecoder enhancementsattention mechanismsword error rate
0
0 comments X

The pith

Whisper decoder enhancements reduce word error rates on Dravidian and other low-resource languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Whisper's higher error rates on Dravidian languages stem from their longer words, diverse vocabulary, and low repetition, which create sparse token distributions and expose an imbalance in the decoder between self-attention for linguistic context and cross-attention for acoustic cues. Baseline fine-tuning confirms this imbalance produces frequent character substitutions. The authors introduce Weighted-Attention to adaptively balance the two attention sources and Self-Conditioning to feed intermediate predictions back into the decoder for better token consistency. Experiments on low-resource and agglutinative languages report consistent error reductions, offering a targeted fix without relying on impractical synthetic data repetition.

Core claim

Decoder imbalance between self-attention and cross-attention produces token inconsistencies for languages with sparse distributions; Weighted-Attention adaptively balances attention sources while Self-Conditioning reinjects intermediate predictions, yielding consistent WER reductions on Dravidian and low-resource languages.

What carries the argument

Weighted-Attention, which adaptively balances self-attention and cross-attention sources, and Self-Conditioning, which reinjects intermediate decoder predictions to enforce token consistency.

If this is right

  • Lower word error rates on agglutinative and low-resource languages without new training data.
  • Fewer character-level substitution errors in languages with long words and high vocabulary diversity.
  • A practical decoder-level alternative to synthetic token-repetition augmentation.
  • Improved consistency in multilingual ASR outputs for languages with sparse token distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoder fixes could be tested on other multilingual models that share Whisper's attention architecture.
  • If the enhancements prove stable across domains, they might reduce reliance on language-specific fine-tuning pipelines.
  • The linguistic analysis of word length and repetition patterns could guide similar decoder adjustments for non-Dravidian agglutinative languages.

Load-bearing premise

The decoder imbalance between self-attention and cross-attention is the main driver of high error rates rather than data volume or acoustic modeling gaps.

What would settle it

No measurable WER drop on Dravidian test sets after inserting Weighted-Attention and Self-Conditioning into the Whisper decoder would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.09535 by Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik.

Figure 1
Figure 1. Figure 1: Block diagrams of proposed approaches. 4. Proposed Methodology In this section, we present details of the proposed Weighted￾Attention mechanism to balance linguistic and acoustic cues adaptively, and a Self-Conditioning module to reinforce token consistency during decoding. 4.1. Weighted-Attention The main challenge lies in the decoder’s difficulty in balancing acoustic cues from cross-attention with conte… view at source ↗
read the original abstract

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes Whisper's elevated WER on Dravidian languages relative to Indo-Aryan ones, linking this to longer words, higher vocabulary diversity, lower repetition, and resulting sparse token distributions that cause character-level substitution errors. Baseline fine-tuning is said to reveal decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Two decoder enhancements are introduced—Weighted-Attention to adaptively balance attention sources and Self-Conditioning to reinject intermediate predictions for token consistency—with experiments claimed to yield consistent WER reductions for low-resource and agglutinative languages.

Significance. If the claimed WER reductions are substantiated with proper controls and the methods are shown to specifically correct the identified imbalance, the work could offer targeted decoder improvements for multilingual ASR on agglutinative low-resource languages. The linguistic and dataset analysis provides a useful framing for why certain languages underperform.

major comments (2)
  1. [Abstract] Abstract: the claim that 'experiments demonstrate consistent WER reductions' is unsupported by any reported numbers, baselines, dataset sizes, statistical tests, or error breakdowns, preventing verification of the central experimental result.
  2. [Abstract / implied experimental analysis] The manuscript states that baseline fine-tuning 'reveals decoder imbalance' and that the two enhancements mitigate it, yet supplies no quantitative metric (attention weight ratios, layer-wise contribution scores, or error correlation with attention sources), no ablation isolating each component's effect on the imbalance, and no before/after comparison demonstrating rebalancing rather than generic regularization.
minor comments (1)
  1. [Title / Abstract] The title refers to 'Dravidian and Low-Resource Languages' while the abstract focuses on Dravidian as the primary case; a brief clarification of scope would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical support in the abstract and experimental sections. We will revise the manuscript to address these points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'experiments demonstrate consistent WER reductions' is unsupported by any reported numbers, baselines, dataset sizes, statistical tests, or error breakdowns, preventing verification of the central experimental result.

    Authors: We agree that the abstract claim requires concrete support. The full manuscript contains detailed experimental tables with WER results across languages, dataset sizes, and comparisons to baselines, but these were not summarized in the abstract. In revision we will add specific reduction percentages, dataset details, and note that improvements hold across multiple runs. revision: yes

  2. Referee: [Abstract / implied experimental analysis] The manuscript states that baseline fine-tuning 'reveals decoder imbalance' and that the two enhancements mitigate it, yet supplies no quantitative metric (attention weight ratios, layer-wise contribution scores, or error correlation with attention sources), no ablation isolating each component's effect on the imbalance, and no before/after comparison demonstrating rebalancing rather than generic regularization.

    Authors: We accept this critique. The current text describes the imbalance qualitatively from fine-tuning observations but lacks the requested quantitative metrics and ablations. We will expand the experiments section with attention weight ratio measurements, layer-wise analyses, component ablations, and before/after comparisons to show targeted rebalancing effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on experiments without self-referential derivations

full rationale

The paper contains no equations, derivations, fitted parameters presented as predictions, or self-citations that bear the central claim. Linguistic observations and baseline fine-tuning results motivate the introduction of Weighted-Attention and Self-Conditioning, but these are presented as empirical enhancements rather than outputs forced by construction from the inputs. The central claim of WER reductions is supported by described experiments, which are independent of any circular reduction. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted constants, or new postulated entities; all content is descriptive.

pith-pipeline@v0.9.1-grok · 5678 in / 1060 out tokens · 29994 ms · 2026-06-27T16:19:45.024563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 2 linked inside Pith

  1. [1]

    While these models perform well on high-resource languages, they continue to show significantly lower performance on Indo-Aryan and Dravidian languages

    Introduction Automatic speech recognition (ASR) has advanced rapidly in recent years, primarily driven by transformer-based architec- tures such as Whisper. While these models perform well on high-resource languages, they continue to show significantly lower performance on Indo-Aryan and Dravidian languages. Notably, Dravidian languages such as Tamil, Tel...

  2. [2]

    Related Work Modern multilingual ASR systems are largely built on Transformer-based encoder-decoder architectures that jointly model acoustic, pronunciation, and language information within a unified framework. Prior work has demonstrated that incorporating subword units along with explicit language- symbol conditioning significantly improves recognition ...

  3. [3]

    Preliminary Study 3.1. Linguistic Variation Analysis We perform a corpus-level linguistic analysis using the Kath- bath dataset [20], which includes languages from both Indo- Aryan (Hindi, Gujarati, Marathi, Bengali) and Dravidian (Tamil, Telugu, Kannada, Malayalam) families. The analysis focuses on type-to-token ratio (TTR), average word repetition (WR),...

  4. [4]

    Proposed Methodology In this section, we present details of the proposed Weighted- Attention mechanism to balance linguistic and acoustic cues adaptively, and a Self-Conditioning module to reinforce token consistency during decoding. 4.1. Weighted-Attention The main challenge lies in the decoder’s difficulty in balancing acoustic cues from cross-attention...

  5. [5]

    Implementation Details Experiments were conducted on four NVIDIA A100 (40GB) GPUs using Whisper-medium

    Experimental setup 5.1. Implementation Details Experiments were conducted on four NVIDIA A100 (40GB) GPUs using Whisper-medium. Fine-tuning was performed for 3 epochs with the AdamW optimizer, a batch size of 16, and learning rates of 1e-5 for standard fine-tuning and 5e-5 for newly introduced parameters. All experiments were imple- mented using the Huggi...

  6. [6]

    Results and discussion 6.1. Analysis on Indian Languages The results in Table 4 show the relative improvements of differ- ent decoder enhancement methods over the baseline Whisper- medium fine-tuning (W-M FT) with Morphological Splitting (MS). On average, Weighted-Attention and Self-Conditioning yield comparable overall gains (1.54% and 1.55%), improv- in...

  7. [7]

    Conclusion This work investigates the persistent performance gap between Dravidian and Indo-Aryan languages in multilingual ASR sys- tems, focusing on Whisper. Through corpus analysis and de- coding behavior, we identify high character-level substitution errors in Dravidian languages, driven by complex morphology, longer words, and low repetition. To addr...

  8. [8]

    All scientific contributions, technical implementa- tions, and interpretations were developed and validated by the authors

    Generative AI Use Disclosure Generative AI tools have been used mainly for grammar correc- tion, paraphrasing, and overall language editing purposes in the manuscript. All scientific contributions, technical implementa- tions, and interpretations were developed and validated by the authors. In accordance with policy guidelines, no generative AI system is ...

  9. [9]

    Multilingual end-to-end speech recognition with a single transformer on low-resource languages,

    S. Zhou, S. Xu, and B. Xu, “Multilingual end-to-end speech recognition with a single transformer on low-resource languages,” arXiv preprint arXiv:1806.05059, 2018

  10. [10]

    Im- proving massively multilingual asr with auxiliary ctc objectives,

    W. Chen, B. Yan, J. Shi, Y . Peng, S. Maiti, and S. Watanabe, “Im- proving massively multilingual asr with auxiliary ctc objectives,” inICASSP 2023-2023 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  11. [11]

    Self-and-mixed attention decoder with deep acoustic structure for transformer-based lvcsr,

    X. Zhou, G. Lee, E. Yılmaz, Y . Long, J. Liang, and H. Li, “Self-and-mixed attention decoder with deep acoustic structure for transformer-based lvcsr,”arXiv preprint arXiv:2006.10407, 2020

  12. [12]

    Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation,

    H. Le, J. Pino, C. Wang, J. Gu, D. Schwab, and L. Be- sacier, “Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation,”arXiv preprint arXiv:2011.00747, 2020

  13. [13]

    W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,

    Y .-A. Chung, Y . Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang, and Y . Wu, “W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 244–250

  14. [14]

    Unsupervised cross-lingual representation learning for speech recognition,

    A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,”arXiv preprint arXiv:2006.13979, 2020

  15. [15]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  16. [16]

    Comparative performance analysis of end-to-end asr models on indo-aryan and dravidian languages within india’s linguistic landscape,

    P. Jain and A. Bhowmick, “Comparative performance analysis of end-to-end asr models on indo-aryan and dravidian languages within india’s linguistic landscape,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2025, no. 1, p. 10, 2025

  17. [17]

    Enhancing whisper’s accu- racy and speed for indian languages through prompt-tuning and tokenization,

    K. Tripathi, R. Gothi, and P. Wasnik, “Enhancing whisper’s accu- racy and speed for indian languages through prompt-tuning and tokenization,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  18. [18]

    Improving speech recognition systems for the morphologically complex malayalam language using sub- word tokens for language modeling,

    K. Manohar and R. Rajan, “Improving speech recognition systems for the morphologically complex malayalam language using sub- word tokens for language modeling,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2023, no. 1, p. 47, 2023

  19. [19]

    From english to more lan- guages: Parameter-efficient model reprogramming for cross- lingual speech recognition,

    C.-H. H. Yang, B. Li, Y . Zhang, N. Chen, R. Prabhavalkar, T. N. Sainath, and T. Strohman, “From english to more lan- guages: Parameter-efficient model reprogramming for cross- lingual speech recognition,” inICASSP 2023-2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  20. [20]

    Automatic speech recognition with bert and ctc transformers: A review,

    N. Djeffal, H. Kheddar, D. Addou, A. C. Mazari, and Y . Himeur, “Automatic speech recognition with bert and ctc transformers: A review,” in2023 2nd International Conference on Electronics, En- ergy and Measurement (IC2EM), vol. 1. IEEE, 2023, pp. 1–8

  21. [21]

    Remember the context! asr slot error correction through memorization,

    D. Bekal, A. Shenoy, M. Sunkara, S. Bodapati, and K. Kirch- hoff, “Remember the context! asr slot error correction through memorization,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 236–243

  22. [22]

    Introduction to machine learning: k-nearest neigh- bors,

    Z. Zhang, “Introduction to machine learning: k-nearest neigh- bors,”Annals of translational medicine, vol. 4, no. 11, p. 218, 2016

  23. [23]

    Outlier reduction with gated attention for improved post-training quanti- zation in large sequence-to-sequence speech foundation models,

    D. Wagner, I. Baumann, K. Riedhammer, and T. Bocklet, “Outlier reduction with gated attention for improved post-training quanti- zation in large sequence-to-sequence speech foundation models,” arXiv preprint arXiv:2406.11022, 2024

  24. [24]

    Quantizable transformers: Removing outliers by helping attention heads do nothing,

    Y . Bondarenko, M. Nagel, and T. Blankevoort, “Quantizable transformers: Removing outliers by helping attention heads do nothing,”Advances in Neural Information Processing Systems, vol. 36, pp. 75 067–75 096, 2023

  25. [25]

    Non-autoregressive asr with self-conditioned folded encoders,

    T. Komatsu, “Non-autoregressive asr with self-conditioned folded encoders,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7427–7431

  26. [26]

    A survey on non-autoregressive generation for neural machine trans- lation and beyond,

    Y . Xiao, L. Wu, J. Guo, J. Li, M. Zhang, T. Qin, and T.-y. Liu, “A survey on non-autoregressive generation for neural machine trans- lation and beyond,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 407–11 427, 2023

  27. [27]

    Nar-srec: Non- autoregressive end-to-end speech recognition with error correc- tion decoder,

    B. Lyu, C. Fan, Y . Ming, J. Zhou, and K. Hong, “Nar-srec: Non- autoregressive end-to-end speech recognition with error correc- tion decoder,”IEEE Transactions on Instrumentation and Mea- surement, 2025

  28. [28]

    Indicsuperb: A speech processing universal performance benchmark for indian languages,

    T. Javed, K. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: A speech processing universal performance benchmark for indian languages,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 12 942–12 950

  29. [29]

    Combining over-sampling and under-sampling techniques for imbalance dataset,

    N. Junsomboon and T. Phienthrakul, “Combining over-sampling and under-sampling techniques for imbalance dataset,” inPro- ceedings of the 9th international conference on machine learning and computing, 2017, pp. 243–247

  30. [30]

    Indic nlp library,

    A. Kunchukuttan, “Indic nlp library,” https://github.com/ anoopkunchukuttan/indic nlp library, 2015, accessed: 2025-08- 29

  31. [31]

    Openslr 40: Zeroth-korean corpus,

    OpenSLR, “Openslr 40: Zeroth-korean corpus,” https://openslr. org/40/, accessed: 2025-09-15

  32. [32]

    Mozilla common voice,

    Mozilla Foundation, “Mozilla common voice,” https: //commonvoice.mozilla.org/en/datasets, 2024, accessed: 2025- 09-15