pith. sign in

arxiv: 2604.06487 · v1 · submitted 2026-04-07 · 💻 cs.CL

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM-based ASRdomain adaptationmodality gapmixed batchinglimited speech dataspeech recognitionfine-tuningtext-only adaptation
0
0 comments X

The pith

Mixed batching with only 10 percent target speech matches or beats full-dataset fine-tuning in LLM-based ASR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates ways to adapt LLM-based automatic speech recognition systems to new domains when only limited paired speech-text data is available. It compares text-only adaptation, full paired adaptation, and a mixed batching approach that interleaves the two types of data. The results indicate that even a small fraction of speech data provides a useful signal for closing the gap between the speech encoder outputs and the language model's text expectations. This matters because gathering large amounts of domain-specific audio is costly, so methods that work with far less audio could make adaptation feasible in more settings.

Core claim

The central claim is that mixed batching, by combining text-only examples with a limited number of paired speech-text examples in the same training batches, supplies an effective modality-alignment signal that reduces the mismatch introduced by the speech projector. Experiments in both in-domain and out-of-domain conditions demonstrate that using only 10 percent of the target-domain speech (less than 4 hours) under this regime produces word error rates comparable to or lower than those from conventional fine-tuning on the entire paired dataset.

What carries the argument

Mixed batching (MB), the training strategy that interleaves text-only adaptation data with paired speech-text examples within individual batches to expose the LLM to noisy speech representations.

If this is right

  • MB with 10 percent speech consistently outperforms text-only adaptation across the tested domains.
  • The same limited-speech regime can reach or surpass the accuracy of full paired-data fine-tuning.
  • Both in-domain and out-of-domain adaptation benefit from the addition of even small speech quantities.
  • The modality gap left by text-only adaptation can be narrowed without requiring the complete target speech corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may lower the data-collection cost for adapting ASR to specialized vocabularies or low-resource languages.
  • Similar mixing strategies could be tested on other multimodal LLM tasks where one modality is cheaper to obtain than the other.
  • Optimal mixing ratios and batch sizes remain open parameters that could be tuned further to reduce audio needs even more.

Load-bearing premise

That the observed performance gains arise specifically from the modality-alignment effect of mixing the two data types rather than from differences in batch composition, learning-rate behavior, or other training details not isolated in the experiments.

What would settle it

A follow-up run that applies the same total compute and data volume but replaces mixed batches with either pure text batches plus separate speech batches or randomly reordered data without explicit mixing, and shows no WER improvement, would indicate that mixed batching itself is not the decisive factor.

Figures

Figures reproduced from arXiv: 2604.06487 by Andreas Stolcke, Dairazalia S\'anchez-Cort\'es, Esa\'u Villatoro-Tello, Hasindri Watawana, Kadri Hacioglu, Manjunath K E, Petr Motlicek, Sergio Burdisso, Severin Baroudi, Shashi Kumar, Shiran Liu, Thibault Ba\~neras-Roux.

Figure 1
Figure 1. Figure 1: Effect of different target text proportions (τt) on Mixed Batch adaptation on DefinedAI Banking. We investigate the effect of varying the proportion of target text data per batch (τt) in the text-only adaptation setting. No audio is included (τa = 0), and the remaining batch proportion is uniformly distributed across source auxiliary data. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of the base ASR and adapted models. The figure compares standard ASR adaptation using only paired speech–text data with mixed-batch adaptation combining paired speech–text and text-only samples. The x-axis indicates the pro￾portion of speech in the adaptation data, ranging from 0% (text-only) to 100% (paired speech–text). This allows us to study how the model adapts under low￾resource condition… view at source ↗
Figure 3
Figure 3. Figure 3: Performances of base system and adapted model with text-only, paired speech-text and mixed batch adaptation. For both audio-adaptation settings, 20% of speech is used. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates domain adaptation for LLM-based ASR systems that connect a speech encoder to an LLM via a projection module. It compares text-only adaptation, paired speech-text adaptation, and mixed batching (MB) strategies, claiming that MB incorporating only 10% of target-domain speech data (less than 4 hours) produces word error rates comparable to or better than conventional fine-tuning on the full paired dataset in both in-domain and out-of-domain settings, by supplying a modality-alignment signal that mitigates the speech-text gap.

Significance. If the central experimental claim holds after proper controls and ablations, the result would be significant for efficient domain adaptation in LLM-based ASR. It would demonstrate that minimal audio data can close the modality gap without full paired datasets, lowering data requirements for adapting such systems to new domains while maintaining or improving performance.

major comments (2)
  1. [Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.
  2. [Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.
minor comments (1)
  1. [Abstract] The abstract would benefit from reporting the precise WER numbers, dataset names and sizes, and the base LLM used to allow readers to assess the magnitude of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental controls and reporting standards. We address each major comment below and have revised the manuscript to improve clarity and verifiability of the results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.

    Authors: We agree that the abstract must explicitly confirm matched experimental conditions to support attribution to modality alignment. All strategies (text-only, paired speech-text, mixed batching, and conventional fine-tuning) in our experiments use the identical base LLM, the same optimizer configuration (learning rate schedule, batch size, and total steps), and comparable text data volumes. These controls are detailed in the experimental setup of the full manuscript. We have revised the abstract to state that all runs share these configurations, ensuring the WER gains from limited speech in MB can be attributed to the modality-alignment signal rather than differences in optimization or token exposure. revision: yes

  2. Referee: [Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.

    Authors: We acknowledge the need for greater statistical rigor and detail. The full manuscript reports exact dataset sizes (target-domain speech <4 hours for the 10% setting) and describes batch composition for mixed batching along with hyperparameter controls in the methods and experimental sections. To address the concern, we have updated the abstract and results to include error bars from multiple seeds, notes on batch composition ablations, and confirmation of consistent hyperparameters across conditions. These revisions make the in-domain and out-of-domain improvements verifiable and strengthen the case that they arise from modality alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons are self-contained experimental results

full rationale

The paper reports direct experimental outcomes from comparing text-only adaptation, paired speech-text adaptation, and mixed batching (MB) on LLM-based ASR, using limited target-domain audio (10% or <4 hours). These are falsifiable WER measurements on in-domain and out-of-domain settings, not a derivation chain, mathematical prediction, or self-referential definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the modality-alignment claim rests on observed performance differences rather than reducing to inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical and relies on standard machine-learning assumptions about data splits and optimization; no new entities or ad-hoc parameters are introduced beyond typical training hyperparameters.

axioms (1)
  • domain assumption Standard i.i.d. assumptions on train/validation/test splits and that word error rate is a sufficient proxy for ASR quality
    Implicit in all ASR evaluation; invoked when claiming performance improvements.

pith-pipeline@v0.9.0 · 5522 in / 1190 out tokens · 39640 ms · 2026-05-10T18:37:41.791608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    Introduction The integration of Large Language Models (LLMs) into speech processing has revolutionized natural voice interactions. An in- creasingly common approach is LLM-based automatic speech recognition (ASR), where a pretrained speech encoder is con- nected to a pretrained language model via a small projection layer [1, 2, 3, 4]. In particular, the S...

  2. [2]

    Related work 2.1. LLM-Based ASR Recent work in ASR has shifted toward modular architectures that couple pretrained self-supervised speech encoders such as wav2vec 2.0 [8], HuBERT [9], or WavLM [17], with LLMs such as LLaMA [18], or Vicuna [19]), through trainable pro- jection layers that map continuous acoustic features into the LLM embedding space [1, 20...

  3. [3]

    First, a base model is trained on source-domain paired speech- text data

    Mixed Batching Strategy We consider a domain adaptation setting for LLM-based ASR. First, a base model is trained on source-domain paired speech- text data. This base system learns the initial speech–text align- ment and serves as the starting point for all experiments. We then adapt this pretrained model to a target domain using different fine-tuning str...

  4. [4]

    LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]

    Experimental Protocol 4.1. LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]. The architecture consists of three main com- ponents: (i) a pretrained speech encoder, (ii) a trainable speech projector, and (iii) a LLM. Given a speech segmentX S, the speech encoder produces a sequence of acoustic representations...

  5. [5]

    This base model serves as the starting point for all subsequent experiments

    Experimental Results Before conducting the adaptation experiments, we first train a base ASR model exclusively on the source-domain data from the DefinedAI corpus, including the Banking, Insurance, and Healthcare partitions. This base model serves as the starting point for all subsequent experiments. We then perform domain adaptation on different target d...

  6. [6]

    Conclusion In this work, we investigated domain adaptation for LLM-based ASR in settings where target-domain speech is limited but tex- tual data is abundant. We identified the modality mismatch in- troduced by text-only fine-tuning as a key factor behind perfor- mance degradation and proposed a hybrid batching strategy that mix speech-text pairs with tex...

  7. [7]

    Generative AI Use Disclosure Generative AI was used to proofread the paper, fix orthographic and grammatical errors, and reduce the length of the long sec- tions of the paper

  8. [8]

    An embarrassingly simple approach for LLM with strong ASR capacity,

    Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhanget al., “An embarrassingly simple approach for llm with strong asr capacity,”arXiv preprint arXiv:2402.08846, 2024

  9. [9]

    Slm: Bridge the thin gap between speech and text foundation models,

    M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y . Cao, N. Chen, Y . Zhang, H. Soltau, P. K. Rubensteinet al., “Slm: Bridge the thin gap between speech and text foundation models,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

  10. [10]

    Connecting speech encoder and large language model for asr,

    W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–12 641

  11. [11]

    Bridging the modality gap: Softly discretizing audio representation for llm-based automatic speech recognition,

    M. Yang, S.-J. Chen, J. Xie, and J. Hansen, “Bridging the modal- ity gap: Softly discretizing audio representation for llm-based automatic speech recognition,”arXiv preprint arXiv:2506.05706, 2025

  12. [12]

    Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,

    L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  13. [13]

    Learning from models beyond fine-tuning,

    H. Zheng, L. Shen, A. Tang, Y . Luo, H. Hu, B. Du, Y . Wen, and D. Tao, “Learning from models beyond fine-tuning,”Nature Ma- chine Intelligence, vol. 7, no. 1, pp. 6–17, 2025

  14. [14]

    Prompting large language models for zero-shot domain adaptation in speech recognition,

    Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

  15. [15]

    wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

  16. [16]

    Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

  17. [17]

    A simple baseline for domain adaptation in end to end asr systems using synthetic data,

    R. Joshi and A. Singh, “A simple baseline for domain adaptation in end to end asr systems using synthetic data,” inProceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp. 244–249

  18. [18]

    Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,

    S. Khurana, A. Laurent, and J. Glass, “Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6647–6651

  19. [19]

    arXiv preprint arXiv:2005.04290 (2020)

    J. Huang, O. Kuchaiev, P. O’Neill, V . Lavrukhin, J. Li, A. Flo- res, G. Kucsko, and B. Ginsburg, “Cross-language transfer learn- ing, continuous learning, and domain adaptation for end-to-end automatic speech recognition,”arXiv preprint arXiv:2005.04290, 2020

  20. [20]

    Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,

    L. Zheng, X. Wang, Q. Zhao, and T. Li, “Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,”Applied Sciences, vol. 16, no. 1, p. 60, 2025

  21. [21]

    Low-resource domain adaptation for speech LLMs via text-only fine-tuning,

    Y . Fang, J. Peng, X. Li, Y . Xi, C. Zhang, G. Zhong, and K. Yu, “Low-resource domain adaptation for speech llms via text-only fine-tuning,”arXiv preprint arXiv:2506.05671, 2025

  22. [22]

    Text-only adaptation in llm- based asr through text denoising,

    A. Carofilis, S. Burdisso, E. Villatoro-Tello, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm- based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900

  23. [23]

    Effective text adaptation for llm- based asr through soft prompt fine-tuning,

    Y . Ma, Z. Liu, and O. Kalinli, “Effective text adaptation for llm- based asr through soft prompt fine-tuning,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 64–69

  24. [24]

    Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  25. [25]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023

  27. [27]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, M. Zejun, and C. Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inThe Twelfth International Con- ference on Learning Representations, 2024

  28. [28]

    On decoder-only architecture for speech- to-text and large language model integration,

    J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

  29. [29]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

    S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  30. [30]

    Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,

    P. Rangappa, A. Carofilis, J. Prakash, S. Kumar, S. Burdisso, S. Madikeri, E. Villatoro-Tello, B. Sharma, P. Motlicek, K. Ha- ciogluet al., “Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,” inProc. Inter- speech 2025, 2025, pp. 4928–4932

  31. [31]

    Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,

    M. Bhattacharjee, P. Motlicek, S. Madikeri, H. Helmke, O. Ohneiser, M. Kleinert, and H. Ehr, “Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,”European Journal of Transport and Infrastructure Re- search, vol. 42, no. 4, pp. 133–153, 2024

  32. [32]

    An end-to-end overview of clinical speech ai,

    S.-I. Ng, L. Xu, I. Siegert, N. Cummins, N. R. Benway, J. Liss, and V . Berisha, “An end-to-end overview of clinical speech ai,” IEEE Transactions on Audio, Speech and Language Processing, 2026

  33. [33]

    Large-scale asr domain adaptation using self-and semi-supervised learning,

    D. Hwang, A. Misra, Z. Huo, N. Siddhartha, S. Garg, D. Qiu, K. C. Sim, T. Strohman, F. Beaufays, and Y . He, “Large-scale asr domain adaptation using self-and semi-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6627– 6631

  34. [34]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

  35. [35]

    Slidespeech: A large scale slide-enriched audio-visual corpus,

    H. Wang, F. Yu, X. Shi, Y . Wang, S. Zhang, and M. Li, “Slidespeech: A large scale slide-enriched audio-visual corpus,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 076–11 080