Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3
The pith
Mixed batching with only 10 percent target speech matches or beats full-dataset fine-tuning in LLM-based ASR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that mixed batching, by combining text-only examples with a limited number of paired speech-text examples in the same training batches, supplies an effective modality-alignment signal that reduces the mismatch introduced by the speech projector. Experiments in both in-domain and out-of-domain conditions demonstrate that using only 10 percent of the target-domain speech (less than 4 hours) under this regime produces word error rates comparable to or lower than those from conventional fine-tuning on the entire paired dataset.
What carries the argument
Mixed batching (MB), the training strategy that interleaves text-only adaptation data with paired speech-text examples within individual batches to expose the LLM to noisy speech representations.
If this is right
- MB with 10 percent speech consistently outperforms text-only adaptation across the tested domains.
- The same limited-speech regime can reach or surpass the accuracy of full paired-data fine-tuning.
- Both in-domain and out-of-domain adaptation benefit from the addition of even small speech quantities.
- The modality gap left by text-only adaptation can be narrowed without requiring the complete target speech corpus.
Where Pith is reading between the lines
- The technique may lower the data-collection cost for adapting ASR to specialized vocabularies or low-resource languages.
- Similar mixing strategies could be tested on other multimodal LLM tasks where one modality is cheaper to obtain than the other.
- Optimal mixing ratios and batch sizes remain open parameters that could be tuned further to reduce audio needs even more.
Load-bearing premise
That the observed performance gains arise specifically from the modality-alignment effect of mixing the two data types rather than from differences in batch composition, learning-rate behavior, or other training details not isolated in the experiments.
What would settle it
A follow-up run that applies the same total compute and data volume but replaces mixed batches with either pure text batches plus separate speech batches or randomly reordered data without explicit mixing, and shows no WER improvement, would indicate that mixed batching itself is not the decisive factor.
Figures
read the original abstract
Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates domain adaptation for LLM-based ASR systems that connect a speech encoder to an LLM via a projection module. It compares text-only adaptation, paired speech-text adaptation, and mixed batching (MB) strategies, claiming that MB incorporating only 10% of target-domain speech data (less than 4 hours) produces word error rates comparable to or better than conventional fine-tuning on the full paired dataset in both in-domain and out-of-domain settings, by supplying a modality-alignment signal that mitigates the speech-text gap.
Significance. If the central experimental claim holds after proper controls and ablations, the result would be significant for efficient domain adaptation in LLM-based ASR. It would demonstrate that minimal audio data can close the modality gap without full paired datasets, lowering data requirements for adapting such systems to new domains while maintaining or improving performance.
major comments (2)
- [Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.
- [Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.
minor comments (1)
- [Abstract] The abstract would benefit from reporting the precise WER numbers, dataset names and sizes, and the base LLM used to allow readers to assess the magnitude of the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on experimental controls and reporting standards. We address each major comment below and have revised the manuscript to improve clarity and verifiability of the results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.
Authors: We agree that the abstract must explicitly confirm matched experimental conditions to support attribution to modality alignment. All strategies (text-only, paired speech-text, mixed batching, and conventional fine-tuning) in our experiments use the identical base LLM, the same optimizer configuration (learning rate schedule, batch size, and total steps), and comparable text data volumes. These controls are detailed in the experimental setup of the full manuscript. We have revised the abstract to state that all runs share these configurations, ensuring the WER gains from limited speech in MB can be attributed to the modality-alignment signal rather than differences in optimization or token exposure. revision: yes
-
Referee: [Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.
Authors: We acknowledge the need for greater statistical rigor and detail. The full manuscript reports exact dataset sizes (target-domain speech <4 hours for the 10% setting) and describes batch composition for mixed batching along with hyperparameter controls in the methods and experimental sections. To address the concern, we have updated the abstract and results to include error bars from multiple seeds, notes on batch composition ablations, and confirmation of consistent hyperparameters across conditions. These revisions make the in-domain and out-of-domain improvements verifiable and strengthen the case that they arise from modality alignment. revision: yes
Circularity Check
No circularity: empirical comparisons are self-contained experimental results
full rationale
The paper reports direct experimental outcomes from comparing text-only adaptation, paired speech-text adaptation, and mixed batching (MB) on LLM-based ASR, using limited target-domain audio (10% or <4 hours). These are falsifiable WER measurements on in-domain and out-of-domain settings, not a derivation chain, mathematical prediction, or self-referential definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the modality-alignment claim rests on observed performance differences rather than reducing to inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard i.i.d. assumptions on train/validation/test splits and that word error rate is a sufficient proxy for ASR quality
Reference graph
Works this paper leans on
-
[1]
Introduction The integration of Large Language Models (LLMs) into speech processing has revolutionized natural voice interactions. An in- creasingly common approach is LLM-based automatic speech recognition (ASR), where a pretrained speech encoder is con- nected to a pretrained language model via a small projection layer [1, 2, 3, 4]. In particular, the S...
-
[2]
Related work 2.1. LLM-Based ASR Recent work in ASR has shifted toward modular architectures that couple pretrained self-supervised speech encoders such as wav2vec 2.0 [8], HuBERT [9], or WavLM [17], with LLMs such as LLaMA [18], or Vicuna [19]), through trainable pro- jection layers that map continuous acoustic features into the LLM embedding space [1, 20...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
First, a base model is trained on source-domain paired speech- text data
Mixed Batching Strategy We consider a domain adaptation setting for LLM-based ASR. First, a base model is trained on source-domain paired speech- text data. This base system learns the initial speech–text align- ment and serves as the starting point for all experiments. We then adapt this pretrained model to a target domain using different fine-tuning str...
-
[4]
LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]
Experimental Protocol 4.1. LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]. The architecture consists of three main com- ponents: (i) a pretrained speech encoder, (ii) a trainable speech projector, and (iii) a LLM. Given a speech segmentX S, the speech encoder produces a sequence of acoustic representations...
-
[5]
This base model serves as the starting point for all subsequent experiments
Experimental Results Before conducting the adaptation experiments, we first train a base ASR model exclusively on the source-domain data from the DefinedAI corpus, including the Banking, Insurance, and Healthcare partitions. This base model serves as the starting point for all subsequent experiments. We then perform domain adaptation on different target d...
-
[6]
Conclusion In this work, we investigated domain adaptation for LLM-based ASR in settings where target-domain speech is limited but tex- tual data is abundant. We identified the modality mismatch in- troduced by text-only fine-tuning as a key factor behind perfor- mance degradation and proposed a hybrid batching strategy that mix speech-text pairs with tex...
-
[7]
Generative AI Use Disclosure Generative AI was used to proofread the paper, fix orthographic and grammatical errors, and reduce the length of the long sec- tions of the paper
-
[8]
An embarrassingly simple approach for LLM with strong ASR capacity,
Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhanget al., “An embarrassingly simple approach for llm with strong asr capacity,”arXiv preprint arXiv:2402.08846, 2024
-
[9]
Slm: Bridge the thin gap between speech and text foundation models,
M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y . Cao, N. Chen, Y . Zhang, H. Soltau, P. K. Rubensteinet al., “Slm: Bridge the thin gap between speech and text foundation models,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[10]
Connecting speech encoder and large language model for asr,
W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–12 641
work page 2024
-
[11]
M. Yang, S.-J. Chen, J. Xie, and J. Hansen, “Bridging the modal- ity gap: Softly discretizing audio representation for llm-based automatic speech recognition,”arXiv preprint arXiv:2506.05706, 2025
-
[12]
L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
work page 2026
-
[13]
Learning from models beyond fine-tuning,
H. Zheng, L. Shen, A. Tang, Y . Luo, H. Hu, B. Du, Y . Wen, and D. Tao, “Learning from models beyond fine-tuning,”Nature Ma- chine Intelligence, vol. 7, no. 1, pp. 6–17, 2025
work page 2025
-
[14]
Prompting large language models for zero-shot domain adaptation in speech recognition,
Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[15]
wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020
work page 2020
-
[16]
Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021
work page 2021
-
[17]
A simple baseline for domain adaptation in end to end asr systems using synthetic data,
R. Joshi and A. Singh, “A simple baseline for domain adaptation in end to end asr systems using synthetic data,” inProceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp. 244–249
work page 2022
-
[18]
Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,
S. Khurana, A. Laurent, and J. Glass, “Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6647–6651
work page 2022
-
[19]
arXiv preprint arXiv:2005.04290 (2020)
J. Huang, O. Kuchaiev, P. O’Neill, V . Lavrukhin, J. Li, A. Flo- res, G. Kucsko, and B. Ginsburg, “Cross-language transfer learn- ing, continuous learning, and domain adaptation for end-to-end automatic speech recognition,”arXiv preprint arXiv:2005.04290, 2020
-
[20]
Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,
L. Zheng, X. Wang, Q. Zhao, and T. Li, “Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,”Applied Sciences, vol. 16, no. 1, p. 60, 2025
work page 2025
-
[21]
Low-resource domain adaptation for speech LLMs via text-only fine-tuning,
Y . Fang, J. Peng, X. Li, Y . Xi, C. Zhang, G. Zhong, and K. Yu, “Low-resource domain adaptation for speech llms via text-only fine-tuning,”arXiv preprint arXiv:2506.05671, 2025
-
[22]
Text-only adaptation in llm- based asr through text denoising,
A. Carofilis, S. Burdisso, E. Villatoro-Tello, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm- based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900
-
[23]
Effective text adaptation for llm- based asr through soft prompt fine-tuning,
Y . Ma, Z. Liu, and O. Kalinli, “Effective text adaptation for llm- based asr through soft prompt fine-tuning,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 64–69
work page 2024
-
[24]
Wavlm: Large-scale self- supervised pre-training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
work page 2022
-
[25]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023
work page 2023
-
[27]
SALMONN: Towards Generic Hearing Abilities for Large Language Models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, M. Zejun, and C. Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inThe Twelfth International Con- ference on Learning Representations, 2024
work page 2024
-
[28]
On decoder-only architecture for speech- to-text and large language model integration,
J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8
work page 2023
-
[29]
Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,
S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[30]
Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,
P. Rangappa, A. Carofilis, J. Prakash, S. Kumar, S. Burdisso, S. Madikeri, E. Villatoro-Tello, B. Sharma, P. Motlicek, K. Ha- ciogluet al., “Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,” inProc. Inter- speech 2025, 2025, pp. 4928–4932
work page 2025
-
[31]
Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,
M. Bhattacharjee, P. Motlicek, S. Madikeri, H. Helmke, O. Ohneiser, M. Kleinert, and H. Ehr, “Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,”European Journal of Transport and Infrastructure Re- search, vol. 42, no. 4, pp. 133–153, 2024
work page 2024
-
[32]
An end-to-end overview of clinical speech ai,
S.-I. Ng, L. Xu, I. Siegert, N. Cummins, N. R. Benway, J. Liss, and V . Berisha, “An end-to-end overview of clinical speech ai,” IEEE Transactions on Audio, Speech and Language Processing, 2026
work page 2026
-
[33]
Large-scale asr domain adaptation using self-and semi-supervised learning,
D. Hwang, A. Misra, Z. Huo, N. Siddhartha, S. Garg, D. Qiu, K. C. Sim, T. Strohman, F. Beaufays, and Y . He, “Large-scale asr domain adaptation using self-and semi-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6627– 6631
work page 2022
-
[34]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[35]
Slidespeech: A large scale slide-enriched audio-visual corpus,
H. Wang, F. Yu, X. Shi, Y . Wang, S. Zhang, and M. Li, “Slidespeech: A large scale slide-enriched audio-visual corpus,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 076–11 080
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.