Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Andreas Stolcke; Dairazalia S\'anchez-Cort\'es; Esa\'u Villatoro-Tello; Hasindri Watawana; Kadri Hacioglu; Manjunath K E; Petr Motlicek; Sergio Burdisso; Severin Baroudi; Shashi Kumar

arxiv: 2604.06487 · v1 · submitted 2026-04-07 · 💻 cs.CL

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Ba\~neras-Roux , Sergio Burdisso , Esa\'u Villatoro-Tello , Dairazalia S\'anchez-Cort\'es , Shiran Liu , Severin Baroudi , Shashi Kumar , Hasindri Watawana

show 4 more authors

Manjunath K E Kadri Hacioglu Petr Motlicek Andreas Stolcke

This is my paper

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM-based ASRdomain adaptationmodality gapmixed batchinglimited speech dataspeech recognitionfine-tuningtext-only adaptation

0 comments

The pith

Mixed batching with only 10 percent target speech matches or beats full-dataset fine-tuning in LLM-based ASR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates ways to adapt LLM-based automatic speech recognition systems to new domains when only limited paired speech-text data is available. It compares text-only adaptation, full paired adaptation, and a mixed batching approach that interleaves the two types of data. The results indicate that even a small fraction of speech data provides a useful signal for closing the gap between the speech encoder outputs and the language model's text expectations. This matters because gathering large amounts of domain-specific audio is costly, so methods that work with far less audio could make adaptation feasible in more settings.

Core claim

The central claim is that mixed batching, by combining text-only examples with a limited number of paired speech-text examples in the same training batches, supplies an effective modality-alignment signal that reduces the mismatch introduced by the speech projector. Experiments in both in-domain and out-of-domain conditions demonstrate that using only 10 percent of the target-domain speech (less than 4 hours) under this regime produces word error rates comparable to or lower than those from conventional fine-tuning on the entire paired dataset.

What carries the argument

Mixed batching (MB), the training strategy that interleaves text-only adaptation data with paired speech-text examples within individual batches to expose the LLM to noisy speech representations.

If this is right

MB with 10 percent speech consistently outperforms text-only adaptation across the tested domains.
The same limited-speech regime can reach or surpass the accuracy of full paired-data fine-tuning.
Both in-domain and out-of-domain adaptation benefit from the addition of even small speech quantities.
The modality gap left by text-only adaptation can be narrowed without requiring the complete target speech corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may lower the data-collection cost for adapting ASR to specialized vocabularies or low-resource languages.
Similar mixing strategies could be tested on other multimodal LLM tasks where one modality is cheaper to obtain than the other.
Optimal mixing ratios and batch sizes remain open parameters that could be tuned further to reduce audio needs even more.

Load-bearing premise

That the observed performance gains arise specifically from the modality-alignment effect of mixing the two data types rather than from differences in batch composition, learning-rate behavior, or other training details not isolated in the experiments.

What would settle it

A follow-up run that applies the same total compute and data volume but replaces mixed batches with either pure text batches plus separate speech batches or randomly reordered data without explicit mixing, and shows no WER improvement, would indicate that mixed batching itself is not the decisive factor.

Figures

Figures reproduced from arXiv: 2604.06487 by Andreas Stolcke, Dairazalia S\'anchez-Cort\'es, Esa\'u Villatoro-Tello, Hasindri Watawana, Kadri Hacioglu, Manjunath K E, Petr Motlicek, Sergio Burdisso, Severin Baroudi, Shashi Kumar, Shiran Liu, Thibault Ba\~neras-Roux.

**Figure 1.** Figure 1: Effect of different target text proportions (τt) on Mixed Batch adaptation on DefinedAI Banking. We investigate the effect of varying the proportion of target text data per batch (τt) in the text-only adaptation setting. No audio is included (τa = 0), and the remaining batch proportion is uniformly distributed across source auxiliary data. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Performance of the base ASR and adapted models. The figure compares standard ASR adaptation using only paired speech–text data with mixed-batch adaptation combining paired speech–text and text-only samples. The x-axis indicates the proportion of speech in the adaptation data, ranging from 0% (text-only) to 100% (paired speech–text). This allows us to study how the model adapts under lowresource condition… view at source ↗

**Figure 3.** Figure 3: Performances of base system and adapted model with text-only, paired speech-text and mixed batch adaptation. For both audio-adaptation settings, 20% of speech is used. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixed batching with 10% target speech matches full fine-tuning in LLM ASR, but the abstract leaves open whether gains trace to modality alignment or to differences in training dynamics.

read the letter

The main thing to know is that this paper reports a practical result: mixing just 10% of the target-domain speech (under 4 hours) into adaptation batches lets an LLM-based ASR system reach word error rates that match or beat conventional fine-tuning on the entire paired dataset. The mixed-batching approach interleaves text-only examples with paired speech-text ones so the LLM sees the actual speech encoder outputs during adaptation instead of only clean text embeddings. Experiments cover both in-domain and out-of-domain cases and show that even small amounts of speech help close the modality gap that text-only adaptation leaves behind. That observation is the concrete piece of work here and it has clear value for anyone who cannot collect large matched audio sets for every new domain. The setup is straightforward and the numbers suggest the bottleneck really is the mismatch between speech representations and the LLM rather than sheer volume of speech content. The soft spot is the missing controls. The abstract does not state whether the text-only, paired, and mixed conditions share the same base LLM, the same total training steps, the same batch size, or the same learning-rate schedule. If the mixed batches simply change the token mix or the optimization path, the improvement could come from those factors instead of the intended alignment signal. No error bars, variance numbers, or statistical tests are mentioned, so the “comparable or better” claim is hard to judge precisely from the given information. The stress-test concern about uncontrolled optimization differences therefore stands on the abstract alone. If the full paper contains those ablations they are not foregrounded. This is aimed at researchers and engineers working on efficient domain adaptation for LLM-based ASR, especially in cost-sensitive or low-resource settings. A reader who needs to reduce audio collection costs will find the empirical pattern useful even if the exact mechanism needs tighter pinning down. The work deserves peer review because the core finding is a usable empirical result that can be checked and extended with fuller experimental detail.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates domain adaptation for LLM-based ASR systems that connect a speech encoder to an LLM via a projection module. It compares text-only adaptation, paired speech-text adaptation, and mixed batching (MB) strategies, claiming that MB incorporating only 10% of target-domain speech data (less than 4 hours) produces word error rates comparable to or better than conventional fine-tuning on the full paired dataset in both in-domain and out-of-domain settings, by supplying a modality-alignment signal that mitigates the speech-text gap.

Significance. If the central experimental claim holds after proper controls and ablations, the result would be significant for efficient domain adaptation in LLM-based ASR. It would demonstrate that minimal audio data can close the modality gap without full paired datasets, lowering data requirements for adapting such systems to new domains while maintaining or improving performance.

major comments (2)

[Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.
[Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.

minor comments (1)

[Abstract] The abstract would benefit from reporting the precise WER numbers, dataset names and sizes, and the base LLM used to allow readers to assess the magnitude of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental controls and reporting standards. We address each major comment below and have revised the manuscript to improve clarity and verifiability of the results.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that MB with <4h (10%) target speech matches or exceeds full-dataset conventional ASR fine-tuning rests on the assumption that performance gains arise specifically from modality alignment. However, the abstract provides no information on whether text-only, paired, MB, and conventional runs share the same base LLM, identical optimizer (LR schedule, batch size, total steps), or equivalent volumes of text data. If MB simply exposes the model to more total tokens or uses different effective batch composition, the WER improvement cannot be attributed to alignment.

Authors: We agree that the abstract must explicitly confirm matched experimental conditions to support attribution to modality alignment. All strategies (text-only, paired speech-text, mixed batching, and conventional fine-tuning) in our experiments use the identical base LLM, the same optimizer configuration (learning rate schedule, batch size, and total steps), and comparable text data volumes. These controls are detailed in the experimental setup of the full manuscript. We have revised the abstract to state that all runs share these configurations, ensuring the WER gains from limited speech in MB can be attributed to the modality-alignment signal rather than differences in optimization or token exposure. revision: yes
Referee: [Abstract] Abstract / Experimental results: No error bars, statistical tests, exact dataset sizes, ablation details on batch composition, or controls for optimization hyperparameters are reported. This leaves the 'consistent improvements across in-domain and out-of-domain settings' unverifiable and makes it impossible to confirm that the gains are load-bearing for the modality-alignment hypothesis rather than uncontrolled factors.

Authors: We acknowledge the need for greater statistical rigor and detail. The full manuscript reports exact dataset sizes (target-domain speech <4 hours for the 10% setting) and describes batch composition for mixed batching along with hyperparameter controls in the methods and experimental sections. To address the concern, we have updated the abstract and results to include error bars from multiple seeds, notes on batch composition ablations, and confirmation of consistent hyperparameters across conditions. These revisions make the in-domain and out-of-domain improvements verifiable and strengthen the case that they arise from modality alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons are self-contained experimental results

full rationale

The paper reports direct experimental outcomes from comparing text-only adaptation, paired speech-text adaptation, and mixed batching (MB) on LLM-based ASR, using limited target-domain audio (10% or <4 hours). These are falsifiable WER measurements on in-domain and out-of-domain settings, not a derivation chain, mathematical prediction, or self-referential definition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the modality-alignment claim rests on observed performance differences rather than reducing to inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical and relies on standard machine-learning assumptions about data splits and optimization; no new entities or ad-hoc parameters are introduced beyond typical training hyperparameters.

axioms (1)

domain assumption Standard i.i.d. assumptions on train/validation/test splits and that word error rate is a sufficient proxy for ASR quality
Implicit in all ASR evaluation; invoked when claiming performance improvements.

pith-pipeline@v0.9.0 · 5522 in / 1190 out tokens · 39640 ms · 2026-05-10T18:37:41.791608+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Introduction The integration of Large Language Models (LLMs) into speech processing has revolutionized natural voice interactions. An in- creasingly common approach is LLM-based automatic speech recognition (ASR), where a pretrained speech encoder is con- nected to a pretrained language model via a small projection layer [1, 2, 3, 4]. In particular, the S...

work page
[2]

Related work 2.1. LLM-Based ASR Recent work in ASR has shifted toward modular architectures that couple pretrained self-supervised speech encoders such as wav2vec 2.0 [8], HuBERT [9], or WavLM [17], with LLMs such as LLaMA [18], or Vicuna [19]), through trainable pro- jection layers that map continuous acoustic features into the LLM embedding space [1, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

First, a base model is trained on source-domain paired speech- text data

Mixed Batching Strategy We consider a domain adaptation setting for LLM-based ASR. First, a base model is trained on source-domain paired speech- text data. This base system learns the initial speech–text align- ment and serves as the starting point for all experiments. We then adapt this pretrained model to a target domain using different fine-tuning str...

work page
[4]

LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]

Experimental Protocol 4.1. LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]. The architecture consists of three main com- ponents: (i) a pretrained speech encoder, (ii) a trainable speech projector, and (iii) a LLM. Given a speech segmentX S, the speech encoder produces a sequence of acoustic representations...

work page
[5]

This base model serves as the starting point for all subsequent experiments

Experimental Results Before conducting the adaptation experiments, we first train a base ASR model exclusively on the source-domain data from the DefinedAI corpus, including the Banking, Insurance, and Healthcare partitions. This base model serves as the starting point for all subsequent experiments. We then perform domain adaptation on different target d...

work page
[6]

Conclusion In this work, we investigated domain adaptation for LLM-based ASR in settings where target-domain speech is limited but tex- tual data is abundant. We identified the modality mismatch in- troduced by text-only fine-tuning as a key factor behind perfor- mance degradation and proposed a hybrid batching strategy that mix speech-text pairs with tex...

work page
[7]

Generative AI Use Disclosure Generative AI was used to proofread the paper, fix orthographic and grammatical errors, and reduce the length of the long sec- tions of the paper

work page
[8]

An embarrassingly simple approach for LLM with strong ASR capacity,

Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhanget al., “An embarrassingly simple approach for llm with strong asr capacity,”arXiv preprint arXiv:2402.08846, 2024

work page arXiv 2024
[9]

Slm: Bridge the thin gap between speech and text foundation models,

M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y . Cao, N. Chen, Y . Zhang, H. Soltau, P. K. Rubensteinet al., “Slm: Bridge the thin gap between speech and text foundation models,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[10]

Connecting speech encoder and large language model for asr,

W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–12 641

work page 2024
[11]

Bridging the modality gap: Softly discretizing audio representation for llm-based automatic speech recognition,

M. Yang, S.-J. Chen, J. Xie, and J. Hansen, “Bridging the modal- ity gap: Softly discretizing audio representation for llm-based automatic speech recognition,”arXiv preprint arXiv:2506.05706, 2025

work page arXiv 2025
[12]

Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026
[13]

Learning from models beyond fine-tuning,

H. Zheng, L. Shen, A. Tang, Y . Luo, H. Hu, B. Du, Y . Wen, and D. Tao, “Learning from models beyond fine-tuning,”Nature Ma- chine Intelligence, vol. 7, no. 1, pp. 6–17, 2025

work page 2025
[14]

Prompting large language models for zero-shot domain adaptation in speech recognition,

Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[15]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020
[16]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021
[17]

A simple baseline for domain adaptation in end to end asr systems using synthetic data,

R. Joshi and A. Singh, “A simple baseline for domain adaptation in end to end asr systems using synthetic data,” inProceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp. 244–249

work page 2022
[18]

Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,

S. Khurana, A. Laurent, and J. Glass, “Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6647–6651

work page 2022
[19]

arXiv preprint arXiv:2005.04290 (2020)

J. Huang, O. Kuchaiev, P. O’Neill, V . Lavrukhin, J. Li, A. Flo- res, G. Kucsko, and B. Ginsburg, “Cross-language transfer learn- ing, continuous learning, and domain adaptation for end-to-end automatic speech recognition,”arXiv preprint arXiv:2005.04290, 2020

work page arXiv 2005
[20]

Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,

L. Zheng, X. Wang, Q. Zhao, and T. Li, “Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,”Applied Sciences, vol. 16, no. 1, p. 60, 2025

work page 2025
[21]

Low-resource domain adaptation for speech LLMs via text-only fine-tuning,

Y . Fang, J. Peng, X. Li, Y . Xi, C. Zhang, G. Zhong, and K. Yu, “Low-resource domain adaptation for speech llms via text-only fine-tuning,”arXiv preprint arXiv:2506.05671, 2025

work page arXiv 2025
[22]

Text-only adaptation in llm- based asr through text denoising,

A. Carofilis, S. Burdisso, E. Villatoro-Tello, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm- based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900

work page arXiv 2026
[23]

Effective text adaptation for llm- based asr through soft prompt fine-tuning,

Y . Ma, Z. Liu, and O. Kalinli, “Effective text adaptation for llm- based asr through soft prompt fine-tuning,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 64–69

work page 2024
[24]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[25]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023

work page 2023
[27]

SALMONN: Towards Generic Hearing Abilities for Large Language Models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, M. Zejun, and C. Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inThe Twelfth International Con- ference on Learning Representations, 2024

work page 2024
[28]

On decoder-only architecture for speech- to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023
[29]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[30]

Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,

P. Rangappa, A. Carofilis, J. Prakash, S. Kumar, S. Burdisso, S. Madikeri, E. Villatoro-Tello, B. Sharma, P. Motlicek, K. Ha- ciogluet al., “Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,” inProc. Inter- speech 2025, 2025, pp. 4928–4932

work page 2025
[31]

Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,

M. Bhattacharjee, P. Motlicek, S. Madikeri, H. Helmke, O. Ohneiser, M. Kleinert, and H. Ehr, “Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,”European Journal of Transport and Infrastructure Re- search, vol. 42, no. 4, pp. 133–153, 2024

work page 2024
[32]

An end-to-end overview of clinical speech ai,

S.-I. Ng, L. Xu, I. Siegert, N. Cummins, N. R. Benway, J. Liss, and V . Berisha, “An end-to-end overview of clinical speech ai,” IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026
[33]

Large-scale asr domain adaptation using self-and semi-supervised learning,

D. Hwang, A. Misra, Z. Huo, N. Siddhartha, S. Garg, D. Qiu, K. C. Sim, T. Strohman, F. Beaufays, and Y . He, “Large-scale asr domain adaptation using self-and semi-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6627– 6631

work page 2022
[34]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[35]

Slidespeech: A large scale slide-enriched audio-visual corpus,

H. Wang, F. Yu, X. Shi, Y . Wang, S. Zhang, and M. Li, “Slidespeech: A large scale slide-enriched audio-visual corpus,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 076–11 080

work page 2024

[1] [1]

Introduction The integration of Large Language Models (LLMs) into speech processing has revolutionized natural voice interactions. An in- creasingly common approach is LLM-based automatic speech recognition (ASR), where a pretrained speech encoder is con- nected to a pretrained language model via a small projection layer [1, 2, 3, 4]. In particular, the S...

work page

[2] [2]

Related work 2.1. LLM-Based ASR Recent work in ASR has shifted toward modular architectures that couple pretrained self-supervised speech encoders such as wav2vec 2.0 [8], HuBERT [9], or WavLM [17], with LLMs such as LLaMA [18], or Vicuna [19]), through trainable pro- jection layers that map continuous acoustic features into the LLM embedding space [1, 20...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

First, a base model is trained on source-domain paired speech- text data

Mixed Batching Strategy We consider a domain adaptation setting for LLM-based ASR. First, a base model is trained on source-domain paired speech- text data. This base system learns the initial speech–text align- ment and serves as the starting point for all experiments. We then adapt this pretrained model to a target domain using different fine-tuning str...

work page

[4] [4]

LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]

Experimental Protocol 4.1. LLM-Based ASR Architecture In our experiments, the ASR system follows the SLAM-ASR framework [1]. The architecture consists of three main com- ponents: (i) a pretrained speech encoder, (ii) a trainable speech projector, and (iii) a LLM. Given a speech segmentX S, the speech encoder produces a sequence of acoustic representations...

work page

[5] [5]

This base model serves as the starting point for all subsequent experiments

Experimental Results Before conducting the adaptation experiments, we first train a base ASR model exclusively on the source-domain data from the DefinedAI corpus, including the Banking, Insurance, and Healthcare partitions. This base model serves as the starting point for all subsequent experiments. We then perform domain adaptation on different target d...

work page

[6] [6]

Conclusion In this work, we investigated domain adaptation for LLM-based ASR in settings where target-domain speech is limited but tex- tual data is abundant. We identified the modality mismatch in- troduced by text-only fine-tuning as a key factor behind perfor- mance degradation and proposed a hybrid batching strategy that mix speech-text pairs with tex...

work page

[7] [7]

Generative AI Use Disclosure Generative AI was used to proofread the paper, fix orthographic and grammatical errors, and reduce the length of the long sec- tions of the paper

work page

[8] [8]

An embarrassingly simple approach for LLM with strong ASR capacity,

Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhanget al., “An embarrassingly simple approach for llm with strong asr capacity,”arXiv preprint arXiv:2402.08846, 2024

work page arXiv 2024

[9] [9]

Slm: Bridge the thin gap between speech and text foundation models,

M. Wang, W. Han, I. Shafran, Z. Wu, C.-C. Chiu, Y . Cao, N. Chen, Y . Zhang, H. Soltau, P. K. Rubensteinet al., “Slm: Bridge the thin gap between speech and text foundation models,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[10] [10]

Connecting speech encoder and large language model for asr,

W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 637–12 641

work page 2024

[11] [11]

Bridging the modality gap: Softly discretizing audio representation for llm-based automatic speech recognition,

M. Yang, S.-J. Chen, J. Xie, and J. Hansen, “Bridging the modal- ity gap: Softly discretizing audio representation for llm-based automatic speech recognition,”arXiv preprint arXiv:2506.05706, 2025

work page arXiv 2025

[12] [12]

Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,

L. Xu, H. Xie, S. J. Qin, X. Tao, and F. L. Wang, “Parameter- efficient fine-tuning methods for pretrained language models: A critical review and assessment,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026

[13] [13]

Learning from models beyond fine-tuning,

H. Zheng, L. Shen, A. Tang, Y . Luo, H. Hu, B. Du, Y . Wen, and D. Tao, “Learning from models beyond fine-tuning,”Nature Ma- chine Intelligence, vol. 7, no. 1, pp. 6–17, 2025

work page 2025

[14] [14]

Prompting large language models for zero-shot domain adaptation in speech recognition,

Y . Li, Y . Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[15] [15]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020

[16] [16]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

work page 2021

[17] [17]

A simple baseline for domain adaptation in end to end asr systems using synthetic data,

R. Joshi and A. Singh, “A simple baseline for domain adaptation in end to end asr systems using synthetic data,” inProceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp. 244–249

work page 2022

[18] [18]

Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,

S. Khurana, A. Laurent, and J. Glass, “Magic dust for cross- lingual adaptation of monolingual wav2vec-2.0,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6647–6651

work page 2022

[19] [19]

arXiv preprint arXiv:2005.04290 (2020)

J. Huang, O. Kuchaiev, P. O’Neill, V . Lavrukhin, J. Li, A. Flo- res, G. Kucsko, and B. Ginsburg, “Cross-language transfer learn- ing, continuous learning, and domain adaptation for end-to-end automatic speech recognition,”arXiv preprint arXiv:2005.04290, 2020

work page arXiv 2005

[20] [20]

Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,

L. Zheng, X. Wang, Q. Zhao, and T. Li, “Two-stage domain adap- tation for llm-based asr by decoupling linguistic and acoustic fac- tors,”Applied Sciences, vol. 16, no. 1, p. 60, 2025

work page 2025

[21] [21]

Low-resource domain adaptation for speech LLMs via text-only fine-tuning,

Y . Fang, J. Peng, X. Li, Y . Xi, C. Zhang, G. Zhong, and K. Yu, “Low-resource domain adaptation for speech llms via text-only fine-tuning,”arXiv preprint arXiv:2506.05671, 2025

work page arXiv 2025

[22] [22]

Text-only adaptation in llm- based asr through text denoising,

A. Carofilis, S. Burdisso, E. Villatoro-Tello, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm- based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900

work page arXiv 2026

[23] [23]

Effective text adaptation for llm- based asr through soft prompt fine-tuning,

Y . Ma, Z. Liu, and O. Kalinli, “Effective text adaptation for llm- based asr through soft prompt fine-tuning,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 64–69

work page 2024

[24] [24]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[25] [25]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalezet al., “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023

work page 2023

[27] [27]

SALMONN: Towards Generic Hearing Abilities for Large Language Models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, M. Zejun, and C. Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inThe Twelfth International Con- ference on Learning Representations, 2024

work page 2024

[28] [28]

On decoder-only architecture for speech- to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

work page 2023

[29] [29]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[30] [30]

Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,

P. Rangappa, A. Carofilis, J. Prakash, S. Kumar, S. Burdisso, S. Madikeri, E. Villatoro-Tello, B. Sharma, P. Motlicek, K. Ha- ciogluet al., “Efficient data selection for domain adaptation of asr using pseudo-labels and multi-stage filtering,” inProc. Inter- speech 2025, 2025, pp. 4928–4932

work page 2025

[31] [31]

Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,

M. Bhattacharjee, P. Motlicek, S. Madikeri, H. Helmke, O. Ohneiser, M. Kleinert, and H. Ehr, “Minimum effort adapta- tion of automatic speech recognition system in air traffic man- agement,”European Journal of Transport and Infrastructure Re- search, vol. 42, no. 4, pp. 133–153, 2024

work page 2024

[32] [32]

An end-to-end overview of clinical speech ai,

S.-I. Ng, L. Xu, I. Siegert, N. Cummins, N. R. Benway, J. Liss, and V . Berisha, “An end-to-end overview of clinical speech ai,” IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026

[33] [33]

Large-scale asr domain adaptation using self-and semi-supervised learning,

D. Hwang, A. Misra, Z. Huo, N. Siddhartha, S. Garg, D. Qiu, K. C. Sim, T. Strohman, F. Beaufays, and Y . He, “Large-scale asr domain adaptation using self-and semi-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6627– 6631

work page 2022

[34] [34]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[35] [35]

Slidespeech: A large scale slide-enriched audio-visual corpus,

H. Wang, F. Yu, X. Shi, Y . Wang, S. Zhang, and M. Li, “Slidespeech: A large scale slide-enriched audio-visual corpus,” inICASSP 2024-2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 076–11 080

work page 2024