pith. machine review for the scientific record. sign in

arxiv: 2603.15045 · v2 · submitted 2026-03-16 · 📡 eess.AS

Recognition: no theorem link

LLMs and Speech: Integration vs. Combination

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:36 UTC · model grok-4.3

classification 📡 eess.AS
keywords automatic speech recognitionlarge language modelstight integrationshallow fusionacoustic modeljoint CTCLibrispeech
0
0 comments X

The pith

Tight integration of acoustic models with LLMs outperforms shallow fusion for automatic speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the best way to apply pre-trained large language models to automatic speech recognition by directly comparing two strategies. The first is tight integration, where an acoustic model is embedded inside the LLM to create a speech LLM. The second is the conventional shallow fusion approach that combines separate acoustic model scores with LLM scores at inference time. Through systematic ablations on label units, fine-tuning methods, model sizes, attention interfaces, encoder downsampling, and joint CTC training, the work identifies configurations that improve recognition accuracy on Librispeech and Loquacious while also reducing hallucinations.

Core claim

The central claim is that embedding the acoustic model directly into the LLM through tight integration yields better speech recognition performance than combining an independent acoustic model with the LLM via shallow fusion, once appropriate choices are made for label units, fine-tuning strategies, attention mechanisms, and joint CTC decoding to control hallucinations.

What carries the argument

The speech LLM formed by tight integration of the acoustic model inside the LLM, contrasted with score-level shallow fusion of separate AM and LLM components.

If this is right

  • Careful selection of label units and fine-tuning strategies in tight integration produces measurable accuracy gains over baseline shallow fusion.
  • Joint CTC decoding during recognition measurably reduces hallucinations generated by the integrated speech LLM.
  • Fine-tuning the LLM on target transcriptions improves shallow-fusion rescoring and single-pass fusion variants.
  • Label-wise and delayed fusion methods enable efficient single-pass recognition without separate rescoring steps.
  • Larger LLM sizes and different pre-training data affect integration performance more than they affect shallow fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If tight integration scales reliably, future systems may converge on single unified models that accept both speech and text inputs without separate fusion stages.
  • The same integration patterns could be tested on related tasks such as speech translation or spoken-language understanding to check whether the accuracy advantage transfers.
  • Results on clean read-speech corpora like Librispeech leave open whether the integration advantage holds under noisy or far-field conditions typical of real deployments.

Load-bearing premise

That the chosen ablations on label units, fine-tuning strategies, LLM sizes, attention interfaces, and joint CTC will be sufficient to determine the best utilization strategy and that results on Librispeech and Loquacious will generalize to other domains.

What would settle it

A controlled experiment on an unseen dataset with different acoustic conditions where shallow fusion consistently achieves lower word error rates than the best tight-integration configuration would refute the superiority of integration.

Figures

Figures reproduced from arXiv: 2603.15045 by Albert Zeyer, Hermann Ney, Mohammad Zeineldeen, Ralf Schl\"uter, Robin Schmitt.

Figure 1
Figure 1. Figure 1: Self-attention weights of prefix LLM. Model is initial￾ized with the baseline AED encoder and the Qwen2 0.5B de￾coder and fine-tuned for 1 epoch on Loquacious. From bottom to top: layer 24 head 7, layer 5 head 8, layer 13 head 13. The X-axis shows the key/value positions for corresponding inputs, and the Y-axis shows the query positions for the decoder output (omitting part of the sequence for better visib… view at source ↗
read the original abstract

In this work, we study how to best utilize pre-trained LLMs for automatic speech recognition. Specifically, we compare the tight integration of an acoustic model (AM) with the LLM ("speech LLM") to the traditional way of combining AM and LLM via shallow fusion. For tight integration, we provide ablations on the effect of different label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization. Additionally, we investigate joint recognition with a CTC model to mitigate hallucinations of speech LLMs and present effective optimizations for this joint recognition. For shallow fusion, we investigate the effect of fine-tuning the LLM on the transcriptions using different label units, and we compare rescoring AM hypotheses to single-pass recognition with label-wise or delayed fusion of AM and LLM scores. We train on Librispeech and Loquacious and evaluate our models on the HuggingFace ASR leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents an empirical comparison of tight integration between an acoustic model and a pre-trained LLM (termed speech LLM) versus traditional shallow fusion for automatic speech recognition. It details ablations for the integration approach on label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization, plus joint CTC to address hallucinations; for shallow fusion it examines LLM fine-tuning on transcriptions with different label units and compares rescoring versus single-pass label-wise or delayed fusion. Models are trained on Librispeech and Loquacious and evaluated on the HuggingFace ASR leaderboard.

Significance. If the results establish that tight integration yields meaningful, controlled improvements over shallow fusion (distinct from capacity or optimization artifacts), the work would provide actionable guidance on LLM utilization in ASR, including practical mitigations for hallucinations and effective fusion variants.

major comments (3)
  1. [Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.
  2. [Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.
  3. [Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.
minor comments (1)
  1. [Methods] Notation for label units and fusion variants (label-wise vs. delayed) should be defined explicitly on first use to avoid ambiguity across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript comparing tight integration of acoustic models with LLMs versus shallow fusion for ASR. We address each major comment below with clarifications on our design choices and planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.

    Authors: We acknowledge that a complete factorial design would more rigorously exclude potential interactions. However, the computational demands of training and fine-tuning large speech LLMs on Librispeech and Loquacious precluded a full grid search. Instead, we followed standard practice by performing targeted, one-at-a-time ablations while fixing other hyperparameters to values identified in preliminary runs. The consistent superiority of tight integration across LLM sizes and interfaces supports that the gains stem from the integration mechanism rather than capacity or unexamined interactions. We will add an explicit limitations paragraph in Section 4 discussing this design choice and its rationale. revision: partial

  2. Referee: [Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.

    Authors: The manuscript contains quantitative WER tables (Tables 1–3) reporting specific numbers for tight integration versus shallow fusion, label-unit variants, LLM sizes, and joint CTC configurations on both Librispeech and the HuggingFace leaderboard. Standard deviations across three random seeds are included for key comparisons, and hallucination rates are quantified before and after joint CTC. We will improve visibility by adding error bars to all figures, reporting p-values for main comparisons, and expanding the results section to make these numbers more prominent in the revision. revision: yes

  3. Referee: [Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.

    Authors: We agree that domain mismatch between training corpora (read and conversational speech) and certain leaderboard subsets is a relevant concern. The current results already show competitive performance across diverse leaderboard domains, but we did not provide a dedicated per-subset breakdown or mismatch analysis. In the revised manuscript we will add a new subsection discussing potential domain shifts, report WER stratified by leaderboard category, and note that broader generalization testing remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison with no derivations or fitted predictions

full rationale

The paper is a purely empirical study comparing tight LLM-AM integration against shallow fusion on Librispeech and Loquacious, using standard training, ablations on label units/fine-tuning/LLM size/attention, and joint CTC. No mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no load-bearing self-citations that reduce the central claim to unverified premises. All results follow from direct experimentation on public data; the design is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions in machine learning for ASR: that fine-tuning LLMs on transcripts is beneficial, that the chosen label units and attention interfaces are representative, and that the Librispeech and Loquacious corpora adequately represent the target domain. No new entities or ad-hoc axioms are introduced.

axioms (2)
  • domain assumption Fine-tuning strategies and label units meaningfully affect integration performance
    Invoked throughout the ablation sections described in the abstract.
  • domain assumption Joint CTC decoding mitigates LLM hallucinations without harming overall accuracy
    Stated as an investigated mitigation technique.

pith-pipeline@v0.9.0 · 5466 in / 1269 out tokens · 69582 ms · 2026-05-15T10:36:54.989884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 10 internal anchors

  1. [1]

    Introduction & Related Work Traditionally, speech recognition systems consist of an acous- tic model (AM) and a separate language model (LM), which al- lows to utilize large text-only corpora in addition to the typically smaller speech corpora. Multiple approaches for the combina- tion of AMs and LMs have been proposed over the years, such as shallow fusi...

  2. [2]

    The input of 10ms log mel features is first pro- cessed by a convolutional front-end with a subsampling factor of 6

    Shallow Fusion Baseline Acoustic Encoder.All our models utilize a Conformer-based encoder [22]. The input of 10ms log mel features is first pro- cessed by a convolutional front-end with a subsampling factor of 6. This sequence is further processed by a Conformer to pro- duce the encoder output hT 1 = Encoder(xT ′ 1 ), whereT=⌈T ′/6⌉andT ′ is the length of...

  3. [3]

    Adapter.The adapter can optionally further downsample the encoder output, and then optionally performs a linear transfor- mation to match the decoder input dimension

    Integrated Models Encoder and CTC output.We use the same encoder and CTC output as in the shallow fusion baseline. Adapter.The adapter can optionally further downsample the encoder output, and then optionally performs a linear transfor- mation to match the decoder input dimension. The downsam- pling is applied directly to the encoder outputh T 1 . In the ...

  4. [4]

    The number of encoder parameters includes the CTC projection matrix of the auxiliary layers

    Experiments For training, we use Librispeech 960h [34] and thelargesplit of the Loquacious [35] corpus, which contains 25K hours Table 3:Details of our baseline ASR and prefix LLM models. The number of encoder parameters includes the CTC projection matrix of the auxiliary layers. Model Pre-trained Decoder # Label units [K] # Params [M] EncoderAdapterDecod...

  5. [5]

    Since the prompt is always the same in our case, it does not provide any useful information for the model and can also be considered as a sequence of meaningless filler tokens

    observed that LLMs can learn to utilize meaningless filler tokens such as a sequence of dots to improve their performance. Since the prompt is always the same in our case, it does not provide any useful information for the model and can also be considered as a sequence of meaningless filler tokens. Moti- vated by the attention pattern of the last head in ...

  6. [6]

    We compared tight integration of acoustic encoder and LLMs,speech LLMs(SLLMs), to shallow fusion of acous- tic and language model scores

    Conclusions & Outlook In this work, we have investigated different utilizations of large language models (LLMs) for automatic speech recogni- tion (ASR). We compared tight integration of acoustic encoder and LLMs,speech LLMs(SLLMs), to shallow fusion of acous- tic and language model scores. Our findings are as follows: • Prefix LLMs (PLLMs) outperform the...

  7. [7]

    Generative AI Use Disclosure We use LLMs to improve the formulations and grammar of the paper

  8. [8]

    On Using Monolingual Corpora in Neural Machine Translation

    C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On using monolingual corpora in neural machine translation,” Preprint arXiv:1503.03535, 2015

  9. [9]

    Cold Fusion: Training Seq2Seq Models Together with Language Models,

    A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold Fusion: Training Seq2Seq Models Together with Language Models,” in Interspeech, 2018, pp. 387–391

  10. [10]

    Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,

    M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schl ¨uter, and H. Ney, “Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,” inInterspeech, Aug. 2021, pp. 2856–2860. [Online]. Available: http://https://arxiv.org/abs/2104.05544

  11. [11]

    Sequence- discriminative training of deep neural networks

    K. Vesel ´y, A. Ghoshal, L. Burget, and D. Povey, “Sequence- discriminative training of deep neural networks.” inInterspeech, 2013, pp. 2345–2349

  12. [12]

    Delayed fusion: Integrating large language models into first-pass decoding in end-to-end speech recognition,

    T. Hori, M. Kocour, A. Haider, E. McDermott, and X. Zhuang, “Delayed fusion: Integrating large language models into first-pass decoding in end-to-end speech recognition,” inIEEE ICASSP, 2025, pp. 1–5

  13. [13]

    Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,

    K. Deng, J. Guo, Y . Ma, N. Moritz, P. C. Woodland, O. Kalinli, and M. Seltzer, “Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,” inIEEE ICASSP, 2025, pp. 1–5

  14. [14]

    K., Asawaroengchai, C., Nguyen, D

    P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, J. Qin, D. Rozenberg, T. Sainath, J. Schalkwyk, M. Sharifi, M. T. Ramanovich, M. Tagliasacchi, A. Tudor, M. Velimirovi ´c, D. Vincent, J. Yu, Y . Wang, V . Zayats, N. Zeghidour, Y . Zhang,...

  15. [15]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” 2023. [Online]. Available: https://arxiv.org/abs/2305.11000

  16. [16]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk

  17. [17]

    Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,

    Y . Bai, J. Chen, J. Chen, W. Chen, Z. Chen, C. Ding, L. Dong et al., “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04675

  18. [18]

    Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,

    K.-T. Xu, F.-L. Xie, X. Tang, and Y . Hu, “Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14350

  19. [19]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Microsoft, A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01743

  20. [20]

    Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,

    G. Saon, A. Dekel, A. Brooks, T. Nagano, A. Daniels, A. Satt, A. Mittal, B. Kingsbury, D. Haws, E. Morais, G. Kurata, H. Aronowitz, I. Ibrahim, J. Kuo, K. Soule, L. Lastras, M. Suzuki, R. Hoory, S. Thomas, S. Novitasari, T. Fukuda, V . Sunder, X. Cui, and Z. Kons, “Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,” 2025. ...

  21. [21]

    Qwen3-ASR Technical Report

    X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang et al., “Qwen3-ASR technical report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.21337

  22. [22]

    Exploring the limits of decoder-only models trained on public speech recognition cor- pora,

    A. Gupta, G. Saon, and B. Kingsbury, “Exploring the limits of decoder-only models trained on public speech recognition cor- pora,” inInterspeech, 2024, pp. 252–256

  23. [23]

    What language model architecture and pretraining objective works best for zero-shot generalization?

    T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objective works best for zero-shot generalization?” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Sz...

  24. [24]

    Efficient streaming LLM for speech recognition,

    J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming LLM for speech recognition,” inIEEE ICASSP. IEEE, 2025, pp. 1–5

  25. [25]

    Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catan- zaro, “Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML. JMLR.org, 2024

  26. [26]

    Connecting speech encoder and large language model for asr,

    W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” 2023. [Online]. Available: https://arxiv.org/abs/2309.13963

  27. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: A family of highly capable multimodal models,” 2025. [Online]. Available: https://arxiv.org/abs/2312.11805

  28. [28]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. Heet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765

  29. [29]

    Conformer: Convolution- augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech, 2020

  30. [30]

    Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

  31. [31]

    End-to-end speech recognition: A survey,

    R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schl¨uter, and S. Watan- abe, “End-to-end speech recognition: A survey,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325–351, 2023

  32. [32]

    An embarrassingly simple approach for LLM with strong ASR capacity,

    Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhang, and X. Chen, “An embarrassingly simple approach for LLM with strong ASR capacity,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08846

  33. [33]

    CTC-based compression for direct speech translation,

    M. Gaido, M. Cettolo, M. Negri, and M. Turchi, “CTC-based compression for direct speech translation,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 690–696. [Online]. ...

  34. [34]

    On decoder-only architec- ture for speech-to-text and large language model integration,

    J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y . Wu, “On decoder-only architec- ture for speech-to-text and large language model integration,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, pp. 1–8

  35. [35]

    Cjst: Ctc compressor based joint speech and text training for decoder-only asr,

    W. Zhou, J. Jia, L. Sari, J. Mahadeokar, and O. Kalinli, “Cjst: Ctc compressor based joint speech and text training for decoder-only asr,” inIEEE ICASSP, 2025, pp. 1–5

  36. [36]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inNIPS, 2017, pp. 6000–6010

  37. [37]

    Attention-Based Models for Speech Recognition

    J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention-based models for speech recognition,” Preprint arXiv:1506.07503, 2015. [Online]. Available: http: //arxiv.org/abs/1506.07503

  38. [38]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

    W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inIEEE ICASSP, 2016

  39. [39]

    T5gemma 2: Seeing, reading, and understanding longer,

    B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathak, C. Hardin, F. Visin, J. Zhang, K. Kenealy, Q. Yin, X. Song, O. Lacombe, A. Joulin, T. Warkentin, and A. Roberts, “T5gemma 2: Seeing, reading, and understanding longer,” 2025. [Online]. Available: https://arxiv.org/abs/2512.14856

  40. [40]

    Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,

    T. Hori, S. Watanabe, Y . Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” inInterspeech, 2017

  41. [41]

    Lib- rispeech: an ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inIEEE ICASSP. IEEE, 2015, pp. 5206–5210

  42. [42]

    Loqua- cious Set: 25,000 hours of transcribed and diverse English speech recognition data for research and commercial use,

    T. Parcollet, Y . Tseng, S. Zhang, and R. C. van Dalen, “Loqua- cious Set: 25,000 hours of transcribed and diverse English speech recognition data for research and commercial use,” inInterspeech 2025, 2025, pp. 4053–4057

  43. [43]

    SpecAugment: A simple data augmen- tation method for automatic speech recognition,

    D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inInterspeech, 2019, pp. 2613–2617

  44. [44]

    RETURNN as a generic flexi- ble neural toolkit with application to translation and speech recog- nition,

    A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexi- ble neural toolkit with application to translation and speech recog- nition,” inAnnual Meeting of the Assoc. for Computational Lin- guistics, Melbourne, Australia, Jul. 2018

  45. [45]

    Pytorch: An imperative style, high- performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high- performance deep learning library,” inAdvances in Neural Information Processing S...

  46. [46]

    ESPnet: End-to-End Speech Processing Toolkit

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End- to-end speech processing toolkit,” 2018. [Online]. Available: https://arxiv.org/abs/1804.00015

  47. [47]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yuet al., “Qwen2 technical report,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671

  48. [48]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

  49. [49]

    Kalajdzievski

    D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,” 2023. [Online]. Available: https: //arxiv.org/abs/2312.03732

  50. [50]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

    T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, E. Blanco and W. Lu, Eds. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 6...

  51. [51]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Huiet al., “Qwen3 technical report,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.09388

  52. [52]

    Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

    R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,”

  53. [53]
  54. [54]

    Language Modeling with Deep Transformers,

    K. Irie, A. Zeyer, R. Schl ¨uter, and H. Ney, “Language Modeling with Deep Transformers,” inInterspeech 2019, 2019, pp. 3905– 3909

  55. [55]

    Let’s think dot by dot: Hidden computation in transformer language models,

    J. Pfau, W. Merrill, and S. R. Bowman, “Let’s think dot by dot: Hidden computation in transformer language models,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=NikbrdtYvG