arxiv: 2603.15045 · v2 · submitted 2026-03-16 · 📡 eess.AS

Recognition: no theorem link

LLMs and Speech: Integration vs. Combination

Robin Schmitt , Albert Zeyer , Mohammad Zeineldeen , Ralf Schl\"uter , Hermann Ney

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:36 UTC · model grok-4.3

classification 📡 eess.AS

keywords automatic speech recognitionlarge language modelstight integrationshallow fusionacoustic modeljoint CTCLibrispeech

0 comments

The pith

Tight integration of acoustic models with LLMs outperforms shallow fusion for automatic speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the best way to apply pre-trained large language models to automatic speech recognition by directly comparing two strategies. The first is tight integration, where an acoustic model is embedded inside the LLM to create a speech LLM. The second is the conventional shallow fusion approach that combines separate acoustic model scores with LLM scores at inference time. Through systematic ablations on label units, fine-tuning methods, model sizes, attention interfaces, encoder downsampling, and joint CTC training, the work identifies configurations that improve recognition accuracy on Librispeech and Loquacious while also reducing hallucinations.

Core claim

The central claim is that embedding the acoustic model directly into the LLM through tight integration yields better speech recognition performance than combining an independent acoustic model with the LLM via shallow fusion, once appropriate choices are made for label units, fine-tuning strategies, attention mechanisms, and joint CTC decoding to control hallucinations.

What carries the argument

The speech LLM formed by tight integration of the acoustic model inside the LLM, contrasted with score-level shallow fusion of separate AM and LLM components.

If this is right

Careful selection of label units and fine-tuning strategies in tight integration produces measurable accuracy gains over baseline shallow fusion.
Joint CTC decoding during recognition measurably reduces hallucinations generated by the integrated speech LLM.
Fine-tuning the LLM on target transcriptions improves shallow-fusion rescoring and single-pass fusion variants.
Label-wise and delayed fusion methods enable efficient single-pass recognition without separate rescoring steps.
Larger LLM sizes and different pre-training data affect integration performance more than they affect shallow fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If tight integration scales reliably, future systems may converge on single unified models that accept both speech and text inputs without separate fusion stages.
The same integration patterns could be tested on related tasks such as speech translation or spoken-language understanding to check whether the accuracy advantage transfers.
Results on clean read-speech corpora like Librispeech leave open whether the integration advantage holds under noisy or far-field conditions typical of real deployments.

Load-bearing premise

That the chosen ablations on label units, fine-tuning strategies, LLM sizes, attention interfaces, and joint CTC will be sufficient to determine the best utilization strategy and that results on Librispeech and Loquacious will generalize to other domains.

What would settle it

A controlled experiment on an unseen dataset with different acoustic conditions where shallow fusion consistently achieves lower word error rates than the best tight-integration configuration would refute the superiority of integration.

Figures

Figures reproduced from arXiv: 2603.15045 by Albert Zeyer, Hermann Ney, Mohammad Zeineldeen, Ralf Schl\"uter, Robin Schmitt.

**Figure 1.** Figure 1: Self-attention weights of prefix LLM. Model is initialized with the baseline AED encoder and the Qwen2 0.5B decoder and fine-tuned for 1 epoch on Loquacious. From bottom to top: layer 24 head 7, layer 5 head 8, layer 13 head 13. The X-axis shows the key/value positions for corresponding inputs, and the Y-axis shows the query positions for the decoder output (omitting part of the sequence for better visib… view at source ↗

read the original abstract

In this work, we study how to best utilize pre-trained LLMs for automatic speech recognition. Specifically, we compare the tight integration of an acoustic model (AM) with the LLM ("speech LLM") to the traditional way of combining AM and LLM via shallow fusion. For tight integration, we provide ablations on the effect of different label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization. Additionally, we investigate joint recognition with a CTC model to mitigate hallucinations of speech LLMs and present effective optimizations for this joint recognition. For shallow fusion, we investigate the effect of fine-tuning the LLM on the transcriptions using different label units, and we compare rescoring AM hypotheses to single-pass recognition with label-wise or delayed fusion of AM and LLM scores. We train on Librispeech and Loquacious and evaluate our models on the HuggingFace ASR leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Systematic ablations compare tight AM-LLM integration to shallow fusion for ASR, with practical additions like joint CTC, but design may not fully isolate effects.

read the letter

This paper compares tight integration of an acoustic model with a pre-trained LLM against the usual shallow fusion approach for automatic speech recognition. The authors run ablations on label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization. They also test joint recognition with a CTC model to reduce hallucinations and include optimizations for that joint setup. On the fusion side they examine fine-tuning the LLM on transcripts and different score combination methods like label-wise or delayed fusion. They train on Librispeech and Loquacious and evaluate on the HuggingFace ASR leaderboard. The experimental plan is clear and uses public data, which makes the comparisons easy to reproduce and situate against other work. The joint CTC part and its optimizations address a real practical problem with LLM-based recognizers. The soft spots are around whether the ablations cleanly separate integration benefits from capacity or optimization differences. The stress-test note is on point here: without a full factorial design, interactions between variables like LLM size and attention interface could make one approach look better for reasons unrelated to integration itself. Generalization is another question mark because the leaderboard covers domains outside the training sets. This is useful for ASR practitioners deciding how to bring in LLMs. It does not introduce a new framework but maps out the trade-offs in a systematic way. The thinking is straightforward and engaged with existing fusion literature. It deserves peer review so referees can examine the actual numbers, significance, and whether the conclusions survive the design limitations.

Referee Report

3 major / 1 minor

Summary. The manuscript presents an empirical comparison of tight integration between an acoustic model and a pre-trained LLM (termed speech LLM) versus traditional shallow fusion for automatic speech recognition. It details ablations for the integration approach on label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization, plus joint CTC to address hallucinations; for shallow fusion it examines LLM fine-tuning on transcriptions with different label units and compares rescoring versus single-pass label-wise or delayed fusion. Models are trained on Librispeech and Loquacious and evaluated on the HuggingFace ASR leaderboard.

Significance. If the results establish that tight integration yields meaningful, controlled improvements over shallow fusion (distinct from capacity or optimization artifacts), the work would provide actionable guidance on LLM utilization in ASR, including practical mitigations for hallucinations and effective fusion variants.

major comments (3)

[Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.
[Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.
[Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.

minor comments (1)

[Methods] Notation for label units and fusion variants (label-wise vs. delayed) should be defined explicitly on first use to avoid ambiguity across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript comparing tight integration of acoustic models with LLMs versus shallow fusion for ASR. We address each major comment below with clarifications on our design choices and planned revisions where appropriate.

read point-by-point responses

Referee: [Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.

Authors: We acknowledge that a complete factorial design would more rigorously exclude potential interactions. However, the computational demands of training and fine-tuning large speech LLMs on Librispeech and Loquacious precluded a full grid search. Instead, we followed standard practice by performing targeted, one-at-a-time ablations while fixing other hyperparameters to values identified in preliminary runs. The consistent superiority of tight integration across LLM sizes and interfaces supports that the gains stem from the integration mechanism rather than capacity or unexamined interactions. We will add an explicit limitations paragraph in Section 4 discussing this design choice and its rationale. revision: partial
Referee: [Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.

Authors: The manuscript contains quantitative WER tables (Tables 1–3) reporting specific numbers for tight integration versus shallow fusion, label-unit variants, LLM sizes, and joint CTC configurations on both Librispeech and the HuggingFace leaderboard. Standard deviations across three random seeds are included for key comparisons, and hallucination rates are quantified before and after joint CTC. We will improve visibility by adding error bars to all figures, reporting p-values for main comparisons, and expanding the results section to make these numbers more prominent in the revision. revision: yes
Referee: [Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.

Authors: We agree that domain mismatch between training corpora (read and conversational speech) and certain leaderboard subsets is a relevant concern. The current results already show competitive performance across diverse leaderboard domains, but we did not provide a dedicated per-subset breakdown or mismatch analysis. In the revised manuscript we will add a new subsection discussing potential domain shifts, report WER stratified by leaderboard category, and note that broader generalization testing remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison with no derivations or fitted predictions

full rationale

The paper is a purely empirical study comparing tight LLM-AM integration against shallow fusion on Librispeech and Loquacious, using standard training, ablations on label units/fine-tuning/LLM size/attention, and joint CTC. No mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no load-bearing self-citations that reduce the central claim to unverified premises. All results follow from direct experimentation on public data; the design is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard assumptions in machine learning for ASR: that fine-tuning LLMs on transcripts is beneficial, that the chosen label units and attention interfaces are representative, and that the Librispeech and Loquacious corpora adequately represent the target domain. No new entities or ad-hoc axioms are introduced.

axioms (2)

domain assumption Fine-tuning strategies and label units meaningfully affect integration performance
Invoked throughout the ablation sections described in the abstract.
domain assumption Joint CTC decoding mitigates LLM hallucinations without harming overall accuracy
Stated as an investigated mitigation technique.

pith-pipeline@v0.9.0 · 5466 in / 1269 out tokens · 69582 ms · 2026-05-15T10:36:54.989884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 10 internal anchors

[1]

Introduction & Related Work Traditionally, speech recognition systems consist of an acous- tic model (AM) and a separate language model (LM), which al- lows to utilize large text-only corpora in addition to the typically smaller speech corpora. Multiple approaches for the combina- tion of AMs and LMs have been proposed over the years, such as shallow fusi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The input of 10ms log mel features is first pro- cessed by a convolutional front-end with a subsampling factor of 6

Shallow Fusion Baseline Acoustic Encoder.All our models utilize a Conformer-based encoder [22]. The input of 10ms log mel features is first pro- cessed by a convolutional front-end with a subsampling factor of 6. This sequence is further processed by a Conformer to pro- duce the encoder output hT 1 = Encoder(xT ′ 1 ), whereT=⌈T ′/6⌉andT ′ is the length of...

work page
[3]

Adapter.The adapter can optionally further downsample the encoder output, and then optionally performs a linear transfor- mation to match the decoder input dimension

Integrated Models Encoder and CTC output.We use the same encoder and CTC output as in the shallow fusion baseline. Adapter.The adapter can optionally further downsample the encoder output, and then optionally performs a linear transfor- mation to match the decoder input dimension. The downsam- pling is applied directly to the encoder outputh T 1 . In the ...

work page
[4]

The number of encoder parameters includes the CTC projection matrix of the auxiliary layers

Experiments For training, we use Librispeech 960h [34] and thelargesplit of the Loquacious [35] corpus, which contains 25K hours Table 3:Details of our baseline ASR and prefix LLM models. The number of encoder parameters includes the CTC projection matrix of the auxiliary layers. Model Pre-trained Decoder # Label units [K] # Params [M] EncoderAdapterDecod...

work page arXiv 1910
[5]

Since the prompt is always the same in our case, it does not provide any useful information for the model and can also be considered as a sequence of meaningless filler tokens

observed that LLMs can learn to utilize meaningless filler tokens such as a sequence of dots to improve their performance. Since the prompt is always the same in our case, it does not provide any useful information for the model and can also be considered as a sequence of meaningless filler tokens. Moti- vated by the attention pattern of the last head in ...

work page
[6]

We compared tight integration of acoustic encoder and LLMs,speech LLMs(SLLMs), to shallow fusion of acous- tic and language model scores

Conclusions & Outlook In this work, we have investigated different utilizations of large language models (LLMs) for automatic speech recogni- tion (ASR). We compared tight integration of acoustic encoder and LLMs,speech LLMs(SLLMs), to shallow fusion of acous- tic and language model scores. Our findings are as follows: • Prefix LLMs (PLLMs) outperform the...

work page
[7]

Generative AI Use Disclosure We use LLMs to improve the formulations and grammar of the paper

work page
[8]

On Using Monolingual Corpora in Neural Machine Translation

C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On using monolingual corpora in neural machine translation,” Preprint arXiv:1503.03535, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

Cold Fusion: Training Seq2Seq Models Together with Language Models,

A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold Fusion: Training Seq2Seq Models Together with Language Models,” in Interspeech, 2018, pp. 387–391

work page 2018
[10]

Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,

M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schl ¨uter, and H. Ney, “Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,” inInterspeech, Aug. 2021, pp. 2856–2860. [Online]. Available: http://https://arxiv.org/abs/2104.05544

work page arXiv 2021
[11]

Sequence- discriminative training of deep neural networks

K. Vesel ´y, A. Ghoshal, L. Burget, and D. Povey, “Sequence- discriminative training of deep neural networks.” inInterspeech, 2013, pp. 2345–2349

work page 2013
[12]

Delayed fusion: Integrating large language models into first-pass decoding in end-to-end speech recognition,

T. Hori, M. Kocour, A. Haider, E. McDermott, and X. Zhuang, “Delayed fusion: Integrating large language models into first-pass decoding in end-to-end speech recognition,” inIEEE ICASSP, 2025, pp. 1–5

work page 2025
[13]

Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,

K. Deng, J. Guo, Y . Ma, N. Moritz, P. C. Woodland, O. Kalinli, and M. Seltzer, “Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,” inIEEE ICASSP, 2025, pp. 1–5

work page 2025
[14]

K., Asawaroengchai, C., Nguyen, D

P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, J. Qin, D. Rozenberg, T. Sainath, J. Schalkwyk, M. Sharifi, M. T. Ramanovich, M. Tagliasacchi, A. Tudor, M. Velimirovi ´c, D. Vincent, J. Yu, Y . Wang, V . Zayats, N. Zeghidour, Y . Zhang,...

work page arXiv 2023
[15]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” 2023. [Online]. Available: https://arxiv.org/abs/2305.11000

work page arXiv 2023
[16]

SALMONN: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk

work page 2024
[17]

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,

Y . Bai, J. Chen, J. Chen, W. Chen, Z. Chen, C. Ding, L. Dong et al., “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04675

work page arXiv 2024
[18]

Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,

K.-T. Xu, F.-L. Xie, X. Tang, and Y . Hu, “Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14350

work page arXiv 2025
[19]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft, A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01743

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,

G. Saon, A. Dekel, A. Brooks, T. Nagano, A. Daniels, A. Satt, A. Mittal, B. Kingsbury, D. Haws, E. Morais, G. Kurata, H. Aronowitz, I. Ibrahim, J. Kuo, K. Soule, L. Lastras, M. Suzuki, R. Hoory, S. Thomas, S. Novitasari, T. Fukuda, V . Sunder, X. Cui, and Z. Kons, “Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,” 2025. ...

work page arXiv 2025
[21]

Qwen3-ASR Technical Report

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang et al., “Qwen3-ASR technical report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.21337

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Exploring the limits of decoder-only models trained on public speech recognition cor- pora,

A. Gupta, G. Saon, and B. Kingsbury, “Exploring the limits of decoder-only models trained on public speech recognition cor- pora,” inInterspeech, 2024, pp. 252–256

work page 2024
[23]

What language model architecture and pretraining objective works best for zero-shot generalization?

T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objective works best for zero-shot generalization?” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Sz...

work page 2022
[24]

Efficient streaming LLM for speech recognition,

J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming LLM for speech recognition,” inIEEE ICASSP. IEEE, 2025, pp. 1–5

work page 2025
[25]

Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,

Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catan- zaro, “Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML. JMLR.org, 2024

work page 2024
[26]

Connecting speech encoder and large language model for asr,

W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” 2023. [Online]. Available: https://arxiv.org/abs/2309.13963

work page arXiv 2023
[27]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: A family of highly capable multimodal models,” 2025. [Online]. Available: https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. Heet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech, 2020

work page 2020
[30]

Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

work page 2006
[31]

End-to-end speech recognition: A survey,

R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schl¨uter, and S. Watan- abe, “End-to-end speech recognition: A survey,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325–351, 2023

work page 2023
[32]

An embarrassingly simple approach for LLM with strong ASR capacity,

Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhang, and X. Chen, “An embarrassingly simple approach for LLM with strong ASR capacity,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08846

work page arXiv 2024
[33]

CTC-based compression for direct speech translation,

M. Gaido, M. Cettolo, M. Negri, and M. Turchi, “CTC-based compression for direct speech translation,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 690–696. [Online]. ...

work page 2021
[34]

On decoder-only architec- ture for speech-to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y . Wu, “On decoder-only architec- ture for speech-to-text and large language model integration,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, pp. 1–8

work page 2023
[35]

Cjst: Ctc compressor based joint speech and text training for decoder-only asr,

W. Zhou, J. Jia, L. Sari, J. Mahadeokar, and O. Kalinli, “Cjst: Ctc compressor based joint speech and text training for decoder-only asr,” inIEEE ICASSP, 2025, pp. 1–5

work page 2025
[36]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inNIPS, 2017, pp. 6000–6010

work page 2017
[37]

Attention-Based Models for Speech Recognition

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention-based models for speech recognition,” Preprint arXiv:1506.07503, 2015. [Online]. Available: http: //arxiv.org/abs/1506.07503

work page internal anchor Pith review Pith/arXiv arXiv 2015
[38]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inIEEE ICASSP, 2016

work page 2016
[39]

T5gemma 2: Seeing, reading, and understanding longer,

B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathak, C. Hardin, F. Visin, J. Zhang, K. Kenealy, Q. Yin, X. Song, O. Lacombe, A. Joulin, T. Warkentin, and A. Roberts, “T5gemma 2: Seeing, reading, and understanding longer,” 2025. [Online]. Available: https://arxiv.org/abs/2512.14856

work page arXiv 2025
[40]

Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,

T. Hori, S. Watanabe, Y . Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” inInterspeech, 2017

work page 2017
[41]

Lib- rispeech: an ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inIEEE ICASSP. IEEE, 2015, pp. 5206–5210

work page 2015
[42]

Loqua- cious Set: 25,000 hours of transcribed and diverse English speech recognition data for research and commercial use,

T. Parcollet, Y . Tseng, S. Zhang, and R. C. van Dalen, “Loqua- cious Set: 25,000 hours of transcribed and diverse English speech recognition data for research and commercial use,” inInterspeech 2025, 2025, pp. 4053–4057

work page 2025
[43]

SpecAugment: A simple data augmen- tation method for automatic speech recognition,

D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inInterspeech, 2019, pp. 2613–2617

work page 2019
[44]

RETURNN as a generic flexi- ble neural toolkit with application to translation and speech recog- nition,

A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexi- ble neural toolkit with application to translation and speech recog- nition,” inAnnual Meeting of the Assoc. for Computational Lin- guistics, Melbourne, Australia, Jul. 2018

work page 2018
[45]

Pytorch: An imperative style, high- performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high- performance deep learning library,” inAdvances in Neural Information Processing S...

work page 2019
[46]

ESPnet: End-to-End Speech Processing Toolkit

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End- to-end speech processing toolkit,” 2018. [Online]. Available: https://arxiv.org/abs/1804.00015

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Qwen2 Technical Report

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yuet al., “Qwen2 technical report,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[49]

Kalajdzievski

D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,” 2023. [Online]. Available: https: //arxiv.org/abs/2312.03732

work page arXiv 2023
[50]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, E. Blanco and W. Lu, Eds. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 6...

work page 2018
[51]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Huiet al., “Qwen3 technical report,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,”

work page
[53]

Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

[Online]. Available: https://arxiv.org/abs/2401.01572

work page arXiv
[54]

Language Modeling with Deep Transformers,

K. Irie, A. Zeyer, R. Schl ¨uter, and H. Ney, “Language Modeling with Deep Transformers,” inInterspeech 2019, 2019, pp. 3905– 3909

work page 2019
[55]

Let’s think dot by dot: Hidden computation in transformer language models,

J. Pfau, W. Merrill, and S. R. Bowman, “Let’s think dot by dot: Hidden computation in transformer language models,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=NikbrdtYvG

work page 2024