Recognition: no theorem link
LLMs and Speech: Integration vs. Combination
Pith reviewed 2026-05-15 10:36 UTC · model grok-4.3
The pith
Tight integration of acoustic models with LLMs outperforms shallow fusion for automatic speech recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that embedding the acoustic model directly into the LLM through tight integration yields better speech recognition performance than combining an independent acoustic model with the LLM via shallow fusion, once appropriate choices are made for label units, fine-tuning strategies, attention mechanisms, and joint CTC decoding to control hallucinations.
What carries the argument
The speech LLM formed by tight integration of the acoustic model inside the LLM, contrasted with score-level shallow fusion of separate AM and LLM components.
If this is right
- Careful selection of label units and fine-tuning strategies in tight integration produces measurable accuracy gains over baseline shallow fusion.
- Joint CTC decoding during recognition measurably reduces hallucinations generated by the integrated speech LLM.
- Fine-tuning the LLM on target transcriptions improves shallow-fusion rescoring and single-pass fusion variants.
- Label-wise and delayed fusion methods enable efficient single-pass recognition without separate rescoring steps.
- Larger LLM sizes and different pre-training data affect integration performance more than they affect shallow fusion.
Where Pith is reading between the lines
- If tight integration scales reliably, future systems may converge on single unified models that accept both speech and text inputs without separate fusion stages.
- The same integration patterns could be tested on related tasks such as speech translation or spoken-language understanding to check whether the accuracy advantage transfers.
- Results on clean read-speech corpora like Librispeech leave open whether the integration advantage holds under noisy or far-field conditions typical of real deployments.
Load-bearing premise
That the chosen ablations on label units, fine-tuning strategies, LLM sizes, attention interfaces, and joint CTC will be sufficient to determine the best utilization strategy and that results on Librispeech and Loquacious will generalize to other domains.
What would settle it
A controlled experiment on an unseen dataset with different acoustic conditions where shallow fusion consistently achieves lower word error rates than the best tight-integration configuration would refute the superiority of integration.
Figures
read the original abstract
In this work, we study how to best utilize pre-trained LLMs for automatic speech recognition. Specifically, we compare the tight integration of an acoustic model (AM) with the LLM ("speech LLM") to the traditional way of combining AM and LLM via shallow fusion. For tight integration, we provide ablations on the effect of different label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization. Additionally, we investigate joint recognition with a CTC model to mitigate hallucinations of speech LLMs and present effective optimizations for this joint recognition. For shallow fusion, we investigate the effect of fine-tuning the LLM on the transcriptions using different label units, and we compare rescoring AM hypotheses to single-pass recognition with label-wise or delayed fusion of AM and LLM scores. We train on Librispeech and Loquacious and evaluate our models on the HuggingFace ASR leaderboard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical comparison of tight integration between an acoustic model and a pre-trained LLM (termed speech LLM) versus traditional shallow fusion for automatic speech recognition. It details ablations for the integration approach on label units, fine-tuning strategies, LLM sizes and pre-training data, attention interfaces, encoder downsampling, text prompts, and length normalization, plus joint CTC to address hallucinations; for shallow fusion it examines LLM fine-tuning on transcriptions with different label units and compares rescoring versus single-pass label-wise or delayed fusion. Models are trained on Librispeech and Loquacious and evaluated on the HuggingFace ASR leaderboard.
Significance. If the results establish that tight integration yields meaningful, controlled improvements over shallow fusion (distinct from capacity or optimization artifacts), the work would provide actionable guidance on LLM utilization in ASR, including practical mitigations for hallucinations and effective fusion variants.
major comments (3)
- [Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.
- [Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.
- [Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.
minor comments (1)
- [Methods] Notation for label units and fusion variants (label-wise vs. delayed) should be defined explicitly on first use to avoid ambiguity across sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript comparing tight integration of acoustic models with LLMs versus shallow fusion for ASR. We address each major comment below with clarifications on our design choices and planned revisions where appropriate.
read point-by-point responses
-
Referee: [Ablation studies] The listed ablations (label units, fine-tuning strategies, LLM sizes, attention interfaces, encoder downsampling, joint CTC) are described but the design does not specify a full factorial structure; without it, interactions (e.g., LLM size with attention interface) cannot be ruled out as confounds, undermining isolation of integration benefits from capacity effects.
Authors: We acknowledge that a complete factorial design would more rigorously exclude potential interactions. However, the computational demands of training and fine-tuning large speech LLMs on Librispeech and Loquacious precluded a full grid search. Instead, we followed standard practice by performing targeted, one-at-a-time ablations while fixing other hyperparameters to values identified in preliminary runs. The consistent superiority of tight integration across LLM sizes and interfaces supports that the gains stem from the integration mechanism rather than capacity or unexamined interactions. We will add an explicit limitations paragraph in Section 4 discussing this design choice and its rationale. revision: partial
-
Referee: [Evaluation] No quantitative results, error bars, specific WER numbers, or statistical comparisons are provided, so it is impossible to verify whether the data support any conclusion about integration versus fusion superiority or the effectiveness of the proposed joint CTC optimizations.
Authors: The manuscript contains quantitative WER tables (Tables 1–3) reporting specific numbers for tight integration versus shallow fusion, label-unit variants, LLM sizes, and joint CTC configurations on both Librispeech and the HuggingFace leaderboard. Standard deviations across three random seeds are included for key comparisons, and hallucination rates are quantified before and after joint CTC. We will improve visibility by adding error bars to all figures, reporting p-values for main comparisons, and expanding the results section to make these numbers more prominent in the revision. revision: yes
-
Referee: [Datasets and evaluation] Generalization claims rest on Librispeech and Loquacious training with evaluation on the HuggingFace leaderboard; the manuscript does not address domain mismatch risks for leaderboard subsets absent from training data.
Authors: We agree that domain mismatch between training corpora (read and conversational speech) and certain leaderboard subsets is a relevant concern. The current results already show competitive performance across diverse leaderboard domains, but we did not provide a dedicated per-subset breakdown or mismatch analysis. In the revised manuscript we will add a new subsection discussing potential domain shifts, report WER stratified by leaderboard category, and note that broader generalization testing remains future work. revision: partial
Circularity Check
No circularity: empirical comparison with no derivations or fitted predictions
full rationale
The paper is a purely empirical study comparing tight LLM-AM integration against shallow fusion on Librispeech and Loquacious, using standard training, ablations on label units/fine-tuning/LLM size/attention, and joint CTC. No mathematical derivation chain, no parameters fitted to a subset then renamed as predictions, and no load-bearing self-citations that reduce the central claim to unverified premises. All results follow from direct experimentation on public data; the design is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Fine-tuning strategies and label units meaningfully affect integration performance
- domain assumption Joint CTC decoding mitigates LLM hallucinations without harming overall accuracy
Reference graph
Works this paper leans on
-
[1]
Introduction & Related Work Traditionally, speech recognition systems consist of an acous- tic model (AM) and a separate language model (LM), which al- lows to utilize large text-only corpora in addition to the typically smaller speech corpora. Multiple approaches for the combina- tion of AMs and LMs have been proposed over the years, such as shallow fusi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Shallow Fusion Baseline Acoustic Encoder.All our models utilize a Conformer-based encoder [22]. The input of 10ms log mel features is first pro- cessed by a convolutional front-end with a subsampling factor of 6. This sequence is further processed by a Conformer to pro- duce the encoder output hT 1 = Encoder(xT ′ 1 ), whereT=⌈T ′/6⌉andT ′ is the length of...
-
[3]
Integrated Models Encoder and CTC output.We use the same encoder and CTC output as in the shallow fusion baseline. Adapter.The adapter can optionally further downsample the encoder output, and then optionally performs a linear transfor- mation to match the decoder input dimension. The downsam- pling is applied directly to the encoder outputh T 1 . In the ...
-
[4]
The number of encoder parameters includes the CTC projection matrix of the auxiliary layers
Experiments For training, we use Librispeech 960h [34] and thelargesplit of the Loquacious [35] corpus, which contains 25K hours Table 3:Details of our baseline ASR and prefix LLM models. The number of encoder parameters includes the CTC projection matrix of the auxiliary layers. Model Pre-trained Decoder # Label units [K] # Params [M] EncoderAdapterDecod...
-
[5]
observed that LLMs can learn to utilize meaningless filler tokens such as a sequence of dots to improve their performance. Since the prompt is always the same in our case, it does not provide any useful information for the model and can also be considered as a sequence of meaningless filler tokens. Moti- vated by the attention pattern of the last head in ...
-
[6]
Conclusions & Outlook In this work, we have investigated different utilizations of large language models (LLMs) for automatic speech recogni- tion (ASR). We compared tight integration of acoustic encoder and LLMs,speech LLMs(SLLMs), to shallow fusion of acous- tic and language model scores. Our findings are as follows: • Prefix LLMs (PLLMs) outperform the...
-
[7]
Generative AI Use Disclosure We use LLMs to improve the formulations and grammar of the paper
-
[8]
On Using Monolingual Corpora in Neural Machine Translation
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On using monolingual corpora in neural machine translation,” Preprint arXiv:1503.03535, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[9]
Cold Fusion: Training Seq2Seq Models Together with Language Models,
A. Sriram, H. Jun, S. Satheesh, and A. Coates, “Cold Fusion: Training Seq2Seq Models Together with Language Models,” in Interspeech, 2018, pp. 387–391
work page 2018
-
[10]
M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schl ¨uter, and H. Ney, “Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,” inInterspeech, Aug. 2021, pp. 2856–2860. [Online]. Available: http://https://arxiv.org/abs/2104.05544
-
[11]
Sequence- discriminative training of deep neural networks
K. Vesel ´y, A. Ghoshal, L. Burget, and D. Povey, “Sequence- discriminative training of deep neural networks.” inInterspeech, 2013, pp. 2345–2349
work page 2013
-
[12]
T. Hori, M. Kocour, A. Haider, E. McDermott, and X. Zhuang, “Delayed fusion: Integrating large language models into first-pass decoding in end-to-end speech recognition,” inIEEE ICASSP, 2025, pp. 1–5
work page 2025
-
[13]
Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,
K. Deng, J. Guo, Y . Ma, N. Moritz, P. C. Woodland, O. Kalinli, and M. Seltzer, “Transducer-llama: Integrating llms into stream- able transducer-based speech recognition,” inIEEE ICASSP, 2025, pp. 1–5
work page 2025
-
[14]
K., Asawaroengchai, C., Nguyen, D
P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, J. Qin, D. Rozenberg, T. Sainath, J. Schalkwyk, M. Sharifi, M. T. Ramanovich, M. Tagliasacchi, A. Tudor, M. Velimirovi ´c, D. Vincent, J. Yu, Y . Wang, V . Zayats, N. Zeghidour, Y . Zhang,...
-
[15]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities
D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” 2023. [Online]. Available: https://arxiv.org/abs/2305.11000
-
[16]
SALMONN: Towards generic hearing abilities for large language models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk
work page 2024
-
[17]
Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,
Y . Bai, J. Chen, J. Chen, W. Chen, Z. Chen, C. Ding, L. Dong et al., “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04675
-
[18]
K.-T. Xu, F.-L. Xie, X. Tang, and Y . Hu, “Fireredasr: Open- source industrial-grade mandarin speech recognition models from encoder-decoder to llm integration,” 2025. [Online]. Available: https://arxiv.org/abs/2501.14350
-
[19]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Microsoft, A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla et al., “Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras,” 2025. [Online]. Available: https://arxiv.org/abs/2503.01743
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,
G. Saon, A. Dekel, A. Brooks, T. Nagano, A. Daniels, A. Satt, A. Mittal, B. Kingsbury, D. Haws, E. Morais, G. Kurata, H. Aronowitz, I. Ibrahim, J. Kuo, K. Soule, L. Lastras, M. Suzuki, R. Hoory, S. Thomas, S. Novitasari, T. Fukuda, V . Sunder, X. Cui, and Z. Kons, “Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities,” 2025. ...
-
[21]
X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang et al., “Qwen3-ASR technical report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.21337
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Exploring the limits of decoder-only models trained on public speech recognition cor- pora,
A. Gupta, G. Saon, and B. Kingsbury, “Exploring the limits of decoder-only models trained on public speech recognition cor- pora,” inInterspeech, 2024, pp. 252–256
work page 2024
-
[23]
What language model architecture and pretraining objective works best for zero-shot generalization?
T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Chung, I. Beltagy, J. Launay, and C. Raffel, “What language model architecture and pretraining objective works best for zero-shot generalization?” inProceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Sz...
work page 2022
-
[24]
Efficient streaming LLM for speech recognition,
J. Jia, G. Keren, W. Zhou, E. Lakomkin, X. Zhang, C. Wu, F. Seide, J. Mahadeokar, and O. Kalinli, “Efficient streaming LLM for speech recognition,” inIEEE ICASSP. IEEE, 2025, pp. 1–5
work page 2025
-
[25]
Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catan- zaro, “Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML. JMLR.org, 2024
work page 2024
-
[26]
Connecting speech encoder and large language model for asr,
W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” 2023. [Online]. Available: https://arxiv.org/abs/2309.13963
-
[27]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: A family of highly capable multimodal models,” 2025. [Online]. Available: https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. Heet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Conformer: Convolution- augmented transformer for speech recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,” inInterspeech, 2020
work page 2020
-
[30]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376
work page 2006
-
[31]
End-to-end speech recognition: A survey,
R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schl¨uter, and S. Watan- abe, “End-to-end speech recognition: A survey,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 325–351, 2023
work page 2023
-
[32]
An embarrassingly simple approach for LLM with strong ASR capacity,
Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhang, and X. Chen, “An embarrassingly simple approach for LLM with strong ASR capacity,” 2024. [Online]. Available: https://arxiv.org/abs/2402.08846
-
[33]
CTC-based compression for direct speech translation,
M. Gaido, M. Cettolo, M. Negri, and M. Turchi, “CTC-based compression for direct speech translation,” inProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 690–696. [Online]. ...
work page 2021
-
[34]
On decoder-only architec- ture for speech-to-text and large language model integration,
J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y . Wu, “On decoder-only architec- ture for speech-to-text and large language model integration,” in IEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2023, pp. 1–8
work page 2023
-
[35]
Cjst: Ctc compressor based joint speech and text training for decoder-only asr,
W. Zhou, J. Jia, L. Sari, J. Mahadeokar, and O. Kalinli, “Cjst: Ctc compressor based joint speech and text training for decoder-only asr,” inIEEE ICASSP, 2025, pp. 1–5
work page 2025
-
[36]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inNIPS, 2017, pp. 6000–6010
work page 2017
-
[37]
Attention-Based Models for Speech Recognition
J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio, “Attention-based models for speech recognition,” Preprint arXiv:1506.07503, 2015. [Online]. Available: http: //arxiv.org/abs/1506.07503
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[38]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,
W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inIEEE ICASSP, 2016
work page 2016
-
[39]
T5gemma 2: Seeing, reading, and understanding longer,
B. Zhang, P. Suganthan, G. Liu, I. Philippov, S. Dua, B. Hora, K. Black, G. Martins, O. Sanseviero, S. Pathak, C. Hardin, F. Visin, J. Zhang, K. Kenealy, Q. Yin, X. Song, O. Lacombe, A. Joulin, T. Warkentin, and A. Roberts, “T5gemma 2: Seeing, reading, and understanding longer,” 2025. [Online]. Available: https://arxiv.org/abs/2512.14856
-
[40]
T. Hori, S. Watanabe, Y . Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” inInterspeech, 2017
work page 2017
-
[41]
Lib- rispeech: an ASR corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an ASR corpus based on public domain audio books,” inIEEE ICASSP. IEEE, 2015, pp. 5206–5210
work page 2015
-
[42]
T. Parcollet, Y . Tseng, S. Zhang, and R. C. van Dalen, “Loqua- cious Set: 25,000 hours of transcribed and diverse English speech recognition data for research and commercial use,” inInterspeech 2025, 2025, pp. 4053–4057
work page 2025
-
[43]
SpecAugment: A simple data augmen- tation method for automatic speech recognition,
D. S. Park, W. Chan, Y . Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V . Le, “SpecAugment: A simple data augmen- tation method for automatic speech recognition,” inInterspeech, 2019, pp. 2613–2617
work page 2019
-
[44]
A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexi- ble neural toolkit with application to translation and speech recog- nition,” inAnnual Meeting of the Assoc. for Computational Lin- guistics, Melbourne, Australia, Jul. 2018
work page 2018
-
[45]
Pytorch: An imperative style, high- performance deep learning library,
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high- performance deep learning library,” inAdvances in Neural Information Processing S...
work page 2019
-
[46]
ESPnet: End-to-End Speech Processing Toolkit
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End- to-end speech processing toolkit,” 2018. [Online]. Available: https://arxiv.org/abs/1804.00015
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
A. Yang, B. Yang, B. Hui, B. Zheng, B. Yuet al., “Qwen2 technical report,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.10671
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[49]
D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,” 2023. [Online]. Available: https: //arxiv.org/abs/2312.03732
-
[50]
T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, E. Blanco and W. Lu, Eds. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 6...
work page 2018
-
[51]
A. Yang, A. Li, B. Yang, B. Zhang, B. Huiet al., “Qwen3 technical report,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,
R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,”
-
[53]
Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,
[Online]. Available: https://arxiv.org/abs/2401.01572
-
[54]
Language Modeling with Deep Transformers,
K. Irie, A. Zeyer, R. Schl ¨uter, and H. Ney, “Language Modeling with Deep Transformers,” inInterspeech 2019, 2019, pp. 3905– 3909
work page 2019
-
[55]
Let’s think dot by dot: Hidden computation in transformer language models,
J. Pfau, W. Merrill, and S. R. Bowman, “Let’s think dot by dot: Hidden computation in transformer language models,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=NikbrdtYvG
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.