arxiv: 2601.20898 · v1 · submitted 2026-01-28 · 📡 eess.AS · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

Sergio Burdisso , Esa\'u Villatoro-Tello , Shashi Kumar , Srikanth Madikeri , Andr\'es Carofilis , Pradeep Rangappa , Manjunath K E , Kadri Hacioglu

show 2 more authors

Petr Motlicek Andreas Stolcke

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:35 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LG

keywords LLM-based ASRprompt sensitivitylearnable projectionspeech recognitionprompt projectorautomatic speech recognitionmodel-agnostic extension

0 comments

The pith

A learnable prompt projector makes LLM-based speech recognition less sensitive to prompt choice while improving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based automatic speech recognition connects speech models to large language models using a fixed manual prompt that turns out to vary widely in effectiveness across datasets. No single prompt works best in all cases, which creates performance instability. The paper introduces a prompt projector that learns to map those prompt embeddings into more useful regions of the LLM input space. This addition is model-agnostic and leaves the underlying ASR system unchanged. Tests on four datasets show higher accuracy, lower variability, and results that beat even the strongest hand-selected prompts.

Core claim

The central claim is that a simple learnable prompt projector, trained to transform prompt embeddings without altering the speech-to-LLM projector or the LLM itself, consistently raises ASR performance, lowers sensitivity to prompt wording, and surpasses the best manually chosen prompts across multiple datasets.

What carries the argument

The prompt projector, a model-agnostic module that projects prompt embeddings to more effective regions of the LLM input space.

If this is right

ASR word error rates drop consistently when the projector is added.
Performance variability caused by prompt selection shrinks.
The projector beats the single best manually chosen prompt on each dataset tested.
The base speech-to-LLM architecture requires no modification to gain these benefits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt engineering time can shrink because the projector learns suitable projections from data.
The same idea could apply to other LLM pipelines where fixed prompts create unwanted sensitivity.
Once trained, the projector might allow the system to handle new acoustic conditions without full model retraining.

Load-bearing premise

The prompt projector can be trained on the same data used for the base ASR model without introducing overfitting or requiring dataset-specific tuning that limits generalization.

What would settle it

If the prompt projector fails to lower word error rates or increases performance variance relative to the best fixed prompt on a new held-out dataset, the central claim would be falsified.

read the original abstract

LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The prompt projector is a straightforward addition that addresses real instability from fixed prompts in LLM-ASR, but the abstract leaves the size of gains and generalization unverified.

read the letter

The paper's core point is that prompt choice creates noticeable instability in LLM-based ASR, with no single fixed prompt working best across datasets. They add a small learnable prompt projector that maps the prompt embeddings into a better region of the LLM input space, keeping the rest of the model unchanged. This is presented as a model-agnostic extension that reduces both error rates and variability while beating the best hand-picked prompt on four datasets. That analysis of prompt sensitivity is useful on its own, and the projector idea follows directly from the existing speech-to-LLM projector without adding much complexity. It targets a practical pain point for people deploying these systems. The main weakness is that the abstract supplies no numbers, no error bars, no statistical tests, and no description of how the projector is trained or regularized. Because the module is optimized on the same data as the base ASR model, it is not obvious whether the reported improvements reflect a general mapping or dataset-specific compensation. The stress-test concern about overfitting is reasonable given what is shown here; without held-out splits or cross-dataset checks for the projector itself, the generalization claim stays provisional. The work is aimed at ASR researchers and engineers who already use LLM back-ends and need to cut down on prompt engineering. It is coherent on its own terms and engages the existing literature, so it is worth sending to a serious referee even if the current evidence is thin. The full paper would need to supply the missing quantitative details and controls before it could be considered solid.

Referee Report

3 major / 2 minor

Summary. The paper analyzes prompt sensitivity in LLM-based ASR systems, demonstrating that fixed manual prompts lead to performance variability across datasets with no universally optimal choice. It proposes a learnable prompt projector module—inspired by the speech-to-LLM projector—that maps prompt embeddings into more effective regions of the LLM input space without altering the base model. Experiments across four datasets show that this addition yields consistent performance gains, reduced variability, and outperforms the best manually selected prompts.

Significance. If the empirical results hold under proper controls, the work offers a lightweight, model-agnostic extension that addresses a practical instability in LLM-ASR pipelines. The approach preserves the underlying architecture while providing a data-driven mitigation for prompt engineering, which could improve robustness in deployment scenarios where prompt choice is otherwise brittle.

major comments (3)

[§3.2] §3.2 (Method): The prompt projector is described as trained jointly with the base ASR model on the same task data, yet no details are provided on regularization, separate validation splits for the projector parameters, or cross-validation procedures. This directly bears on the central claim of robust, generalizable improvements rather than dataset-specific compensation.
[§4] §4 (Experiments): The abstract and results claim consistent gains and reduced variability across four datasets, but the manuscript provides no quantitative metrics (e.g., WER deltas), error bars, statistical significance tests, or ablation on projector capacity. Without these, the magnitude and reliability of the reported outperformance over manual prompts cannot be assessed.
[§4.3] §4.3 (Cross-dataset analysis): While four datasets are used, there is no explicit cross-dataset evaluation (e.g., training the projector on one dataset and testing on others) to substantiate the model-agnostic generalization argument against the overfitting risk raised by joint training on task data.

minor comments (2)

[§3.1] Notation for the prompt projector (e.g., its input/output dimensions and integration point with the LLM) should be formalized with an equation or diagram for clarity.
[§2] The prompt analysis in §2 would benefit from a table summarizing the exact prompt templates tested and their per-dataset performance to make the sensitivity claim more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our work. We have carefully reviewed each major comment and provide point-by-point responses below. Where the comments identify gaps in experimental detail or analysis, we agree that revisions are warranted and will update the manuscript accordingly to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Method): The prompt projector is described as trained jointly with the base ASR model on the same task data, yet no details are provided on regularization, separate validation splits for the projector parameters, or cross-validation procedures. This directly bears on the central claim of robust, generalizable improvements rather than dataset-specific compensation.

Authors: We agree that additional details on the training procedure are necessary to substantiate the robustness claims. In the revised manuscript, we will expand §3.2 to specify the regularization strategy applied to the projector (including any weight decay or dropout), the allocation of separate validation splits for tuning projector hyperparameters, and the cross-validation protocol used to confirm that gains are consistent rather than artifacts of particular data partitions. revision: yes
Referee: [§4] §4 (Experiments): The abstract and results claim consistent gains and reduced variability across four datasets, but the manuscript provides no quantitative metrics (e.g., WER deltas), error bars, statistical significance tests, or ablation on projector capacity. Without these, the magnitude and reliability of the reported outperformance over manual prompts cannot be assessed.

Authors: We acknowledge that the current presentation lacks the quantitative rigor needed to fully evaluate the improvements. In the revised §4, we will add explicit WER deltas relative to the best fixed prompts, error bars computed over multiple independent runs, results of statistical significance tests, and an ablation study examining the effect of projector capacity (e.g., varying hidden dimension and number of layers). These additions will allow direct assessment of the magnitude and reliability of the gains. revision: yes
Referee: [§4.3] §4.3 (Cross-dataset analysis): While four datasets are used, there is no explicit cross-dataset evaluation (e.g., training the projector on one dataset and testing on others) to substantiate the model-agnostic generalization argument against the overfitting risk raised by joint training on task data.

Authors: This is a fair observation regarding potential overfitting. To directly address the generalization concern, we will add cross-dataset experiments to the revised §4.3: the projector will be trained on a single dataset and evaluated on the remaining held-out datasets. The results of these experiments will be reported to demonstrate that the learned projection transfers across domains and supports the model-agnostic claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with independent learnable module

full rationale

The paper conducts an empirical study of prompt effects in LLM-based ASR, documents variability across datasets, and introduces a separate prompt projector module whose parameters are learned from task data. Performance gains are reported as experimental outcomes on four datasets rather than as any closed-form derivation or prediction that reduces to the input data or baseline selection by construction. No equations, self-definitional relations, or load-bearing self-citations appear in the provided text; the central claim remains an observable improvement from an added trainable component and does not collapse into a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a small learned projection of prompt embeddings can meaningfully alter LLM behavior in ASR without side effects, plus standard supervised training assumptions.

free parameters (1)

prompt projector parameters
Learned weights of the new module that are fitted during training on ASR data.

axioms (1)

domain assumption LLM-based ASR uses a fixed prompt during both training and inference
Stated as the common design choice whose sensitivity is being addressed.

invented entities (1)

prompt projector module no independent evidence
purpose: Learns to map prompt embeddings to more effective regions of the LLM input space
New component introduced to mitigate prompt sensitivity

pith-pipeline@v0.9.0 · 5530 in / 1262 out tokens · 36959 ms · 2026-05-16T10:35:28.609797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The prompt projector module is a simple drop-in extension... shares the same architecture as the speech projector

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

A conventional ap- proach uses a cascaded architecture, first with an automatic speech recognition (ASR) system followed by an LLM to process the rec- ognized text [5, 6, 7, 8]

INTRODUCTION Integrating speech capabilities into large language models (LLMs) is a key research area, enabling seamless voice interaction and en- hancing multimodal understanding [1, 2, 3, 4]. A conventional ap- proach uses a cascaded architecture, first with an automatic speech recognition (ASR) system followed by an LLM to process the rec- ognized text...

work page
[2]

METHODOLOGY 2.1. Base Model As our base model, we adopt the recently proposed SLAM-ASR ar- chitecture for LLM-based ASR [3].This model showed that a simple speech projection module is sufficient to achieve competitive perfor- mance, outperforming more complex LLM-based ASR systems. Its strong results and minimal design make it an ideal foundation for our ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

All computations were performed inbfloat16format

Training uses AdamW [25] with learning rateγ= 10 −4, batch size 4, and early stopping based on cross-entropy loss on the dev set. All computations were performed inbfloat16format. Overall, our experiments required over 150 train-to-evaluation trials across various settings (10 prompts, 4 datasets, +/-pp(·), +/- LoRA, freez- ing/unfreezing) on a single NVI...

work page
[4]

best” row showing the top-performing manual prompt per dataset. “+LoRA

EXPERIMENTA TION 3.1. Assessing the Impact of Manual Prompts We first examine the effect of fixed prompts on the word error rate (WER%) of ourvanillaLLM-based ASR model (Section 2.1). For each prompt in Table 1, we train and evaluate an independentvanilla model per dataset, isolating the impact of prompt choice. In total, nine prompts beyond thebaseprompt...

work page
[5]

CONCLUSIONS In this work, we took a first step toward understanding and improv- ing prompt robustness in LLM-based ASR. Through a systematic evaluation of manual prompts across multiple datasets, we showed that prompt choice has a large impact on transcription quality, with even minor wording or placement changes leading to notable differ- ences in WER. T...

work page 2020
[6]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Di- nesh Manocha, Rafael Valle, et al., “Audio Flamingo 3: Advancing au- dio intelligence with fully open large audio language models,” preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al., “Qwen2- audio technical report,” preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

An embarrassingly simple approach for LLM with strong ASR capacity,

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, et al., “An embarrassingly simple approach for LLM with strong ASR capacity,” preprint arXiv:2402.08846, 2024

work page arXiv 2024
[9]

SLM: Bridge the thin gap between speech and text foundation models,

Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul K Rubenstein, et al., “SLM: Bridge the thin gap between speech and text foundation models,” inProc. IEEE ASRU, 2023

work page 2023
[10]

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study,

W Ronny Huang, Cyril Allauzen, Tongzhou Chen, Kilol Gupta, Ke Hu, James Qin, Yu Zhang, Yongqiang Wang, Shuo-Yiin Chang, and Tara N Sainath, “Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study,” inProc. IEEE ICASSP, 2024

work page 2024
[11]

Prompting Large Lan- guage Models for Zero-Shot Domain Adaptation in Speech Recogni- tion,

Yuang Li, Yu Wu, Jinyu Li, and Shujie Liu, “Prompting Large Lan- guage Models for Zero-Shot Domain Adaptation in Speech Recogni- tion,” inProc. IEEE ASRU, 2023

work page 2023
[12]

Can Generative Large Language Models Perform ASR Error Correction?,

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, and Kate Knill, “Can Generative Large Language Models Perform ASR Error Correction?,” preprint arXiv:2307.04172, 2023

work page arXiv 2023
[13]

Generative Speech Recognition Error Cor- rection With Large Language Models and Task-Activating Prompting,

Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bu- lyko, and Andreas Stolcke, “Generative Speech Recognition Error Cor- rection With Large Language Models and Task-Activating Prompting,” inProc. IEEE ASRU, 2023

work page 2023
[14]

SALMONN: Towards Generic Hearing Abilities for Large Language Models,

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, MA Zejun, and Chao Zhang, “SALMONN: Towards Generic Hearing Abilities for Large Language Models,” inProc. 12th Intl. Conf. on Learning Representations, 2024

work page 2024
[15]

On decoder- only architecture for speech-to-text and large language model integra- tion,

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, et al., “On decoder- only architecture for speech-to-text and large language model integra- tion,” inProc. IEEE ASRU, 2023

work page 2023
[16]

Connecting speech encoder and large language model for ASR,

Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang, “Connecting speech encoder and large language model for ASR,” inProc. IEEE ICASSP, 2024

work page 2024
[17]

Performance evaluation of SLAM-ASR: The good, the bad, the ugly, and the way forward,

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esa ´u Villatoro-Tello, Manjunath K E, Kadri Hacio ˘glu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, and Andreas Stolcke, “Performance evaluation of SLAM-ASR: The good, the bad, the ugly, and the way forward,” in Proc. IEEE ICASSP Workshop, 2025

work page 2025
[18]

Bridging the modality gap: Softly discretizing audio representation for llm-based automatic speech recognition,

Mu Yang, Szu-Jui Chen, Jiamin Xie, and John Hansen, “Bridging the modality gap: Softly discretizing audio representation for llm-based automatic speech recognition,” preprint arXiv:2506.05706, 2025

work page arXiv 2025
[19]

Low-resource domain adaptation for speech LLMs via text-only fine-tuning,

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, and Kai Yu, “Low-resource domain adaptation for speech LLMs via text-only fine-tuning,” preprint arXiv:2506.05671, 2025

work page arXiv 2025
[20]

Approaching dia- logue state tracking via aligning speech encoders and LLMs,

ˇSimon Sedl ´aˇcek, Bolaji Yusuf, J ´an ˇSvec, Pradyoth Hegde, Santosh Kesiraju, Old ˇrich Plchot, and Jan ˇCernock`y, “Approaching dia- logue state tracking via aligning speech encoders and LLMs,” preprint arXiv:2506.08633, 2025

work page arXiv 2025
[21]

MaLa-ASR: Multimedia-assisted LLM-based ASR,

Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, and Xie Chen, “MaLa-ASR: Multimedia-assisted LLM-based ASR,” in Proc. Interspeech, 2024

work page 2024
[22]

Text-only adaptation in LLM-based asr through text denoising,

Sergio Burdisso, Esa ´u Villatoro-Tello, Andr ´es Carofilis, Shashi Ku- mar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manju- nath K E, Petr Motlicek, Shankar Venkatesan, and Andreas Stolcke, “Text-only adaptation in LLM-based asr through text denoising,” in Proc. IEEE ICASSP, 2026

work page 2026
[23]

SpeechVerse: A large-scale gen- eralizable audio language model,

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al., “SpeechVerse: A large-scale gen- eralizable audio language model,” preprint arXiv:2405.08295, 2024

work page arXiv 2024
[24]

Unveiling the potential of LLM-based ASR on chinese open-source datasets,

Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, and Lei Xie, “Unveiling the potential of LLM-based ASR on chinese open-source datasets,” inProc. IEEE 14th Intl. Symp. on Chinese Spoken Language Processing (ISCSLP), 2024

work page 2024
[25]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[26]

Libri-light: A benchmark for ASR with limited or no supervision,

Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazar ´e, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al., “Libri-light: A benchmark for ASR with limited or no supervision,” inProc. IEEE ICASSP, 2020

work page 2020
[27]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Tal- nikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux, “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics ...

work page 2021
[28]

GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,

Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,”Proc. Interspeech, 2021

work page 2021
[29]

Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,”See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, pp. 6, 2023

work page 2023
[30]

Decoupled Weight Decay Regular- ization,

Ilya Loshchilov and Frank Hutter, “Decoupled Weight Decay Regular- ization,” inProc. Intl. Conf. on Learning Representations, 2019

work page 2019
[31]

Librispeech: An ASR corpus based on public domain audio books,

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudan- pur, “Librispeech: An ASR corpus based on public domain audio books,” inProc. IEEE ICASSP, 2015

work page 2015
[32]

Callhome: A telephone speech corpus for language identification and speech recognition,

Linguistic Data Consortium, “Callhome: A telephone speech corpus for language identification and speech recognition,” Linguistic Data Consortium, University of Pennsylvania, 2000, LDC2000S98

work page 2000
[33]

The AMI meeting corpus: A pre- announcement,

Jean Carletta, Simone Ashby, et al., “The AMI meeting corpus: A pre- announcement,” inInternational Workshop on Machine Learning for Multimodal Interaction. Springer, 2005, pp. 28–39

work page 2005
[34]

SpeechLLM: Multi-Modal LLM for Speech Understanding,

Shangeth Rajaa and Abhinav Tushar, “SpeechLLM: Multi-Modal LLM for Speech Understanding,” https://github.com/skit-ai/SpeechLLM, 2024

work page 2024
[35]

Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing,

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig, “Pre-train, prompt, and predict: A sys- tematic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, Jan. 2023

work page 2023