Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
An embarrassingly simple approach for llm with strong asr capacity
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9representative citing papers
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
citing papers explorer
-
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
-
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
-
Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection
A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.
-
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
-
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.
-
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
-
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
-
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.