An embarrassingly simple approach for llm with strong asr capacity

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, et al · 2024 · arXiv 2402.08846

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

representative citing papers

Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

eess.AS · 2026-04-10 · unverdicted · novelty 7.0

Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.

Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

eess.AS · 2026-01-28 · conditional · novelty 7.0

A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.

Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

eess.AS · 2026-04-14 · unverdicted · novelty 6.0

Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.

Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR

cs.SD · 2026-05-14 · unverdicted · novelty 5.0

A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

eess.AS · 2026-04-14 · unverdicted · novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.

LLMs and Speech: Integration vs. Combination

eess.AS · 2026-03-16 · unverdicted · novelty 4.0

Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.

citing papers explorer

Showing 9 of 9 citing papers.

Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS eess.AS · 2026-04-13 · unverdicted · none · ref 13
Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR eess.AS · 2026-04-10 · unverdicted · none · ref 20
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection eess.AS · 2026-01-28 · conditional · none · ref 8
A learnable prompt projector added to LLM-based ASR reduces prompt sensitivity, lowers performance variability, and beats the best fixed prompts on four datasets.
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction eess.AS · 2026-04-14 · unverdicted · none · ref 32
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR cs.SD · 2026-05-14 · unverdicted · none · ref 11
A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition cs.CL · 2026-04-10 · unverdicted · none · ref 18
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR cs.CL · 2026-04-07 · unverdicted · none · ref 8
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions eess.AS · 2026-04-14 · unverdicted · none · ref 14
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
LLMs and Speech: Integration vs. Combination eess.AS · 2026-03-16 · unverdicted · none · ref 32
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.

An embarrassingly simple approach for llm with strong asr capacity

fields

years

verdicts

representative citing papers

citing papers explorer