Connecting speech encoder and large language model for ASR

Connecting Speech Encoder, Large Language Model for ASR , author= · 2023 · arXiv 2309.13963

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

eess.AS · 2026-04-10 · unverdicted · novelty 7.0

Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.

SALMONN: Towards Generic Hearing Abilities for Large Language Models

cs.SD · 2023-10-20 · unverdicted · novelty 6.0

SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.

LLMs and Speech: Integration vs. Combination

eess.AS · 2026-03-16 · unverdicted · novelty 4.0

Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.

citing papers explorer

Showing 3 of 3 citing papers.

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR eess.AS · 2026-04-10 · unverdicted · none · ref 18
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
SALMONN: Towards Generic Hearing Abilities for Large Language Models cs.SD · 2023-10-20 · unverdicted · none · ref 108
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
LLMs and Speech: Integration vs. Combination eess.AS · 2026-03-16 · unverdicted · none · ref 26
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.

Connecting speech encoder and large language model for ASR

fields

years

verdicts

representative citing papers

citing papers explorer