Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
Connecting speech encoder and large language model for ASR
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
citing papers explorer
-
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.