Beyond Acoustic Prefixes: Persistent Grounding in Serialized Acoustic Memory for LLM-Based Multi-Talker Speech Recognition

Hao Shi; Tatsuya Kawahara; Xugang Lu; Yuan Gao

read the original abstract

Large Language Models (LLMs) are effective decoders for Serialized Output Training (SOT) in two-talker automatic speech recognition (ASR), but their performance degrades substantially in three-talker mixtures. A key limitation is that conventional systems provide acoustic evidence only through an initial projected prefix, requiring the decoder to preserve fine-grained talker information throughout autoregressive generation. We first revisit CTC-derived static prefix conditioning using discrete token, hybrid token-acoustic, and continuous acoustic prompts. Although continuous acoustic cues are more reliable than discrete CTC hypotheses, the improvements on Libri3Mix remain limited, showing that richer prefix content alone does not resolve the conditioning bottleneck. We therefore propose persistent grounding in serialized acoustic memory, which enables the decoder to retrieve talker-structured acoustic evidence throughout the utterance. Specifically, talker-disentangled and onset-ordered acoustic representations are retained as external memory and accessed during decoding through gated residual cross-attention. We further introduce joint low-rank refinement of the acoustic retrieval pathway and selected LLM self-attention projections using LoRA. Experiments on Libri2Mix and Libri3Mix under clean and noisy conditions show consistent improvements over conventional LLM-SOT and naive stacked cross-attention, with particularly large gains in three-talker mixtures. These results demonstrate the importance of persistent access to structured, talker-aware acoustic evidence for LLM-based multi-talker ASR.

Beyond Acoustic Prefixes: Persistent Grounding in Serialized Acoustic Memory for LLM-Based Multi-Talker Speech Recognition

discussion (0)