Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Eun-Jung Holden; Han Yin; Hong Jia; Siyi Wang; Ting Dang; Vidhyasaharan Sethu; Yang Xiao

arxiv: 2605.27039 · v1 · pith:WLTMAQLVnew · submitted 2026-05-26 · 📡 eess.AS · cs.SD

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Yang Xiao , Siyi Wang , Han Yin , Hong Jia , Vidhyasaharan Sethu , Eun-Jung Holden , Ting Dang This is my paper

classification 📡 eess.AS cs.SD

keywords acousticattentionmemorymulti-turnrepresentationretrievalallocationlalms

0 comments

read the original abstract

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

This paper has not been read by Pith yet.

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

discussion (0)