Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

Florian Metze; Jiamin Xie; Ju Lin; Ming Sun; Peng Su; Prashant Rawat; Sangeeta Srivastava; Tyler Vuong; Yiteng Huang; Zhaojiang Lin

arxiv: 2506.14973 · v1 · pith:EUADPMPTnew · submitted 2025-06-17 · 📡 eess.AS · cs.AI

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

Jiamin Xie , Ju Lin , Yiteng Huang , Tyler Vuong , Zhaojiang Lin , Zhaojun Yang , Peng Su , Prashant Rawat

show 3 more authors

Sangeeta Srivastava Ming Sun Florian Metze

This is my paper

classification 📡 eess.AS cs.AI

keywords speechrecognitionaudiodirectionalabilitycuesdirectional-speechllamadirectivity

0 comments

read the original abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech recognition capabilities. However, the ability of Speech LLMs to comprehend and process multi-channel audio with spatial cues remains a relatively uninvestigated area of research. In this work, we present directional-SpeechLlama, a novel approach that leverages the microphone array of smart glasses to achieve directional speech recognition, source localization, and bystander cross-talk suppression. To enhance the model's ability to understand directivity, we propose two key techniques: serialized directional output training (S-DOT) and contrastive direction data augmentation (CDDA). Experimental results show that our proposed directional-SpeechLlama effectively captures the relationship between textual cues and spatial audio, yielding strong performance in both speech recognition and source localization tasks.

This paper has not been read by Pith yet.

Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition

discussion (0)