On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora , Kai-Wei Chang , Chung-Ming Chien , Yifan Peng , Haibin Wu , Yossi Adi , Emmanuel Dupoux , Hung-yi Lee

show 2 more authors

Karen Livescu Shinji Watanabe

Authors on Pith no claims yet

classification 💻 cs.CL cs.SDeess.AS

keywords modelslanguagespeechspokenworkfieldprocessingslms

0 comments

read the original abstract

The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both "pure" language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
A framework for analyzing concept representations in neural models
cs.CL 2026-05 unverdicted novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
cs.CL 2026-04 unverdicted novelty 6.0

A contrastive LLM fine-tuning method creates joint embeddings for dialogue contexts and backchannel realizations, improving retrieval performance and alignment with human judgments over raw WavLM features.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs
eess.AS 2026-04 unverdicted novelty 6.0

TASU2 adds controllability over uncertainty and error rate to text-derived CTC simulation, enabling better cross-modal alignment and low-resource adaptation for speech LLMs than prior text-only or TTS methods.