FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Chaoren Wang; Haizhou Li; Jiaqi Li; Junwen Qiu; Jun Zhang; Lu Lu; Mingjie Chen; Xiaohai Tian; Xinyu Liang; Xu Li

arxiv: 2606.31247 · v1 · pith:HLBQ7NJAnew · submitted 2026-06-30 · 💻 cs.SD · eess.AS

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Jiaqi Li , Chaoren Wang , Xiaohai Tian , Mingjie Chen , Xinyu Liang , Xu Li , Yufan Lin , Junwen Qiu

show 4 more authors

Jun Zhang Lu Lu Haizhou Li Zhizheng Wu

This is my paper

classification 💻 cs.SD eess.AS

keywords frameflexislmspeechdynamicratelanguageratesslms

0 comments

read the original abstract

Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at https://flexislm.github.io .

This paper has not been read by Pith yet.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

discussion (0)