Closing the Modality Reasoning Gap for Speech Large Language Models

· 2026 · cs.CL · arXiv 2601.05543

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Although Speech Large Language Models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

representative citing papers

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

eess.AS · 2026-05-24 · unverdicted · novelty 6.0

Introduces a representation-geometry-based taxonomy for continual learning in speech and audio, identifies mismatches with current CL assumptions in foundation models, and lists open challenges.

citing papers explorer

Showing 1 of 1 citing paper.

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems eess.AS · 2026-05-24 · unverdicted · none · ref 13 · internal anchor
Introduces a representation-geometry-based taxonomy for continual learning in speech and audio, identifies mismatches with current CL assumptions in foundation models, and lists open challenges.

Closing the Modality Reasoning Gap for Speech Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer