Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang; Heng Lu; Jinyu Li; Shujie Liu; Xueyao Zhang; Yan Lu; Zhizheng Wu

arxiv: 2601.05543 · v2 · submitted 2026-01-09 · 💻 cs.CL · cs.SD· eess.AS

Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang , Heng Lu , Xueyao Zhang , Shujie Liu , Yan Lu , Jinyu Li , Zhizheng Wu This is my paper

Pith reviewed 2026-05-16 16:32 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords speech large language modelsmodality reasoning gapreinforcement learningtrajectory alignmentrepresentation alignmentbehavior alignmentTARS

0 comments

The pith

Reinforcement learning with asymmetric rewards aligns speech and text trajectories to close the modality reasoning gap in Speech LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech Large Language Models show markedly weaker reasoning on speech inputs than on equivalent text, leaving a persistent modality gap. The authors trace this to representational drift through Transformer layers and inconsistent behavior during extended reasoning chains. They propose TARS, a reinforcement-learning framework that pulls speech-conditioned trajectories toward text-conditioned ones using two complementary dense rewards. Representation alignment matches layer-wise hidden states while behavior alignment enforces semantic consistency between generated outputs and reference text completions. Experiments on MMSU and OBQA demonstrate that the method narrows the gap and reaches state-of-the-art results among 7B-scale Speech LLMs.

Core claim

TARS is a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions.

What carries the argument

TARS reinforcement-learning framework using asymmetric rewards that combine representation alignment (layer-wise hidden-state similarity) and behavior alignment (semantic consistency to reference text completions).

If this is right

Speech reasoning performance approaches text performance on complex benchmarks.
The 7B-scale models reach state-of-the-art results among Speech LLMs.
Text-only capabilities remain intact after alignment training.
The approach applies directly to existing reasoning benchmarks such as MMSU and OBQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar trajectory-alignment techniques could address modality gaps in other multimodal models such as vision-language systems.
The method may reduce reliance on expensive paired speech-text supervision if alignment generalizes from limited data.
Extending the framework to larger models or noisy real-world speech could test whether the observed gains scale.
Combining TARS with existing speech-specific pretraining might produce further improvements in end-to-end spoken reasoning.

Load-bearing premise

The modality reasoning gap stems mainly from representational drift across layers and long-chain behavior deviations that asymmetric reward alignment can correct without degrading text performance or creating new instabilities.

What would settle it

After training with TARS, evaluate the model on MMSU or OBQA and observe whether speech-input reasoning scores remain substantially below matched text-input scores.

read the original abstract

Although Speech Large Language Models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TARS gives measurable gains on speech reasoning tasks via asymmetric RL alignment of trajectories, but the experiments do not confirm that the gains come from reduced representational drift rather than ordinary fine-tuning effects.

read the letter

The core takeaway is that this paper offers a concrete training recipe for speech LLMs that narrows the gap to text-based reasoning on MMSU and OBQA. TARS runs reinforcement learning on paired speech-text trajectories and uses two rewards: layer-wise hidden-state similarity plus semantic consistency with reference text. The asymmetric design is the main technical step that is not already standard in the cited work, and the reported numbers show it lifts 7B-scale models to new levels among speech LLMs while keeping text performance intact. That is useful empirical evidence for anyone trying to make audio inputs first-class citizens in reasoning models. The authors also correctly flag that the gap likely comes from drift across layers and from deviations in long reasoning chains, which is a reasonable diagnosis. What is missing is direct support for the mechanism. There are no before-and-after measurements of hidden-state similarity or chain deviation, and no ablation that compares TARS against plain RL fine-tuning on the same data. Without those, it is hard to know whether the representation-alignment term is doing the work or whether the gains are just from extra compute and data exposure. The abstract also leaves out reward coefficients, data splits, and significance tests, so the results are harder to reproduce or stress-test than they should be. This work is aimed at groups building or tuning speech LLMs who need practical alignment tricks. A reader who already works on RL for multimodal models will find the reward construction worth looking at, even if the causal story needs tightening. The paper is coherent on its own terms and shows honest engagement with the problem, so it should go to peer review rather than a desk reject. A referee can ask for the missing ablations and metrics; the current version is solid enough to justify that step.

Referee Report

2 major / 1 minor

Summary. The paper claims that Speech LLMs exhibit a modality reasoning gap due to representational drift across Transformer layers and deviations in long-chain reasoning. It introduces TARS, a reinforcement-learning framework that aligns text- and speech-conditioned trajectories via an asymmetric reward design consisting of layer-wise hidden-state similarity (representation alignment) and semantic consistency with reference text (behavior alignment). Experiments on MMSU and OBQA benchmarks report that TARS narrows the gap and achieves SOTA performance among 7B-scale Speech LLMs.

Significance. If the reported gains are shown to arise specifically from the proposed trajectory alignment rather than generic RL effects, the work would offer a practical method for improving multimodal reasoning while preserving text performance. The empirical results on challenging benchmarks indicate potential utility for 7B-scale models, but the absence of mechanistic validation reduces the strength of the central claim.

major comments (2)

[Experiments] Experiments section: no pre- versus post-training measurements of representational drift (layer-wise hidden-state similarity) or long-chain behavior deviations are reported, so it remains unclear whether the MMSU/OBQA gains result from the asymmetric rewards or from increased compute and data exposure alone.
[Ablation studies] Ablation studies: the manuscript provides no comparison of TARS against standard RL fine-tuning on identical data and base model, which is required to substantiate that the asymmetric reward design (rather than generic RL) drives the SOTA claim among 7B Speech LLMs.

minor comments (1)

[Training details] Training details: reward weighting coefficients, data splits, training dynamics, and statistical significance tests are not fully specified, limiting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in experimental validation that we will address in revision to better substantiate the contribution of the asymmetric reward design in TARS.

read point-by-point responses

Referee: [Experiments] Experiments section: no pre- versus post-training measurements of representational drift (layer-wise hidden-state similarity) or long-chain behavior deviations are reported, so it remains unclear whether the MMSU/OBQA gains result from the asymmetric rewards or from increased compute and data exposure alone.

Authors: We agree that the current manuscript lacks explicit pre- versus post-training measurements of layer-wise hidden-state similarity and long-chain behavior deviations. In the revised version we will add these analyses: we will report cosine similarities between text- and speech-conditioned hidden states at each Transformer layer, as well as semantic consistency scores on long-chain reasoning chains, both before and after TARS training. These measurements will show the reduction in representational drift and behavior deviation, supporting that the MMSU/OBQA gains are driven by the proposed alignment rather than generic training effects. revision: yes
Referee: [Ablation studies] Ablation studies: the manuscript provides no comparison of TARS against standard RL fine-tuning on identical data and base model, which is required to substantiate that the asymmetric reward design (rather than generic RL) drives the SOTA claim among 7B Speech LLMs.

Authors: We concur that a direct comparison against standard RL fine-tuning on the identical base model and data is necessary. The revised manuscript will include an ablation in which the same 7B model is fine-tuned with standard RL (task-performance reward only, same data volume and compute budget) and evaluated on MMSU and OBQA. Results will be reported alongside TARS to isolate the contribution of the layer-wise representation alignment and semantic behavior alignment signals. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL rewards defined externally, evaluated on independent benchmarks

full rationale

The paper introduces TARS as an empirical reinforcement-learning procedure whose rewards are constructed from external signals (layer-wise hidden-state similarity between speech/text trajectories and semantic consistency against reference text completions). These are not derived from the target reasoning metrics (MMSU/OBQA accuracy) and the evaluation uses held-out benchmarks. No equations or derivation chain reduce any claimed prediction to fitted inputs by construction. No self-citation is load-bearing for the core mechanism, and the approach remains falsifiable via ablations or pre/post drift measurements. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method assumes standard transformer architecture and RL training stability without introducing new free parameters beyond typical reward coefficients; no new physical or mathematical entities are postulated.

free parameters (1)

reward weighting coefficients
The relative strength between representation alignment and behavior alignment rewards is chosen to balance the two signals.

axioms (2)

domain assumption Hidden-state similarity at corresponding layers is a valid proxy for representational alignment between modalities.
Invoked when defining the representation alignment reward.
domain assumption Semantic consistency with reference text completions indicates correct reasoning behavior.
Used to define the behavior alignment reward.

pith-pipeline@v0.9.0 · 5454 in / 1348 out tokens · 25007 ms · 2026-05-16T16:32:36.381004+00:00 · methodology

Closing the Modality Reasoning Gap for Speech Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)