arxiv: 2604.06695 · v1 · submitted 2026-04-08 · 💻 cs.AI

Recognition: no theorem link

Reasoning Fails Where Step Flow Breaks

Xiaoyu Xu , Yulan Pan , Xiaosong Yuan , Zhihong Shen , Minghao Su , Yuanhao Su , Xiaofeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords Step-SaliencyStepFlowlarge reasoning modelsattention analysisinformation flowchain of thoughttest-time interventionreasoning performance

0 comments

The pith

Repairing attention flow between reasoning steps improves accuracy in large models without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models generate long chains of thought but remain unstable on multi-step math, science, and coding problems. The paper introduces Step-Saliency to create step-to-step attention maps along the full question-thinking-summary path. This exposes two recurring failures: shallow layers lock onto the current step and ignore earlier context, while deep layers gradually lose focus on the thinking segment. To address the patterns, the authors propose StepFlow, a test-time method that adjusts shallow saliency via an Odds-Equal Bridge and adds step-level momentum in deep layers. The intervention raises accuracy across several models without any retraining, suggesting that targeted information-flow repairs can recover part of the reasoning performance that is otherwise lost.

Core claim

Step-Saliency reveals Shallow Lock-in, where shallow layers over-focus on the current step with little use of prior context, and Deep Decay, where deep layers lose saliency on the thinking segment as the summary attends more to itself and recent steps. StepFlow, using Odds-Equal Bridge and Step Momentum Injection, adjusts these patterns to improve accuracy on math, science, and coding tasks across multiple LRMs without retraining.

What carries the argument

Step-Saliency, which pools attention-gradient scores into step-to-step maps along the question-thinking-summary trajectory, used to diagnose flow failures that StepFlow then corrects at test time.

If this is right

StepFlow raises accuracy on math, science, and coding tasks across multiple LRMs without retraining.
Repairing information flow recovers part of the missing reasoning performance in long-chain models.
The patterns of Shallow Lock-in and Deep Decay recur consistently across several models.
Test-time saliency adjustments can mitigate reasoning errors while leaving model weights unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention-flow problems may appear in other sequential tasks that require maintaining context over many steps.
Targeting information flow during inference offers a route to improve models without full retraining.
Similar step-level analysis could diagnose failures in standard language models on complex planning problems.

Load-bearing premise

The observed attention patterns are causal drivers of reasoning errors rather than mere correlates, and the saliency adjustments directly fix the root cause of performance drops.

What would settle it

Running StepFlow on a held-out set of multi-step problems changes the measured attention patterns but produces no accuracy gain, or finding models that lack Shallow Lock-in and Deep Decay yet still fail on the same tasks.

Figures

Figures reproduced from arXiv: 2604.06695 by Minghao Su, Xiaofeng Zhang, Xiaosong Yuan, Xiaoyu Xu, Yuanhao Su, Yulan Pan, Zhihong Shen.

**Figure 2.** Figure 2: Step-Saliency patterns for shallow vs. deep layers and correct vs. error traces. Top: depth-collapsed step→step saliency maps; darker red indicates stronger influence between steps. Bottom: schematic diagrams summarizing the observed patterns—red arrows denote narrow, local flow (Shallow Lock-in and Deep Decay in error traces), while blue arrows denote broad, long-range flow in correct traces. Xu et al., 2… view at source ↗

**Figure 3.** Figure 3: Layer-wise saliency intensities across three models. (a) GPT-OSS-20B: thinking and summary selfintensities for correct vs. error traces. (b) R1-Distill-32B and QwQ-32B show the same pattern: stronger shallow lock-in and summary self-reinforcement in error traces. as I (ℓ) T = 1 K X K i=1 M (ℓ) i←i and I (ℓ) S = M (ℓ) (K+1)←(K+1), (4) where I (ℓ) T measures how much each step reuses its own content and I (… view at source ↗

**Figure 4.** Figure 4: Effect of StepFlow on Step-Saliency (rep [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Accuracy vs. τmax on AIME24 and GPQA-D. Dashed lines show baseline accuracy; stars mark the chosen τ ⋆ . (competition-style math) and GPQA-Diamond (graduate-level science). For each model/benchmark pair, we record: (i) the smallest scale αlow that improves over the no-intervention baseline, (ii) the chosen scale α ⋆ , and (iii) the largest scale αhigh that does not reduce accuracy by more than 1 point com… view at source ↗

read the original abstract

Large reasoning models (LRMs) that generate long chains of thought now perform well on multi-step math, science, and coding tasks. However, their behavior is still unstable and hard to interpret, and existing analysis tools struggle with such long, structured reasoning traces. We introduce Step-Saliency, which pools attention--gradient scores into step-to-step maps along the question--thinking--summary trajectory. Across several models, Step-Saliency reveals two recurring information-flow failures: Shallow Lock-in, where shallow layers over-focus on the current step and barely use earlier context, and Deep Decay, where deep layers gradually lose saliency on the thinking segment and the summary increasingly attends to itself and the last few steps. Motivated by these patterns, we propose StepFlow, a saliency-inspired test-time intervention that adjusts shallow saliency patterns measured by Step-Saliency via Odds-Equal Bridge and adds a small step-level residual in deep layers via Step Momentum Injection. StepFlow improves accuracy on math, science, and coding tasks across multiple LRMs without retraining, indicating that repairing information flow can recover part of their missing reasoning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step-Saliency gives a step-level view of attention in long reasoning traces and StepFlow applies a simple test-time patch, but the paper does not show that the patch fixes the diagnosed patterns rather than something else.

read the letter

The paper introduces Step-Saliency to build step-to-step attention maps across question, thinking, and summary segments in large reasoning models. It identifies two repeated patterns—Shallow Lock-in in early layers and Deep Decay in later ones—and then proposes StepFlow, which uses an Odds-Equal Bridge in shallow layers plus Step Momentum Injection in deep layers as a test-time fix. The abstract states that this raises accuracy on math, science, and coding tasks across several models without retraining.

Referee Report

3 major / 2 minor

Summary. The paper introduces Step-Saliency, which aggregates attention-gradient scores into step-to-step maps along question-thinking-summary trajectories in large reasoning models. It identifies two recurring information-flow issues: Shallow Lock-in (shallow layers over-focus on the current step) and Deep Decay (deep layers lose saliency on thinking segments). Motivated by these, it proposes StepFlow, a test-time intervention using Odds-Equal Bridge in shallow layers and Step Momentum Injection in deep layers, and reports accuracy gains on math, science, and coding tasks across LRMs without retraining.

Significance. If the performance gains are robust and the intervention can be directly linked to repair of the diagnosed patterns, the work offers a practical test-time method to improve LRM reasoning and a diagnostic tool for long structured traces. The empirical focus on information flow without retraining is a strength, though the significance hinges on establishing causality beyond correlation.

major comments (3)

[Experimental results] Experimental results section: no pre- and post-intervention Step-Saliency maps or quantitative saliency comparisons are shown to confirm that Odds-Equal Bridge and Step Momentum Injection actually alter the Shallow Lock-in and Deep Decay patterns.
[Ablation studies] Ablation studies: the manuscript lacks controls that isolate the specific saliency adjustments from generic attention perturbations or added residuals, leaving open whether the gains arise from the motivated mechanisms or unrelated effects.
[Introduction and Discussion] Causality argument: the claim that the observed patterns are causal drivers of errors (and that StepFlow recovers missing performance by repairing them) rests on performance improvements alone; without direct evidence tying the intervention to pattern repair, the inference is not load-bearing.

minor comments (2)

[Abstract] Ensure all reported accuracy improvements include full baselines, multiple runs with error bars, and statistical tests; the abstract currently omits these details.
[Method] Clarify the precise pooling formula for Step-Saliency (attention-gradient aggregation along the trajectory) with an equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key gaps in linking our diagnostic observations to the proposed intervention. We address each major comment below and will revise the manuscript to incorporate the requested evidence.

read point-by-point responses

Referee: Experimental results section: no pre- and post-intervention Step-Saliency maps or quantitative saliency comparisons are shown to confirm that Odds-Equal Bridge and Step Momentum Injection actually alter the Shallow Lock-in and Deep Decay patterns.

Authors: We agree that the manuscript currently lacks pre- and post-intervention Step-Saliency maps and quantitative comparisons. In the revised version we will include these visualizations and metrics across the evaluated tasks to directly show how the interventions modify the identified patterns. revision: yes
Referee: Ablation studies: the manuscript lacks controls that isolate the specific saliency adjustments from generic attention perturbations or added residuals, leaving open whether the gains arise from the motivated mechanisms or unrelated effects.

Authors: The referee correctly notes the absence of isolating controls. We will add new ablation experiments that compare StepFlow against generic attention perturbations and random residual injections to demonstrate that the accuracy gains are attributable to the specific saliency adjustments rather than nonspecific effects. revision: yes
Referee: Introduction and Discussion: Causality argument: the claim that the observed patterns are causal drivers of errors (and that StepFlow recovers missing performance by repairing them) rests on performance improvements alone; without direct evidence tying the intervention to pattern repair, the inference is not load-bearing.

Authors: We acknowledge that performance gains alone do not establish causality. By adding the requested saliency maps and ablations, we will revise the introduction and discussion to present a more cautious framing: the patterns motivate StepFlow, and the empirical results are consistent with repair of those patterns, while explicitly noting the correlational nature of the current evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation and test-time intervention remain independent of inputs

full rationale

The paper introduces Step-Saliency as a diagnostic tool to identify attention patterns (Shallow Lock-in and Deep Decay) across models, then proposes StepFlow as a motivated test-time patch (Odds-Equal Bridge plus Step Momentum Injection) whose accuracy gains are measured on standard math/science/coding benchmarks. No equations, fitted parameters, or derivations are shown that reduce the reported improvements to quantities defined by the same intervention or by self-referential definitions. The central claim rests on external task performance rather than any self-citation chain or ansatz smuggled from prior work by the same authors. This is a standard empirical pipeline with no load-bearing reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities. The interventions are described at conceptual level without stated tunable constants or background assumptions beyond standard attention mechanics.

pith-pipeline@v0.9.0 · 5510 in / 1127 out tokens · 46682 ms · 2026-05-10T18:40:21.672609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Benchmarking perturbation-based saliency maps for explaining atari agents.Frontiers in Artifi- cial Intelligence, 5:903875. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. 2024. Live- codebench: Holistic and contamination free eval- uation of large language models for co...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

From understanding to utilization: A survey on explainability for large language models.arXiv preprint arXiv:2401.12874, 2024

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Haoyan Luo and Lucia Specia. 2024. From understand- ing to utilization: A survey on explainability for large language models.arXiv preprint arXiv:2401.12874. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Ze...

work page arXiv 2024
[3]

arXiv preprint arXiv:2507.09709 , year=

Large language models encode semantics in low-dimensional linear subspaces.arXiv preprint arXiv:2507.09709. Hassan Sajjad, Nadir Durrani, Fahim Dalvi, Firoj Alam, Abdul Khan, and Jia Xu. 2022. Analyzing encoded concepts in transformer language models. InPro- ceedings of the 2022 Conference of the North Amer- ican chapter of the Association for Computation...

work page arXiv 2022
[4]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686. Shaotian Yan, Chen Shen, Wenxiao Wang, Liang Xie, Junjie Liu, and Jieping Ye. 2025. Don’t take things out of context: Attention intervention for enhancing chain-of-thought reasoning in large language models. arXiv preprint arXiv:25...

work page internal anchor Pith review arXiv 2025
[5]

gcd(324,432) = 108

sinθ= 324 cosθ−432 sinθ.✓ Wait, let me factor. gcd(324,432) = 108 . So 108(3 cosθ−4 sinθ) , max = 108 √9 + 16 = 108×5 = 540.✓ Analysis.The model correctly derives +36 sinθ but drops it when aggregating, yielding −468 sinθ instead of −432 sinθ. OEB keeps the earlier derivation visible, preventing lock-in on the domi- nant term. E.3 Bad Case Analysis Bad Ca...

work page