Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

Arti Jain; Baihong Yuan; Lijie Yang; Ravi Netravali; Shijie Cao; Yiwei Chen; Zhihao Jia; Zhihao Zhang

arxiv: 2508.07101 · v2 · submitted 2025-08-09 · 💻 cs.CL · cs.AI

Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

Lijie Yang , Zhihao Zhang , Arti Jain , Shijie Cao , Baihong Yuan , Yiwei Chen , Zhihao Jia , Ravi Netravali This is my paper

Pith reviewed 2026-05-18 23:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sparse attentionreasoning modelslong-horizon decodingtraining-free optimizationattention mechanismstest-time compute

0 comments

The pith

Token importance in reasoning is global across heads and stable over steps, so unified sparse selection keeps accuracy while cutting attended tokens sharply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LessIsMore, a training-free sparse attention technique for large reasoning models that generate long outputs from short prompts. It starts from the observation that the tokens that matter most for correct reasoning tend to be the same ones across different attention heads and change little from one decoding step to the next. By selecting those tokens once in a single unified pass and always retaining a small window of the most recent tokens, the method produces one consistent set of tokens that can be reused across all layers. This reduces the total tokens processed at each step without harming accuracy on hard reasoning benchmarks and produces clear end-to-end speed gains once the attention kernel is optimized.

Core claim

LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to 1.6× end-to-end decoding speedup and up to 1.72× faster sparse attention computation.

What carries the argument

Cross-head unified token selection paired with a fixed recency window that creates one globally consistent, reusable token set for attention across layers and decoding steps.

If this is right

Matches or improves accuracy on challenging reasoning benchmarks across several model families.
Delivers up to 1.6× end-to-end decoding speedup after kernel optimizations.
Provides up to 1.72× faster sparse attention computation.
Extends to long-context tasks beyond the primary reasoning evaluations.
Lowers memory footprint during extended generation without any retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reasoning traces appear to depend on a small persistent core of facts rather than re-examining every prior token at each new step.
The same stability pattern may appear in other long-form generation settings such as multi-step code synthesis or extended dialogue.
The technique could be stacked with quantization or KV-cache compression for larger combined speedups.
Selected tokens could be inspected to test whether they align with explicit reasoning premises or intermediate results.

Load-bearing premise

The tokens that matter for reasoning accuracy are largely the same across attention heads and stay stable as decoding proceeds, so one shared selection works everywhere.

What would settle it

Measure whether accuracy on a long-horizon benchmark such as GSM8K or MATH drops when token selection is performed independently per head or when the recency window is removed, compared with the unified cross-head version.

read the original abstract

Large reasoning models achieve strong performance through test-time scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning. Our key insight is that token importance in reasoning is global and stable: critical tokens are largely shared across attention heads and remain stable over decoding steps. Guided by this structure, LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to $1.6\times$ end-to-end decoding speedup and up to $1.72\times$ faster sparse attention computation, with additional long-context results demonstrating the generality of our approach. Code is available at \href{https://github.com/DerrickYLJ/LessIsMore}{https://github.com/DerrickYLJ/LessIsMore}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LessIsMore delivers a simple training-free sparse attention method with reported accuracy parity and speedups on reasoning tasks, but the core claim of cross-head token stability lacks direct measurements.

read the letter

The main point is that this paper gives a practical training-free way to sparsify attention during long reasoning chains. It unifies token selection across heads based on the idea that important tokens are shared and stable, then adds a recency window to keep recent context. The result is a single mask reused across layers, with claims of matching or better accuracy on benchmarks while using far fewer tokens and hitting 1.6x end-to-end speedup after kernel tweaks. Code is public, which helps verification.

Referee Report

2 major / 2 minor

Summary. The paper introduces LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning in LLMs. It posits that token importance is global across attention heads and stable across decoding steps, enabling a unified sparse mask with a recency window that can be reused across layers. The method is evaluated on multiple model families and reasoning benchmarks, claiming to match or exceed baseline accuracy while attending to fewer tokens and delivering up to 1.6× end-to-end decoding speedup and 1.72× faster sparse attention, with code released publicly.

Significance. If the central empirical claims hold under fuller controls, this work would provide a simple, training-free route to accelerate test-time reasoning without retraining or accuracy degradation, addressing a practical bottleneck in long-chain inference. The public code release and cross-model results strengthen the contribution for the efficient-inference community.

major comments (2)

[§3 (Method) and abstract] The manuscript states the guiding insight that 'token importance in reasoning is global and stable' (abstract and §3) but provides no direct quantification—such as average Jaccard index or top-k overlap statistics between per-head importance scores at fixed steps, or token-consistency metrics over successive decoding steps. Without these measurements, it remains possible that accuracy parity is driven primarily by the recency-window heuristic rather than cross-head unification, weakening the justification for the unified-selection design.
[§4 (Experiments)] §4 (Experiments): Accuracy results are reported without error bars, standard deviations, or details on the number of runs; exact per-benchmark token budgets and sparsity ratios are also not tabulated. These omissions make it difficult to evaluate the robustness of the 'matches or improves accuracy' claim and the precise efficiency-accuracy trade-off.

minor comments (2)

[§3.3] The description of the recency-window size (one of the few free parameters) would benefit from an explicit sensitivity analysis or default-value justification in §3.3.
[Figures] Figure 2 (or equivalent attention-mask visualization) could include a side-by-side comparison of per-head vs. unified masks to illustrate the claimed overlap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major point below with clarifications from the manuscript and indicate revisions where they strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§3 (Method) and abstract] The manuscript states the guiding insight that 'token importance in reasoning is global and stable' (abstract and §3) but provides no direct quantification—such as average Jaccard index or top-k overlap statistics between per-head importance scores at fixed steps, or token-consistency metrics over successive decoding steps. Without these measurements, it remains possible that accuracy parity is driven primarily by the recency-window heuristic rather than cross-head unification, weakening the justification for the unified-selection design.

Authors: We appreciate the referee's emphasis on strengthening the empirical grounding of the core insight. The manuscript already includes ablation studies (§4.3) comparing unified cross-head selection against independent per-head selection, which show that unification contributes measurably to accuracy retention beyond the recency window alone. Nevertheless, we agree that explicit overlap metrics would provide clearer support. In the revision we will add, in §3 and the experiments section, average Jaccard indices across heads at fixed decoding steps and token-consistency statistics over successive steps. These additions will directly address whether the observed accuracy parity relies primarily on the recency heuristic or on the unified selection mechanism. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments): Accuracy results are reported without error bars, standard deviations, or details on the number of runs; exact per-benchmark token budgets and sparsity ratios are also not tabulated. These omissions make it difficult to evaluate the robustness of the 'matches or improves accuracy' claim and the precise efficiency-accuracy trade-off.

Authors: We acknowledge that the current reporting omits variability measures and precise budget details. The experiments were performed with a small number of runs per setting, but these statistics and the exact per-benchmark token counts were not tabulated. In the revised manuscript we will (i) report standard deviations and error bars for all accuracy figures, (ii) state the number of runs used for each benchmark, and (iii) add a supplementary table listing exact token budgets, sparsity ratios, and attended-token counts for every model-benchmark pair. These changes will make the efficiency-accuracy trade-offs fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical design choice justified by benchmarks, not by construction

full rationale

The manuscript introduces a training-free sparse attention method whose central design decisions (cross-head unified selection and recency window) are motivated by an empirical observation about token stability rather than any mathematical derivation, fitted parameter, or self-citation chain. No equations are presented that define a quantity in terms of itself, no predictions are made from subsets of the same data, and no uniqueness theorems or ansatzes are imported from prior author work. The accuracy and speedup claims rest on direct experimental comparisons across model families and benchmarks, making the derivation self-contained against external results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of stable cross-head token importance plus a small number of implementation choices such as window size; no new physical entities or heavily fitted constants are introduced.

free parameters (1)

recency window size
The length of the stable recency window is a design choice that must be set for each model or task.

axioms (1)

domain assumption Token importance in reasoning is global and stable across attention heads and decoding steps.
This premise directly motivates the cross-head unified selection and recency window.

pith-pipeline@v0.9.0 · 5773 in / 1173 out tokens · 32939 ms · 2026-05-18T23:36:01.830239+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our token-level analysis across the reasoning process reveals two key observations on attention localities... substantial overlap in token-importance rankings across heads... recency locality pattern across decoding steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.