Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention
Pith reviewed 2026-05-18 23:36 UTC · model grok-4.3
The pith
Token importance in reasoning is global across heads and stable over steps, so unified sparse selection keeps accuracy while cutting attended tokens sharply.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to 1.6× end-to-end decoding speedup and up to 1.72× faster sparse attention computation.
What carries the argument
Cross-head unified token selection paired with a fixed recency window that creates one globally consistent, reusable token set for attention across layers and decoding steps.
If this is right
- Matches or improves accuracy on challenging reasoning benchmarks across several model families.
- Delivers up to 1.6× end-to-end decoding speedup after kernel optimizations.
- Provides up to 1.72× faster sparse attention computation.
- Extends to long-context tasks beyond the primary reasoning evaluations.
- Lowers memory footprint during extended generation without any retraining.
Where Pith is reading between the lines
- Reasoning traces appear to depend on a small persistent core of facts rather than re-examining every prior token at each new step.
- The same stability pattern may appear in other long-form generation settings such as multi-step code synthesis or extended dialogue.
- The technique could be stacked with quantization or KV-cache compression for larger combined speedups.
- Selected tokens could be inspected to test whether they align with explicit reasoning premises or intermediate results.
Load-bearing premise
The tokens that matter for reasoning accuracy are largely the same across attention heads and stay stable as decoding proceeds, so one shared selection works everywhere.
What would settle it
Measure whether accuracy on a long-horizon benchmark such as GSM8K or MATH drops when token selection is performed independently per head or when the recency window is removed, compared with the unified cross-head version.
read the original abstract
Large reasoning models achieve strong performance through test-time scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning. Our key insight is that token importance in reasoning is global and stable: critical tokens are largely shared across attention heads and remain stable over decoding steps. Guided by this structure, LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to $1.6\times$ end-to-end decoding speedup and up to $1.72\times$ faster sparse attention computation, with additional long-context results demonstrating the generality of our approach. Code is available at \href{https://github.com/DerrickYLJ/LessIsMore}{https://github.com/DerrickYLJ/LessIsMore}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning in LLMs. It posits that token importance is global across attention heads and stable across decoding steps, enabling a unified sparse mask with a recency window that can be reused across layers. The method is evaluated on multiple model families and reasoning benchmarks, claiming to match or exceed baseline accuracy while attending to fewer tokens and delivering up to 1.6× end-to-end decoding speedup and 1.72× faster sparse attention, with code released publicly.
Significance. If the central empirical claims hold under fuller controls, this work would provide a simple, training-free route to accelerate test-time reasoning without retraining or accuracy degradation, addressing a practical bottleneck in long-chain inference. The public code release and cross-model results strengthen the contribution for the efficient-inference community.
major comments (2)
- [§3 (Method) and abstract] The manuscript states the guiding insight that 'token importance in reasoning is global and stable' (abstract and §3) but provides no direct quantification—such as average Jaccard index or top-k overlap statistics between per-head importance scores at fixed steps, or token-consistency metrics over successive decoding steps. Without these measurements, it remains possible that accuracy parity is driven primarily by the recency-window heuristic rather than cross-head unification, weakening the justification for the unified-selection design.
- [§4 (Experiments)] §4 (Experiments): Accuracy results are reported without error bars, standard deviations, or details on the number of runs; exact per-benchmark token budgets and sparsity ratios are also not tabulated. These omissions make it difficult to evaluate the robustness of the 'matches or improves accuracy' claim and the precise efficiency-accuracy trade-off.
minor comments (2)
- [§3.3] The description of the recency-window size (one of the few free parameters) would benefit from an explicit sensitivity analysis or default-value justification in §3.3.
- [Figures] Figure 2 (or equivalent attention-mask visualization) could include a side-by-side comparison of per-head vs. unified masks to illustrate the claimed overlap.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments. We address each major point below with clarifications from the manuscript and indicate revisions where they strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [§3 (Method) and abstract] The manuscript states the guiding insight that 'token importance in reasoning is global and stable' (abstract and §3) but provides no direct quantification—such as average Jaccard index or top-k overlap statistics between per-head importance scores at fixed steps, or token-consistency metrics over successive decoding steps. Without these measurements, it remains possible that accuracy parity is driven primarily by the recency-window heuristic rather than cross-head unification, weakening the justification for the unified-selection design.
Authors: We appreciate the referee's emphasis on strengthening the empirical grounding of the core insight. The manuscript already includes ablation studies (§4.3) comparing unified cross-head selection against independent per-head selection, which show that unification contributes measurably to accuracy retention beyond the recency window alone. Nevertheless, we agree that explicit overlap metrics would provide clearer support. In the revision we will add, in §3 and the experiments section, average Jaccard indices across heads at fixed decoding steps and token-consistency statistics over successive steps. These additions will directly address whether the observed accuracy parity relies primarily on the recency heuristic or on the unified selection mechanism. revision: yes
-
Referee: [§4 (Experiments)] §4 (Experiments): Accuracy results are reported without error bars, standard deviations, or details on the number of runs; exact per-benchmark token budgets and sparsity ratios are also not tabulated. These omissions make it difficult to evaluate the robustness of the 'matches or improves accuracy' claim and the precise efficiency-accuracy trade-off.
Authors: We acknowledge that the current reporting omits variability measures and precise budget details. The experiments were performed with a small number of runs per setting, but these statistics and the exact per-benchmark token counts were not tabulated. In the revised manuscript we will (i) report standard deviations and error bars for all accuracy figures, (ii) state the number of runs used for each benchmark, and (iii) add a supplementary table listing exact token budgets, sparsity ratios, and attended-token counts for every model-benchmark pair. These changes will make the efficiency-accuracy trade-offs fully transparent. revision: yes
Circularity Check
No circularity: empirical design choice justified by benchmarks, not by construction
full rationale
The manuscript introduces a training-free sparse attention method whose central design decisions (cross-head unified selection and recency window) are motivated by an empirical observation about token stability rather than any mathematical derivation, fitted parameter, or self-citation chain. No equations are presented that define a quantity in terms of itself, no predictions are made from subsets of the same data, and no uniqueness theorems or ansatzes are imported from prior author work. The accuracy and speedup claims rest on direct experimental comparisons across model families and benchmarks, making the derivation self-contained against external results.
Axiom & Free-Parameter Ledger
free parameters (1)
- recency window size
axioms (1)
- domain assumption Token importance in reasoning is global and stable across attention heads and decoding steps.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our token-level analysis across the reasoning process reveals two key observations on attention localities... substantial overlap in token-importance rankings across heads... recency locality pattern across decoding steps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.