RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
Pith reviewed 2026-05-21 12:41 UTC · model grok-4.3
The pith
A recurrence-augmented attention model pretrained densely can switch to dilated sparse attention at inference after short adaptation while retaining most accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAT+ augments standard attention with full-sequence recurrence and active recurrence learning during dense pretraining on 100B tokens. The resulting 1.5B-parameter model can then be adapted in 1B tokens to dilated attention at dilation D=16 or D=64, optionally combined with local windows or hybrid layer/head compositions, while matching dense accuracy at D=16 and losing only 2-3 points at D=64 on commonsense and LongBench tasks. Larger scales to 2.6B and 7.6B parameters show even smaller relative losses under 64x reductions in attention FLOPs and KV cache size.
What carries the argument
Recurrence-augmented attention (RAT+) that inserts full-sequence recurrence and active recurrence learning into the dense pretraining phase to enable later adaptation to arbitrary dilated patterns.
If this is right
- A single pretrained checkpoint supports multiple inference configurations without retraining from scratch.
- Attention FLOPs and KV cache size scale down linearly with dilation factor D while long-range connections remain.
- Hybrid compositions of dilated and local-window layers become selectable at inference without extra training.
- The same adaptation procedure works across model scales from 1.5B to 7.6B parameters.
Where Pith is reading between the lines
- The method may generalize to other structured sparsity patterns beyond fixed dilation if the recurrence provides a sufficiently rich inductive bias.
- Deployment pipelines could maintain one dense checkpoint and generate on-demand sparse variants for different hardware budgets.
- Recurrence augmentation might reduce the data needed for fine-tuning other sparse attention variants such as those based on hashing or clustering.
Load-bearing premise
The recurrence signals learned during dense pretraining contain the information needed for short adaptation to succeed across different dilation factors.
What would settle it
A control model trained without the recurrence augmentation but with identical dense pretraining and the same 1B-token adaptation step shows substantially larger accuracy drops than RAT+ when switched to D=16 or D=64.
Figures
read the original abstract
Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D = 16, and drops by about 2-3 points at D = 64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RAT+, a dense pretraining architecture augmenting attention with full-sequence recurrence and active recurrence learning. A single RAT+ model pretrained densely on 100B tokens can be switched at inference to dilated attention (optionally with local windows) or hybrid compositions after only 1B-token resolution adaptation, closely matching dense accuracy at D=16 and dropping ~2-3 points at D=64 on commonsense and LongBench tasks for 1.5B-7.6B models, with 64x reductions in attention FLOPs and KV cache.
Significance. If the central claim holds after addressing ablations, this would be a meaningful advance for flexible sparse inference from one dense model, reducing the cost of per-configuration retraining for dilated patterns. Strengths include scaling results to 7.6B parameters, concrete benchmark numbers, and public code release at https://github.com/wimh966/rat-plus, which supports reproducibility.
major comments (2)
- The claim that recurrence augmentation during dense pretraining enables robust adaptation to arbitrary dilations (without per-D retraining) is load-bearing but not isolated. The 1B-token resolution adaptation occurs after switching to the target dilated pattern; an ablation applying identical adaptation to a standard dense baseline (without RAT+) is needed to show that recurrence contributes beyond standard fine-tuning. This directly affects the assertion that one pretrained RAT+ model suffices for flexible reuse across D values.
- Abstract and results sections report accuracy numbers for 1.5B-7.6B models but omit details on exact baselines, adaptation procedure, run-to-run variance, or data-overlap controls. This limits verification of the 'closely matches dense accuracy' claim at D=16 and the 2-3 point drop at D=64.
minor comments (2)
- Clarify the precise definition and implementation of 'active recurrence learning' in the methods section, including any additional loss terms or hyperparameters.
- Add a table or figure summarizing the exact dilation factors, local window sizes, and hybrid compositions tested, with corresponding accuracy and efficiency metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree that the suggested additions will improve the clarity and strength of our claims regarding RAT+.
read point-by-point responses
-
Referee: The claim that recurrence augmentation during dense pretraining enables robust adaptation to arbitrary dilations (without per-D retraining) is load-bearing but not isolated. The 1B-token resolution adaptation occurs after switching to the target dilated pattern; an ablation applying identical adaptation to a standard dense baseline (without RAT+) is needed to show that recurrence contributes beyond standard fine-tuning. This directly affects the assertion that one pretrained RAT+ model suffices for flexible reuse across D values.
Authors: We agree that an ablation isolating the effect of recurrence augmentation is valuable for supporting the central claim. We will add an experiment applying the identical 1B-token resolution adaptation procedure to a standard dense Transformer baseline (without RAT+) across the same dilation factors. The comparison results and discussion will be incorporated into the revised manuscript to demonstrate the specific advantage of RAT+ pretraining for flexible reuse. revision: yes
-
Referee: Abstract and results sections report accuracy numbers for 1.5B-7.6B models but omit details on exact baselines, adaptation procedure, run-to-run variance, or data-overlap controls. This limits verification of the 'closely matches dense accuracy' claim at D=16 and the 2-3 point drop at D=64.
Authors: We acknowledge the need for greater detail to support verification. In the revision, we will expand the experimental sections to specify the exact baselines, provide a step-by-step description of the adaptation procedure, report run-to-run variance where multiple seeds were used, and clarify data-overlap controls between pretraining and adaptation. These updates will directly bolster the reported accuracy claims at D=16 and D=64. revision: yes
Circularity Check
No significant circularity; empirical architecture with measured adaptation outcomes
full rationale
The paper introduces RAT+ as a dense pretraining architecture with recurrence augmentation and reports empirical accuracy after a short resolution adaptation step on standard benchmarks. No equations, derivations, or fitted parameters are presented that reduce the reported performance to inputs by construction. The central claim is an experimental outcome (matching dense accuracy at D=16 after adaptation) rather than an analytical reduction or self-referential definition, and the results are positioned as falsifiable measurements against external tasks. No load-bearing self-citation chains or uniqueness theorems are invoked in the provided text to force the result.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning... ˜v_l = g_l ⊙ ˜v_{l-1} + (1−g_l)⊙v_l
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
full-sequence recurrence (L=T) and active recurrence learning... L∗=64
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.