pith. sign in

arxiv: 2602.18196 · v5 · pith:MWMWF35Wnew · submitted 2026-02-20 · 💻 cs.LG

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Pith reviewed 2026-05-21 12:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords dilated attentionrecurrence augmented transformersparse inferenceefficient transformerslong context modelingattention mechanismskv cache reduction
0
0 comments X

The pith

A recurrence-augmented attention model pretrained densely can switch to dilated sparse attention at inference after short adaptation while retaining most accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding full-sequence recurrence and active recurrence learning during dense pretraining creates a single model flexible enough for different inference-time sparsity patterns. Instead of training separate models from scratch for each dilation factor, RAT+ requires only a brief 1B-token adaptation step to reach near-dense performance at moderate dilation and acceptable drops at high dilation. This approach addresses the accuracy collapse that occurs when naively sparsifying a standard pretrained transformer. If correct, it removes the need to maintain multiple sparse models for varying efficiency targets in long-context applications.

Core claim

RAT+ augments standard attention with full-sequence recurrence and active recurrence learning during dense pretraining on 100B tokens. The resulting 1.5B-parameter model can then be adapted in 1B tokens to dilated attention at dilation D=16 or D=64, optionally combined with local windows or hybrid layer/head compositions, while matching dense accuracy at D=16 and losing only 2-3 points at D=64 on commonsense and LongBench tasks. Larger scales to 2.6B and 7.6B parameters show even smaller relative losses under 64x reductions in attention FLOPs and KV cache size.

What carries the argument

Recurrence-augmented attention (RAT+) that inserts full-sequence recurrence and active recurrence learning into the dense pretraining phase to enable later adaptation to arbitrary dilated patterns.

If this is right

  • A single pretrained checkpoint supports multiple inference configurations without retraining from scratch.
  • Attention FLOPs and KV cache size scale down linearly with dilation factor D while long-range connections remain.
  • Hybrid compositions of dilated and local-window layers become selectable at inference without extra training.
  • The same adaptation procedure works across model scales from 1.5B to 7.6B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize to other structured sparsity patterns beyond fixed dilation if the recurrence provides a sufficiently rich inductive bias.
  • Deployment pipelines could maintain one dense checkpoint and generate on-demand sparse variants for different hardware budgets.
  • Recurrence augmentation might reduce the data needed for fine-tuning other sparse attention variants such as those based on hashing or clustering.

Load-bearing premise

The recurrence signals learned during dense pretraining contain the information needed for short adaptation to succeed across different dilation factors.

What would settle it

A control model trained without the recurrence augmentation but with identical dense pretraining and the same 1B-token adaptation step shows substantially larger accuracy drops than RAT+ when switched to D=16 or D=64.

Figures

Figures reproduced from arXiv: 2602.18196 by Caglar Gulcehre, Xiuying Wei.

Figure 1
Figure 1. Figure 1: (a) For architectural simplicity, we adopt an extreme overlapped setting, i.e., full-sequence recurrence with L = T. (b) Joint training to preserve dense attention capability while enforcing active recurrence learning with desired effective length L ∗ = 64. (c) After pretraining, the resulting model can be efficiently adapted to various sparse inference patterns including effective results on dilated atten… view at source ↗
Figure 2
Figure 2. Figure 2: Efficiency results of the temporal-mixing operator on a single GH200 GPU, covering both prefilling and decoding sce￾narios with hidden dimension H. Prefilling latency is measured on sequences of 262K tokens. Decoding latency is measured for 256 or 128 batches of tokens for the two hidden dimensions, respectively; the baseline runs out of memory beyond 32K to￾kens. We use FlexAttention (Dong et al., 2024) f… view at source ↗
Figure 4
Figure 4. Figure 4: Maximum decoding throughput (tokens/sec) of the full 1.5B and 7B models for decoding 1024 tokens, measured at context lengths of 4096 and 16384, corresponding to prefilling lengths of 3072 and 15360 tokens, respectively [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison with GQA/MQA using different numbers of KV heads. Joint training is also applied to GQA/MQA (D † = 1, W = 64) to match training FLOPs. RAT+ achieves lower PPL and offers greater flexibility, including single pretraining and the ability to preserve local KV cache size. Comparison with GQA and MQA We first compare with widely-used pretraining architectures, grouped-query attention (GQA) and multi-… view at source ↗
Figure 6
Figure 6. Figure 6: Scaling-up experiments: we report validation loss on a held-out 0.5B-token subset to illustrate the even smaller loss gap between dense and sparse variants as model size increases. The starred points refer to attention models trained with D † = 1 and W = 64, matched in training FLOPs at the same model scale, and are included as a reference for comparison with dense RAT+ [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 8
Figure 8. Figure 8: 1B-token adaptation on two pretrained models. It is evident that various dilated patterns quickly achieve stable loss values within a few hundred million tokens. We employed a sim￾ple optimization scheme with no warmup, which may explain the slight loss increase of D = 1 at the beginning, after which it recov￾ers. We also ablate other active recurrence lengths, as shown in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 9
Figure 9. Figure 9: L2 norm values of recurrence outputs at different time steps. We observe that the outputs at early time steps differ significantly. The first row shows an initialized network using our simple recurrence at layers 0, 6, 18, and 23. The second row corresponds to the same initialized network but with a non-zero initial cell state provided to the recurrence. The third row shows the results of the pretrained ne… view at source ↗
read the original abstract

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D = 16, and drops by about 2-3 points at D = 64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RAT+, a dense pretraining architecture augmenting attention with full-sequence recurrence and active recurrence learning. A single RAT+ model pretrained densely on 100B tokens can be switched at inference to dilated attention (optionally with local windows) or hybrid compositions after only 1B-token resolution adaptation, closely matching dense accuracy at D=16 and dropping ~2-3 points at D=64 on commonsense and LongBench tasks for 1.5B-7.6B models, with 64x reductions in attention FLOPs and KV cache.

Significance. If the central claim holds after addressing ablations, this would be a meaningful advance for flexible sparse inference from one dense model, reducing the cost of per-configuration retraining for dilated patterns. Strengths include scaling results to 7.6B parameters, concrete benchmark numbers, and public code release at https://github.com/wimh966/rat-plus, which supports reproducibility.

major comments (2)
  1. The claim that recurrence augmentation during dense pretraining enables robust adaptation to arbitrary dilations (without per-D retraining) is load-bearing but not isolated. The 1B-token resolution adaptation occurs after switching to the target dilated pattern; an ablation applying identical adaptation to a standard dense baseline (without RAT+) is needed to show that recurrence contributes beyond standard fine-tuning. This directly affects the assertion that one pretrained RAT+ model suffices for flexible reuse across D values.
  2. Abstract and results sections report accuracy numbers for 1.5B-7.6B models but omit details on exact baselines, adaptation procedure, run-to-run variance, or data-overlap controls. This limits verification of the 'closely matches dense accuracy' claim at D=16 and the 2-3 point drop at D=64.
minor comments (2)
  1. Clarify the precise definition and implementation of 'active recurrence learning' in the methods section, including any additional loss terms or hyperparameters.
  2. Add a table or figure summarizing the exact dilation factors, local window sizes, and hybrid compositions tested, with corresponding accuracy and efficiency metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree that the suggested additions will improve the clarity and strength of our claims regarding RAT+.

read point-by-point responses
  1. Referee: The claim that recurrence augmentation during dense pretraining enables robust adaptation to arbitrary dilations (without per-D retraining) is load-bearing but not isolated. The 1B-token resolution adaptation occurs after switching to the target dilated pattern; an ablation applying identical adaptation to a standard dense baseline (without RAT+) is needed to show that recurrence contributes beyond standard fine-tuning. This directly affects the assertion that one pretrained RAT+ model suffices for flexible reuse across D values.

    Authors: We agree that an ablation isolating the effect of recurrence augmentation is valuable for supporting the central claim. We will add an experiment applying the identical 1B-token resolution adaptation procedure to a standard dense Transformer baseline (without RAT+) across the same dilation factors. The comparison results and discussion will be incorporated into the revised manuscript to demonstrate the specific advantage of RAT+ pretraining for flexible reuse. revision: yes

  2. Referee: Abstract and results sections report accuracy numbers for 1.5B-7.6B models but omit details on exact baselines, adaptation procedure, run-to-run variance, or data-overlap controls. This limits verification of the 'closely matches dense accuracy' claim at D=16 and the 2-3 point drop at D=64.

    Authors: We acknowledge the need for greater detail to support verification. In the revision, we will expand the experimental sections to specify the exact baselines, provide a step-by-step description of the adaptation procedure, report run-to-run variance where multiple seeds were used, and clarify data-overlap controls between pretraining and adaptation. These updates will directly bolster the reported accuracy claims at D=16 and D=64. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with measured adaptation outcomes

full rationale

The paper introduces RAT+ as a dense pretraining architecture with recurrence augmentation and reports empirical accuracy after a short resolution adaptation step on standard benchmarks. No equations, derivations, or fitted parameters are presented that reduce the reported performance to inputs by construction. The central claim is an experimental outcome (matching dense accuracy at D=16 after adaptation) rather than an analytical reduction or self-referential definition, and the results are positioned as falsifiable measurements against external tasks. No load-bearing self-citation chains or uniqueness theorems are invoked in the provided text to force the result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level description of recurrence augmentation and adaptation. Standard transformer training assumptions apply but are not enumerated.

pith-pipeline@v0.9.0 · 5766 in / 1176 out tokens · 35981 ms · 2026-05-21T12:41:39.630525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.