pith. sign in

arxiv: 2605.16579 · v2 · pith:SY6GQXC2new · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

Pith reviewed 2026-05-22 09:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords autoregressive video diffusionlinear attentionhybrid attentionrecurrent memorytemporal consistencyvideo generationefficient attention
0
0 comments X

The pith

Hybrid attention with recurrent memory enables linear scaling for autoregressive video diffusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARL2, a hybrid attention module for autoregressive video diffusion that addresses the quadratic complexity and growing memory of softmax self-attention. It splits attention into an intra-frame softmax branch for handling spatial details and local dependencies, and an inter-frame gated recurrent linear branch that uses a fixed-size state to remember long-range context across frames. This design allows the model to scale linearly in time with constant memory instead of relying on a growing key-value cache. The approach includes specific update rules to avoid noise in the state and asymmetry within frames. Experiments replacing 75 percent of layers show speedups and memory savings with maintained or improved quality.

Core claim

The paper claims that self-attention can be decomposed into an intra-frame softmax branch for spatial detail and local dependencies and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context, thereby replacing quadratic cross-frame attention with linear-time constant-memory operation while improving temporal consistency.

What carries the argument

The ARL2 hybrid attention module consisting of an intra-frame softmax branch and an inter-frame gated recurrent linear branch that maintains a fixed-size recurrent state for cross-frame memory.

If this is right

  • Models achieve linear-time scaling and constant memory usage for longer video sequences.
  • Up to 2.26 times wall-clock speedup and 54 percent memory reduction when 75 percent of layers use the hybrid module.
  • Comparable generation quality with improved temporal consistency over full softmax models.
  • Enables conversion of pretrained AR video diffusion models to hybrid linear attention via two-stage training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could support real-time streaming video generation on devices with limited memory.
  • Similar hybrid approaches might apply to other autoregressive generation tasks involving long sequences.
  • Further analysis could test how the fixed-size state performs as video lengths increase beyond current experiments.

Load-bearing premise

A gated recurrent linear branch can preserve sufficient long-range temporal context across frames without the information loss that would occur with full cross-frame softmax attention.

What would settle it

Run the hybrid model and full-softmax baseline on videos with progressively more frames and measure if quality and temporal consistency metrics stay comparable or better without dropping as length increases.

Figures

Figures reproduced from arXiv: 2605.16579 by Kunyang Li, Mubarak Shah, Yuzhang Shang.

Figure 1
Figure 1. Figure 1: (a) AR video diffusion relies on softmax self-attention with a growing KV cache, incurring [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ARL2 attention module (normalization omitted for clarity). Given tokens XN of frame N, QN , KN , VN , which are routed to two branches. The intra-frame branch (top) applies bidirectional softmax attention over tokens within the current frame, producing Ointra. The inter-frame branch (bottom) applies recurrent linear attention, where a fixed-size state S maintains and updates long￾range memory across frames… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Softmax attention scales quadratically with video length, while the hybrid layer scales [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Per-dimension ARR averaged across layers. Four Hybrid-Recoverable (HR) dimensions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Attention decomposition on a representative [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison. Highlighting strong visual quality and temporal consistency under [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison across all evaluated models. Six uniformly sampled frames from [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents ARL2, a hybrid attention architecture for autoregressive video diffusion models. It decomposes attention into an intra-frame softmax branch for local spatial dependencies and an inter-frame gated recurrent linear branch that uses a fixed-size state to maintain cross-frame memory. This allows replacing 75% of attention layers in a pretrained model using a two-stage training process, resulting in linear scaling, constant memory, up to 2.26x speedup, 54% memory savings, and improved temporal consistency.

Significance. If the empirical results are robust, this approach could substantially advance the field by enabling efficient long-video generation in AR diffusion models without the memory overhead of KV caches. The insight that recurrent states can handle long-range temporal context while softmax handles local details is valuable, and the efficient fine-tuning strategy adds practical utility. The work provides reproducible design choices that could be adopted in future models.

major comments (2)
  1. [§3.2] §3.2, inter-frame branch description: the gated recurrent linear update maintains a fixed-size state with post-denoising update and shared pre-update state, but the manuscript provides no capacity analysis, eigenvalue bounds, or saturation experiments as sequence length grows, which is load-bearing for the claim that this branch preserves long-range context without the information loss of full cross-frame softmax attention.
  2. [§4.3] §4.3, temporal consistency results: reported gains over the full-softmax baseline are promising, yet the absence of ablations on recurrent state dimension versus video horizon (e.g., 50+ frames) leaves open whether the fixed-size memory truly scales or merely matches quality on the evaluated lengths.
minor comments (3)
  1. Figure 2: the diagram of the hybrid module would be clearer with explicit arrows distinguishing the intra-frame softmax path from the recurrent state update path.
  2. The abstract and §1 both use 'to the best of our knowledge' for the hybrid conversion claim; a brief comparison to prior linear-attention video works would strengthen this.
  3. Notation in §3.1 for the gated recurrent update (e.g., the exact form of the linear projection and gate) is introduced without a consolidated symbol table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of ARL2 on efficient autoregressive video diffusion. We address the two major comments below and commit to revisions that strengthen the empirical and analytical support for the recurrent state's behavior.

read point-by-point responses
  1. Referee: [§3.2] §3.2, inter-frame branch description: the gated recurrent linear update maintains a fixed-size state with post-denoising update and shared pre-update state, but the manuscript provides no capacity analysis, eigenvalue bounds, or saturation experiments as sequence length grows, which is load-bearing for the claim that this branch preserves long-range context without the information loss of full cross-frame softmax attention.

    Authors: We agree that a dedicated capacity analysis would strengthen the claims. The current manuscript prioritizes end-to-end empirical validation (quality, temporal consistency, and efficiency) over theoretical bounds. In the revision we will add a short subsection to §3.2 that derives a simple contraction bound on the gated linear update and discusses how the post-denoising update and shared pre-update state mitigate information loss. We will also append saturation plots that track state-norm growth and cosine similarity between early and late frames for sequences up to 128 frames. revision: yes

  2. Referee: [§4.3] §4.3, temporal consistency results: reported gains over the full-softmax baseline are promising, yet the absence of ablations on recurrent state dimension versus video horizon (e.g., 50+ frames) leaves open whether the fixed-size memory truly scales or merely matches quality on the evaluated lengths.

    Authors: We acknowledge that the reported temporal-consistency gains are currently demonstrated on the standard evaluation lengths used by prior AR video diffusion work. To directly address scalability, the revised manuscript will expand §4.3 with a new ablation table that varies recurrent state dimension (128/256/512) and evaluates temporal consistency and FID on video horizons of 32, 48, and 64 frames. These additional results will be generated with the same two-stage training protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: novel hybrid architecture validated empirically

full rationale

The paper introduces an original ARL2 hybrid attention design that decomposes self-attention into an intra-frame softmax branch and an inter-frame gated recurrent linear branch with fixed-size state, updated post-denoising and shared pre-update. Performance claims of 2.26x speedup, 54% memory reduction, and improved temporal consistency are presented as direct experimental outcomes of this new module and two-stage training on a pretrained AR video diffusion model. No equations, predictions, or central results reduce by construction to fitted parameters, self-citations, or renamed prior patterns; the work is self-contained as an architectural proposal with independent empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; the central design rests on the assumption that a recurrent linear state suffices for cross-frame memory. No numerical free parameters are specified in the provided text.

axioms (1)
  • domain assumption A gated recurrent linear branch can maintain effective long-range temporal memory for video frames
    This premise underpins the inter-frame branch and the claim of preserved temporal consistency.
invented entities (1)
  • ARL2 hybrid attention module no independent evidence
    purpose: Replace quadratic cross-frame attention while keeping linear scaling and constant memory
    Newly proposed architecture combining softmax intra-frame and gated recurrent inter-frame branches

pith-pipeline@v0.9.0 · 5848 in / 1293 out tokens · 58100 ms · 2026-05-22T09:06:35.648457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages

  1. [1]

    ArXiv , year=

    Video Diffusion Models , author=. ArXiv , year=

  2. [2]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    Scalable Diffusion Models with Transformers , author=. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  3. [3]

    ArXiv , year=

    Wan: Open and Advanced Large-Scale Video Generative Models , author=. ArXiv , year=

  4. [4]

    ArXiv , year=

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. ArXiv , year=

  5. [5]

    ArXiv , year=

    Open-Sora: Democratizing Efficient Video Production for All , author=. ArXiv , year=

  6. [6]

    ArXiv , year=

    Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. ArXiv , year=

  7. [7]

    ArXiv , year=

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion , author=. ArXiv , year=

  8. [8]

    ArXiv , year=

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation , author=. ArXiv , year=

  9. [9]

    ArXiv , year=

    Rolling Forcing: Autoregressive Long Video Diffusion in Real Time , author=. ArXiv , year=

  10. [10]

    ArXiv , year=

    Context Forcing: Consistent Autoregressive Video Generation with Long Context , author=. ArXiv , year=

  11. [11]

    ArXiv , year=

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation , author=. ArXiv , year=

  12. [12]

    2026 , url=

    Streaming Autoregressive Video Generation via Diagonal Distillation , author=. 2026 , url=

  13. [13]

    International Conference on Machine Learning , year=

    Linear Transformers Are Secretly Fast Weight Programmers , author=. International Conference on Machine Learning , year=

  14. [14]

    ArXiv , year=

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models , author=. ArXiv , year=

  15. [15]

    ArXiv , year=

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. ArXiv , year=

  16. [16]

    ArXiv , year=

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality , author=. ArXiv , year=

  17. [17]

    ArXiv , year=

    Gated Linear Attention Transformers with Hardware-Efficient Training , author=. ArXiv , year=

  18. [18]

    ArXiv , year=

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author=. ArXiv , year=

  19. [19]

    ArXiv , year=

    Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. ArXiv , year=

  20. [20]

    ArXiv , year=

    Retentive Network: A Successor to Transformer for Large Language Models , author=. ArXiv , year=

  21. [21]

    Conference on Empirical Methods in Natural Language Processing , year=

    RWKV: Reinventing RNNs for the Transformer Era , author=. Conference on Empirical Methods in Natural Language Processing , year=

  22. [22]

    ArXiv , year=

    Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels , author=. ArXiv , year=

  23. [23]

    International Conference on Machine Learning , year=

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. International Conference on Machine Learning , year=

  24. [24]

    ArXiv , year=

    ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers , author=. ArXiv , year=

  25. [25]

    ArXiv , year=

    SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer , author=. ArXiv , year=

  26. [26]

    ArXiv , year=

    VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory , author=. ArXiv , year=

  27. [27]

    ArXiv , year=

    Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer , author=. ArXiv , year=

  28. [28]

    ArXiv , year=

    Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts , author=. ArXiv , year=

  29. [29]

    ArXiv , year=

    Distilling to Hybrid Attention Models via KL-Guided Layer Selection , author=. ArXiv , year=

  30. [30]

    ArXiv , year=

    SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer , author=. ArXiv , year=

  31. [31]

    ArXiv , year=

    Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention , author=. ArXiv , year=

  32. [32]

    ArXiv , year=

    Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention , author=. ArXiv , year=

  33. [33]

    ArXiv , year=

    RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale , author=. ArXiv , year=

  34. [34]

    ArXiv , year=

    Denoising Diffusion Probabilistic Models , author=. ArXiv , year=

  35. [35]

    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  36. [36]

    ArXiv , year=

    Flow Matching for Generative Modeling , author=. ArXiv , year=

  37. [37]

    ArXiv , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. ArXiv , year=

  38. [38]

    Neural Information Processing Systems , year=

    Attention is All you Need , author=. Neural Information Processing Systems , year=

  39. [39]

    ArXiv , year=

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers , author=. ArXiv , year=

  40. [40]

    2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation , author=. 2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  41. [41]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  42. [42]

    2024 , url=

    SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces , author=. 2024 , url=

  43. [43]

    2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    Long-Context State-Space Video World Models , author=. 2025 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  44. [44]

    ArXiv , year=

    Pushing the Boundaries of State Space Models for Image and Video Generation , author=. ArXiv , year=

  45. [45]

    ArXiv , year=

    Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention , author=. ArXiv , year=

  46. [46]

    ArXiv , year=

    M4V: Multi-Modal Mamba for Text-to-Video Generation , author=. ArXiv , year=

  47. [47]

    ArXiv , year=

    SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention , author=. ArXiv , year=

  48. [48]

    ArXiv , year=

    DiTFastAttn: Attention Compression for Diffusion Transformer Models , author=. ArXiv , year=

  49. [49]

    ArXiv , year=

    Efficient Autoregressive Video Diffusion with Dummy Head , author=. ArXiv , year=

  50. [50]

    ArXiv , year=

    Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing , author=. ArXiv , year=

  51. [51]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    From Slow Bidirectional to Fast Autoregressive Video Diffusion Models , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  52. [52]

    ArXiv , year=

    Analysis of Attention in Video Diffusion Transformers , author=. ArXiv , year=

  53. [53]

    ArXiv , year=

    Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression , author=. ArXiv , year=

  54. [54]

    ArXiv , year=

    MAGI-1: Autoregressive Video Generation at Scale , author=. ArXiv , year=

  55. [55]

    ArXiv , year=

    Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization , author=. ArXiv , year=

  56. [56]

    ArXiv , year=

    LongLive: Real-time Interactive Long Video Generation , author=. ArXiv , year=

  57. [57]

    ArXiv , year=

    Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion , author=. ArXiv , year=

  58. [58]

    ArXiv , year=

    Inference-Time Hyper-Scaling with KV Cache Compression , author=. ArXiv , year=

  59. [59]

    2026 , url=

    KV Cache Optimization Strategies for Scalable and Efficient LLM Inference , author=. 2026 , url=

  60. [60]

    2026 , url=

    StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference , author=. 2026 , url=

  61. [61]

    ArXiv , year=

    LTX-Video: Realtime Video Latent Diffusion , author=. ArXiv , year=

  62. [62]

    ArXiv , year=

    SkyReels-V2: Infinite-length Film Generative Model , author=. ArXiv , year=

  63. [63]

    ArXiv , year=

    Understanding Attention Mechanism in Video Diffusion Models , author=. ArXiv , year=

  64. [64]

    ArXiv , year=

    Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity , author=. ArXiv , year=

  65. [65]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Breaking the Low-Rank Dilemma of Linear Attention , author=. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  66. [66]

    2025 , url=

    Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective , author=. 2025 , url=

  67. [67]

    ArXiv , year=

    PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache , author=. ArXiv , year=

  68. [68]

    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    VBench: Comprehensive Benchmark Suite for Video Generative Models , author=. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  69. [69]

    ArXiv , year=

    VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models , author=. ArXiv , year=