Higher-order Linear Attention
Pith reviewed 2026-05-18 03:03 UTC · model grok-4.3
The pith
Higher-order Linear Attention captures second-order interactions in linear time with a constant-size state.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Higher-order Linear Attention realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n × n matrices. It supplies closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly.
What carries the argument
Compact prefix sufficient statistics that accumulate the necessary moments of past tokens to compute higher-order interactions on the fly.
If this is right
- Second-order HLA runs in linear time with fixed memory independent of sequence length.
- Training can use associative scans over chunks to produce exactly the same activations as the serial version.
- A masked causal variant exists that uses only two extra summary statistics.
- The same construction extends in closed form to third and higher orders.
Where Pith is reading between the lines
- HLA could serve as a drop-in replacement for standard attention in long-context models where higher-order term mixing improves dependency capture.
- Because it stays exactly equivalent to a recurrence, it inherits the same parallel training tricks already used by state-space models.
- Testing whether second-order statistics measurably improve performance on tasks that require modeling pairwise token relations would be a direct next experiment.
Load-bearing premise
Compact prefix sufficient statistics exist and suffice to realize the desired higher-order interactions while preserving the causal streaming property and exact equivalence to the serial recurrence.
What would settle it
Compare the per-token outputs of the HLA recurrence against an explicit higher-order attention computation on a short sequence; any mismatch on second-order terms would show the identities do not hold.
read the original abstract
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Higher-order Linear Attention (HLA) as a causal streaming mechanism realizing higher-order interactions via compact prefix sufficient statistics. For the second-order case it claims constant-size state, linear-time per-token outputs without materializing n×n matrices, closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel associative-scan training scheme that exactly reproduces serial-recurrence activations. Extensions to third and higher orders are outlined.
Significance. If the closed-form identities and exact equivalence are correct, HLA would offer a principled way to obtain attention-like higher-order mixing at linear cost and constant state, improving expressivity over first-order linear attention and standard SSMs while retaining exact scan-based training. The associative-scan exact-reproduction property is a concrete strength for reproducibility and training stability.
major comments (2)
- [Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.
- [Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.
minor comments (2)
- Notation for the prefix sufficient statistics (e.g., moment tensors or outer-product accumulators) should be defined explicitly with dimension counts before the streaming identities are introduced.
- The manuscript would benefit from a small empirical verification (even on synthetic data) confirming that the scan-based implementation matches the serial recurrence to machine precision.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our manuscript on Higher-order Linear Attention. The comments highlight important points regarding the presentation of our central claims. We will revise the manuscript to include explicit derivations, constructions, and verifications as requested, which will strengthen the exposition without altering the core technical contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.
Authors: We agree that the abstract and the section on closed-form streaming identities would benefit from expanded derivations to make the constant-state and exact-reproduction properties fully transparent. In the revised manuscript we will add a dedicated subsection deriving the incremental update rules for the second-order sufficient statistics. These derivations will explicitly track all query-key-value cross terms (including the data-dependent mixing) and show that they remain incrementally maintainable with a fixed number of summary tensors under the causal mask. We will also include a short error-bound analysis confirming that the identities are exact (no approximation) when the associative scan is used. This addition directly addresses the load-bearing guarantee for linear-time, constant-state operation. revision: yes
-
Referee: [Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.
Authors: We acknowledge that the current description of the strictly causal masked variant is concise and would be improved by an explicit construction. In the revision we will insert a short proof sketch and the concrete tensor forms of the two additional summaries. The construction demonstrates that these two summaries capture all remaining cross-terms required by the causal mask while preserving both the streaming property and exact equivalence to the serial recurrence; no further state is needed. This will eliminate any ambiguity about whether the constant-state property holds for the masked case. revision: yes
Circularity Check
Derivation of HLA streaming identities is self-contained from sufficient statistics
full rationale
The paper presents closed-form streaming identities derived directly from the definition of prefix sufficient statistics (sums, outer products, and cross terms) to maintain constant-size state for second-order interactions. These identities are shown to reproduce the serial recurrence exactly via associative scans, with an explicit masked variant using two additional summaries. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation, or ansatz imported from prior work by the same authors; the construction is parameter-free and mathematically derived from first principles of incremental statistics. The central claim of linear-time constant-state higher-order attention therefore stands on independent algebraic identities rather than circular redefinition of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Higher-order interactions admit compact prefix sufficient statistics that enable exact causal streaming updates.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We maintain prefix summaries at time t: SK_t := Σ_{i≤t} k_i k_i^T … The output of second-order HLA at time t is … ot := q_t^T SK_t CQV_t
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Scan equivalence: serial vs. (decayed) associative scans)… the per-token masked outputs are identical to those of the serial algorithm
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentio...
-
[3]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[4]
15 Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher R´ e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix- valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
-
[9]
Random feature attention.arXiv preprint arXiv:2103.02143,
Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention.arXiv preprint arXiv:2103.02143,
-
[10]
Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, et al. Transnormerllm: A faster and better large language model with improved transnormer.arXiv preprint arXiv:2307.14995,
-
[11]
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention- 2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,
-
[12]
Hopfield Networks is All You Need
Hubert Ramsauer, Bernhard Sch¨ afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´ c, Geir Kjetil Sandve, et al. Hopfield networks is all you need.arXiv preprint arXiv:2008.02217,
work page internal anchor Pith review Pith/arXiv arXiv 2008
-
[13]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,
Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,
-
[16]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[17]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b. Shu Zhong, Mingyu Xu, Tenglong Ao, an...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.