arxiv: 2510.27258 · v3 · pith:6EAOJO5Knew · submitted 2025-10-31 · 💻 cs.LG · cs.AI· cs.CL

Higher-order Linear Attention

Yifan Zhang , Zhen Qin , Quanquan Gu This is my paper

Pith reviewed 2026-05-18 03:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords higher-order linear attentionlinear attentioncausal attentionstate space modelsefficient transformersautoregressive modelsprefix statistics

0 comments

The pith

Higher-order Linear Attention captures second-order interactions in linear time with a constant-size state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Higher-order Linear Attention to overcome the limits of first-order linear attention and kernel approximations. It uses compact prefix sufficient statistics to enable higher-order data-dependent mixing while remaining strictly causal and streaming. This matters for scaling autoregressive language models because it avoids quadratic matrix costs yet keeps the ability to model richer token interactions than basic linear recurrences. The method supplies closed-form updates, a masked causal variant, and an associative-scan training procedure that matches serial execution exactly. Extensions to third and higher orders are outlined as part of the same framework.

Core claim

Higher-order Linear Attention realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n × n matrices. It supplies closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly.

What carries the argument

Compact prefix sufficient statistics that accumulate the necessary moments of past tokens to compute higher-order interactions on the fly.

If this is right

Second-order HLA runs in linear time with fixed memory independent of sequence length.
Training can use associative scans over chunks to produce exactly the same activations as the serial version.
A masked causal variant exists that uses only two extra summary statistics.
The same construction extends in closed form to third and higher orders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

HLA could serve as a drop-in replacement for standard attention in long-context models where higher-order term mixing improves dependency capture.
Because it stays exactly equivalent to a recurrence, it inherits the same parallel training tricks already used by state-space models.
Testing whether second-order statistics measurably improve performance on tasks that require modeling pairwise token relations would be a direct next experiment.

Load-bearing premise

Compact prefix sufficient statistics exist and suffice to realize the desired higher-order interactions while preserving the causal streaming property and exact equivalence to the serial recurrence.

What would settle it

Compare the per-token outputs of the HLA recurrence against an explicit higher-order attention computation on a short sequence; any mismatch on second-order terms would show the identities do not hold.

read the original abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a higher-order linear attention that keeps constant state for second-order terms via streaming identities and associative scans, but the lack of derivations leaves the exactness claim unverified.

read the letter

The core idea is a causal higher-order linear attention that maintains a fixed-size state of prefix sufficient statistics to compute outputs equivalent to a full serial recurrence, all in linear time without quadratic matrices. For the second-order case they add a masked variant with two extra summaries and train via chunk-parallel associative scans that match the serial version exactly. That construction, plus the outline for third-order and beyond, is the new piece relative to standard linear attention and SSM work. It does a clean job framing the problem as limited expressivity in current scalable alternatives and positions this as a direct fix that still feels attention-like and data-dependent. The training scheme sounds practical for people who already use scans in recurrent models. The soft spot is that the abstract only asserts the closed-form identities and exact reproduction without showing the algebra or any error bounds. The stress-test concern about whether every cross-term between query, key, and value folds into a constant-size accumulator without breaking associativity or causality is still open; if even one interaction requires extra state or approximation, the linear-time constant-state guarantee weakens. Nothing in the provided text rules that out or confirms it. This is aimed at people building long-context autoregressive models who need more mixing power than first-order linear attention but still want recurrent-style efficiency. A reader already working on efficient sequence models or attention variants would get the most out of it and could test the identities themselves. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject; the idea is worth checking even if the proofs need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Higher-order Linear Attention (HLA) as a causal streaming mechanism realizing higher-order interactions via compact prefix sufficient statistics. For the second-order case it claims constant-size state, linear-time per-token outputs without materializing n×n matrices, closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel associative-scan training scheme that exactly reproduces serial-recurrence activations. Extensions to third and higher orders are outlined.

Significance. If the closed-form identities and exact equivalence are correct, HLA would offer a principled way to obtain attention-like higher-order mixing at linear cost and constant state, improving expressivity over first-order linear attention and standard SSMs while retaining exact scan-based training. The associative-scan exact-reproduction property is a concrete strength for reproducibility and training stability.

major comments (2)

[Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.
[Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.

minor comments (2)

Notation for the prefix sufficient statistics (e.g., moment tensors or outer-product accumulators) should be defined explicitly with dimension counts before the streaming identities are introduced.
The manuscript would benefit from a small empirical verification (even on synthetic data) confirming that the scan-based implementation matches the serial recurrence to machine precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript on Higher-order Linear Attention. The comments highlight important points regarding the presentation of our central claims. We will revise the manuscript to include explicit derivations, constructions, and verifications as requested, which will strengthen the exposition without altering the core technical contributions.

read point-by-point responses

Referee: [Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.

Authors: We agree that the abstract and the section on closed-form streaming identities would benefit from expanded derivations to make the constant-state and exact-reproduction properties fully transparent. In the revised manuscript we will add a dedicated subsection deriving the incremental update rules for the second-order sufficient statistics. These derivations will explicitly track all query-key-value cross terms (including the data-dependent mixing) and show that they remain incrementally maintainable with a fixed number of summary tensors under the causal mask. We will also include a short error-bound analysis confirming that the identities are exact (no approximation) when the associative scan is used. This addition directly addresses the load-bearing guarantee for linear-time, constant-state operation. revision: yes
Referee: [Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.

Authors: We acknowledge that the current description of the strictly causal masked variant is concise and would be improved by an explicit construction. In the revision we will insert a short proof sketch and the concrete tensor forms of the two additional summaries. The construction demonstrates that these two summaries capture all remaining cross-terms required by the causal mask while preserving both the streaming property and exact equivalence to the serial recurrence; no further state is needed. This will eliminate any ambiguity about whether the constant-state property holds for the masked case. revision: yes

Circularity Check

0 steps flagged

Derivation of HLA streaming identities is self-contained from sufficient statistics

full rationale

The paper presents closed-form streaming identities derived directly from the definition of prefix sufficient statistics (sums, outer products, and cross terms) to maintain constant-size state for second-order interactions. These identities are shown to reproduce the serial recurrence exactly via associative scans, with an explicit masked variant using two additional summaries. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation, or ansatz imported from prior work by the same authors; the construction is parameter-free and mathematically derived from first principles of incremental statistics. The central claim of linear-time constant-state higher-order attention therefore stands on independent algebraic identities rather than circular redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of compact sufficient statistics that capture higher-order interactions exactly in a constant-size state; this is treated as a domain assumption rather than derived from prior results.

axioms (1)

domain assumption Higher-order interactions admit compact prefix sufficient statistics that enable exact causal streaming updates.
Invoked to justify constant-size state and linear-time per-token computation for second-order case.

pith-pipeline@v0.9.0 · 5685 in / 1087 out tokens · 28825 ms · 2026-05-18T03:03:06.024190+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We maintain prefix summaries at time t: SK_t := Σ_{i≤t} k_i k_i^T … The output of second-order HLA at time t is … ot := q_t^T SK_t CQV_t
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Scan equivalence: serial vs. (decayed) associative scans)… the per-token masked outputs are identical to those of the serial algorithm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 12 internal anchors

[1]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentio...

work page arXiv
[3]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

15 Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher R´ e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix- valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv
[9]

Random feature attention.arXiv preprint arXiv:2103.02143,

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention.arXiv preprint arXiv:2103.02143,

work page arXiv
[10]

Transnormerllm: A faster and better large language model with improved transnormer.arXiv preprint arXiv:2307.14995,

Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, et al. Transnormerllm: A faster and better large language model with improved transnormer.arXiv preprint arXiv:2307.14995,

work page arXiv
[11]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention- 2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

work page arXiv
[12]

Hopfield Networks is All You Need

Hubert Ramsauer, Bernhard Sch¨ afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´ c, Geir Kjetil Sandve, et al. Hopfield networks is all you need.arXiv preprint arXiv:2008.02217,

work page internal anchor Pith review Pith/arXiv arXiv 2008
[13]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

work page arXiv
[16]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[17]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b. Shu Zhong, Mingyu Xu, Tenglong Ao, an...

work page internal anchor Pith review Pith/arXiv arXiv