pith. machine review for the scientific record.
sign in

arxiv: 2510.27258 · v3 · pith:6EAOJO5Knew · submitted 2025-10-31 · 💻 cs.LG · cs.AI· cs.CL

Higher-order Linear Attention

Pith reviewed 2026-05-18 03:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords higher-order linear attentionlinear attentioncausal attentionstate space modelsefficient transformersautoregressive modelsprefix statistics
0
0 comments X

The pith

Higher-order Linear Attention captures second-order interactions in linear time with a constant-size state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Higher-order Linear Attention to overcome the limits of first-order linear attention and kernel approximations. It uses compact prefix sufficient statistics to enable higher-order data-dependent mixing while remaining strictly causal and streaming. This matters for scaling autoregressive language models because it avoids quadratic matrix costs yet keeps the ability to model richer token interactions than basic linear recurrences. The method supplies closed-form updates, a masked causal variant, and an associative-scan training procedure that matches serial execution exactly. Extensions to third and higher orders are outlined as part of the same framework.

Core claim

Higher-order Linear Attention realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n × n matrices. It supplies closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly.

What carries the argument

Compact prefix sufficient statistics that accumulate the necessary moments of past tokens to compute higher-order interactions on the fly.

If this is right

  • Second-order HLA runs in linear time with fixed memory independent of sequence length.
  • Training can use associative scans over chunks to produce exactly the same activations as the serial version.
  • A masked causal variant exists that uses only two extra summary statistics.
  • The same construction extends in closed form to third and higher orders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • HLA could serve as a drop-in replacement for standard attention in long-context models where higher-order term mixing improves dependency capture.
  • Because it stays exactly equivalent to a recurrence, it inherits the same parallel training tricks already used by state-space models.
  • Testing whether second-order statistics measurably improve performance on tasks that require modeling pairwise token relations would be a direct next experiment.

Load-bearing premise

Compact prefix sufficient statistics exist and suffice to realize the desired higher-order interactions while preserving the causal streaming property and exact equivalence to the serial recurrence.

What would settle it

Compare the per-token outputs of the HLA recurrence against an explicit higher-order attention computation on a short sequence; any mismatch on second-order terms would show the identities do not hold.

read the original abstract

The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Higher-order Linear Attention (HLA) as a causal streaming mechanism realizing higher-order interactions via compact prefix sufficient statistics. For the second-order case it claims constant-size state, linear-time per-token outputs without materializing n×n matrices, closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel associative-scan training scheme that exactly reproduces serial-recurrence activations. Extensions to third and higher orders are outlined.

Significance. If the closed-form identities and exact equivalence are correct, HLA would offer a principled way to obtain attention-like higher-order mixing at linear cost and constant state, improving expressivity over first-order linear attention and standard SSMs while retaining exact scan-based training. The associative-scan exact-reproduction property is a concrete strength for reproducibility and training stability.

major comments (2)
  1. [Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.
  2. [Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.
minor comments (2)
  1. Notation for the prefix sufficient statistics (e.g., moment tensors or outer-product accumulators) should be defined explicitly with dimension counts before the streaming identities are introduced.
  2. The manuscript would benefit from a small empirical verification (even on synthetic data) confirming that the scan-based implementation matches the serial recurrence to machine precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript on Higher-order Linear Attention. The comments highlight important points regarding the presentation of our central claims. We will revise the manuscript to include explicit derivations, constructions, and verifications as requested, which will strengthen the exposition without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the section stating the closed-form streaming identities: the central claim that second-order HLA maintains constant-size state while exactly reproducing the serial recurrence is presented without derivations, error bounds, or verification that all query-key-value cross terms remain incrementally maintainable under the causal mask. This is load-bearing for the linear-time constant-state guarantee.

    Authors: We agree that the abstract and the section on closed-form streaming identities would benefit from expanded derivations to make the constant-state and exact-reproduction properties fully transparent. In the revised manuscript we will add a dedicated subsection deriving the incremental update rules for the second-order sufficient statistics. These derivations will explicitly track all query-key-value cross terms (including the data-dependent mixing) and show that they remain incrementally maintainable with a fixed number of summary tensors under the causal mask. We will also include a short error-bound analysis confirming that the identities are exact (no approximation) when the associative scan is used. This addition directly addresses the load-bearing guarantee for linear-time, constant-state operation. revision: yes

  2. Referee: [Masked variant description] The paragraph describing the strictly causal masked variant: the assertion that two additional summaries suffice to preserve exact equivalence and the streaming property is stated without explicit construction or proof that no further cross-terms are required, leaving open the possibility that the constant-state property fails for some interactions.

    Authors: We acknowledge that the current description of the strictly causal masked variant is concise and would be improved by an explicit construction. In the revision we will insert a short proof sketch and the concrete tensor forms of the two additional summaries. The construction demonstrates that these two summaries capture all remaining cross-terms required by the causal mask while preserving both the streaming property and exact equivalence to the serial recurrence; no further state is needed. This will eliminate any ambiguity about whether the constant-state property holds for the masked case. revision: yes

Circularity Check

0 steps flagged

Derivation of HLA streaming identities is self-contained from sufficient statistics

full rationale

The paper presents closed-form streaming identities derived directly from the definition of prefix sufficient statistics (sums, outer products, and cross terms) to maintain constant-size state for second-order interactions. These identities are shown to reproduce the serial recurrence exactly via associative scans, with an explicit masked variant using two additional summaries. No step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation, or ansatz imported from prior work by the same authors; the construction is parameter-free and mathematically derived from first principles of incremental statistics. The central claim of linear-time constant-state higher-order attention therefore stands on independent algebraic identities rather than circular redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of compact sufficient statistics that capture higher-order interactions exactly in a constant-size state; this is treated as a domain assumption rather than derived from prior results.

axioms (1)
  • domain assumption Higher-order interactions admit compact prefix sufficient statistics that enable exact causal streaming updates.
    Invoked to justify constant-size state and linear-time per-token computation for second-order case.

pith-pipeline@v0.9.0 · 5685 in / 1087 out tokens · 28825 ms · 2026-05-18T03:03:06.024190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

  2. [2]

    Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentio...

  3. [3]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

  4. [4]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    15 Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  5. [5]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  6. [6]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher R´ e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

  7. [7]

    Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix- valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

  8. [8]

    Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7” goose” with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

  9. [9]

    Random feature attention.arXiv preprint arXiv:2103.02143,

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention.arXiv preprint arXiv:2103.02143,

  10. [10]

    Transnormerllm: A faster and better large language model with improved transnormer.arXiv preprint arXiv:2307.14995,

    Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, et al. Transnormerllm: A faster and better large language model with improved transnormer.arXiv preprint arXiv:2307.14995,

  11. [11]

    Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention- 2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

  12. [12]

    Hopfield Networks is All You Need

    Hubert Ramsauer, Bernhard Sch¨ afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´ c, Geir Kjetil Sandve, et al. Hopfield networks is all you need.arXiv preprint arXiv:2008.02217,

  13. [13]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

  14. [14]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

  15. [15]

    Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

    Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, et al. Mesanet: Sequence modeling by locally optimal test-time training.arXiv preprint arXiv:2506.05233,

  16. [16]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

  17. [17]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635,

  18. [18]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b. Shu Zhong, Mingyu Xu, Tenglong Ao, an...