pith. sign in

arxiv: 2605.22476 · v1 · pith:3JP2FMNHnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

Pith reviewed 2026-05-22 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords entity trackingstructured sparse attentionsubquadratic complexityresolvent operatortransformer attentionsequence modelingblockwise evaluationstate propagation
0
0 comments X

The pith

Learned attention for entity tracking concentrates in block-diagonal patterns, enabling a blockwise resolvent operator that runs in subquadratic time O(n^{4/3}d) while matching dense accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention weights in entity tracking tasks exhibit strong local block structure with only light cross-block connections. This structure lets the authors replace full dense evaluation of a resolvent-style attention operator with an exact-within-block plus reduced-cross-block scheme. The resulting algorithm scales as O(n^{4/3}d) and delivers the same exact-match accuracy as the dense version on controlled benchmarks. If the structure holds more generally, long-sequence entity tracking becomes practical without quadratic cost. The work also shows a clear limit: accuracy collapses once the number of simultaneously tracked properties exceeds the number of attention heads.

Core claim

In the entity-tracking setting the learned attention matrix is strongly structured: most mass lies inside local block-diagonal neighborhoods while cross-block interactions form only a light residue. Exploiting this, the authors derive a blockwise evaluation of the resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system, producing an evaluation whose complexity is O(n^{4/3}d) and, when d is comparable to n, O(n^{7/3}). On controlled tracking benchmarks this blockwise method matches the dense operator's exact-match accuracy while cutting wall-clock time by 12-29 percent and running up to 2.4 times faster than a compactly-

What carries the argument

blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system

If this is right

  • The blockwise resolvent matches dense exact-match accuracy on the reported tracking benchmarks.
  • Wall-clock time drops 12-29 percent under the standardized measurement protocol.
  • The method runs up to 2.4 times faster than a compact dense Transformer at comparable accuracy.
  • Performance collapses once the number of evolving properties exceeds the number of attention heads.
  • Ablations over block size and model capacity identify practical operating ranges for the approximation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block-diagonal sparsity pattern may appear in other state-tracking or multi-hop reasoning tasks, allowing the same blockwise reduction to be reused without task-specific redesign.
  • For sequences much longer than current benchmarks the O(n^{4/3}d) scaling would compound the observed speedups, making previously intractable tracking lengths feasible on current hardware.
  • The head-count limitation suggests a natural extension: dynamically allocating extra heads or learned routers when the number of active properties grows, rather than fixing the head count in advance.

Load-bearing premise

Learned attention concentrates most of its mass inside local block-diagonal neighborhoods and leaves only a light cross-block residue that the reduced system can approximate without losing exact-match accuracy.

What would settle it

A controlled tracking benchmark in which the number of simultaneously evolving properties exceeds the number of attention heads, or any new task where the attention matrix deviates markedly from block-diagonal structure and accuracy drops below the dense baseline.

Figures

Figures reproduced from arXiv: 2605.22476 by Alexandre Allauzen, Erwan Fagnou, Hangyue Zhao, Paul Caillon.

Figure 1
Figure 1. Figure 1: Blockwise evaluation of resolvent attention. Split the causal attention matrix A ∈ R n×n into k = n//m contiguous m × m tiles. Local branch: apply Sγ(A) = (1 − γ)A(I − γA) −1 exactly on the block diagonal. Residual branch: compress off-block interactions to a k × k system (down-sample), apply the same operator, then lift back (up-sample). The final output is the sum of both branches. 2.3 Residual (reduced)… view at source ↗
read the original abstract

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that learned attention in entity tracking tasks exhibits strong block-diagonal structure with light cross-block residue. Exploiting this, it derives a blockwise resolvent-style operator that evaluates within-block interactions exactly while routing cross-block interactions through a reduced system, yielding subquadratic complexity O(n^{4/3}d) (or O(n^{7/3}) when d≈n). On controlled tracking benchmarks the method matches the dense operator's exact-match accuracy, delivers 12-29% wall-clock reductions under a standardized protocol, and is up to 2.4× faster than a compact dense Transformer at comparable accuracy; ablations over block size and capacity are provided, along with a noted limitation when the number of evolving properties exceeds the number of attention heads.

Significance. If the approximation preserves multi-hop state propagation without distortion, the work would offer a practical route to scaling task-specific attention operators to longer sequences while retaining their compression benefits. The empirical matching of accuracy together with reproducible speedups and ablations constitute a clear strength; however, the absence of error bounds on the reduced cross-block system limits the result's theoretical reach and generalizability beyond the reported controlled benchmarks.

major comments (2)
  1. [Derivation of the blockwise resolvent operator (Section 3)] The central claim that routing cross-block interactions through the reduced system preserves the multi-hop state updates required for entity tracking is load-bearing, yet no error bound or accumulation analysis is supplied as a function of sequence length or number of blocks. Even a light cross-block residue, when composed repeatedly, can distort latent entity/attribute states; the reported exact-match accuracy on controlled benchmarks does not by itself rule out such accumulation for longer sequences.
  2. [Experimental evaluation (Section 4)] Table reporting benchmark results: the exact-match accuracy is stated to match the dense baseline, but the table does not report the number of blocks, the ratio of sequence length to block size, or the number of simultaneously evolving properties relative to the number of heads. Without these quantities it is impossible to verify that the experimental regime actually stresses the multi-hop propagation that the skeptic concern identifies as vulnerable.
minor comments (2)
  1. [Complexity analysis] The complexity statement O(n^{4/3}d) and the special case O(n^{7/3}) when d≈n should be accompanied by a brief derivation sketch or reference to the dominant term in the reduced-system solve.
  2. [Limitations] The limitation paragraph identifies collapse when the number of properties exceeds heads, but does not discuss whether increasing heads or using head-sharing strategies could mitigate it; a short remark would improve completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The positive assessment of the empirical speedups and ablations is appreciated. Below we respond point-by-point to the major comments, clarifying the derivation, strengthening the experimental reporting, and adding discussion where appropriate. We have prepared a revised manuscript incorporating these changes.

read point-by-point responses
  1. Referee: [Derivation of the blockwise resolvent operator (Section 3)] The central claim that routing cross-block interactions through the reduced system preserves the multi-hop state updates required for entity tracking is load-bearing, yet no error bound or accumulation analysis is supplied as a function of sequence length or number of blocks. Even a light cross-block residue, when composed repeatedly, can distort latent entity/attribute states; the reported exact-match accuracy on controlled benchmarks does not by itself rule out such accumulation for longer sequences.

    Authors: We agree that a formal accumulation analysis would strengthen the theoretical contribution. The blockwise resolvent is constructed so that intra-block interactions remain exact while cross-block terms are routed through a reduced linear system whose dimension depends only on the number of blocks (not on sequence length). Because the learned attention exhibits a light cross-block residue, the perturbation to the overall resolvent remains controlled. In the revision we have added a new paragraph in Section 3.3 that derives a first-order perturbation bound on the state update under the observed residue level, together with a short discussion of why repeated composition does not produce visible drift on the entity-tracking tasks. We also report additional experiments on sequences up to length 4096 that continue to match dense accuracy, providing empirical evidence against significant accumulation in the regimes of interest. A fully general, data-independent bound for arbitrary attention patterns remains beyond the scope of this work. revision: partial

  2. Referee: [Experimental evaluation (Section 4)] Table reporting benchmark results: the exact-match accuracy is stated to match the dense baseline, but the table does not report the number of blocks, the ratio of sequence length to block size, or the number of simultaneously evolving properties relative to the number of heads. Without these quantities it is impossible to verify that the experimental regime actually stresses the multi-hop propagation that the skeptic concern identifies as vulnerable.

    Authors: We thank the referee for this observation. The original table omitted these configuration details. In the revised manuscript we have expanded Table 1 (and the corresponding appendix tables) to include, for every reported setting: the number of blocks, the sequence-length-to-block-size ratio, and the number of simultaneously evolving properties relative to the number of attention heads. These additions make explicit that the benchmarks operate in regimes requiring multi-hop propagation and that the number of properties remains within the capacity of the heads, consistent with the limitation already noted in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on observed structure and independent benchmarks

full rationale

The paper observes that learned attention concentrates in block-diagonal neighborhoods with light cross-block residue, then derives a blockwise resolvent approximation that keeps within-block interactions exact. This structure is presented as an empirical property of the trained model rather than a quantity fitted or defined from the target approximation itself. The subquadratic complexity O(n^{4/3}d) follows directly from the blockwise reduction, and accuracy claims are grounded in controlled tracking benchmarks that compare against the dense operator. No self-citation chain, fitted-input-renamed-as-prediction, or self-definitional loop appears in the derivation; the central claim remains independently verifiable against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of block-diagonal attention structure plus one tunable hyperparameter; no new entities are postulated and no parameters are fitted directly to the target accuracy metric.

free parameters (1)
  • block size
    Hyperparameter that controls the trade-off between exact within-block computation and the size of the reduced cross-block system; chosen per experiment but not derived from first principles.
axioms (1)
  • domain assumption Learned attention for entity tracking concentrates most mass in local block-diagonal neighborhoods with only light cross-block residue
    Invoked in the abstract as the key observation enabling the blockwise derivation.

pith-pipeline@v0.9.0 · 5740 in / 1302 out tokens · 62839 ms · 2026-05-22T06:51:27.189521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

  1. [1]

    NeurIPS , volume=

    Attention is all you need , author=. NeurIPS , volume=

  2. [2]

    ICLR , year=

    Structured Attention Networks , author=. ICLR , year=

  3. [3]

    ACL , pages=

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. ACL , pages=

  4. [4]

    International Conference on Learning Representations , year=

    Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

  5. [5]

    What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

    What matters in transformers? not all attention is needed , author=. arXiv preprint arXiv:2406.15786 , year=

  6. [6]

    International Conference on Learning Representations , year=

    Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

  7. [7]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

  8. [8]

    NeurIPS , volume=

    Big bird: Transformers for longer sequences , author=. NeurIPS , volume=

  9. [9]

    NeurIPS , volume=

    Language models are few-shot learners , author=. NeurIPS , volume=

  10. [10]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  11. [11]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

  12. [12]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  13. [13]

    EMNLP , pages=

    Chain and causal attention for efficient entity tracking , author=. EMNLP , pages=

  14. [14]

    ACL | IJCNLP 2015 , year=

    Entity-centric coreference resolution with model stacking , author=. ACL | IJCNLP 2015 , year=

  15. [15]

    ICLR , year=

    Tracking the World State with Recurrent Entity Networks , author=. ICLR , year=

  16. [16]

    Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

    What does BERT look at? an analysis of BERT’s attention , author=. Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

  17. [17]

    Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

    Entity-aware transformers for entity search , author=. Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

  18. [18]

    Computational Linguistics , volume=

    Modeling local coherence: An entity-based approach , author=. Computational Linguistics , volume=. 2008 , publisher=

  19. [19]

    Generating Long Sequences with Sparse Transformers

    Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

  20. [20]

    Linformer: Self-Attention with Linear Complexity

    Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

  21. [21]

    ICLR 2021 , year=

    Rethinking Attention with Performers , author=. ICLR 2021 , year=

  22. [22]

    Entity Tracking in Language Models

    Kim, Najoung and Schuster, Sebastian. Entity Tracking in Language Models. ACL. 2023. doi:10.18653/v1/2023.acl-long.213

  23. [23]

    Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

    Prakash, Nikhil and Shaham, Tamar and Haklay, Tal and Belinkov, Yonatan and Bau, David , booktitle =. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

  24. [24]

    Hierarchical Context Merging: Better Long Context Understanding for Pre-trained

    Woomin Song and Seunghyuk Oh and Sangwoo Mo and Jaehyung Kim and Sukmin Yun and Jung-Woo Ha and Jinwoo Shin , booktitle=. Hierarchical Context Merging: Better Long Context Understanding for Pre-trained. 2024 , url=

  25. [25]

    ACL , pages=

    Efficient transformers with dynamic token pooling , author=. ACL , pages=

  26. [26]

    NeurIPS , volume=

    Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. NeurIPS , volume=

  27. [27]

    Length-adaptive transformer: Train once with length drop, use anytime with search , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  28. [28]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Not all tokens are equal: Human-centric visual analysis via token clustering transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  29. [29]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  30. [30]

    NeurIPS , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. NeurIPS , volume=

  31. [31]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Scrolls: Standardized comparison over long language sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  32. [32]

    A Comprehensive Survey on Document-Level Information Extraction

    Zheng, Hanwen and Wang, Sijia and Huang, Lifu. A Comprehensive Survey on Document-Level Information Extraction. Proceedings of the Workshop on the Future of Event Detection (FuturED). 2024. doi:10.18653/v1/2024.futured-1.6

  33. [33]

    M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

    Budzianowski, Pawe and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I \ n igo and Ultes, Stefan and Ramadan, Osman and Ga s i \'c , Milica. M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling. EMNLP 2018. 2018. doi:10.18653/v1/D18-1547

  34. [34]

    End-to-end Neural Coreference Resolution

    Lee, Kenton and He, Luheng and Lewis, Mike and Zettlemoyer, Luke. End-to-end Neural Coreference Resolution. EMNLP. 2017. doi:10.18653/v1/D17-1018

  35. [35]

    and Marasovi \'c , Ana and Smith, Noah A

    Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP-IJCNLP. 2019. doi:10.18653/v1/D19-1606

  36. [36]

    Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension

    Dalvi, Bhavana and Huang, Lifu and Tandon, Niket and Yih, Wen-tau and Clark, Peter. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension. NAACL 2018. 2018. doi:10.18653/v1/N18-1144

  37. [37]

    OpenWebText Corpus , author=

  38. [38]

    2024 , howpublished =

    Abideen , title =. 2024 , howpublished =