Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

Alexandre Allauzen; Erwan Fagnou; Hangyue Zhao; Paul Caillon

arxiv: 2605.22476 · v1 · pith:3JP2FMNHnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

Hangyue Zhao , Paul Caillon , Erwan Fagnou , Alexandre Allauzen This is my paper

Pith reviewed 2026-05-22 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords entity trackingstructured sparse attentionsubquadratic complexityresolvent operatortransformer attentionsequence modelingblockwise evaluationstate propagation

0 comments

The pith

Learned attention for entity tracking concentrates in block-diagonal patterns, enabling a blockwise resolvent operator that runs in subquadratic time O(n^{4/3}d) while matching dense accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention weights in entity tracking tasks exhibit strong local block structure with only light cross-block connections. This structure lets the authors replace full dense evaluation of a resolvent-style attention operator with an exact-within-block plus reduced-cross-block scheme. The resulting algorithm scales as O(n^{4/3}d) and delivers the same exact-match accuracy as the dense version on controlled benchmarks. If the structure holds more generally, long-sequence entity tracking becomes practical without quadratic cost. The work also shows a clear limit: accuracy collapses once the number of simultaneously tracked properties exceeds the number of attention heads.

Core claim

In the entity-tracking setting the learned attention matrix is strongly structured: most mass lies inside local block-diagonal neighborhoods while cross-block interactions form only a light residue. Exploiting this, the authors derive a blockwise evaluation of the resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system, producing an evaluation whose complexity is O(n^{4/3}d) and, when d is comparable to n, O(n^{7/3}). On controlled tracking benchmarks this blockwise method matches the dense operator's exact-match accuracy while cutting wall-clock time by 12-29 percent and running up to 2.4 times faster than a compactly-

What carries the argument

blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system

If this is right

The blockwise resolvent matches dense exact-match accuracy on the reported tracking benchmarks.
Wall-clock time drops 12-29 percent under the standardized measurement protocol.
The method runs up to 2.4 times faster than a compact dense Transformer at comparable accuracy.
Performance collapses once the number of evolving properties exceeds the number of attention heads.
Ablations over block size and model capacity identify practical operating ranges for the approximation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block-diagonal sparsity pattern may appear in other state-tracking or multi-hop reasoning tasks, allowing the same blockwise reduction to be reused without task-specific redesign.
For sequences much longer than current benchmarks the O(n^{4/3}d) scaling would compound the observed speedups, making previously intractable tracking lengths feasible on current hardware.
The head-count limitation suggests a natural extension: dynamically allocating extra heads or learned routers when the number of active properties grows, rather than fixing the head count in advance.

Load-bearing premise

Learned attention concentrates most of its mass inside local block-diagonal neighborhoods and leaves only a light cross-block residue that the reduced system can approximate without losing exact-match accuracy.

What would settle it

A controlled tracking benchmark in which the number of simultaneously evolving properties exceeds the number of attention heads, or any new task where the attention matrix deviates markedly from block-diagonal structure and accuracy drops below the dense baseline.

Figures

Figures reproduced from arXiv: 2605.22476 by Alexandre Allauzen, Erwan Fagnou, Hangyue Zhao, Paul Caillon.

**Figure 1.** Figure 1: Blockwise evaluation of resolvent attention. Split the causal attention matrix A ∈ R n×n into k = n//m contiguous m × m tiles. Local branch: apply Sγ(A) = (1 − γ)A(I − γA) −1 exactly on the block diagonal. Residual branch: compress off-block interactions to a k × k system (down-sample), apply the same operator, then lift back (up-sample). The final output is the sum of both branches. 2.3 Residual (reduced)… view at source ↗

read the original abstract

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable blockwise resolvent trick for subquadratic entity tracking attention that holds accuracy on their benchmarks, but the lack of error bounds leaves the long-sequence robustness open.

read the letter

The main point is that this work shows how to evaluate a resolvent-style attention operator in blocks for entity tracking: exact inside each block and routed through a smaller system for the cross-block parts. That yields the claimed O(n^{4/3}d) scaling and the reported 12-29% wall-clock savings while matching dense accuracy on the controlled tracking tasks. They also note it can be up to 2.4 times faster than a compact dense Transformer at similar exact-match levels, and they include ablations on block size plus a clear statement of the limitation when properties outnumber heads. That combination of a task-specific sparsity observation and the concrete speed/accuracy numbers is what is new here. The approach is scoped to entity tracking rather than general sequences, which keeps the claims grounded. The soft spot is the one the stress-test flags. The method depends on the cross-block residue staying light enough that the reduced system still carries the necessary multi-hop state updates. The abstract gives no derivation details, no error bounds in terms of sequence length or block count, and no separate checks on whether approximation errors grow with more blocks. Their benchmarks show matching accuracy, so the approximation holds in the tested regime, but without those bounds it is hard to know how far the gains extend to longer sequences or more complex tracking. This is the kind of paper that would interest people working on efficient attention for structured long-sequence tasks in NLP or agent modeling. A reader already thinking about sparse or block-structured operators would get direct value from the technique and the ablations. It has enough empirical grounding and a distinct technical step to deserve a serious referee, though the review should press for the missing error analysis and longer-sequence tests. I would send it to review with those requests rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that learned attention in entity tracking tasks exhibits strong block-diagonal structure with light cross-block residue. Exploiting this, it derives a blockwise resolvent-style operator that evaluates within-block interactions exactly while routing cross-block interactions through a reduced system, yielding subquadratic complexity O(n^{4/3}d) (or O(n^{7/3}) when d≈n). On controlled tracking benchmarks the method matches the dense operator's exact-match accuracy, delivers 12-29% wall-clock reductions under a standardized protocol, and is up to 2.4× faster than a compact dense Transformer at comparable accuracy; ablations over block size and capacity are provided, along with a noted limitation when the number of evolving properties exceeds the number of attention heads.

Significance. If the approximation preserves multi-hop state propagation without distortion, the work would offer a practical route to scaling task-specific attention operators to longer sequences while retaining their compression benefits. The empirical matching of accuracy together with reproducible speedups and ablations constitute a clear strength; however, the absence of error bounds on the reduced cross-block system limits the result's theoretical reach and generalizability beyond the reported controlled benchmarks.

major comments (2)

[Derivation of the blockwise resolvent operator (Section 3)] The central claim that routing cross-block interactions through the reduced system preserves the multi-hop state updates required for entity tracking is load-bearing, yet no error bound or accumulation analysis is supplied as a function of sequence length or number of blocks. Even a light cross-block residue, when composed repeatedly, can distort latent entity/attribute states; the reported exact-match accuracy on controlled benchmarks does not by itself rule out such accumulation for longer sequences.
[Experimental evaluation (Section 4)] Table reporting benchmark results: the exact-match accuracy is stated to match the dense baseline, but the table does not report the number of blocks, the ratio of sequence length to block size, or the number of simultaneously evolving properties relative to the number of heads. Without these quantities it is impossible to verify that the experimental regime actually stresses the multi-hop propagation that the skeptic concern identifies as vulnerable.

minor comments (2)

[Complexity analysis] The complexity statement O(n^{4/3}d) and the special case O(n^{7/3}) when d≈n should be accompanied by a brief derivation sketch or reference to the dominant term in the reduced-system solve.
[Limitations] The limitation paragraph identifies collapse when the number of properties exceeds heads, but does not discuss whether increasing heads or using head-sharing strategies could mitigate it; a short remark would improve completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The positive assessment of the empirical speedups and ablations is appreciated. Below we respond point-by-point to the major comments, clarifying the derivation, strengthening the experimental reporting, and adding discussion where appropriate. We have prepared a revised manuscript incorporating these changes.

read point-by-point responses

Referee: [Derivation of the blockwise resolvent operator (Section 3)] The central claim that routing cross-block interactions through the reduced system preserves the multi-hop state updates required for entity tracking is load-bearing, yet no error bound or accumulation analysis is supplied as a function of sequence length or number of blocks. Even a light cross-block residue, when composed repeatedly, can distort latent entity/attribute states; the reported exact-match accuracy on controlled benchmarks does not by itself rule out such accumulation for longer sequences.

Authors: We agree that a formal accumulation analysis would strengthen the theoretical contribution. The blockwise resolvent is constructed so that intra-block interactions remain exact while cross-block terms are routed through a reduced linear system whose dimension depends only on the number of blocks (not on sequence length). Because the learned attention exhibits a light cross-block residue, the perturbation to the overall resolvent remains controlled. In the revision we have added a new paragraph in Section 3.3 that derives a first-order perturbation bound on the state update under the observed residue level, together with a short discussion of why repeated composition does not produce visible drift on the entity-tracking tasks. We also report additional experiments on sequences up to length 4096 that continue to match dense accuracy, providing empirical evidence against significant accumulation in the regimes of interest. A fully general, data-independent bound for arbitrary attention patterns remains beyond the scope of this work. revision: partial
Referee: [Experimental evaluation (Section 4)] Table reporting benchmark results: the exact-match accuracy is stated to match the dense baseline, but the table does not report the number of blocks, the ratio of sequence length to block size, or the number of simultaneously evolving properties relative to the number of heads. Without these quantities it is impossible to verify that the experimental regime actually stresses the multi-hop propagation that the skeptic concern identifies as vulnerable.

Authors: We thank the referee for this observation. The original table omitted these configuration details. In the revised manuscript we have expanded Table 1 (and the corresponding appendix tables) to include, for every reported setting: the number of blocks, the sequence-length-to-block-size ratio, and the number of simultaneously evolving properties relative to the number of attention heads. These additions make explicit that the benchmarks operate in regimes requiring multi-hop propagation and that the number of properties remains within the capacity of the heads, consistent with the limitation already noted in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on observed structure and independent benchmarks

full rationale

The paper observes that learned attention concentrates in block-diagonal neighborhoods with light cross-block residue, then derives a blockwise resolvent approximation that keeps within-block interactions exact. This structure is presented as an empirical property of the trained model rather than a quantity fitted or defined from the target approximation itself. The subquadratic complexity O(n^{4/3}d) follows directly from the blockwise reduction, and accuracy claims are grounded in controlled tracking benchmarks that compare against the dense operator. No self-citation chain, fitted-input-renamed-as-prediction, or self-definitional loop appears in the derivation; the central claim remains independently verifiable against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of block-diagonal attention structure plus one tunable hyperparameter; no new entities are postulated and no parameters are fitted directly to the target accuracy metric.

free parameters (1)

block size
Hyperparameter that controls the trade-off between exact within-block computation and the size of the reduced cross-block system; chosen per experiment but not derived from first principles.

axioms (1)

domain assumption Learned attention for entity tracking concentrates most mass in local block-diagonal neighborhoods with only light cross-block residue
Invoked in the abstract as the key observation enabling the blockwise derivation.

pith-pipeline@v0.9.0 · 5740 in / 1302 out tokens · 62839 ms · 2026-05-22T06:51:27.189521+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length O(n^{4/3}d)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using the Neumann series (valid for γ∈[0,1) and row-stochastic A), (I−γA)^{-1} = ∑_{t≥0} γ^t A^t, S_γ(A) aggregates A, A^2, A^3, … (multi-hop propagation) in a single layer.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

NeurIPS , volume=

Attention is all you need , author=. NeurIPS , volume=

work page
[2]

ICLR , year=

Structured Attention Networks , author=. ICLR , year=

work page
[3]

ACL , pages=

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. ACL , pages=

work page
[4]

International Conference on Learning Representations , year=

Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page
[5]

What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

What matters in transformers? not all attention is needed , author=. arXiv preprint arXiv:2406.15786 , year=

work page arXiv
[6]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page
[7]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[8]

NeurIPS , volume=

Big bird: Transformers for longer sequences , author=. NeurIPS , volume=

work page
[9]

NeurIPS , volume=

Language models are few-shot learners , author=. NeurIPS , volume=

work page
[10]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[13]

EMNLP , pages=

Chain and causal attention for efficient entity tracking , author=. EMNLP , pages=

work page
[14]

ACL | IJCNLP 2015 , year=

Entity-centric coreference resolution with model stacking , author=. ACL | IJCNLP 2015 , year=

work page 2015
[15]

ICLR , year=

Tracking the World State with Recurrent Entity Networks , author=. ICLR , year=

work page
[16]

Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

What does BERT look at? an analysis of BERT’s attention , author=. Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

work page 2019
[17]

Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

Entity-aware transformers for entity search , author=. Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

work page
[18]

Computational Linguistics , volume=

Modeling local coherence: An entity-based approach , author=. Computational Linguistics , volume=. 2008 , publisher=

work page 2008
[19]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[20]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[21]

ICLR 2021 , year=

Rethinking Attention with Performers , author=. ICLR 2021 , year=

work page 2021
[22]

Entity Tracking in Language Models

Kim, Najoung and Schuster, Sebastian. Entity Tracking in Language Models. ACL. 2023. doi:10.18653/v1/2023.acl-long.213

work page doi:10.18653/v1/2023.acl-long.213 2023
[23]

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

Prakash, Nikhil and Shaham, Tamar and Haklay, Tal and Belinkov, Yonatan and Bau, David , booktitle =. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

work page
[24]

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained

Woomin Song and Seunghyuk Oh and Sangwoo Mo and Jaehyung Kim and Sukmin Yun and Jung-Woo Ha and Jinwoo Shin , booktitle=. Hierarchical Context Merging: Better Long Context Understanding for Pre-trained. 2024 , url=

work page 2024
[25]

ACL , pages=

Efficient transformers with dynamic token pooling , author=. ACL , pages=

work page
[26]

NeurIPS , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. NeurIPS , volume=

work page
[27]

Length-adaptive transformer: Train once with length drop, use anytime with search , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Not all tokens are equal: Human-centric visual analysis via token clustering transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[29]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[30]

NeurIPS , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. NeurIPS , volume=

work page
[31]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scrolls: Standardized comparison over long language sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[32]

A Comprehensive Survey on Document-Level Information Extraction

Zheng, Hanwen and Wang, Sijia and Huang, Lifu. A Comprehensive Survey on Document-Level Information Extraction. Proceedings of the Workshop on the Future of Event Detection (FuturED). 2024. doi:10.18653/v1/2024.futured-1.6

work page doi:10.18653/v1/2024.futured-1.6 2024
[33]

M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

Budzianowski, Pawe and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I \ n igo and Ultes, Stefan and Ramadan, Osman and Ga s i \'c , Milica. M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling. EMNLP 2018. 2018. doi:10.18653/v1/D18-1547

work page doi:10.18653/v1/d18-1547 2018
[34]

End-to-end Neural Coreference Resolution

Lee, Kenton and He, Luheng and Lewis, Mike and Zettlemoyer, Luke. End-to-end Neural Coreference Resolution. EMNLP. 2017. doi:10.18653/v1/D17-1018

work page doi:10.18653/v1/d17-1018 2017
[35]

and Marasovi \'c , Ana and Smith, Noah A

Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP-IJCNLP. 2019. doi:10.18653/v1/D19-1606

work page doi:10.18653/v1/d19-1606 2019
[36]

Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension

Dalvi, Bhavana and Huang, Lifu and Tandon, Niket and Yih, Wen-tau and Clark, Peter. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension. NAACL 2018. 2018. doi:10.18653/v1/N18-1144

work page doi:10.18653/v1/n18-1144 2018
[37]

OpenWebText Corpus , author=

work page
[38]

2024 , howpublished =

Abideen , title =. 2024 , howpublished =

work page 2024

[1] [1]

NeurIPS , volume=

Attention is all you need , author=. NeurIPS , volume=

work page

[2] [2]

ICLR , year=

Structured Attention Networks , author=. ICLR , year=

work page

[3] [3]

ACL , pages=

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. ACL , pages=

work page

[4] [4]

International Conference on Learning Representations , year=

Long Range Arena : A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page

[5] [5]

What mat- ters in transformers? not all attention is needed.arXiv preprint arXiv:2406.15786, 2024

What matters in transformers? not all attention is needed , author=. arXiv preprint arXiv:2406.15786 , year=

work page arXiv

[6] [6]

International Conference on Learning Representations , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations , year=

work page

[7] [7]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004

[8] [8]

NeurIPS , volume=

Big bird: Transformers for longer sequences , author=. NeurIPS , volume=

work page

[9] [9]

NeurIPS , volume=

Language models are few-shot learners , author=. NeurIPS , volume=

work page

[10] [10]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[13] [13]

EMNLP , pages=

Chain and causal attention for efficient entity tracking , author=. EMNLP , pages=

work page

[14] [14]

ACL | IJCNLP 2015 , year=

Entity-centric coreference resolution with model stacking , author=. ACL | IJCNLP 2015 , year=

work page 2015

[15] [15]

ICLR , year=

Tracking the World State with Recurrent Entity Networks , author=. ICLR , year=

work page

[16] [16]

Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

What does BERT look at? an analysis of BERT’s attention , author=. Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP , pages=

work page 2019

[17] [17]

Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

Entity-aware transformers for entity search , author=. Proceedings of the 45th international acm sigir conference on research and development in information retrieval , pages=

work page

[18] [18]

Computational Linguistics , volume=

Modeling local coherence: An entity-based approach , author=. Computational Linguistics , volume=. 2008 , publisher=

work page 2008

[19] [19]

Generating Long Sequences with Sparse Transformers

Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904

[20] [20]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-attention with linear complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[21] [21]

ICLR 2021 , year=

Rethinking Attention with Performers , author=. ICLR 2021 , year=

work page 2021

[22] [22]

Entity Tracking in Language Models

Kim, Najoung and Schuster, Sebastian. Entity Tracking in Language Models. ACL. 2023. doi:10.18653/v1/2023.acl-long.213

work page doi:10.18653/v1/2023.acl-long.213 2023

[23] [23]

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

Prakash, Nikhil and Shaham, Tamar and Haklay, Tal and Belinkov, Yonatan and Bau, David , booktitle =. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , url =

work page

[24] [24]

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained

Woomin Song and Seunghyuk Oh and Sangwoo Mo and Jaehyung Kim and Sukmin Yun and Jung-Woo Ha and Jinwoo Shin , booktitle=. Hierarchical Context Merging: Better Long Context Understanding for Pre-trained. 2024 , url=

work page 2024

[25] [25]

ACL , pages=

Efficient transformers with dynamic token pooling , author=. ACL , pages=

work page

[26] [26]

NeurIPS , volume=

Dynamicvit: Efficient vision transformers with dynamic token sparsification , author=. NeurIPS , volume=

work page

[27] [27]

Length-adaptive transformer: Train once with length drop, use anytime with search , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Not all tokens are equal: Human-centric visual analysis via token clustering transformer , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[29] [29]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page

[30] [30]

NeurIPS , volume=

Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. NeurIPS , volume=

work page

[31] [31]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Scrolls: Standardized comparison over long language sequences , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022

[32] [32]

A Comprehensive Survey on Document-Level Information Extraction

Zheng, Hanwen and Wang, Sijia and Huang, Lifu. A Comprehensive Survey on Document-Level Information Extraction. Proceedings of the Workshop on the Future of Event Detection (FuturED). 2024. doi:10.18653/v1/2024.futured-1.6

work page doi:10.18653/v1/2024.futured-1.6 2024

[33] [33]

M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

Budzianowski, Pawe and Wen, Tsung-Hsien and Tseng, Bo-Hsiang and Casanueva, I \ n igo and Ultes, Stefan and Ramadan, Osman and Ga s i \'c , Milica. M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling. EMNLP 2018. 2018. doi:10.18653/v1/D18-1547

work page doi:10.18653/v1/d18-1547 2018

[34] [34]

End-to-end Neural Coreference Resolution

Lee, Kenton and He, Luheng and Lewis, Mike and Zettlemoyer, Luke. End-to-end Neural Coreference Resolution. EMNLP. 2017. doi:10.18653/v1/D17-1018

work page doi:10.18653/v1/d17-1018 2017

[35] [35]

and Marasovi \'c , Ana and Smith, Noah A

Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP-IJCNLP. 2019. doi:10.18653/v1/D19-1606

work page doi:10.18653/v1/d19-1606 2019

[36] [36]

Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension

Dalvi, Bhavana and Huang, Lifu and Tandon, Niket and Yih, Wen-tau and Clark, Peter. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension. NAACL 2018. 2018. doi:10.18653/v1/N18-1144

work page doi:10.18653/v1/n18-1144 2018

[37] [37]

OpenWebText Corpus , author=

work page

[38] [38]

2024 , howpublished =

Abideen , title =. 2024 , howpublished =

work page 2024