Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li; Ryan Cotterell

arxiv: 2605.00768 · v2 · pith:UL5I747Mnew · submitted 2026-05-01 · 💻 cs.CL

Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li , Ryan Cotterell This is my paper

Pith reviewed 2026-05-20 23:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords local attentiontransformer expressivitylinear temporal logicregular languagesattention mechanismslanguage recognitionglobal attention

0 comments

The pith

Local attention introduces a second temporal operator to transformers, strictly enlarging the class of recognizable regular languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fixed-precision transformers with global attention match a fragment of linear temporal logic that contains only a single past operator. Adding local attention supplies a second distinct temporal operator, which lets the model recognize a strictly larger set of regular languages. Global attention and local attention turn out to be complementary: each can express properties the other cannot, and models that combine both reach the richest fragment. This logical difference supplies a formal reason why local attention sometimes improves quality even when quadratic cost is not the bottleneck. Experiments on formal-language recognition and natural-language modeling confirm that hybrid global-local models outperform global-only baselines.

Core claim

Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. Adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment.

What carries the argument

The correspondence between fixed-precision transformer attention and fragments of linear temporal logic, where global attention supplies one past operator and local attention supplies a distinct second operator.

If this is right

Hybrid global-local transformers recognize a strictly larger set of regular languages than either mechanism alone.
Global attention and local attention each capture temporal properties the other misses.
Models using both attention types achieve the most expressive fragment among the architectures compared.
Observed quality gains from local attention may arise from this added expressivity rather than from efficiency alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention-type choices in new architectures could be guided by the specific temporal-logic fragment a downstream task requires.
Analogous logical characterizations might be derived for other restricted attention patterns such as sliding windows of fixed radius.
Targeted tests on formal languages known to need two distinct past operators could map the exact contribution of local attention.

Load-bearing premise

That fixed-precision global-attention transformers can be exactly characterized as a fragment of linear temporal logic with precisely one past operator.

What would settle it

A regular language that requires two temporal operators yet is recognized by a global-only transformer (or vice versa), which would violate the claimed complementarity.

Figures

Figures reproduced from arXiv: 2605.00768 by Jiaoda Li, Ryan Cotterell.

**Figure 1.** Figure 1: Forbidden configuration in the minimal DFAs of view at source ↗

**Figure 2.** Figure 2: Minimal DFAs. Nodes represent states and arrows view at source ↗

**Figure 3.** Figure 3: Heatmap of longest perfect lengths (maximum over runs) across formal languages, attention patterns, and positional view at source ↗

**Figure 4.** Figure 4: Perplexity on WikiText-2 for local, hybrid, and global attention patterns under different positional encodings. Curves view at source ↗

read the original abstract

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Local attention adds a second LTL past operator and is complementary to global, with the proofs and experiments providing a formal account of its practical benefits.

read the letter

The main thing to know is that this paper proves local attention corresponds to an LTL fragment with two past operators, strictly larger than the single-operator fragment for global attention, and that the two are expressively complementary. Their combination gives the richest class of recognizable regular languages. This supplies a theoretical reason for why local windows sometimes improve quality even though they restrict the model. The work extends the existing global-attention characterization cleanly and includes experiments on formal language recognition plus natural language modeling that show hybrid models outperforming global-only ones. Those experiments line up with the theory and give it some empirical grounding. The central modeling step is the claim that a fixed-size window exactly implements one additional temporal operator without leftover power from softmax or value aggregation. If that reduction is tight in both directions, the strict enlargement and complementarity results follow; any looseness would require qualification. The abstract states the claims directly, but the full derivations need checking for precision on that point. This paper is for people who work on logical characterizations of transformers or who want a formal handle on attention variants. A reader already familiar with the global-attention LTL result will get the most out of it. The extension is new and the evidence is formal plus experimental, so it deserves a serious referee even if the proofs need tightening in places. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator (extending prior results). It proves that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary—neither subsumes the other—and their combination yields the richest fragment. These results are supported by formal proofs and experiments on formal language recognition and natural language modeling.

Significance. If the central correspondences hold, the work supplies a formal explanation for why local attention can improve model quality beyond efficiency gains, by linking it to increased expressivity over regular languages. It extends existing logical characterizations of global attention and provides a basis for understanding hybrid architectures. The explicit proofs of enlargement and complementarity, together with experimental corroboration on both synthetic and natural language tasks, constitute the main strengths.

major comments (2)

[§3.3] §3.3, the reduction establishing the local-attention to second-past-operator correspondence: the argument that a fixed-size window plus attention aggregation implements precisely one additional LTL operator (without extra power from softmax or value projection) is load-bearing for the strict-enlargement claim. The derivation must explicitly rule out simulation of additional temporal patterns via repeated local steps or residual global-like behavior within the window.
[§4.2] §4.2, Theorem 3 on complementarity: the separating languages (one recognizable by global but not local, and vice versa) must be constructed and verified in detail to confirm they remain outside the other fragment even after accounting for multiple layers or varying window sizes; without this, the 'neither subsumes the other' claim is not fully secured.

minor comments (2)

[§2] Notation for the LTL fragments (single vs. two past operators) should be aligned more explicitly with the cited prior work on global attention to facilitate direct comparison.
[Table 2] Experimental tables (e.g., Table 2) would benefit from reporting variance or multiple random seeds to make the corroboration of the theory more robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment below, clarifying the relevant proofs and indicating the revisions we will make to strengthen the presentation.

read point-by-point responses

Referee: [§3.3] §3.3, the reduction establishing the local-attention to second-past-operator correspondence: the argument that a fixed-size window plus attention aggregation implements precisely one additional LTL operator (without extra power from softmax or value projection) is load-bearing for the strict-enlargement claim. The derivation must explicitly rule out simulation of additional temporal patterns via repeated local steps or residual global-like behavior within the window.

Authors: We agree that this correspondence is central to the strict-enlargement result. Section 3.3 proves both directions of the equivalence: local attention with a fixed window recognizes exactly the languages expressible by adding one additional past operator to the single-operator fragment, and conversely. The multi-layer case is handled by showing that compositions of local steps remain within the two-operator LTL fragment, because each local aggregation is strictly bounded by the window and cannot encode unbounded or independent temporal relations. The proof also establishes that value projections and softmax within the local mask cannot simulate global behavior, as the attention scores are confined to the window. We will add an explicit remark and supporting lemma in the revised Section 3.3 to rule out these possibilities more directly. revision: yes
Referee: [§4.2] §4.2, Theorem 3 on complementarity: the separating languages (one recognizable by global but not local, and vice versa) must be constructed and verified in detail to confirm they remain outside the other fragment even after accounting for multiple layers or varying window sizes; without this, the 'neither subsumes the other' claim is not fully secured.

Authors: The separating languages and their verification appear in the proof of Theorem 3. The global-only language requires arbitrary-distance dependencies that no fixed window (regardless of size or number of layers) can capture. The local-only language requires isolation of strictly local patterns that global attention cannot enforce. The argument already accounts for multiple layers by showing that the global fragment cannot simulate the local one even with arbitrary depth, and vice versa. To address the request for expanded detail, we will include additional verification steps and explicit checks against multi-layer and variable-window simulations in the revised main text or appendix. revision: partial

Circularity Check

0 steps flagged

No circularity: formal proof extends cited prior characterization independently

full rationale

The derivation relies on a cited prior result establishing that fixed-precision global attention corresponds to a single-past-operator fragment of LTL, then provides new proofs that local attention introduces a second operator and that the fragments are complementary. These are independent formal arguments within the logical framework rather than reductions by construction, self-definitions, or fitted inputs. No equations or claims in the paper equate a derived expressivity result directly to its modeling assumptions or to self-citations in a load-bearing way. The work is self-contained as a theoretical extension against external logical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the prior result linking global attention to a single-past-operator fragment of LTL and on the assumption that local attention can be analyzed inside the same logical model of computation.

axioms (1)

domain assumption Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator.
This is invoked as an established prior result that the paper extends to local attention.

pith-pipeline@v0.9.0 · 5714 in / 1237 out tokens · 89150 ms · 2026-05-20T23:57:28.882370+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator... adding local attention introduces a second temporal operator
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

locally testable languages... LTL[P,Y]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. https://arxiv.org/abs/1607.06450 Layer normalization . In NIPS Deep Learning Symposium

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . In The Third International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Computing Research Repository, arXiv:2004.05150. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://proceedings.neurips.cc/paper_fil...

work page 2020
[5]

Janusz Brzozowski, Baiyu Li, and David Liu. 2012. https://dl.acm.org/doi/abs/10.5555/3173440.3173443 Syntactic complexities of six classes of star-free languages . Journal of Automata, Languages and Combinatorics, 17(2):83–105

work page doi:10.5555/3173440.3173443 2012
[6]

Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, and Brian DuSell. 2025. https://openreview.net/forum?id=aWLQTbfFgV Training neural networks as recognizers of formal languages . In The Thirteenth International Conference on Learning Representations

work page 2025
[7]

David Chiang, Peter Cholak, and Anand Pillay. 2023. https://proceedings.mlr.press/v202/chiang23a.html Tighter bounds on the expressivity of transformer encoders . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544--5562. PMLR

work page 2023
[8]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

Joëlle Cohen, Dominique Perrin, and Jean-Eric Pin. 1993. https://doi.org/10.1016/0022-0000(93)90005-H On the expressive power of temporal logic . Journal of Computer and System Sciences, 46(3):271--294

work page doi:10.1016/0022-0000(93)90005-h 1993
[10]

DeepSeek-AI . 2025. https://arxiv.org/abs/2501.12948 DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning . Computing Research Repository, arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gr\'egoire Del\'etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. 2023. https://openreview.net/forum?id=WbxHAzkeQcn Neural networks and the chomsky hierarchy . In The Eleventh International Conference on Learning Representations

work page 2023
[12]

Samuel Eilenberg. 1974. https://books.google.ch/books?id=CZtduwEACAAJ Automata, Languages, and Machines . Number pt. 2 in 59/B. Academic Press

work page 1974
[13]

Dov Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. https://doi.org/10.1145/567446.567462 On the temporal analysis of fairness . POPL '80, page 163–173, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/567446.567462 1980
[14]

James A. Green. 1951. http://www.jstor.org/stable/1969317 On the structure of semigroups . Annals of Mathematics, 54(1):163--172

work page arXiv 1951
[15]

John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.156 RNN s can generate bounded hierarchical languages with optimal memory . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978--2010, Online. Association for Computa...

work page doi:10.18653/v1/2020.emnlp-main.156 2020
[16]

IEEE . 2019. https://doi.org/10.1109/IEEESTD.2019.8766229 IEEE standard for floating-point arithmetic . IEEE Std 754-2019

work page doi:10.1109/ieeestd.2019.8766229 2019
[17]

Johan Anthony Wilem Kamp. 1968. https://www.proquest.com/docview/302320357 Tense Logic and the Theory of Linear Order . Ph.D. thesis, University of California, Los Angeles

work page arXiv 1968
[18]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. https://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . Computing Research Repository, arXiv:1412.6980. Version 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Stephen C. Kleene. 1956. https://doi.org/doi:10.1515/9781400882618-002 Representation of Events in Nerve Nets and Finite Automata , pages 3--42. Princeton University Press, Princeton

work page doi:10.1515/9781400882618-002 1956
[20]

Jiaoda Li and Ryan Cotterell. 2025. https://openreview.net/forum?id=29LwAgLFpj Characterizing the expressivity of fixed-precision transformer language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[21]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. https://openreview.net/forum?id=3EWTEy9MTM Chain of thought empowers transformers to solve inherently serial problems . In The Twelfth International Conference on Learning Representations

work page 2024
[22]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In The Seventh International Conference on Learning Representations

work page 2019
[23]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1166 Effective approaches to attention-based neural machine translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1166 2015
[24]

Robert McNaughton and Seymour Papert. 1971. https://mitpress.mit.edu/9780262130769/counter-free-automata/ Counter-free Automata . M.I.T. Press research monographs. M.I.T. Press

work page arXiv 1971
[25]

William Merrill and Ashish Sabharwal. 2024. https://openreview.net/forum?id=NjNGlPh8Wh The expressive power of transformers with chain of thought . In The Twelfth International Conference on Learning Representations

work page 2024
[26]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . Computing Research Repository, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[27]

Perles, Michael O

Micha A. Perles, Michael O. Rabin, and Eli Shamir. 1963. https://api.semanticscholar.org/CorpusID:45448007 The theory of definite automata . IEEE Transactions on Electronic Computers, 12:233--243

work page 1963
[28]

Amir Pnueli. 1977. https://doi.org/10.1109/SFCS.1977.32 The temporal logic of programs . In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46--57

work page doi:10.1109/sfcs.1977.32 1977
[29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . OpenAI technical report

work page 2018
[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI technical report

work page 2019
[31]

J \"u rgen Schmidhuber. 1992. https://ieeexplore.ieee.org/document/6796337 Learning to control fast-weight memories: An alternative to dynamic recurrent networks . In Neural Computation, volume 4, pages 131--139. MIT Press

work page arXiv 1992
[32]

Howard Straubing. 1994. https://doi.org/10.1007/978-1-4612-0289-9 Finite Automata, Formal Logic, and Circuit Complexity . Progress in Theoretical Computer Science. Birkh \"a user Boston, Boston, MA

work page doi:10.1007/978-1-4612-0289-9 1994
[33]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 Roformer: Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

work page doi:10.1016/j.neucom.2023.127063 2024
[34]

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Chang, Khyathi Chandu, Akshita Bhagia, Oyvind Tafjord, and 1 others. 2025. https://arxiv.org/abs/2501.00656 OLMo 2 : The best fully open language model to date . Computing Research Repository, arXiv:2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

work page 2017
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

work page 2022
[37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

work page 2020
[38]

Andy Yang, Micha \"e l Cadilhac, and David Chiang. 2025. https://openreview.net/forum?id=jPduiyxyfw Knee-deep in c- RASP : A transformer depth hierarchy . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[39]

Andy Yang, David Chiang, and Dana Angluin. 2024. https://openreview.net/forum?id=FBMsBdH0yz Masked hard-attention transformers recognize exactly the star-free languages . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[40]

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. https://aclanthology.org/2026.tacl-1.8/ Simulating hard attention using soft attention . Transactions of the Association for Computational Linguistics, 14

work page 2026
[41]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf Big bird: Transformers for longer sequences . In Advances in Neural Information Pr...

work page 2020

[1] [1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. https://arxiv.org/abs/1607.06450 Layer normalization . In NIPS Deep Learning Symposium

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . In The Third International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Computing Research Repository, arXiv:2004.05150. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://proceedings.neurips.cc/paper_fil...

work page 2020

[5] [5]

Janusz Brzozowski, Baiyu Li, and David Liu. 2012. https://dl.acm.org/doi/abs/10.5555/3173440.3173443 Syntactic complexities of six classes of star-free languages . Journal of Automata, Languages and Combinatorics, 17(2):83–105

work page doi:10.5555/3173440.3173443 2012

[6] [6]

Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, and Brian DuSell. 2025. https://openreview.net/forum?id=aWLQTbfFgV Training neural networks as recognizers of formal languages . In The Thirteenth International Conference on Learning Representations

work page 2025

[7] [7]

David Chiang, Peter Cholak, and Anand Pillay. 2023. https://proceedings.mlr.press/v202/chiang23a.html Tighter bounds on the expressivity of transformer encoders . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544--5562. PMLR

work page 2023

[8] [8]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

Joëlle Cohen, Dominique Perrin, and Jean-Eric Pin. 1993. https://doi.org/10.1016/0022-0000(93)90005-H On the expressive power of temporal logic . Journal of Computer and System Sciences, 46(3):271--294

work page doi:10.1016/0022-0000(93)90005-h 1993

[10] [10]

DeepSeek-AI . 2025. https://arxiv.org/abs/2501.12948 DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning . Computing Research Repository, arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gr\'egoire Del\'etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. 2023. https://openreview.net/forum?id=WbxHAzkeQcn Neural networks and the chomsky hierarchy . In The Eleventh International Conference on Learning Representations

work page 2023

[12] [12]

Samuel Eilenberg. 1974. https://books.google.ch/books?id=CZtduwEACAAJ Automata, Languages, and Machines . Number pt. 2 in 59/B. Academic Press

work page 1974

[13] [13]

Dov Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. https://doi.org/10.1145/567446.567462 On the temporal analysis of fairness . POPL '80, page 163–173, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/567446.567462 1980

[14] [14]

James A. Green. 1951. http://www.jstor.org/stable/1969317 On the structure of semigroups . Annals of Mathematics, 54(1):163--172

work page arXiv 1951

[15] [15]

John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.156 RNN s can generate bounded hierarchical languages with optimal memory . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978--2010, Online. Association for Computa...

work page doi:10.18653/v1/2020.emnlp-main.156 2020

[16] [16]

IEEE . 2019. https://doi.org/10.1109/IEEESTD.2019.8766229 IEEE standard for floating-point arithmetic . IEEE Std 754-2019

work page doi:10.1109/ieeestd.2019.8766229 2019

[17] [17]

Johan Anthony Wilem Kamp. 1968. https://www.proquest.com/docview/302320357 Tense Logic and the Theory of Linear Order . Ph.D. thesis, University of California, Los Angeles

work page arXiv 1968

[18] [18]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. https://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . Computing Research Repository, arXiv:1412.6980. Version 9

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Stephen C. Kleene. 1956. https://doi.org/doi:10.1515/9781400882618-002 Representation of Events in Nerve Nets and Finite Automata , pages 3--42. Princeton University Press, Princeton

work page doi:10.1515/9781400882618-002 1956

[20] [20]

Jiaoda Li and Ryan Cotterell. 2025. https://openreview.net/forum?id=29LwAgLFpj Characterizing the expressivity of fixed-precision transformer language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[21] [21]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. https://openreview.net/forum?id=3EWTEy9MTM Chain of thought empowers transformers to solve inherently serial problems . In The Twelfth International Conference on Learning Representations

work page 2024

[22] [22]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In The Seventh International Conference on Learning Representations

work page 2019

[23] [23]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1166 Effective approaches to attention-based neural machine translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1166 2015

[24] [24]

Robert McNaughton and Seymour Papert. 1971. https://mitpress.mit.edu/9780262130769/counter-free-automata/ Counter-free Automata . M.I.T. Press research monographs. M.I.T. Press

work page arXiv 1971

[25] [25]

William Merrill and Ashish Sabharwal. 2024. https://openreview.net/forum?id=NjNGlPh8Wh The expressive power of transformers with chain of thought . In The Twelfth International Conference on Learning Representations

work page 2024

[26] [26]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . Computing Research Repository, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[27] [27]

Perles, Michael O

Micha A. Perles, Michael O. Rabin, and Eli Shamir. 1963. https://api.semanticscholar.org/CorpusID:45448007 The theory of definite automata . IEEE Transactions on Electronic Computers, 12:233--243

work page 1963

[28] [28]

Amir Pnueli. 1977. https://doi.org/10.1109/SFCS.1977.32 The temporal logic of programs . In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46--57

work page doi:10.1109/sfcs.1977.32 1977

[29] [29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . OpenAI technical report

work page 2018

[30] [30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI technical report

work page 2019

[31] [31]

J \"u rgen Schmidhuber. 1992. https://ieeexplore.ieee.org/document/6796337 Learning to control fast-weight memories: An alternative to dynamic recurrent networks . In Neural Computation, volume 4, pages 131--139. MIT Press

work page arXiv 1992

[32] [32]

Howard Straubing. 1994. https://doi.org/10.1007/978-1-4612-0289-9 Finite Automata, Formal Logic, and Circuit Complexity . Progress in Theoretical Computer Science. Birkh \"a user Boston, Boston, MA

work page doi:10.1007/978-1-4612-0289-9 1994

[33] [33]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 Roformer: Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

work page doi:10.1016/j.neucom.2023.127063 2024

[34] [34]

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Chang, Khyathi Chandu, Akshita Bhagia, Oyvind Tafjord, and 1 others. 2025. https://arxiv.org/abs/2501.00656 OLMo 2 : The best fully open language model to date . Computing Research Repository, arXiv:2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

work page 2017

[36] [36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

work page 2022

[37] [37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

work page 2020

[38] [38]

Andy Yang, Micha \"e l Cadilhac, and David Chiang. 2025. https://openreview.net/forum?id=jPduiyxyfw Knee-deep in c- RASP : A transformer depth hierarchy . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[39] [39]

Andy Yang, David Chiang, and Dana Angluin. 2024. https://openreview.net/forum?id=FBMsBdH0yz Masked hard-attention transformers recognize exactly the star-free languages . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[40] [40]

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. https://aclanthology.org/2026.tacl-1.8/ Simulating hard attention using soft attention . Transactions of the Association for Computational Linguistics, 14

work page 2026

[41] [41]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf Big bird: Transformers for longer sequences . In Advances in Neural Information Pr...

work page 2020