Recognition: 2 theorem links
· Lean TheoremCross-Attention and Encoder-Decoder Transformers: A Logical Characterization
Pith reviewed 2026-05-11 02:05 UTC · model grok-4.3
The pith
Encoder-decoder transformers match a temporal logic with a counting global modality over the encoder and a past modality over the decoder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Encoder-decoder transformers over text, in the setting of floating-point numbers and soft attention, are exactly characterized by a temporal logic that extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. The same class of transformers is also characterized by a type of distributed automata. These equivalences hold even when architectural details such as masking are varied, and they extend to the autoregressive case.
What carries the argument
A temporal logic with a counting global modality over the encoder input sequence and a past modality over the decoder input sequence, which together capture the cross-attention mechanism between the two parts of the model.
If this is right
- Any property expressible in the new logic can be decided for a given transformer by checking the corresponding formula.
- Transformers with altered masking rules remain within the same logical class.
- The automata characterization supplies an alternative computational model that can simulate the same cross-attention behavior.
- The results apply directly to autoregressive generation settings.
Where Pith is reading between the lines
- The logic may serve as a specification language for verifying safety properties of transformer-based systems before deployment.
- Connections between the counting modality and automata might allow complexity bounds on what encoder-decoder models can compute in polynomial time.
- Designers could use the logic to generate candidate architectures that satisfy desired input-output behaviors.
Load-bearing premise
The continuous and probabilistic behavior of soft attention can be faithfully reproduced by the discrete counting and past modalities without any loss of expressive power.
What would settle it
An input-output pair on a small encoder-decoder transformer where the output token probabilities cannot be matched by any formula in the proposed logic or by any run of the corresponding distributed automata.
Figures
read the original abstract
We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. We also give an additional characterization of such transformers via a type of distributed automata, and show that our results are not limited to the specific choices in the architecture and can account for changes in, e.g., masking. Finally, we discuss encoder-decoder transformers in the autoregressive setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to provide a novel logical characterization of encoder-decoder transformers (with cross-attention, in the floating-point soft-attention setting) via a temporal logic extending propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. It additionally characterizes such transformers via distributed automata, shows the results are robust to variations such as masking, and discusses the autoregressive setting.
Significance. If the claimed exact equivalences hold, the work would offer a valuable formal bridge between transformer architectures and logical/automata-theoretic methods, supporting verification and analysis of cross-attention mechanisms in LLMs. The automata characterization and explicit handling of masking changes are positive features that broaden applicability.
major comments (2)
- [§3 (logic definition) and §4 (equivalence proof)] The central equivalence (likely Theorem 1 or the main result in the section on logical characterization) asserts that every transformer computation, including dot-product attention scores, softmax normalization, and real-valued weighted sums, is expressible in the logic. However, the counting global modality only tracks cardinalities of positions satisfying formulas, and the past modality tracks temporal order; it is unclear from the definitions how these discrete modalities exactly encode the continuous, position-pair-dependent floating-point operations of soft cross-attention without an intermediate discretization that would break exact equivalence.
- [§5 (automata characterization)] The automata characterization (in the section following the logic) claims a precise match via distributed automata, but the transition rules must be shown to simulate the numerical attention matrix computation exactly; if the automata are defined only over discrete states derived from the logic, this risks the same gap with respect to floating-point values and masking.
minor comments (2)
- [Abstract and §1] The abstract and introduction could more explicitly state the precise class of encoder-decoder transformers considered (e.g., number of layers, head count, or restrictions on value embeddings) to clarify the scope of the characterization.
- [§3] Notation for the modalities (e.g., how the counting operator is written and how it interacts with propositional variables) should be introduced with a small concrete example of a simple cross-attention step to aid readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. The comments highlight important points regarding clarity in the logical and automata characterizations, which we will address through revisions to improve the exposition of the proofs and examples.
read point-by-point responses
-
Referee: [§3 (logic definition) and §4 (equivalence proof)] The central equivalence (likely Theorem 1 or the main result in the section on logical characterization) asserts that every transformer computation, including dot-product attention scores, softmax normalization, and real-valued weighted sums, is expressible in the logic. However, the counting global modality only tracks cardinalities of positions satisfying formulas, and the past modality tracks temporal order; it is unclear from the definitions how these discrete modalities exactly encode the continuous, position-pair-dependent floating-point operations of soft cross-attention without an intermediate discretization that would break exact equivalence.
Authors: We appreciate this observation on the need for greater explicitness. The logic encodes the floating-point operations by using the counting modality to express exact cardinalities and comparisons that correspond to the results of dot-product computations and softmax (via formulas that define thresholds and weighted sums over positions satisfying prior subformulas). The equivalence in §4 is established by induction on transformer layers, where each real-valued operation is captured by a corresponding logical formula without requiring discretization of the values themselves. The past modality handles decoder dependencies in the same manner. To make this connection clearer, we will revise §3 and §4 to include a detailed worked example of encoding a cross-attention head, along with expanded proof steps showing how the modalities simulate the numerical matrix operations exactly. revision: yes
-
Referee: [§5 (automata characterization)] The automata characterization (in the section following the logic) claims a precise match via distributed automata, but the transition rules must be shown to simulate the numerical attention matrix computation exactly; if the automata are defined only over discrete states derived from the logic, this risks the same gap with respect to floating-point values and masking.
Authors: The distributed automata are constructed directly from the logical formulas, with states and transitions that preserve the exact numerical semantics of the attention computations (including floating-point values) through the same inductive encoding used in the logic. Masking is handled by restricting the transition rules to the relevant positions, as shown in the robustness results. We acknowledge that the current presentation of the automata transitions could be more detailed. In the revision, we will expand §5 with additional formal definitions and a proof sketch explicitly demonstrating the simulation of the attention matrix and masking variations. revision: yes
Circularity Check
No significant circularity; independent formalisms shown equivalent to architecture
full rationale
The paper defines a new temporal logic (propositional logic plus counting global modality over encoder and past modality over decoder) and a class of distributed automata as standalone formalisms. It then proves these exactly characterize encoder-decoder transformers under floating-point soft-attention. No equation or definition reduces the claimed equivalence to a fitted parameter, self-referential construction, or self-citation chain. The derivation proceeds by explicit translation of transformer operations into the logic/automata and vice versa, remaining self-contained without importing load-bearing results from prior author work.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard axioms and semantics of propositional logic
- domain assumption Transformer computations use floating-point arithmetic and soft attention
invented entities (2)
-
Counting global modality
no independent evidence
-
Past modality over decoder input
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also give an additional characterization of such transformers via a type of distributed automata... counting past-global distributed automata (CPG-automata)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Expressive Power of Graph Transformers via Logic.Proceedings of the AAAI Conference on Artificial Intelli- gence, 40(24):19569–19579, Mar. 2026. URL:https://ojs.aaai.org/index.php/AAAI/article/ view/39036,doi:10.1609/aaai.v40i24.39036
-
[2]
Expres- sive Power of Graph Transformers via Logic, 2026
Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Expres- sive Power of Graph Transformers via Logic, 2026. URL:https://arxiv.org/abs/2508.01067, arXiv:2508.01067. 11
-
[3]
Logical Character- izations of Recurrent Graph Neural Networks with Reals and Floats
Veeti Ahvonen, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Logical Character- izations of Recurrent Graph Neural Networks with Reals and Floats. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 104205–104249. Curran Asso- ciates, Inc., 202...
-
[4]
Pablo Barcelo, Alexander Kozachinskiy, Anthony W. Lin, and Vladimir Podolskii. Logical Lan- guages Accepted by Transformer Encoders with Hard Attention. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Represen- tations, volume 2024, pages 22077–22087, 2024. URL:https://proceedings.iclr.cc/...
work page 2024
-
[5]
Overcoming a Theoretical Limitation of Self-Attention
David Chiang and Peter Cholak. Overcoming a Theoretical Limitation of Self-Attention. InPro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, 2022
work page 2022
-
[6]
Tighter Bounds on the Expressivity of Transformer Encoders
David Chiang, Peter Cholak, and Anand Pillay. Tighter Bounds on the Expressivity of Transformer Encoders. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[7]
No Language Left Behind: Scaling Human-Centered Machine Translation
Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffer- nan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No Language Left Behind: Scaling Human-Centered Machine Translation.arXiv preprint arXiv:2207.04672, 2022. URL: https://arxiv.org/abs/2207.04672,arXiv:2207.04672
work page internal anchor Pith review arXiv 2022
-
[8]
TheAccuracyofFloatingPointSummation.SIAM J
NicholasJ.Higham. TheAccuracyofFloatingPointSummation.SIAM J. Sci. Comput., 14(4):783– 799, 1993.doi:10.1137/0914050
-
[9]
Softmax Transformers Are Turing Complete,
Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax Transformers are Turing-Complete, 2025. URL:https://arxiv.org/abs/2511.20038,arXiv:2511.20038
-
[10]
Modal Logic and Distributed Message Passing Automata
Antti Kuusisto. Modal Logic and Distributed Message Passing Automata. In Simona Ronchi Della Rocca, editor,Computer Science Logic 2013 (CSL 2013), volume 23 ofLeibniz International Proceedings in Informatics (LIPIcs), pages 452–468, Dagstuhl, Germany, 2013. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL:https://drops.dagstuhl.de/entities/document/...
-
[11]
Characterizing the Expressivity of Fixed-precision Transformer Language Models
Jiaoda Li and Ryan Cotterell. Characterizing the Expressivity of Fixed-precision Transformer Language Models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 159510– 159554. Curran Associates, Inc., 2025. URL:https://proceedings.neurips.cc/paper...
work page 2025
-
[12]
Chain of Thought Empowers Transform- ers to Solve Inherently Serial Problems
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transform- ers to Solve Inherently Serial Problems. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[13]
A Logic for Expressing Log-Precision Transformers
William Merrill and Ashish Sabharwal. A Logic for Expressing Log-Precision Transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 52453–52463. Curran As- sociates, Inc., 2023. URL:https://proceedings.neurips.cc/paper_files/paper/2023/file/ a48e5877...
work page 2023
-
[14]
The Expressive Power of Transformers with Chain of Thought
William Merrill and Ashish Sabharwal. The Expressive Power of Transformers with Chain of Thought. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, 12 pages 7690–7706, 2024. URL:https://proceedings.iclr.cc/paper_files/paper/2024/file/ 1f59721c106ea80f61329903911...
work page 2024
-
[15]
Attention is Turing Complete.Journal of Machine Learning Research, 22(75):1–35, 2021
Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is Turing Complete.Journal of Machine Learning Research, 22(75):1–35, 2021. URL:http://jmlr.org/papers/v22/20-302. html
work page 2021
-
[16]
Thomas G. Robertazzi and Stuart C. Schwartz. Best "Ordering" for Floating-Point Addition.ACM Trans. Math. Softw., 14(1):101–110, 1988.doi:10.1145/42288.42343
-
[17]
High- Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[18]
What Formal Languages Can Transformers Express? A Survey
Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What Formal Languages Can Transformers Express? A Survey.Transactions of the Association for Com- putational Linguistics, 12:543–561, 2024. URL:http://dx.doi.org/10.1162/tacl_a_00663, doi:10.1162/tacl_a_00663
-
[19]
RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[20]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL...
work page 2017
-
[21]
Rounding Errors in Algebraic Processes
James Hardy Wilkinson. Rounding Errors in Algebraic Processes. InInformation Processing, Proceedings of the 1st International Conference on Information Processing, UNESCO, Paris 15-20 June 1959, pages 44–53. UNESCO (Paris), 1959
work page 1959
-
[22]
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers
Andy Yang and David Chiang. Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers. InFirst Conference on Language Modeling, 2024. URL:https: //openreview.net/forum?id=FmhPg4UJ9K
work page 2024
-
[23]
MaskedHard-AttentionTransformersRecognizeEx- actlytheStar-FreeLanguages
AndyYang, DavidChiang, andDanaAngluin. MaskedHard-AttentionTransformersRecognizeEx- actlytheStar-FreeLanguages. InA.Globerson, L.Mackey, D.Belgrave, A.Fan, U.Paquet, J.Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 10202–10235. Curran Associates, Inc., 2024. URL:https://proceedings.neurips.cc/paper_ fi...
-
[24]
Knee-Deep in C-RASP: A Transformer Depth Hierarchy
Andy J Yang, Michaël Cadilhac, and David Chiang. Knee-Deep in C-RASP: A Transformer Depth Hierarchy. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 37184–37225. Cur- ran Associates, Inc., 2025. URL:https://proceedings.neurips.cc/paper_files/pap...
work page 2025
-
[25]
Open-Sora: Democratizing Efficient Video Production for All,
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing Efficient Video Production for All,
-
[26]
Open-Sora: Democratizing Efficient Video Production for All
URL:https://arxiv.org/abs/2412.20404,arXiv:2412.20404. 13 A Proofs for Theorem 6 A.1 Proof of Theorem 3 In this section, we prove Theorem 3. Theorem 3(Logic⇒transformers).For each tuple of formulae ofGPTL −, we can construct a feature- wise equivalent encoder–decoder transformer without the finalsoftmax. Proof.We intuitively use theMLPs to simulate Boolea...
work page internal anchor Pith review arXiv
-
[27]
Ifn pre τ0 (G, i)≥kfor someτ 0 ∈S ψ, thenmin(n pre τ0 (G, i), k) = min(n pre τ0 (H, j), k) =k, and hence both sums are at leastk
-
[28]
We have thus shown thatG, i|=⟨G⟩pre ≥ℓ ψif and only ifH, j|=⟨G⟩ pre ≥ℓ ψ
Ifn pre τ (G, i)< kfor allτ∈S ψ, then sincemin(n pre τ (G, i), k) = min(n pre τ (H, j), k), we must have npre τ (G, i) =n pre τ (H, j), meaning the sums are equal. We have thus shown thatG, i|=⟨G⟩pre ≥ℓ ψif and only ifH, j|=⟨G⟩ pre ≥ℓ ψ. The caseφ=⟨P⟩ suf ≥ℓ is handled analogously. Next, we define type automata for the above defined types. These are, intu...
-
[29]
They satisfy the sameGPTL− ≥-type of widthkand modal depth0
-
[30]
For allGPTL − ≥-typesτof widthkand modal depthd, there arel < kvertices in the prefix of (G, w)that satisfyτif and only if there arelvertices in the prefix of(H, v)that satisfyτ
-
[31]
For allGPTL − ≥-typesτof widthkand modal depthd, there arel≥kvertices in the prefix of (G, w)that satisfyτif and only if there are at leastkvertices in the prefix of(H, v)that satisfyτ
-
[32]
For allGPTL − ≥-typesτof widthkand modal depthd, there arel < kvertices in the suffix of (G, w)beforewthat satisfyτif and only if there arelvertices in the suffix of(H, v)beforevthat satisfyτ
-
[33]
For allGPTL − ≥-typesτof widthkand modal depthd, there arel≥kvertices in the suffix of (G, w)beforewthat satisfyτif and only if there are at leastkvertices in the suffix of(H, v) beforevthat satisfyτ. Item 1. holds if and only if each initialization functionπassigns(G, w)and(H, v)the same initial state. Notice that by Proposition 11,τ(G,w) k,d =τ (H,v) k,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.