On the Existence of Universal Simulators of Attention
Pith reviewed 2026-05-19 07:24 UTC · model grok-4.3
The pith
A transformer encoder can exactly simulate any attention mechanism by implementing it through a universal composition of encoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a universal simulator U composed of transformer encoders, we present algorithmic solutions to replicate attention outputs and the underlying elementary matrix and activation operations via RASP, showing the existence of an algorithmically achievable, data-agnostic solution.
What carries the argument
The universal simulator U, a stack of transformer encoders that encodes RASP programs to perform the matrix multiplications, softmax, and other primitives of attention.
If this is right
- Attention mechanisms become exactly reproducible inside transformers without any training data or optimization.
- The same simulator can handle the elementary matrix operations and nonlinear activations that define attention.
- Transformers gain a deterministic way to match the behavior of any attention function they are asked to emulate.
- Results on expressivity now include an explicit algorithmic construction rather than existence proofs alone.
Where Pith is reading between the lines
- The construction might be adapted to simulate attention variants that include custom positional encodings or masking rules.
- One could check whether the same encoder stack can also simulate attention inside decoder-only or encoder-decoder transformers.
- If the simulator works, it opens a route to verify whether learned attention heads have converged to one of the exactly simulable mechanisms.
Load-bearing premise
The RASP model of computation can be implemented inside ordinary transformer encoders without special positional encodings, layer norms, or residual connections that would break the simulation for arbitrary attention variants.
What would settle it
Running the constructed U on a concrete input sequence and attention variant and finding that its output deviates from the true attention output would show that no such universal simulator exists.
read the original abstract
Previous work on the learnability of transformers \textemdash\ focused primarily on examining their ability to approximate specific algorithmic patterns through training \textemdash\ has largely been data-driven, offering only probabilistic guarantees rather than deterministic solutions. Expressivity, on the contrary, has been devised to address the problems \emph{computable} by such architecture theoretically. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: \emph{can a transformer, as a computational model, simulate an arbitrary attention mechanism, or in particular, the underlying operations?} In this study, we investigate the transformer encoder's ability to simulate a vanilla attention mechanism. By constructing a universal simulator $\mathcal{U}$ composed of transformer encoders, we present algorithmic solutions to replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. We show the existence of an algorithmically achievable, data-agnostic solution, previously known to be approximated only by learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a transformer encoder can simulate an arbitrary attention mechanism by constructing a universal simulator U composed of transformer encoders. This simulator replicates attention outputs and the underlying elementary operations (such as matrix multiplications and activations like softmax) by expressing them as RASP programs, thereby establishing the existence of an algorithmically achievable, data-agnostic solution where prior work provided only probabilistic guarantees through learning.
Significance. If the construction is valid and faithful to standard transformer architectures, the result would strengthen the theoretical understanding of transformer expressivity by providing deterministic, algorithmic simulation of attention rather than learned approximations. This bridges learnability and expressivity results and could inform formal analyses of transformer capabilities in a RASP-based framework.
major comments (2)
- [Abstract] Abstract: The abstract states that a construction of the universal simulator U exists but provides no proof sketch, explicit RASP program, or verification that the simulator handles all edge cases of attention (e.g., different head counts or masking). This leaves a derivation gap for the central claim.
- [Main construction] Main construction (implied in the RASP-to-transformer compilation): The claim that RASP programs compile directly into standard encoder blocks assumes faithful realization without additional assumptions on positional encodings, layer norms, or residual connections. Typical RASP constructions for selection, copying, and indexing use explicit position tokens or custom encodings that are not part of the minimal transformer encoder definition (multi-head attention + FFN + residual + layer-norm), which risks breaking the simulation for arbitrary attention variants.
minor comments (1)
- [Notation and figures] Ensure all symbols (e.g., U, RASP) are consistently formatted in math mode and that any figures illustrating the simulator architecture are clearly labeled with component connections.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of clarity and architectural fidelity in our work on universal simulators for attention. We address each major comment point by point below, providing clarifications based on the manuscript's constructions while noting where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that a construction of the universal simulator U exists but provides no proof sketch, explicit RASP program, or verification that the simulator handles all edge cases of attention (e.g., different head counts or masking). This leaves a derivation gap for the central claim.
Authors: We acknowledge that the abstract, being concise, omits a detailed sketch. The full manuscript (Section 3 and Appendix A) establishes the existence of U via an explicit RASP program that composes transformer encoder blocks to simulate the attention matrix multiplication, softmax activation, and output projection. Multi-head attention is handled by replicating the single-head RASP simulation across heads in parallel (using the multi-head attention primitive available in the target architecture), and masking is implemented through RASP's selection and indexing operations that condition on position or padding indicators. We will revise the abstract to include a brief proof sketch outlining the RASP compilation steps and note the handling of these edge cases. revision: yes
-
Referee: [Main construction] Main construction (implied in the RASP-to-transformer compilation): The claim that RASP programs compile directly into standard encoder blocks assumes faithful realization without additional assumptions on positional encodings, layer norms, or residual connections. Typical RASP constructions for selection, copying, and indexing use explicit position tokens or custom encodings that are not part of the minimal transformer encoder definition (multi-head attention + FFN + residual + layer-norm), which risks breaking the simulation for arbitrary attention variants.
Authors: The construction in the manuscript uses the standard transformer encoder block exactly as defined (multi-head self-attention, feed-forward network, residual connections, and layer normalization), with fixed sinusoidal positional encodings added to the input embeddings as in the original Vaswani et al. architecture. RASP programs for selection and copying are realized using the attention mechanism itself to attend over positions, without requiring non-standard encodings; the positional information is provided via the standard sinusoidal embeddings, which are sufficient for the indexing operations in our simulator. We will add a clarifying subsection in the main text (and expand the appendix) to explicitly map each RASP primitive to the corresponding transformer components, confirming compatibility with arbitrary attention variants under the standard definition. revision: partial
Circularity Check
No circularity: explicit algorithmic construction via RASP
full rationale
The paper establishes its central claim through an explicit algorithmic construction of a universal simulator U built from transformer encoders, which replicates arbitrary attention outputs and elementary operations (matrix multiplications, activations) by expressing them as RASP programs. This is a direct existence proof by construction that does not reduce any result to fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or steps in the derivation chain are equivalent to their inputs by construction, and the approach relies on the formal properties of the RASP framework and standard transformer components without circular dependencies or renaming of known results. The derivation remains self-contained as an independent algorithmic demonstration rather than a statistical or definitional tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption RASP programs can be realized by finite compositions of transformer encoder layers
- standard math Attention outputs are exactly expressible via elementary matrix multiplications and activations
invented entities (1)
-
Universal simulator U
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.