On the Existence of Universal Simulators of Attention

Anish Chakrabarty; Debanjan Dutta; Faizanuddin Ansari; Swagatam Das

arxiv: 2506.18739 · v2 · submitted 2025-06-23 · 💻 cs.LG · cs.AI

On the Existence of Universal Simulators of Attention

Debanjan Dutta , Anish Chakrabarty , Faizanuddin Ansari , Swagatam Das This is my paper

Pith reviewed 2026-05-19 07:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformerattention mechanismRASPuniversal simulatorexpressivitysimulationmachine learningdeterministic computation

0 comments

The pith

A transformer encoder can exactly simulate any attention mechanism by implementing it through a universal composition of encoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that transformers are not limited to learning attention patterns from data but can deterministically replicate any given attention mechanism and its core matrix and activation steps. It constructs a universal simulator U made entirely of standard transformer encoders that follows the RASP model to carry out these operations exactly. A reader would care because earlier work offered only probabilistic guarantees from training, while this supplies an algorithmic, data-independent route to the same result. The approach therefore connects theoretical expressivity proofs with concrete computational capability inside the transformer architecture itself.

Core claim

By constructing a universal simulator U composed of transformer encoders, we present algorithmic solutions to replicate attention outputs and the underlying elementary matrix and activation operations via RASP, showing the existence of an algorithmically achievable, data-agnostic solution.

What carries the argument

The universal simulator U, a stack of transformer encoders that encodes RASP programs to perform the matrix multiplications, softmax, and other primitives of attention.

If this is right

Attention mechanisms become exactly reproducible inside transformers without any training data or optimization.
The same simulator can handle the elementary matrix operations and nonlinear activations that define attention.
Transformers gain a deterministic way to match the behavior of any attention function they are asked to emulate.
Results on expressivity now include an explicit algorithmic construction rather than existence proofs alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The construction might be adapted to simulate attention variants that include custom positional encodings or masking rules.
One could check whether the same encoder stack can also simulate attention inside decoder-only or encoder-decoder transformers.
If the simulator works, it opens a route to verify whether learned attention heads have converged to one of the exactly simulable mechanisms.

Load-bearing premise

The RASP model of computation can be implemented inside ordinary transformer encoders without special positional encodings, layer norms, or residual connections that would break the simulation for arbitrary attention variants.

What would settle it

Running the constructed U on a concrete input sequence and attention variant and finding that its output deviates from the true attention output would show that no such universal simulator exists.

read the original abstract

Previous work on the learnability of transformers \textemdash\ focused primarily on examining their ability to approximate specific algorithmic patterns through training \textemdash\ has largely been data-driven, offering only probabilistic guarantees rather than deterministic solutions. Expressivity, on the contrary, has been devised to address the problems \emph{computable} by such architecture theoretically. These results proved the Turing-completeness of transformers, investigated bounds focused on circuit complexity, and formal logic. Being at the crossroad between learnability and expressivity, the question remains: \emph{can a transformer, as a computational model, simulate an arbitrary attention mechanism, or in particular, the underlying operations?} In this study, we investigate the transformer encoder's ability to simulate a vanilla attention mechanism. By constructing a universal simulator $\mathcal{U}$ composed of transformer encoders, we present algorithmic solutions to replicate attention outputs and the underlying elementary matrix and activation operations via RASP, a formal framework for transformer computation. We show the existence of an algorithmically achievable, data-agnostic solution, previously known to be approximated only by learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a deterministic universal simulator for arbitrary attention via RASP programs inside transformer encoders, but the abstract gives no construction details and the embedding into vanilla encoders looks questionable.

read the letter

The main takeaway is that the authors want to replace probabilistic training guarantees with an explicit algorithmic construction: a stack of transformer encoders that can simulate any attention mechanism and its elementary operations by expressing them as RASP programs. This is positioned as a bridge between learnability results and expressivity work on Turing completeness and circuit complexity. If the construction holds, it would give a data-agnostic way to replicate attention outputs without fitting parameters. That direction is worth noting because most prior work stays in the probabilistic regime or proves broad computability without targeting the attention operation itself. The paper does at least name the target clearly and tries to make the simulation algorithmic rather than learned. The soft spot is the missing substance. The abstract states the existence of the simulator U but supplies no proof sketch, no sample RASP program, and no check that it covers edge cases like varying head counts or masking. The stress-test concern about positional encodings and residual paths is reasonable on the evidence given; RASP constructions for selection and indexing often rely on explicit position tokens or non-standard routing that standard encoder blocks do not guarantee. If the paper needs those extras to make the simulation work, the claim that it simulates arbitrary vanilla attention does not go through. The work is aimed at people who already follow RASP and transformer expressivity papers. A reader who wants concrete constructions or reproducible programs will find little to use here yet. It is coherent enough on its own terms to deserve referee time, mainly to see whether the full manuscript actually delivers the RASP program and addresses the compilation assumptions. I would send it out rather than desk reject, but the bar for acceptance would be high once the details are visible.

Referee Report

2 major / 1 minor

Summary. The paper claims that a transformer encoder can simulate an arbitrary attention mechanism by constructing a universal simulator U composed of transformer encoders. This simulator replicates attention outputs and the underlying elementary operations (such as matrix multiplications and activations like softmax) by expressing them as RASP programs, thereby establishing the existence of an algorithmically achievable, data-agnostic solution where prior work provided only probabilistic guarantees through learning.

Significance. If the construction is valid and faithful to standard transformer architectures, the result would strengthen the theoretical understanding of transformer expressivity by providing deterministic, algorithmic simulation of attention rather than learned approximations. This bridges learnability and expressivity results and could inform formal analyses of transformer capabilities in a RASP-based framework.

major comments (2)

[Abstract] Abstract: The abstract states that a construction of the universal simulator U exists but provides no proof sketch, explicit RASP program, or verification that the simulator handles all edge cases of attention (e.g., different head counts or masking). This leaves a derivation gap for the central claim.
[Main construction] Main construction (implied in the RASP-to-transformer compilation): The claim that RASP programs compile directly into standard encoder blocks assumes faithful realization without additional assumptions on positional encodings, layer norms, or residual connections. Typical RASP constructions for selection, copying, and indexing use explicit position tokens or custom encodings that are not part of the minimal transformer encoder definition (multi-head attention + FFN + residual + layer-norm), which risks breaking the simulation for arbitrary attention variants.

minor comments (1)

[Notation and figures] Ensure all symbols (e.g., U, RASP) are consistently formatted in math mode and that any figures illustrating the simulator architecture are clearly labeled with component connections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity and architectural fidelity in our work on universal simulators for attention. We address each major comment point by point below, providing clarifications based on the manuscript's constructions while noting where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that a construction of the universal simulator U exists but provides no proof sketch, explicit RASP program, or verification that the simulator handles all edge cases of attention (e.g., different head counts or masking). This leaves a derivation gap for the central claim.

Authors: We acknowledge that the abstract, being concise, omits a detailed sketch. The full manuscript (Section 3 and Appendix A) establishes the existence of U via an explicit RASP program that composes transformer encoder blocks to simulate the attention matrix multiplication, softmax activation, and output projection. Multi-head attention is handled by replicating the single-head RASP simulation across heads in parallel (using the multi-head attention primitive available in the target architecture), and masking is implemented through RASP's selection and indexing operations that condition on position or padding indicators. We will revise the abstract to include a brief proof sketch outlining the RASP compilation steps and note the handling of these edge cases. revision: yes
Referee: [Main construction] Main construction (implied in the RASP-to-transformer compilation): The claim that RASP programs compile directly into standard encoder blocks assumes faithful realization without additional assumptions on positional encodings, layer norms, or residual connections. Typical RASP constructions for selection, copying, and indexing use explicit position tokens or custom encodings that are not part of the minimal transformer encoder definition (multi-head attention + FFN + residual + layer-norm), which risks breaking the simulation for arbitrary attention variants.

Authors: The construction in the manuscript uses the standard transformer encoder block exactly as defined (multi-head self-attention, feed-forward network, residual connections, and layer normalization), with fixed sinusoidal positional encodings added to the input embeddings as in the original Vaswani et al. architecture. RASP programs for selection and copying are realized using the attention mechanism itself to attend over positions, without requiring non-standard encodings; the positional information is provided via the standard sinusoidal embeddings, which are sufficient for the indexing operations in our simulator. We will add a clarifying subsection in the main text (and expand the appendix) to explicitly map each RASP primitive to the corresponding transformer components, confirming compatibility with arbitrary attention variants under the standard definition. revision: partial

Circularity Check

0 steps flagged

No circularity: explicit algorithmic construction via RASP

full rationale

The paper establishes its central claim through an explicit algorithmic construction of a universal simulator U built from transformer encoders, which replicates arbitrary attention outputs and elementary operations (matrix multiplications, activations) by expressing them as RASP programs. This is a direct existence proof by construction that does not reduce any result to fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or steps in the derivation chain are equivalent to their inputs by construction, and the approach relies on the formal properties of the RASP framework and standard transformer components without circular dependencies or renaming of known results. The derivation remains self-contained as an independent algorithmic demonstration rather than a statistical or definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The construction relies on the prior RASP framework and standard assumptions about matrix operations and transformer components; no new free parameters or invented entities with independent evidence are introduced in the abstract.

axioms (2)

domain assumption RASP programs can be realized by finite compositions of transformer encoder layers
Invoked when claiming the universal simulator U is composed of transformer encoders
standard math Attention outputs are exactly expressible via elementary matrix multiplications and activations
Used to reduce attention simulation to replicable operations

invented entities (1)

Universal simulator U no independent evidence
purpose: To replicate arbitrary attention mechanisms deterministically
New construction introduced to achieve data-agnostic simulation

pith-pipeline@v0.9.0 · 5731 in / 1301 out tokens · 34130 ms · 2026-05-19T07:24:51.185231+00:00 · methodology

On the Existence of Universal Simulators of Attention

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)