On the Expressive Power of Contextual Relations in Transformers
Pith reviewed 2026-05-21 10:03 UTC · model grok-4.3
The pith
Transformers can approximate arbitrary contextual relations using softmax attention or Sinkhorn normalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Within a measure-theoretic framework in which contextual relations are modeled as conditional distributions or as joint distributions (couplings), standard softmax attention and alternately Sinkhorn normalization allow Transformer architectures to approximate arbitrary contextual relation rules, with the choice of normalization determining how the relations are represented.
What carries the argument
Measure-theoretic modeling of contextual relations as probabilistic objects, with attention serving as normalization of an affinity function that unifies softmax attention with entropy-regularized optimal transport.
Load-bearing premise
Contextual relations can be faithfully represented as probabilistic conditional or joint distributions inside a measure-theoretic framework.
What would settle it
A concrete contextual relation rule that cannot be approximated to any desired accuracy by a finite-depth Transformer using either softmax attention or Sinkhorn normalization on a fixed finite set of tokens.
read the original abstract
Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a measure-theoretic framework for understanding contextual relations in Transformer models, representing them as conditional distributions or couplings. It establishes a connection between softmax attention and entropy-regularized optimal transport, and proves a universal approximation theorem showing that standard Transformer architectures with softmax attention or Sinkhorn normalization can approximate arbitrary contextual relation rules.
Significance. Should the universal approximation theorem be rigorously proven, this work would provide a significant theoretical foundation for the expressive power of Transformers in modeling contextual dependencies. It offers a unified view linking attention mechanisms to optimal transport, which could explain their empirical success and guide future architectural designs.
major comments (1)
- [Section 4] The universal approximation theorem claims that standard Softmax Attention can approximate arbitrary contextual relations. However, the affinity function is realized via query-key products QK^T in fixed dimension d, leading to rank-at-most-d affinity matrices. For discrete supports of size n > d, this may not suffice to approximate general couplings, as noted in the stress-test concern. Please address whether the theorem requires d to scale with the support size or how the result holds for fixed d.
minor comments (1)
- The presentation of the measure-theoretic framework could benefit from an early explicit equation defining the affinity function and its normalization via softmax or Sinkhorn.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for highlighting an important technical point regarding the role of the embedding dimension in the universal approximation result. We address this comment below.
read point-by-point responses
-
Referee: [Section 4] The universal approximation theorem claims that standard Softmax Attention can approximate arbitrary contextual relations. However, the affinity function is realized via query-key products QK^T in fixed dimension d, leading to rank-at-most-d affinity matrices. For discrete supports of size n > d, this may not suffice to approximate general couplings, as noted in the stress-test concern. Please address whether the theorem requires d to scale with the support size or how the result holds for fixed d.
Authors: We agree that the query-key product produces an affinity matrix of rank at most d, which for a fixed d and discrete support size n > d cannot represent arbitrary couplings. Our universal approximation theorem is formulated in the standard sense of approximation theory: for any target contextual relation (conditional distribution or coupling) and any desired accuracy, there exists a Transformer architecture with sufficiently large embedding dimension d (along with appropriate depth or number of heads if needed) that approximates the target arbitrarily closely. This is analogous to classical universal approximation theorems for neural networks, where the hidden dimension is permitted to grow with the target complexity. We will revise Section 4 to state this dependence explicitly, add a remark on the rank limitation for fixed d, and discuss the practical implications when d is held constant while n grows. revision: yes
Circularity Check
No significant circularity in the universal approximation theorem derivation
full rationale
The paper introduces a measure-theoretic framework in which contextual relations are modeled as conditional distributions or couplings, then connects standard softmax attention to entropy-regularized optimal transport as a normalization of an affinity function. Within this framework a universal approximation theorem is stated for Transformers using softmax attention or Sinkhorn normalization. No load-bearing step reduces by construction to the inputs: the theorem is not obtained by fitting parameters to data and relabeling the fit as a prediction, nor by self-defining the target relations in terms of the attention mechanism itself. No self-citation chain or uniqueness theorem imported from prior author work is invoked to force the result. The derivation therefore remains self-contained with independent mathematical content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Contextual relations admit a measure-theoretic representation as conditional or joint distributions.
- domain assumption Attention mechanisms can be viewed as normalizations of an underlying affinity function.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization... affinity function... entropy-regularized optimal transport
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
measure-theoretic framework... texts as probability measures... Wasserstein distance... coupling systems
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
A single neural operator can approximate the map from arbitrary joint densities to their conditionals, backed by new continuity results and illustrated on Gaussian mixtures.
-
One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators
A single neural operator can approximate the map from joint densities to conditional densities to arbitrary accuracy, with a proof based on continuity of the conditioning operator and a demonstration on Gaussian mixtures.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.