pith. sign in

arxiv: 2603.25860 · v3 · pith:DIYGYPBPnew · submitted 2026-03-26 · 📊 stat.ML · cs.LG

On the Expressive Power of Contextual Relations in Transformers

Pith reviewed 2026-05-21 10:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords transformersexpressive poweruniversal approximationattention mechanismsoptimal transportcontextual relationssoftmaxsinkhorn normalization
0
0 comments X

The pith

Transformers can approximate arbitrary contextual relations using softmax attention or Sinkhorn normalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a measure-theoretic framework modeling contextual relations as probabilistic objects, either conditional distributions or joint distributions called couplings. It links standard softmax attention to entropy-regularized optimal transport by treating attention as a normalization step applied to an underlying affinity function. The central result is a universal approximation theorem showing that Transformers equipped with either standard softmax attention or Sinkhorn normalization can approximate any contextual relation rule to arbitrary accuracy. A sympathetic reader would care because this supplies a rigorous mathematical account of why Transformers succeed on context-heavy tasks and clarifies how different normalizations shape the relations they represent.

Core claim

Within a measure-theoretic framework in which contextual relations are modeled as conditional distributions or as joint distributions (couplings), standard softmax attention and alternately Sinkhorn normalization allow Transformer architectures to approximate arbitrary contextual relation rules, with the choice of normalization determining how the relations are represented.

What carries the argument

Measure-theoretic modeling of contextual relations as probabilistic objects, with attention serving as normalization of an affinity function that unifies softmax attention with entropy-regularized optimal transport.

Load-bearing premise

Contextual relations can be faithfully represented as probabilistic conditional or joint distributions inside a measure-theoretic framework.

What would settle it

A concrete contextual relation rule that cannot be approximated to any desired accuracy by a finite-depth Transformer using either softmax attention or Sinkhorn normalization on a fixed finite set of tokens.

read the original abstract

Transformer architectures have achieved remarkable empirical success in modeling contextual relations, yet a clear understanding of their expressive power is still lacking. In this work, we introduce a measure-theoretic framework in which contextual relations are modeled as probabilistic objects, either as conditional distributions or as joint distributions (couplings). This perspective reveals a natural connection between standard softmax attention and entropy-regularized optimal transport, providing a unified view of attention as a normalization of an underlying affinity function. Within this framework, we establish a universal approximation theorem for contextual systems using standard Softmax Attention and alternately Sinkhorn normalization. These results show that Transformer architectures can approximate arbitrary contextual relations rules, and that the choice of normalization determines how these relations are represented. Moreover, they provide a principled explanation for why Transformers are effective at modeling contextual relations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a measure-theoretic framework for understanding contextual relations in Transformer models, representing them as conditional distributions or couplings. It establishes a connection between softmax attention and entropy-regularized optimal transport, and proves a universal approximation theorem showing that standard Transformer architectures with softmax attention or Sinkhorn normalization can approximate arbitrary contextual relation rules.

Significance. Should the universal approximation theorem be rigorously proven, this work would provide a significant theoretical foundation for the expressive power of Transformers in modeling contextual dependencies. It offers a unified view linking attention mechanisms to optimal transport, which could explain their empirical success and guide future architectural designs.

major comments (1)
  1. [Section 4] The universal approximation theorem claims that standard Softmax Attention can approximate arbitrary contextual relations. However, the affinity function is realized via query-key products QK^T in fixed dimension d, leading to rank-at-most-d affinity matrices. For discrete supports of size n > d, this may not suffice to approximate general couplings, as noted in the stress-test concern. Please address whether the theorem requires d to scale with the support size or how the result holds for fixed d.
minor comments (1)
  1. The presentation of the measure-theoretic framework could benefit from an early explicit equation defining the affinity function and its normalization via softmax or Sinkhorn.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting an important technical point regarding the role of the embedding dimension in the universal approximation result. We address this comment below.

read point-by-point responses
  1. Referee: [Section 4] The universal approximation theorem claims that standard Softmax Attention can approximate arbitrary contextual relations. However, the affinity function is realized via query-key products QK^T in fixed dimension d, leading to rank-at-most-d affinity matrices. For discrete supports of size n > d, this may not suffice to approximate general couplings, as noted in the stress-test concern. Please address whether the theorem requires d to scale with the support size or how the result holds for fixed d.

    Authors: We agree that the query-key product produces an affinity matrix of rank at most d, which for a fixed d and discrete support size n > d cannot represent arbitrary couplings. Our universal approximation theorem is formulated in the standard sense of approximation theory: for any target contextual relation (conditional distribution or coupling) and any desired accuracy, there exists a Transformer architecture with sufficiently large embedding dimension d (along with appropriate depth or number of heads if needed) that approximates the target arbitrarily closely. This is analogous to classical universal approximation theorems for neural networks, where the hidden dimension is permitted to grow with the target complexity. We will revise Section 4 to state this dependence explicitly, add a remark on the rank limitation for fixed d, and discuss the practical implications when d is held constant while n grows. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the universal approximation theorem derivation

full rationale

The paper introduces a measure-theoretic framework in which contextual relations are modeled as conditional distributions or couplings, then connects standard softmax attention to entropy-regularized optimal transport as a normalization of an affinity function. Within this framework a universal approximation theorem is stated for Transformers using softmax attention or Sinkhorn normalization. No load-bearing step reduces by construction to the inputs: the theorem is not obtained by fitting parameters to data and relabeling the fit as a prediction, nor by self-defining the target relations in terms of the attention mechanism itself. No self-citation chain or uniqueness theorem imported from prior author work is invoked to force the result. The derivation therefore remains self-contained with independent mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the framework relies on standard measure-theoretic assumptions and the modeling choice of contextual relations as distributions; no explicit free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Contextual relations admit a measure-theoretic representation as conditional or joint distributions.
    Invoked to set up the probabilistic modeling of relations before linking to attention.
  • domain assumption Attention mechanisms can be viewed as normalizations of an underlying affinity function.
    Central modeling step that enables the connection to entropy-regularized optimal transport.

pith-pipeline@v0.9.0 · 5651 in / 1238 out tokens · 50219 ms · 2026-05-21T10:03:57.629758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators

    stat.ML 2026-05 unverdicted novelty 7.0

    A single neural operator can approximate the map from arbitrary joint densities to their conditionals, backed by new continuity results and illustrated on Gaussian mixtures.

  2. One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators

    stat.ML 2026-05 unverdicted novelty 6.0

    A single neural operator can approximate the map from joint densities to conditional densities to arbitrary accuracy, with a proof based on continuity of the conditioning operator and a demonstration on Gaussian mixtures.