Function graph transformers universally approximate operators between function spaces
Pith reviewed 2026-05-20 13:03 UTC · model grok-4.3
The pith
Transformers can approximate any nonlinear operator between function spaces when functions are lifted to graph measures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Function graph transformers are graph-preserving maps from graph measures to graph measures that can be approximated arbitrarily well by finite sequences of softmax self-attention layers and pointwise multilayer perceptrons. This yields universal approximation theorems for wide families of nonlinear operators between function spaces. The same construction accommodates regularized negative-order Sobolev inputs and output query points defined on separate domains.
What carries the argument
Function graph transformers: a subclass of measure-theoretic transformers that preserve graph structure by mapping graph-supported measures to graph-supported measures, thereby guaranteeing single-valued function outputs while allowing approximation by standard transformer components.
If this is right
- Universal approximation holds for operators acting on regularized negative-order Sobolev function spaces.
- Output query locations may be chosen independently of the input discretization points.
- Refinement of discretizations corresponds to convergence in the space of measures.
- The roles of positional encodings and graph connectivity become explicit in the operator-learning setting.
Where Pith is reading between the lines
- Similar graph-measure ideas could be applied to other architectures such as graph neural networks for operator learning.
- Practical training procedures might enforce the graph-preserving property through additional loss terms or architectural constraints.
- This viewpoint suggests new ways to prove discretization invariance for existing transformer-based PDE solvers.
Load-bearing premise
Representing functions by measures on their graphs and adopting a measure-theoretic view of transformers is general enough to include all operators one wishes to approximate.
What would settle it
A concrete nonlinear operator from one function space to another that cannot be approximated to any desired accuracy by any finite composition of standard softmax attention layers and pointwise MLPs, when functions are represented by their graph measures, would falsify the result.
Figures
read the original abstract
We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function $h$ is represented by its graph measure $\gamma_h$, with finite tokens $\{(x_j,h(x_j))\}_{j=1}^N$ being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper lifts functions to measures supported on their graphs and uses a measure-theoretic view of transformers to introduce function graph transformers, a graph-preserving subclass that maps graph measures to graph measures (ensuring single-valued outputs). It claims to prove that these graph-preserving maps can be approximated arbitrarily closely by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation for broad classes of nonlinear operators between function spaces. The framework is asserted to handle discretization refinement via measure convergence, regularized negative-order Sobolev inputs, and query points on mismatched domains without loss of generality.
Significance. If the central approximation result holds with the claimed preservation of graph support, the work supplies a continuum viewpoint and mathematical toolkit for transformer-based operator learning. It addresses discretization invariance and regularity challenges that are difficult for existing approaches, while clarifying roles of positional encodings and graph structure. The explicit accommodation of negative-order Sobolev inputs and cross-domain queries would be a notable advance if rigorously established.
major comments (2)
- [abstract and main approximation theorem] The central claim that graph-preserving maps can be approximated by unrestricted softmax attention layers without restricting the class of operators (stated in the abstract and developed in the main results) requires explicit verification that the limit preserves single-valued functional outputs. In weak measure metrics such as Wasserstein or weak-*, small perturbations can split mass across multiple y-values for the same x; the argument appears to rely on density of graph-preserving maps plus an implicit projection step whose details are not secured for negative-order Sobolev inputs or mismatched query domains.
- [framework section and universal approximation result] The framework assumes that lifting to graph measures combined with the prior measure-theoretic transformer view provides a sufficiently general setting without restricting approximable operators. However, the dependence on that prior work for the operator approximation result introduces grounding that is not fully external; the manuscript should clarify independence and verify that the graph-preservation constraint does not implicitly narrow the operator class for the Sobolev cases highlighted as a strength.
minor comments (2)
- [introduction and framework] Notation for empirical graph measures (finite tokens {(x_j, h(x_j))}) and their convergence under discretization refinement should be made fully explicit with a dedicated definition or equation to aid readability.
- [section 2] The manuscript would benefit from a short table or diagram contrasting the function graph transformer construction with standard measure-theoretic transformers to highlight the graph-preservation mechanism.
Simulated Author's Rebuttal
We thank the referee for the careful reading and insightful comments. We address each major point below, indicating the revisions we will incorporate to strengthen the manuscript while preserving the core contributions.
read point-by-point responses
-
Referee: [abstract and main approximation theorem] The central claim that graph-preserving maps can be approximated by unrestricted softmax attention layers without restricting the class of operators (stated in the abstract and developed in the main results) requires explicit verification that the limit preserves single-valued functional outputs. In weak measure metrics such as Wasserstein or weak-*, small perturbations can split mass across multiple y-values for the same x; the argument appears to rely on density of graph-preserving maps plus an implicit projection step whose details are not secured for negative-order Sobolev inputs or mismatched query domains.
Authors: We agree that explicit verification of preservation under limits is necessary for full rigor. In the revised manuscript we will insert a new lemma establishing that the weak-* limit of a sequence of graph-preserving maps remains graph-preserving when the underlying measures arise from functions in the regularized negative-order Sobolev spaces considered in the paper. The lemma will also treat the projection onto graph measures explicitly, showing that the projection is continuous in the Wasserstein metric for the relevant function classes and that it introduces no additional error that would affect the universal-approximation guarantee. The same argument extends directly to query points on mismatched domains by viewing the query as a marginal of the lifted measure. These additions will be placed immediately after the statement of the main approximation theorem. revision: yes
-
Referee: [framework section and universal approximation result] The framework assumes that lifting to graph measures combined with the prior measure-theoretic transformer view provides a sufficiently general setting without restricting approximable operators. However, the dependence on that prior work for the operator approximation result introduces grounding that is not fully external; the manuscript should clarify independence and verify that the graph-preservation constraint does not implicitly narrow the operator class for the Sobolev cases highlighted as a strength.
Authors: We will add a dedicated paragraph in the framework section that separates the contributions: the measure-theoretic transformer construction is taken as background, but the density of graph-preserving maps within the space of all continuous maps on graph measures, together with the approximation by standard softmax attention, is proved self-containedly in our Theorems 3.4 and 4.2. Because every operator between the function spaces lifts uniquely to a graph-preserving map on the corresponding graph measures, the restriction to graph-preserving maps does not reduce the class of approximable operators. A short appendix subsection will verify that the same density and approximation statements hold uniformly for the regularized negative-order Sobolev inputs, confirming that the highlighted strength is retained. revision: yes
Circularity Check
No significant circularity; derivation relies on independent proofs within the measure-theoretic framework.
full rationale
The paper defines function graph transformers as a graph-preserving subclass of measure-theoretic transformers and claims to prove that such maps can be approximated by standard softmax attention plus MLPs, yielding universal operator approximation. This is presented as a mathematical result rather than a reduction by construction, self-definition, or fitted input. The reliance on a 'recently introduced measure-theoretic view' is a citation to prior work; per guidelines, a cited result counts as independent support unless it is shown to reduce the central claim to an unverified self-citation chain or ansatz. No equations or steps in the provided text exhibit the specific reduction (e.g., Eq. X equivalent to input by definition or prediction forced by fit). The framework is self-contained against external benchmarks for the stated universal approximation claims, making this the normal honest non-finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Functions can be represented by measures supported on their graphs, with empirical approximations given by finite tokens
- domain assumption The recently introduced measure-theoretic view of transformers extends to graph measures for operator learning
invented entities (1)
-
function graph transformer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
functions are represented by graph measures and transformers by graph-preserving measure maps, yielding universality results that extend to negative-order Sobolev spaces
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author=. 2020 , eprint=
work page 2020
-
[2]
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , url =
Tancik, Matthew and Srinivasan, Pratul and Mildenhall, Ben and Fridovich-Keil, Sara and Raghavan, Nithin and Singhal, Utkarsh and Ramamoorthi, Ravi and Barron, Jonathan and Ng, Ren , booktitle =. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , url =
- [3]
-
[4]
International Conference on Learning Representations , year=
Fourier Neural Operator for Parametric Partial Differential Equations , author=. International Conference on Learning Representations , year=
-
[5]
Taylor, Michael E. , TITLE =. 2023 , PAGES =. doi:10.1007/978-3-031-33928-8 , URL =
-
[6]
Flowers: A Warp Drive for Neural PDE Solvers , author=. 2026 , eprint=
work page 2026
-
[7]
Bogachev, V. I. , TITLE =. 2007 , PAGES =. doi:10.1007/978-3-540-34514-5 , URL =
-
[8]
Florin Isaia , keywords =. On the autonomous Nemytskii operator between Sobolev spaces in the critical and supercritical cases: Well-definedness and higher-order chain rule , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.na.2021.112576 , url =
- [9]
-
[10]
Proceedings of the Thirty-Second Conference on Learning Theory , pages =
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , author =. Proceedings of the Thirty-Second Conference on Learning Theory , pages =. 2019 , editor =
work page 2019
-
[11]
Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =
work page 2020
-
[12]
International Conference on Learning Representations , year=
Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=
-
[13]
Edoardo Calvello and Nikola B. Kovachki and Matthew E. Levine and Andrew M. Stuart , title =. Journal of Machine Learning Research , year =
-
[14]
The Twelfth International Conference on Learning Representations , year =
Functional Interpolation for Relative Positions Improves Long Context Transformers , author =. The Twelfth International Conference on Learning Representations , year =
- [15]
-
[16]
Hardy's inequalities revisited , journal =
Brezis, Ha\". Hardy's inequalities revisited , journal =. 1997 , pages =
work page 1997
-
[17]
SIAM Journal on Mathematical Analysis , volume =
Costabel, Martin , title =. SIAM Journal on Mathematical Analysis , volume =. 1988 , pages =
work page 1988
-
[18]
Grisvard, Pierre , title =
-
[19]
Lions, Jacques-Louis and Magenes, Enrico , title =
- [20]
-
[21]
McLean, William , title =
-
[22]
Direct Methods in the Theory of Elliptic Equations , series =
Ne. Direct Methods in the Theory of Elliptic Equations , series =
-
[23]
A Panorama of Discrepancy Theory , editor =
Dick, Josef and Pillichshammer, Friedrich , title =. A Panorama of Discrepancy Theory , editor =. 2014 , doi =
work page 2014
-
[24]
Probability Theory and Related Fields , volume =
Fournier, Nicolas and Guillin, Arnaud , title =. Probability Theory and Related Fields , volume =. 2015 , doi =
work page 2015
- [25]
-
[26]
The Analysis of Linear Partial Differential Operators
H. The Analysis of Linear Partial Differential Operators
-
[27]
Leoni, Giovanni , title =
- [28]
-
[29]
Optimal Transport: Old and New , series =
Villani, C. Optimal Transport: Old and New , series =. 2009 , doi =
work page 2009
- [30]
-
[31]
A mathematical perspective on transformers , author =. 2025 , journal =
work page 2025
-
[32]
A neural ODE interpretation of transformer layers , author =. 2022 , journal =
work page 2022
-
[33]
A unified perspective on the dynamics of deep transformers , author =. 2025 , journal =
work page 2025
-
[34]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. 2021 , booktitle =
work page 2021
-
[35]
An introduction to partial differential equations , author =. 2004 , publisher =
work page 2004
-
[36]
Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science , author=. 2026 , eprint=
work page 2026
- [37]
-
[38]
Bayesian posterior perturbation analysis with integral probability metrics , author =. 2023 , journal =
work page 2023
-
[39]
Calculation of the Wasserstein Distance Between Probability Distributions on the Line , author =. 1974 , journal =. doi:10.1137/1118101 , url =
-
[40]
Choose a transformer: fourier or galerkin , author =. 2021 , booktitle =
work page 2021
-
[41]
Control to flocking of the kinetic Cucker--Smale model , author =. 2015 , journal =
work page 2015
-
[42]
Billingsley,Convergence of Probability Measures
Convergence of probability measures , author =. 1999 , publisher =. doi:10.1002/9780470316962 , isbn =
-
[43]
Pappas and Paris Perdikaris , year =
Sifan Wang and Jacob H Seidman and Shyam Sankaran and Hanwen Wang and George J. Pappas and Paris Perdikaris , year =. The Thirteenth International Conference on Learning Representations , url =
-
[44]
Diffusion models: A comprehensive survey of methods and applications , author =. 2023 , journal =
work page 2023
-
[45]
From microscopic to macroscopic scale equations: mean field, hydrodynamic and graph limits , author =. 2024 , url =. 2209.08832 , archiveprefix =
work page internal anchor Pith review arXiv 2024
-
[46]
GNOT: a general neural operator transformer for operator learning , author =. 2023 , booktitle =
work page 2023
-
[47]
Proceedings of the 41st International Conference on Machine Learning , publisher =
How Smooth Is Attention? , author =. Proceedings of the 41st International Conference on Machine Learning , publisher =. 2024 , month =
work page 2024
-
[48]
Inverse problems: a Bayesian perspective , author =. 2010 , journal =
work page 2010
-
[49]
Learning stochastic dynamics and predicting emergent behavior using transformers , author =. 2024 , journal =
work page 2024
-
[50]
Measure-to-measure interpolation using Transformers , author =. 2024 , journal =
work page 2024
-
[51]
Methods of modern mathematical physics
Reed, Michael and Simon, Barry , year =. Methods of modern mathematical physics
-
[52]
Lombardini, Luca and Rossi, Francesco , year =. Obstructions to extension of. Proc. Amer. Math. Soc. , volume =. doi:10.1090/proc/16030 , issn =
-
[53]
On the Convergence of Sample Probability Distributions , author =. 1958 , journal =
work page 1958
-
[54]
On the dynamics of large particle systems in the mean field limit , author =. 2016 , booktitle =
work page 2016
-
[55]
On the local Lipschitz stability of Bayesian inverse problems , author =. 2020 , journal =
work page 2020
-
[56]
Operator Learning with Domain Decomposition for Geometry Generalization in
Jianing Huang and Kaixuan Zhang and Youjia Wu and Ze Cheng , year =. Operator Learning with Domain Decomposition for Geometry Generalization in. The Fourteenth International Conference on Learning Representations , url =
-
[57]
Optimal Transport: Old and New , author =. 2009 , publisher =. doi:10.1007/978-3-540-71050-9 , isbn =
-
[58]
Pattern formation of the Cucker--Smale type kinetic models based on gradient flow , author =. 2023 , journal =
work page 2023
-
[59]
Periodic homogenization and effective mass theorems for the
Allaire, Gr\'. Periodic homogenization and effective mass theorems for the. 2008 , booktitle =. doi:10.1007/978-3-540-79574-2\_1 , url =
-
[60]
Poseidon: Efficient foundation models for PDEs
Poseidon: Efficient Foundation Models for PDEs , author =. 2024 , url =. 2405.19101 , archiveprefix =
-
[61]
Positional knowledge is all you need: position-induced transformer (PiT) for operator learning , author =. 2024 , booktitle =
work page 2024
-
[62]
Probability theory---a comprehensive course , author =. 2020 , publisher =. doi:10.1007/978-3-030-56402-5 , isbn =
-
[63]
Garrido, Quentin and Kiani, Bobak and Lawrence, Hannah and Lecun, Yann and Mialon, Gr. Self-. 2023 , booktitle =. doi:10.52202/075280-1262 , isbn =
-
[64]
Separability and completeness for the Wasserstein distance , author =. 2008 , booktitle =
work page 2008
-
[65]
Stochastic differential equations: an introduction with applications , author =. 2013 , publisher =
work page 2013
-
[66]
The Bayesian approach to inverse problems , author =. 2015 , booktitle =
work page 2015
-
[67]
The lipschitz constant of self-attention , author =. 2021 , booktitle =
work page 2021
-
[68]
Theoretical foundations of deep selective state-space models , author =. 2024 , journal =
work page 2024
-
[69]
Towards understanding the universality of transformers for next-token prediction , author =. 2024 , journal =
work page 2024
-
[70]
Transformer for Partial Differential Equations
Zijie Li and Kazem Meidani and Amir Barati Farimani , year =. Transformer for Partial Differential Equations. Transactions on Machine Learning Research , issn =
-
[71]
Transformers are Universal In-context Learners , author =. 2025 , booktitle =
work page 2025
-
[72]
Transformers as neural operators for solutions of differential equations with finite regularity , author =. 2025 , journal =. doi:https://doi.org/10.1016/j.cma.2024.117560 , issn =
-
[73]
Transformers through the Lens of Support-Preserving Maps between Measures , author =. 2025 , journal =
work page 2025
-
[74]
Trumpets: Injective Flows for Inference and Inverse Problems , author =. 2021 , booktitle =
work page 2021
-
[75]
Understanding the expressive power and mechanisms of transformer for sequence modeling , author =. 2024 , journal =
work page 2024
-
[76]
Universal Approximation of Mean-Field Models via Transformers , author =. 2024 , journal =
work page 2024
-
[77]
Universal physics transformers: a framework for efficiently scaling neural operators , author =. 2024 , booktitle =
work page 2024
-
[78]
Upper and lower bounds for local Lipschitz stability of Bayesian posteriors , author =. 2025 , journal =
work page 2025
-
[79]
Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025
Walrus: A Cross-Domain Foundation Model for Continuum Dynamics , author =. 2025 , url =. 2511.15684 , archiveprefix =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.