pith. sign in

arxiv: 2605.17968 · v1 · pith:5L4FWBGSnew · submitted 2026-05-18 · 💻 cs.LG

Function graph transformers universally approximate operators between function spaces

Pith reviewed 2026-05-20 13:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords function graph transformersuniversal approximationoperator learningmeasure theoretic transformersself-attentiondiscretization invarianceSobolev spacesnonlinear operators
4
0 comments X

The pith

Transformers can approximate any nonlinear operator between function spaces when functions are lifted to graph measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that transformers can learn nonlinear operators between function spaces in a discretization-invariant manner. It does this by lifting each function to a measure supported on its graph and applying a measure-theoretic perspective on transformers. The key step is introducing function graph transformers that preserve the graph structure so that outputs remain valid functions. This structure still permits universal approximation through compositions of ordinary attention layers and MLPs, covering operators on Sobolev spaces and other challenging settings.

Core claim

Function graph transformers are graph-preserving maps from graph measures to graph measures that can be approximated arbitrarily well by finite sequences of softmax self-attention layers and pointwise multilayer perceptrons. This yields universal approximation theorems for wide families of nonlinear operators between function spaces. The same construction accommodates regularized negative-order Sobolev inputs and output query points defined on separate domains.

What carries the argument

Function graph transformers: a subclass of measure-theoretic transformers that preserve graph structure by mapping graph-supported measures to graph-supported measures, thereby guaranteeing single-valued function outputs while allowing approximation by standard transformer components.

If this is right

  • Universal approximation holds for operators acting on regularized negative-order Sobolev function spaces.
  • Output query locations may be chosen independently of the input discretization points.
  • Refinement of discretizations corresponds to convergence in the space of measures.
  • The roles of positional encodings and graph connectivity become explicit in the operator-learning setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar graph-measure ideas could be applied to other architectures such as graph neural networks for operator learning.
  • Practical training procedures might enforce the graph-preserving property through additional loss terms or architectural constraints.
  • This viewpoint suggests new ways to prove discretization invariance for existing transformer-based PDE solvers.

Load-bearing premise

Representing functions by measures on their graphs and adopting a measure-theoretic view of transformers is general enough to include all operators one wishes to approximate.

What would settle it

A concrete nonlinear operator from one function space to another that cannot be approximated to any desired accuracy by any finite composition of standard softmax attention layers and pointwise MLPs, when functions are represented by their graph measures, would falsify the result.

Figures

Figures reproduced from arXiv: 2605.17968 by David Mis, Ivan Dokmani\'c, Maarten V. de Hoop, Matti Lassas, Takashi Furuya.

Figure 1
Figure 1. Figure 1: Operators between function spaces are universally approximated by function graph trans [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative two-dimensional FNO-teacher recovery examples for the trained same [PITH_FULL_IMAGE:figures/full_fig_p047_2.png] view at source ↗
read the original abstract

We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function $h$ is represented by its graph measure $\gamma_h$, with finite tokens $\{(x_j,h(x_j))\}_{j=1}^N$ being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper lifts functions to measures supported on their graphs and uses a measure-theoretic view of transformers to introduce function graph transformers, a graph-preserving subclass that maps graph measures to graph measures (ensuring single-valued outputs). It claims to prove that these graph-preserving maps can be approximated arbitrarily closely by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation for broad classes of nonlinear operators between function spaces. The framework is asserted to handle discretization refinement via measure convergence, regularized negative-order Sobolev inputs, and query points on mismatched domains without loss of generality.

Significance. If the central approximation result holds with the claimed preservation of graph support, the work supplies a continuum viewpoint and mathematical toolkit for transformer-based operator learning. It addresses discretization invariance and regularity challenges that are difficult for existing approaches, while clarifying roles of positional encodings and graph structure. The explicit accommodation of negative-order Sobolev inputs and cross-domain queries would be a notable advance if rigorously established.

major comments (2)
  1. [abstract and main approximation theorem] The central claim that graph-preserving maps can be approximated by unrestricted softmax attention layers without restricting the class of operators (stated in the abstract and developed in the main results) requires explicit verification that the limit preserves single-valued functional outputs. In weak measure metrics such as Wasserstein or weak-*, small perturbations can split mass across multiple y-values for the same x; the argument appears to rely on density of graph-preserving maps plus an implicit projection step whose details are not secured for negative-order Sobolev inputs or mismatched query domains.
  2. [framework section and universal approximation result] The framework assumes that lifting to graph measures combined with the prior measure-theoretic transformer view provides a sufficiently general setting without restricting approximable operators. However, the dependence on that prior work for the operator approximation result introduces grounding that is not fully external; the manuscript should clarify independence and verify that the graph-preservation constraint does not implicitly narrow the operator class for the Sobolev cases highlighted as a strength.
minor comments (2)
  1. [introduction and framework] Notation for empirical graph measures (finite tokens {(x_j, h(x_j))}) and their convergence under discretization refinement should be made fully explicit with a dedicated definition or equation to aid readability.
  2. [section 2] The manuscript would benefit from a short table or diagram contrasting the function graph transformer construction with standard measure-theoretic transformers to highlight the graph-preservation mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments. We address each major point below, indicating the revisions we will incorporate to strengthen the manuscript while preserving the core contributions.

read point-by-point responses
  1. Referee: [abstract and main approximation theorem] The central claim that graph-preserving maps can be approximated by unrestricted softmax attention layers without restricting the class of operators (stated in the abstract and developed in the main results) requires explicit verification that the limit preserves single-valued functional outputs. In weak measure metrics such as Wasserstein or weak-*, small perturbations can split mass across multiple y-values for the same x; the argument appears to rely on density of graph-preserving maps plus an implicit projection step whose details are not secured for negative-order Sobolev inputs or mismatched query domains.

    Authors: We agree that explicit verification of preservation under limits is necessary for full rigor. In the revised manuscript we will insert a new lemma establishing that the weak-* limit of a sequence of graph-preserving maps remains graph-preserving when the underlying measures arise from functions in the regularized negative-order Sobolev spaces considered in the paper. The lemma will also treat the projection onto graph measures explicitly, showing that the projection is continuous in the Wasserstein metric for the relevant function classes and that it introduces no additional error that would affect the universal-approximation guarantee. The same argument extends directly to query points on mismatched domains by viewing the query as a marginal of the lifted measure. These additions will be placed immediately after the statement of the main approximation theorem. revision: yes

  2. Referee: [framework section and universal approximation result] The framework assumes that lifting to graph measures combined with the prior measure-theoretic transformer view provides a sufficiently general setting without restricting approximable operators. However, the dependence on that prior work for the operator approximation result introduces grounding that is not fully external; the manuscript should clarify independence and verify that the graph-preservation constraint does not implicitly narrow the operator class for the Sobolev cases highlighted as a strength.

    Authors: We will add a dedicated paragraph in the framework section that separates the contributions: the measure-theoretic transformer construction is taken as background, but the density of graph-preserving maps within the space of all continuous maps on graph measures, together with the approximation by standard softmax attention, is proved self-containedly in our Theorems 3.4 and 4.2. Because every operator between the function spaces lifts uniquely to a graph-preserving map on the corresponding graph measures, the restriction to graph-preserving maps does not reduce the class of approximable operators. A short appendix subsection will verify that the same density and approximation statements hold uniformly for the regularized negative-order Sobolev inputs, confirming that the highlighted strength is retained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent proofs within the measure-theoretic framework.

full rationale

The paper defines function graph transformers as a graph-preserving subclass of measure-theoretic transformers and claims to prove that such maps can be approximated by standard softmax attention plus MLPs, yielding universal operator approximation. This is presented as a mathematical result rather than a reduction by construction, self-definition, or fitted input. The reliance on a 'recently introduced measure-theoretic view' is a citation to prior work; per guidelines, a cited result counts as independent support unless it is shown to reduce the central claim to an unverified self-citation chain or ansatz. No equations or steps in the provided text exhibit the specific reduction (e.g., Eq. X equivalent to input by definition or prediction forced by fit). The framework is self-contained against external benchmarks for the stated universal approximation claims, making this the normal honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on representing functions via graph measures and extending a prior measure-theoretic transformer view; these are introduced without independent empirical or formal verification beyond the theoretical construction itself.

axioms (2)
  • domain assumption Functions can be represented by measures supported on their graphs, with empirical approximations given by finite tokens
    Stated directly in the abstract as the starting point for the framework.
  • domain assumption The recently introduced measure-theoretic view of transformers extends to graph measures for operator learning
    The abstract says the framework leverages this view to model discretization refinement and operator learning.
invented entities (1)
  • function graph transformer no independent evidence
    purpose: A graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures so outputs remain single-valued functions
    Newly defined in the paper to add structure while claiming no loss of generality for approximation.

pith-pipeline@v0.9.0 · 5773 in / 1499 out tokens · 69825 ms · 2026-05-20T13:03:30.300014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 1 internal anchor

  1. [1]

    2020 , eprint=

    NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author=. 2020 , eprint=

  2. [2]

    Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , url =

    Tancik, Matthew and Srinivasan, Pratul and Mildenhall, Ben and Fridovich-Keil, Sara and Raghavan, Nithin and Singhal, Utkarsh and Ramamoorthi, Ravi and Barron, Jonathan and Ng, Ren , booktitle =. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , url =

  3. [3]

    and Magenes, E

    Lions, J.-L. and Magenes, E. , TITLE =. 1972 , PAGES =

  4. [4]

    International Conference on Learning Representations , year=

    Fourier Neural Operator for Parametric Partial Differential Equations , author=. International Conference on Learning Representations , year=

  5. [5]

    , TITLE =

    Taylor, Michael E. , TITLE =. 2023 , PAGES =. doi:10.1007/978-3-031-33928-8 , URL =

  6. [6]

    2026 , eprint=

    Flowers: A Warp Drive for Neural PDE Solvers , author=. 2026 , eprint=

  7. [7]

    Bogachev, V. I. , TITLE =. 2007 , PAGES =. doi:10.1007/978-3-540-34514-5 , URL =

  8. [8]

    On the autonomous Nemytskii operator between Sobolev spaces in the critical and supercritical cases: Well-definedness and higher-order chain rule , journal =

    Florin Isaia , keywords =. On the autonomous Nemytskii operator between Sobolev spaces in the critical and supercritical cases: Well-definedness and higher-order chain rule , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.na.2021.112576 , url =

  9. [9]

    Yang, Greg , year = 2020, month = apr, number =. Scaling. 1902.04760 , primaryclass =

  10. [10]

    Proceedings of the Thirty-Second Conference on Learning Theory , pages =

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , author =. Proceedings of the Thirty-Second Conference on Learning Theory , pages =. 2019 , editor =

  11. [11]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  12. [12]

    International Conference on Learning Representations , year=

    Rethinking Attention with Performers , author=. International Conference on Learning Representations , year=

  13. [13]

    Kovachki and Matthew E

    Edoardo Calvello and Nikola B. Kovachki and Matthew E. Levine and Andrew M. Stuart , title =. Journal of Machine Learning Research , year =

  14. [14]

    The Twelfth International Conference on Learning Representations , year =

    Functional Interpolation for Relative Positions Improves Long Context Transformers , author =. The Twelfth International Conference on Learning Representations , year =

  15. [15]

    and Fournier, John J

    Adams, Robert A. and Fournier, John J. F. , title =

  16. [16]

    Hardy's inequalities revisited , journal =

    Brezis, Ha\". Hardy's inequalities revisited , journal =. 1997 , pages =

  17. [17]

    SIAM Journal on Mathematical Analysis , volume =

    Costabel, Martin , title =. SIAM Journal on Mathematical Analysis , volume =. 1988 , pages =

  18. [18]

    Grisvard, Pierre , title =

  19. [19]

    Lions, Jacques-Louis and Magenes, Enrico , title =

  20. [20]

    , title =

    Dudley, Richard M. , title =. 1989 , pages =

  21. [21]

    McLean, William , title =

  22. [22]

    Direct Methods in the Theory of Elliptic Equations , series =

    Ne. Direct Methods in the Theory of Elliptic Equations , series =

  23. [23]

    A Panorama of Discrepancy Theory , editor =

    Dick, Josef and Pillichshammer, Friedrich , title =. A Panorama of Discrepancy Theory , editor =. 2014 , doi =

  24. [24]

    Probability Theory and Related Fields , volume =

    Fournier, Nicolas and Guillin, Arnaud , title =. Probability Theory and Related Fields , volume =. 2015 , doi =

  25. [25]

    2000 , doi =

    Graf, Siegfried and Luschgy, Harald , title =. 2000 , doi =

  26. [26]

    The Analysis of Linear Partial Differential Operators

    H. The Analysis of Linear Partial Differential Operators

  27. [27]

    Leoni, Giovanni , title =

  28. [28]

    1992 , doi =

    Niederreiter, Harald , title =. 1992 , doi =

  29. [29]

    Optimal Transport: Old and New , series =

    Villani, C. Optimal Transport: Old and New , series =. 2009 , doi =

  30. [30]

    , title =

    Rychkov, Vyacheslav S. , title =. Journal of the London Mathematical Society , series =. 1999 , doi =

  31. [31]

    2025 , journal =

    A mathematical perspective on transformers , author =. 2025 , journal =

  32. [32]

    2022 , journal =

    A neural ODE interpretation of transformer layers , author =. 2022 , journal =

  33. [33]

    2025 , journal =

    A unified perspective on the dynamics of deep transformers , author =. 2025 , journal =

  34. [34]

    2021 , booktitle =

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. 2021 , booktitle =

  35. [35]

    2004 , publisher =

    An introduction to partial differential equations , author =. 2004 , publisher =

  36. [36]

    2026 , eprint=

    Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science , author=. 2026 , eprint=

  37. [37]

    2017 , booktitle =

    Attention is All you Need , author =. 2017 , booktitle =

  38. [38]

    2023 , journal =

    Bayesian posterior perturbation analysis with integral probability metrics , author =. 2023 , journal =

  39. [39]

    1974 , journal =

    Calculation of the Wasserstein Distance Between Probability Distributions on the Line , author =. 1974 , journal =. doi:10.1137/1118101 , url =

  40. [40]

    2021 , booktitle =

    Choose a transformer: fourier or galerkin , author =. 2021 , booktitle =

  41. [41]

    2015 , journal =

    Control to flocking of the kinetic Cucker--Smale model , author =. 2015 , journal =

  42. [42]

    Billingsley,Convergence of Probability Measures

    Convergence of probability measures , author =. 1999 , publisher =. doi:10.1002/9780470316962 , isbn =

  43. [43]

    Pappas and Paris Perdikaris , year =

    Sifan Wang and Jacob H Seidman and Shyam Sankaran and Hanwen Wang and George J. Pappas and Paris Perdikaris , year =. The Thirteenth International Conference on Learning Representations , url =

  44. [44]

    2023 , journal =

    Diffusion models: A comprehensive survey of methods and applications , author =. 2023 , journal =

  45. [45]

    2024 , url =

    From microscopic to macroscopic scale equations: mean field, hydrodynamic and graph limits , author =. 2024 , url =. 2209.08832 , archiveprefix =

  46. [46]

    2023 , booktitle =

    GNOT: a general neural operator transformer for operator learning , author =. 2023 , booktitle =

  47. [47]

    Proceedings of the 41st International Conference on Machine Learning , publisher =

    How Smooth Is Attention? , author =. Proceedings of the 41st International Conference on Machine Learning , publisher =. 2024 , month =

  48. [48]

    2010 , journal =

    Inverse problems: a Bayesian perspective , author =. 2010 , journal =

  49. [49]

    2024 , journal =

    Learning stochastic dynamics and predicting emergent behavior using transformers , author =. 2024 , journal =

  50. [50]

    2024 , journal =

    Measure-to-measure interpolation using Transformers , author =. 2024 , journal =

  51. [51]

    Methods of modern mathematical physics

    Reed, Michael and Simon, Barry , year =. Methods of modern mathematical physics

  52. [52]

    Obstructions to extension of

    Lombardini, Luca and Rossi, Francesco , year =. Obstructions to extension of. Proc. Amer. Math. Soc. , volume =. doi:10.1090/proc/16030 , issn =

  53. [53]

    1958 , journal =

    On the Convergence of Sample Probability Distributions , author =. 1958 , journal =

  54. [54]

    2016 , booktitle =

    On the dynamics of large particle systems in the mean field limit , author =. 2016 , booktitle =

  55. [55]

    2020 , journal =

    On the local Lipschitz stability of Bayesian inverse problems , author =. 2020 , journal =

  56. [56]

    Operator Learning with Domain Decomposition for Geometry Generalization in

    Jianing Huang and Kaixuan Zhang and Youjia Wu and Ze Cheng , year =. Operator Learning with Domain Decomposition for Geometry Generalization in. The Fourteenth International Conference on Learning Representations , url =

  57. [57]

    , year =

    Optimal Transport: Old and New , author =. 2009 , publisher =. doi:10.1007/978-3-540-71050-9 , isbn =

  58. [58]

    2023 , journal =

    Pattern formation of the Cucker--Smale type kinetic models based on gradient flow , author =. 2023 , journal =

  59. [59]

    Periodic homogenization and effective mass theorems for the

    Allaire, Gr\'. Periodic homogenization and effective mass theorems for the. 2008 , booktitle =. doi:10.1007/978-3-540-79574-2\_1 , url =

  60. [60]

    Poseidon: Efficient foundation models for PDEs

    Poseidon: Efficient Foundation Models for PDEs , author =. 2024 , url =. 2405.19101 , archiveprefix =

  61. [61]

    2024 , booktitle =

    Positional knowledge is all you need: position-induced transformer (PiT) for operator learning , author =. 2024 , booktitle =

  62. [62]

    2020 , publisher =

    Probability theory---a comprehensive course , author =. 2020 , publisher =. doi:10.1007/978-3-030-56402-5 , isbn =

  63. [63]

    Garrido, Quentin and Kiani, Bobak and Lawrence, Hannah and Lecun, Yann and Mialon, Gr. Self-. 2023 , booktitle =. doi:10.52202/075280-1262 , isbn =

  64. [64]

    2008 , booktitle =

    Separability and completeness for the Wasserstein distance , author =. 2008 , booktitle =

  65. [65]

    2013 , publisher =

    Stochastic differential equations: an introduction with applications , author =. 2013 , publisher =

  66. [66]

    2015 , booktitle =

    The Bayesian approach to inverse problems , author =. 2015 , booktitle =

  67. [67]

    2021 , booktitle =

    The lipschitz constant of self-attention , author =. 2021 , booktitle =

  68. [68]

    2024 , journal =

    Theoretical foundations of deep selective state-space models , author =. 2024 , journal =

  69. [69]

    2024 , journal =

    Towards understanding the universality of transformers for next-token prediction , author =. 2024 , journal =

  70. [70]

    Transformer for Partial Differential Equations

    Zijie Li and Kazem Meidani and Amir Barati Farimani , year =. Transformer for Partial Differential Equations. Transactions on Machine Learning Research , issn =

  71. [71]

    2025 , booktitle =

    Transformers are Universal In-context Learners , author =. 2025 , booktitle =

  72. [72]

    2025 , journal =

    Transformers as neural operators for solutions of differential equations with finite regularity , author =. 2025 , journal =. doi:https://doi.org/10.1016/j.cma.2024.117560 , issn =

  73. [73]

    2025 , journal =

    Transformers through the Lens of Support-Preserving Maps between Measures , author =. 2025 , journal =

  74. [74]

    2021 , booktitle =

    Trumpets: Injective Flows for Inference and Inverse Problems , author =. 2021 , booktitle =

  75. [75]

    2024 , journal =

    Understanding the expressive power and mechanisms of transformer for sequence modeling , author =. 2024 , journal =

  76. [76]

    2024 , journal =

    Universal Approximation of Mean-Field Models via Transformers , author =. 2024 , journal =

  77. [77]

    2024 , booktitle =

    Universal physics transformers: a framework for efficiently scaling neural operators , author =. 2024 , booktitle =

  78. [78]

    2025 , journal =

    Upper and lower bounds for local Lipschitz stability of Bayesian posteriors , author =. 2025 , journal =

  79. [79]

    Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

    Walrus: A Cross-Domain Foundation Model for Continuum Dynamics , author =. 2025 , url =. 2511.15684 , archiveprefix =