A Mathematical Explanation of Transformers

Hao Liu; Lingfeng Li; Raymond H. Chan; Xue-Cheng Tai

arxiv: 2510.03989 · v2 · submitted 2025-10-05 · 💻 cs.LG · cs.AI· cs.NA· math.NA

A Mathematical Explanation of Transformers

Xue-Cheng Tai , Hao Liu , Lingfeng Li , Raymond H. Chan This is my paper

Pith reviewed 2026-05-18 10:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NA

keywords transformerintegro-differential equationself-attentionlayer normalizationcontinuous frameworkoperator theoryvariational methodsneural architecture

0 comments

The pith

Transformers arise as discretizations of structured integro-differential equations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out a continuous framework in which the Transformer is treated as the discretization of a structured integro-differential equation defined over continuous token indices and feature dimensions. In this setting the self-attention layer appears directly as a non-local integral operator, while layer normalization is recovered as a projection onto a time-dependent constraint. The resulting operator-theoretic and variational picture supplies a single language for attention, feedforward blocks, and normalization together. A reader would care because the same language immediately suggests ways to analyze stability, derive variants, or import techniques from continuous mathematics into architecture design.

Core claim

The paper claims that the full Transformer operation admits a faithful embedding into continuous domains for both token indices and feature dimensions, so that the discrete architecture is recovered as the discretization of a structured integro-differential equation. Within this formulation the self-attention mechanism emerges naturally as a non-local integral operator, and layer normalization is characterized as a projection to a time-dependent constraint. This operator-theoretic and variational perspective unifies the treatment of attention, feedforward layers, and normalization and thereby supplies a principled foundation for the architecture.

What carries the argument

The structured integro-differential equation whose discretization produces the Transformer, with self-attention realized as its non-local integral operator and layer normalization realized as a projection onto a time-dependent constraint.

If this is right

The components of the Transformer become amenable to analysis by integral operators and variational methods.
Architecture variants can be obtained by altering the underlying continuous equation rather than by ad-hoc layer changes.
Control-theoretic tools become applicable to the dynamics of attention and normalization.
Theoretical statements about convergence or stability can be transferred from the continuous equation to the discrete network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continuous embedding might be used to derive new normalization schemes or attention kernels directly from the integro-differential structure.
Connections to neural ordinary differential equations or other continuous dynamical models could be made precise through this lens.
Numerical schemes for solving the continuous equation might suggest more efficient or more stable discrete implementations.

Load-bearing premise

The discrete Transformer steps can be embedded into continuous domains for tokens and features without losing the essential behavior of the original architecture.

What would settle it

Discretize the proposed integro-differential equation using the same step sizes as a standard Transformer and test whether the resulting outputs match those of the original discrete model on sequence-modeling benchmarks.

Figures

Figures reproduced from arXiv: 2510.03989 by Hao Liu, Lingfeng Li, Raymond H. Chan, Xue-Cheng Tai.

read the original abstract

The Transformer architecture has revolutionized the field of sequence modeling and underpins the recent breakthroughs in large language models (LLMs). However, a comprehensive mathematical theory that explains its structure and operations remains elusive. In this work, we propose a novel continuous framework that rigorously interprets the Transformer as a discretization of a structured integro-differential equation. Within this formulation, the self-attention mechanism emerges naturally as a non-local integral operator, and layer normalization is characterized as a projection to a time-dependent constraint. This operator-theoretic and variational perspective offers a unified and interpretable foundation for understanding the architecture's core components, including attention, feedforward layers, and normalization. Our approach extends beyond previous theoretical analyses by embedding the entire Transformer operation in continuous domains for both token indices and feature dimensions. This leads to a principled and flexible framework that not only deepens on theoretical insight but also offers new directions for architecture design, analysis, and control-based interpretations. This new interpretation provides a step toward bridging the gap between deep learning architectures and continuous mathematical modeling, and contributes a foundational perspective to the ongoing development of interpretable and theoretically grounded neural network models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a continuous integro-differential view of Transformers but needs stronger evidence that discretization recovers the exact discrete operations.

read the letter

The main thing your colleague should know is that the paper proposes viewing the Transformer as a discretization of a structured integro-differential equation, where attention acts as a non-local integral operator and layer normalization as a projection onto a time-dependent constraint. This continuous framework treats both token positions and feature dimensions as continuous variables. What the paper does well is to extend earlier theoretical treatments by embedding the full architecture in these continuous domains. It provides an operator-theoretic and variational lens that ties together attention, feedforward layers, and normalization in one setup. This perspective could help apply methods from integral equations and variational calculus to analyze or modify large models. The soft spots are around the rigor of the connection back to the discrete case. The claim is that this gives a rigorous interpretation, but the provided abstract does not include the explicit discretization scheme or error analysis needed to show that the standard Transformer operations, such as the residual connections and softmax, emerge precisely from the continuous model. If the full manuscript demonstrates this recovery with clear limits and bounds, then the central argument holds up better. Otherwise, there is a risk that the integro-differential equation is an interesting analogy rather than an exact embedding that captures the essential behavior. The paper is for readers focused on the mathematical foundations of deep learning and continuous limits of neural networks. Someone working on theoretical aspects or looking for new ways to design architectures might get value from the unified view, though they may have to do some work to verify the discretization. I would send this to peer review. The idea is timely and distinct enough that referees can assess whether the math delivers on the promise of a faithful continuous limit.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel continuous framework that interprets the Transformer as a discretization of a structured integro-differential equation. In this formulation, self-attention emerges as a non-local integral operator, layer normalization is characterized as a projection onto a time-dependent constraint, and the entire architecture is embedded in continuous domains for both token indices and feature dimensions. The approach aims to provide an operator-theoretic and variational foundation for attention, feedforward layers, and normalization, extending prior analyses and suggesting new directions for design and control-based interpretations.

Significance. If the continuous model can be shown to recover the exact discrete Transformer updates (including residual connections, softmax, and LayerNorm) via a rigorous discretization or limit process, the work would supply a unified mathematical bridge between discrete neural architectures and integro-differential equations. This could enable new theoretical analyses, architecture variants, and variational interpretations, though the current significance is limited by the absence of such recovery.

major comments (2)

[Abstract / Introduction] The central claim requires that the discrete self-attention, FFN, and LayerNorm operations arise exactly as a discretization of the proposed integro-differential equation once token indices and feature dimensions are treated as continuous. The abstract asserts a 'rigorous interpretation' and 'faithful embedding' of the entire Transformer operation, yet no explicit discretization scheme, limit argument, or recovery of the standard residual + softmax + LayerNorm equations is supplied. Without this step the continuous model risks remaining a loose analogy rather than a faithful embedding.
[Main theoretical development] The weakest assumption—that discrete token indices and feature dimensions admit a faithful continuous embedding that accurately captures essential Transformer behavior—is asserted but not demonstrated through concrete recovery of the original architecture's equations or through validation on standard Transformer components.

minor comments (2)

Clarify the precise definition of the 'structured integro-differential equation' with an explicit equation early in the manuscript, including how the time-dependent constraint for LayerNorm is formulated.
Add a short comparison paragraph distinguishing the proposed framework from prior continuous or dynamical-system interpretations of attention (e.g., those based on ODEs or mean-field limits).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and for recognizing the potential of our continuous framework for interpreting Transformers. We address each major comment below and outline the revisions we plan to make to strengthen the connection between the continuous model and the discrete architecture.

read point-by-point responses

Referee: [Abstract / Introduction] The central claim requires that the discrete self-attention, FFN, and LayerNorm operations arise exactly as a discretization of the proposed integro-differential equation once token indices and feature dimensions are treated as continuous. The abstract asserts a 'rigorous interpretation' and 'faithful embedding' of the entire Transformer operation, yet no explicit discretization scheme, limit argument, or recovery of the standard residual + softmax + LayerNorm equations is supplied. Without this step the continuous model risks remaining a loose analogy rather than a faithful embedding.

Authors: We agree with the referee that an explicit discretization scheme and limit argument would solidify the claim of a faithful embedding. The current manuscript focuses on deriving the continuous integro-differential equation and showing the emergence of attention as a non-local operator and normalization as a projection, but does not detail the reverse process of discretizing back to the standard Transformer equations. In the revised version, we will add a dedicated section that provides a rigorous discretization procedure, including how residual connections, the softmax operation, and LayerNorm arise in the discrete limit from the continuous formulation. revision: yes
Referee: [Main theoretical development] The weakest assumption—that discrete token indices and feature dimensions admit a faithful continuous embedding that accurately captures essential Transformer behavior—is asserted but not demonstrated through concrete recovery of the original architecture's equations or through validation on standard Transformer components.

Authors: The referee correctly identifies that while we embed the token indices and feature dimensions into continuous domains and develop the corresponding operator-theoretic view, the manuscript stops short of explicitly recovering the discrete equations or providing numerical validation on standard components. We will revise the main theoretical development to include concrete recovery examples and perhaps a small validation experiment demonstrating that the continuous model approximates the discrete Transformer behavior under fine discretization. revision: yes

Circularity Check

0 steps flagged

No circularity: continuous framework is an independent modeling proposal

full rationale

The paper proposes a novel continuous framework that interprets the Transformer as a discretization of a structured integro-differential equation, with self-attention as a non-local integral operator and layer normalization as a projection to a time-dependent constraint. This is framed as a new modeling choice and extension beyond prior analyses by embedding token indices and feature dimensions in continuous domains. No load-bearing step reduces by construction to fitted parameters, self-definition, or a self-citation chain; the central claim is a proposed embedding rather than a tautological renaming or forced prediction. The derivation is self-contained as a mathematical modeling exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger entries are inferred strictly from claims in the abstract because the full manuscript was unavailable. The central modeling step is the continuous embedding itself.

axioms (1)

domain assumption Discrete Transformer operations can be embedded into continuous domains for both token indices and feature dimensions without loss of essential properties.
This premise is required for the discretization interpretation stated in the abstract.

invented entities (1)

structured integro-differential equation for the Transformer no independent evidence
purpose: To serve as the continuous limit whose discretization recovers the original architecture
Introduced in the abstract as the unifying object; no independent evidence outside the modeling choice is provided.

pith-pipeline@v0.9.0 · 5732 in / 1291 out tokens · 39321 ms · 2026-05-18T10:07:22.012406+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Benning, E

M. Benning, E. Celledoni, M. J. Ehrhardt, B. Owren, and C.-B. Schönlieb. Deep learning as optimal control problems: Models and numerical methods.Journal of Computational Dynamics, 6(2):171–198, 2019

work page 2019
[3]

Bryutkin, J

A. Bryutkin, J. Huang, Z. Deng, G. Yang, C.-B. Schönlieb, and A. Aviles-Rivero. Hamlet: graph transformer neural operator for partial differential equations. InProceedings of the 41st International Conference on Machine Learning, pages 4624–4641, 2024

work page 2024
[4]

M. Chen, H. Jiang, W. Liao, and T. Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery.In- formation and Inference: A Journal of the IMA, 11(4):1203–1253, 2022. 19

work page 2022
[5]

R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

work page 2018
[6]

Cheng, J

C.-W. Cheng, J. Huang, Y. Zhang, G. Yang, C.-B. Schönlieb, and A. I. Aviles-Rivero. Mamba neural operator: Who wins? transformers vs. state-space models for pdes.arXiv preprint arXiv:2410.02113, 2024

work page arXiv 2024
[7]

Y. Cui, Y. Xu, R. Peng, and D. Wu. Layer normalization for tsk fuzzy system optimization in regression problems.IEEE Transactions on Fuzzy Systems, 31(1):254–264, 2022

work page 2022
[8]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M.Minderer, G.Heigold, S.Gelly, etal. Animageisworth16x16words: Transform- ers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[9]

Dutta, T

S. Dutta, T. Gautam, S. Chakrabarti, and T. Chakraborty. Redesigning the transformer architecture with insights from multi-particle dynamical systems.Advances in Neural In- formation Processing Systems, 34:5531–5544, 2021

work page 2021
[10]

Furuya, M

T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. arXiv preprint arXiv:2408.01367, 2024

work page arXiv 2024
[11]

A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023
[12]

Glowinski and P

R. Glowinski and P. Le Tallec.Augmented Lagrangian and operator-splitting methods in nonlinear mechanics, volume 9. Society for Industrial Mathematics, 1989

work page 1989
[13]

Glowinski, S

R. Glowinski, S. J. Osher, and W. Yin.Splitting methods in communication, imaging, science, and engineering. Springer, 2017

work page 2017
[14]

Glowinski, T.-W

R. Glowinski, T.-W. Pan, and X.-C. Tai. Some facts about operator-splitting and alter- nating direction methods. InSplitting Methods in Communication, Imaging, Science, and Engineering, pages 19–94. Springer, 2016

work page 2016
[15]

Graves, A.-r

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013

work page 2013
[16]

Gregor and Y

K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. InProceedings of the 27th international conference on international conference on machine learning, pages 399–406, 2010

work page 2010
[17]

Haber, L

E. Haber, L. Ruthotto, E. Holtham, and S.-H. Jun. Learning across scales—multiscale methodsforconvolutionneuralnetworks. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[18]

Hagemann, J

P. Hagemann, J. Hertrich, and G. Steidl. Stochastic normalizing flows for inverse prob- lems: A markov chains viewpoint.SIAM/ASA Journal on Uncertainty Quantification, 10(3):1162–1190, 2022

work page 2022
[19]

Havrilla and W

A. Havrilla and W. Liao. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data.Advances in Neural Information Processing Systems, 37:42162–42210, 2024

work page 2024
[20]

Mgnet: Aunifiedframeworkofmultigridandconvolutionalneuralnetwork

J.HeandJ.Xu. Mgnet: Aunifiedframeworkofmultigridandconvolutionalneuralnetwork. Science china mathematics, 62:1331–1354, 2019. 20

work page 2019
[21]

Jelassi, M

S. Jelassi, M. Sander, and Y. Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022

work page 2022
[22]

Jiang, Y

F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. Dong, H. Shen, and Y. Wang. Artificial intelligence in healthcare: past, present and future.Stroke and vascular neurology, 2(4), 2017

work page 2017
[23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[24]

Attention is a smoothed cubic spline , url =

Z. Lai, L.-H. Lim, and Y. Liu. Attention is a smoothed cubic spline.arXiv preprint arXiv:2408.09624, 2024

work page arXiv 2024
[25]

Y. Lan, Z. Li, J. Sun, and Y. Xiang. Dosnet as a non-black-box pde solver: When deep learning meets operator splitting.Journal of Computational Physics, 491:112343, 2023

work page 2023
[26]

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations.arXiv preprint arXiv:2010.08895, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[27]

C. Liu, Z. Qiao, C. Li, and C.-B. Schönlieb. Inverse evolution layers: Physics-informed regu- larizers for image segmentation.SIAM Journal on Mathematics of Data Science, 7(1):55–85, 2025

work page 2025
[28]

H. Liu, J. Liu, R. H. Chan, and X.-C. Tai. Double-well net for image segmentation.Multi- scale Modeling & Simulation, 22(4):1449–1477, 2024

work page 2024
[29]

Convex shape prior for deep neural convolution network based eye fun- dus images segmentation.arXiv preprint arXiv:2005.07476,

J. Liu, X.-C. Tai, and S. Luo. Convex shape prior for deep neural convolution network based eye fundus images segmentation.arXiv preprint arXiv:2005.07476, 2020

work page arXiv 2005
[30]

J. Liu, X. Wang, and X.-C. Tai. Deep convolutional neural networks with spatial regu- larization, volume and star-shape priors for image segmentation.Journal of Mathematical Imaging and Vision, 64(6):625–645, 2022

work page 2022
[31]

Y. Liu, Z. Zhang, and H. Schaeffer. Prose: Predicting multiple operators and symbolic expressions using multimodal transformers.Neural Networks, 180:106707, 2024

work page 2024
[32]

Z. Long, Y. Lu, X. Ma, and B. Dong. Pde-net: Learning pdes from data. InInternational conference on machine learning, pages 3208–3216. PMLR, 2018

work page 2018
[33]

J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

work page 2021
[34]

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators.Nature machine intelligence, 3(3):218–229, 2021

work page 2021
[35]

T. Lu, P. Neittaanmaki, and X.-C. Tai. A parallel splitting-up method for partial differential equations and its applications to navier-stokes equations.ESAIM: Mathematical Modelling and Numerical Analysis, 26(6):673–708, 1992

work page 1992
[36]

Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-y. Liu. Understanding and improving transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

work page 2020
[37]

Martin, A

S. Martin, A. Gagneux, P. Hagemann, and G. Steidl. Pnp-flow: Plug-and-play image restoration with flow matching.arXiv preprint arXiv:2410.02423, 2024. 21

work page arXiv 2024
[38]

J. Meng, F. Wang, and J. Liu. Learnable nonlocal self-similarity of deep features for image denoising.SIAM Journal on Imaging Sciences, 17(1):441–475, 2024

work page 2024
[39]

Miotto, F

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley. Deep learning for healthcare: review, opportunities and challenges.Briefings in bioinformatics, 19(6):1236–1246, 2018

work page 2018
[40]

Raissi, P

M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019
[41]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical im- agesegmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015
[42]

Ruiz-Balet and E

D. Ruiz-Balet and E. Zuazua. Neural ode control for classification, approximation, and transport.SIAM Review, 65(3):735–773, 2023

work page 2023
[43]

Ruthotto and E

L. Ruthotto and E. Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 62(3):352–364, 2020

work page 2020
[44]

Z. Shen, A. Havrilla, R. Lai, A. Cloninger, and W. Liao. Transformers for learning on noisy and task-level manifolds: Approximation and generalization insights.arXiv preprint arXiv:2505.03205, 2025

work page internal anchor Pith review arXiv 2025
[45]

Strudel, R

R. Strudel, R. Garcia, I. Laptev, and C. Schmid. Segmenter: Transformer for semantic seg- mentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021

work page 2021
[46]

X.-C. Tai, H. Liu, and R. Chan. Pottsmgnet: A mathematical explanation of encoder- decoder based neural networks.SIAM Journal on Imaging Sciences, 17(1):540–594, 2024

work page 2024
[47]

Tai, H.Liu, R

X.-C. Tai, H.Liu, R. H. Chan, andL.Li. A mathematical explanation of unet.Mathematical Foundations of Computing, 2024

work page 2024
[48]

Takakura and T

S. Takakura and T. Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023

work page 2023
[49]

R. E. Turner. An introduction to transformers.arXiv preprint arXiv:2304.10557, 2023

work page arXiv 2023
[50]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[51]

T. Wang, Z. Dou, C. Bao, and Z. Shi. Diffusion mechanism in residual neural network: theory and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):667–680, 2023

work page 2023
[52]

E. Weinan. A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 5(1):1–11, 2017

work page 2017
[53]

Weinan, C

E. Weinan, C. Ma, and L. Wu. Machine learning from a continuous viewpoint, i.Science China Mathematics, 63(11):2233–2266, 2020

work page 2020
[54]

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. Cvt: Introducing con- volutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021. 22

work page 2021
[55]

J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin. Understanding and improving layer normal- ization.Advances in neural information processing systems, 32, 2019

work page 2019
[56]

Y. Yang, J. Sun, H. Li, and Z. Xu. Admm-csnet: A deep learning approach for image com- pressivesensing.IEEE transactions on pattern analysis and machine intelligence, 42(3):521– 538, 2018

work page 2018
[57]

Yarotsky

D. Yarotsky. Error bounds for approximations with deep relu networks.Neural networks, 94:103–114, 2017

work page 2017
[58]

Z. Ye, X. Huang, L. Chen, H. Liu, Z. Wang, and B. Dong. Pdeformer: Towards a foundation model for one-dimensional partial differential equations.arXiv preprint arXiv:2402.12652, 2024

work page arXiv 2024
[59]

Young, D

T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing.ieee Computational intelligenCe magazine, 13(3):55–75, 2018

work page 2018
[60]

S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim. Graph transformer networks.Advances in neural information processing systems, 32, 2019

work page 2019
[61]

Zhang and R

B. Zhang and R. Sennrich. A lightweight recurrent network for sequence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1538–1548, 2019

work page 2019
[62]

Zhang, L

K. Zhang, L. Li, H. Liu, J. Yuan, and X.-C. Tai. Deep convolutional neural networks meet variational shape compactness priors for image segmentation.Neurocomputing, 623:129395, 2025

work page 2025
[63]

D.-X. Zhou. Universality of deep convolutional neural networks.Applied and computational harmonic analysis, 48(2):787–794, 2020

work page 2020
[64]

J. Zhu, X. Chen, K. He, Y. LeCun, and Z. Liu. Transformers without normalization.arXiv preprint arXiv:2503.10622, 2025

work page arXiv 2025
[65]

Ziaee and E

A. Ziaee and E. Çano. Batch layer normalization a new normalization layer for cnns and rnns. InProceedings of the 6th International Conference on Advances in Artificial Intelli- gence, pages 40–49, 2022. 23

work page 2022

[1] [1]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Benning, E

M. Benning, E. Celledoni, M. J. Ehrhardt, B. Owren, and C.-B. Schönlieb. Deep learning as optimal control problems: Models and numerical methods.Journal of Computational Dynamics, 6(2):171–198, 2019

work page 2019

[3] [3]

Bryutkin, J

A. Bryutkin, J. Huang, Z. Deng, G. Yang, C.-B. Schönlieb, and A. Aviles-Rivero. Hamlet: graph transformer neural operator for partial differential equations. InProceedings of the 41st International Conference on Machine Learning, pages 4624–4641, 2024

work page 2024

[4] [4]

M. Chen, H. Jiang, W. Liao, and T. Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery.In- formation and Inference: A Journal of the IMA, 11(4):1203–1253, 2022. 19

work page 2022

[5] [5]

R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

work page 2018

[6] [6]

Cheng, J

C.-W. Cheng, J. Huang, Y. Zhang, G. Yang, C.-B. Schönlieb, and A. I. Aviles-Rivero. Mamba neural operator: Who wins? transformers vs. state-space models for pdes.arXiv preprint arXiv:2410.02113, 2024

work page arXiv 2024

[7] [7]

Y. Cui, Y. Xu, R. Peng, and D. Wu. Layer normalization for tsk fuzzy system optimization in regression problems.IEEE Transactions on Fuzzy Systems, 31(1):254–264, 2022

work page 2022

[8] [8]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M.Minderer, G.Heigold, S.Gelly, etal. Animageisworth16x16words: Transform- ers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[9] [9]

Dutta, T

S. Dutta, T. Gautam, S. Chakrabarti, and T. Chakraborty. Redesigning the transformer architecture with insights from multi-particle dynamical systems.Advances in Neural In- formation Processing Systems, 34:5531–5544, 2021

work page 2021

[10] [10]

Furuya, M

T. Furuya, M. V. de Hoop, and G. Peyré. Transformers are universal in-context learners. arXiv preprint arXiv:2408.01367, 2024

work page arXiv 2024

[11] [11]

A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023

[12] [12]

Glowinski and P

R. Glowinski and P. Le Tallec.Augmented Lagrangian and operator-splitting methods in nonlinear mechanics, volume 9. Society for Industrial Mathematics, 1989

work page 1989

[13] [13]

Glowinski, S

R. Glowinski, S. J. Osher, and W. Yin.Splitting methods in communication, imaging, science, and engineering. Springer, 2017

work page 2017

[14] [14]

Glowinski, T.-W

R. Glowinski, T.-W. Pan, and X.-C. Tai. Some facts about operator-splitting and alter- nating direction methods. InSplitting Methods in Communication, Imaging, Science, and Engineering, pages 19–94. Springer, 2016

work page 2016

[15] [15]

Graves, A.-r

A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013

work page 2013

[16] [16]

Gregor and Y

K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. InProceedings of the 27th international conference on international conference on machine learning, pages 399–406, 2010

work page 2010

[17] [17]

Haber, L

E. Haber, L. Ruthotto, E. Holtham, and S.-H. Jun. Learning across scales—multiscale methodsforconvolutionneuralnetworks. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[18] [18]

Hagemann, J

P. Hagemann, J. Hertrich, and G. Steidl. Stochastic normalizing flows for inverse prob- lems: A markov chains viewpoint.SIAM/ASA Journal on Uncertainty Quantification, 10(3):1162–1190, 2022

work page 2022

[19] [19]

Havrilla and W

A. Havrilla and W. Liao. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data.Advances in Neural Information Processing Systems, 37:42162–42210, 2024

work page 2024

[20] [20]

Mgnet: Aunifiedframeworkofmultigridandconvolutionalneuralnetwork

J.HeandJ.Xu. Mgnet: Aunifiedframeworkofmultigridandconvolutionalneuralnetwork. Science china mathematics, 62:1331–1354, 2019. 20

work page 2019

[21] [21]

Jelassi, M

S. Jelassi, M. Sander, and Y. Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022

work page 2022

[22] [22]

Jiang, Y

F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. Dong, H. Shen, and Y. Wang. Artificial intelligence in healthcare: past, present and future.Stroke and vascular neurology, 2(4), 2017

work page 2017

[23] [23]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012

[24] [24]

Attention is a smoothed cubic spline , url =

Z. Lai, L.-H. Lim, and Y. Liu. Attention is a smoothed cubic spline.arXiv preprint arXiv:2408.09624, 2024

work page arXiv 2024

[25] [25]

Y. Lan, Z. Li, J. Sun, and Y. Xiang. Dosnet as a non-black-box pde solver: When deep learning meets operator splitting.Journal of Computational Physics, 491:112343, 2023

work page 2023

[26] [26]

Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand- kumar. Fourier neural operator for parametric partial differential equations.arXiv preprint arXiv:2010.08895, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[27] [27]

C. Liu, Z. Qiao, C. Li, and C.-B. Schönlieb. Inverse evolution layers: Physics-informed regu- larizers for image segmentation.SIAM Journal on Mathematics of Data Science, 7(1):55–85, 2025

work page 2025

[28] [28]

H. Liu, J. Liu, R. H. Chan, and X.-C. Tai. Double-well net for image segmentation.Multi- scale Modeling & Simulation, 22(4):1449–1477, 2024

work page 2024

[29] [29]

Convex shape prior for deep neural convolution network based eye fun- dus images segmentation.arXiv preprint arXiv:2005.07476,

J. Liu, X.-C. Tai, and S. Luo. Convex shape prior for deep neural convolution network based eye fundus images segmentation.arXiv preprint arXiv:2005.07476, 2020

work page arXiv 2005

[30] [30]

J. Liu, X. Wang, and X.-C. Tai. Deep convolutional neural networks with spatial regu- larization, volume and star-shape priors for image segmentation.Journal of Mathematical Imaging and Vision, 64(6):625–645, 2022

work page 2022

[31] [31]

Y. Liu, Z. Zhang, and H. Schaeffer. Prose: Predicting multiple operators and symbolic expressions using multimodal transformers.Neural Networks, 180:106707, 2024

work page 2024

[32] [32]

Z. Long, Y. Lu, X. Ma, and B. Dong. Pde-net: Learning pdes from data. InInternational conference on machine learning, pages 3208–3216. PMLR, 2018

work page 2018

[33] [33]

J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021

work page 2021

[34] [34]

L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators.Nature machine intelligence, 3(3):218–229, 2021

work page 2021

[35] [35]

T. Lu, P. Neittaanmaki, and X.-C. Tai. A parallel splitting-up method for partial differential equations and its applications to navier-stokes equations.ESAIM: Mathematical Modelling and Numerical Analysis, 26(6):673–708, 1992

work page 1992

[36] [36]

Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-y. Liu. Understanding and improving transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

work page 2020

[37] [37]

Martin, A

S. Martin, A. Gagneux, P. Hagemann, and G. Steidl. Pnp-flow: Plug-and-play image restoration with flow matching.arXiv preprint arXiv:2410.02423, 2024. 21

work page arXiv 2024

[38] [38]

J. Meng, F. Wang, and J. Liu. Learnable nonlocal self-similarity of deep features for image denoising.SIAM Journal on Imaging Sciences, 17(1):441–475, 2024

work page 2024

[39] [39]

Miotto, F

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley. Deep learning for healthcare: review, opportunities and challenges.Briefings in bioinformatics, 19(6):1236–1246, 2018

work page 2018

[40] [40]

Raissi, P

M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019

[41] [41]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical im- agesegmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015

[42] [42]

Ruiz-Balet and E

D. Ruiz-Balet and E. Zuazua. Neural ode control for classification, approximation, and transport.SIAM Review, 65(3):735–773, 2023

work page 2023

[43] [43]

Ruthotto and E

L. Ruthotto and E. Haber. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 62(3):352–364, 2020

work page 2020

[44] [44]

Z. Shen, A. Havrilla, R. Lai, A. Cloninger, and W. Liao. Transformers for learning on noisy and task-level manifolds: Approximation and generalization insights.arXiv preprint arXiv:2505.03205, 2025

work page internal anchor Pith review arXiv 2025

[45] [45]

Strudel, R

R. Strudel, R. Garcia, I. Laptev, and C. Schmid. Segmenter: Transformer for semantic seg- mentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021

work page 2021

[46] [46]

X.-C. Tai, H. Liu, and R. Chan. Pottsmgnet: A mathematical explanation of encoder- decoder based neural networks.SIAM Journal on Imaging Sciences, 17(1):540–594, 2024

work page 2024

[47] [47]

Tai, H.Liu, R

X.-C. Tai, H.Liu, R. H. Chan, andL.Li. A mathematical explanation of unet.Mathematical Foundations of Computing, 2024

work page 2024

[48] [48]

Takakura and T

S. Takakura and T. Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023

work page 2023

[49] [49]

R. E. Turner. An introduction to transformers.arXiv preprint arXiv:2304.10557, 2023

work page arXiv 2023

[50] [50]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[51] [51]

T. Wang, Z. Dou, C. Bao, and Z. Shi. Diffusion mechanism in residual neural network: theory and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):667–680, 2023

work page 2023

[52] [52]

E. Weinan. A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 5(1):1–11, 2017

work page 2017

[53] [53]

Weinan, C

E. Weinan, C. Ma, and L. Wu. Machine learning from a continuous viewpoint, i.Science China Mathematics, 63(11):2233–2266, 2020

work page 2020

[54] [54]

H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. Cvt: Introducing con- volutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021. 22

work page 2021

[55] [55]

J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin. Understanding and improving layer normal- ization.Advances in neural information processing systems, 32, 2019

work page 2019

[56] [56]

Y. Yang, J. Sun, H. Li, and Z. Xu. Admm-csnet: A deep learning approach for image com- pressivesensing.IEEE transactions on pattern analysis and machine intelligence, 42(3):521– 538, 2018

work page 2018

[57] [57]

Yarotsky

D. Yarotsky. Error bounds for approximations with deep relu networks.Neural networks, 94:103–114, 2017

work page 2017

[58] [58]

Z. Ye, X. Huang, L. Chen, H. Liu, Z. Wang, and B. Dong. Pdeformer: Towards a foundation model for one-dimensional partial differential equations.arXiv preprint arXiv:2402.12652, 2024

work page arXiv 2024

[59] [59]

Young, D

T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing.ieee Computational intelligenCe magazine, 13(3):55–75, 2018

work page 2018

[60] [60]

S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim. Graph transformer networks.Advances in neural information processing systems, 32, 2019

work page 2019

[61] [61]

Zhang and R

B. Zhang and R. Sennrich. A lightweight recurrent network for sequence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1538–1548, 2019

work page 2019

[62] [62]

Zhang, L

K. Zhang, L. Li, H. Liu, J. Yuan, and X.-C. Tai. Deep convolutional neural networks meet variational shape compactness priors for image segmentation.Neurocomputing, 623:129395, 2025

work page 2025

[63] [63]

D.-X. Zhou. Universality of deep convolutional neural networks.Applied and computational harmonic analysis, 48(2):787–794, 2020

work page 2020

[64] [64]

J. Zhu, X. Chen, K. He, Y. LeCun, and Z. Liu. Transformers without normalization.arXiv preprint arXiv:2503.10622, 2025

work page arXiv 2025

[65] [65]

Ziaee and E

A. Ziaee and E. Çano. Batch layer normalization a new normalization layer for cnns and rnns. InProceedings of the 6th International Conference on Advances in Artificial Intelli- gence, pages 40–49, 2022. 23

work page 2022