ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Ashim Dhor; Pin-Yu Chen; Rasel Mondal

arxiv: 2606.19538 · v3 · pith:PELPJXJInew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Ashim Dhor , Rasel Mondal , Pin-Yu Chen This is my paper

Pith reviewed 2026-06-30 11:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords integral transformunified architectureconvolutionself-attentionrecurrenceuniversal approximatorkernel parameterizationmodality-agnostic model

0 comments

The pith

ITNet shows convolution, attention, and recurrence arise as special cases of one learnable integral transform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that the distinct inductive biases in convolutional, recurrent, and transformer networks stem from different views of one underlying object, a learnable integral transform. ITNet implements this with an MLP that parameterizes a kernel depending on both positions and features to model interactions. It proves that the three architecture families emerge as special cases of this kernel under specific choices, and that the model universally approximates continuous operators. Efficient computation techniques are introduced to make it scalable. Experiments show that one ITNet setup performs on par with or better than task-specific models on image classification, language understanding, 3D recognition, and visual question answering.

Core claim

What carries the argument

A learnable integral transform whose kernel is an MLP depending jointly on positions and features, which subsumes convolution, attention, and recurrence as special cases.

Load-bearing premise

An MLP-parameterized kernel depending jointly on positions and features can recover the exact functional behavior and inductive biases of convolution, attention, and recurrence without additional constraints or loss of expressivity.

What would settle it

A concrete input-output pair where no MLP parameterization makes the integral transform output match a standard convolution or attention layer exactly.

Figures

Figures reproduced from arXiv: 2606.19538 by Ashim Dhor, Pin-Yu Chen, Rasel Mondal.

read the original abstract

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The unification claim hinges on unshown exact reductions from an MLP kernel to conv/attn/recurrence, which the abstract does not deliver.

read the letter

The paper frames convolution, self-attention, and recurrence as special cases of one learnable integral transform whose kernel is an MLP that takes both position and feature arguments. That is the central new move: treating the three families as different parameterizations of the same operator rather than separate primitives.

The work does show a single ITNet backbone plus light encoders reaching or beating specialized models on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2. If the shared operator really carries the load across modalities without heavy per-task tuning, that result would be worth checking in detail.

The soft spot is exactly where the stress test points: the abstract asserts that the three families arise as special cases under appropriate parameterizations, yet supplies no explicit weight settings, no derivation that forces translation invariance for convolution or dot-product-plus-softmax for attention, and no proof that the tiled fusion, importance-weighted Monte Carlo, or low-rank tricks preserve those exact reductions at scale. An arbitrary MLP kernel is rich enough to approximate almost anything, so the unification is partly by construction; the non-tautological part is the exact recovery plus the efficiency methods. That step is missing from the provided text.

The paper is aimed at people who want fewer separate code paths for different inductive biases. It is not yet ready for a serious referee because the load-bearing mathematical claim cannot be evaluated. Once the derivations and the explicit constructions appear, it would be worth sending out.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce ITNet, a unified neural architecture based on a learnable integral transform with an MLP kernel that depends jointly on positions and features. It asserts that convolution, self-attention (multi-head), and autoregressive recurrence (LSTM, GRU, S4, Mamba) arise as special cases under appropriate parameterizations of this kernel. ITNet is also claimed to be a universal approximator of continuous operators. Efficiency methods including tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization are proposed to enable scalability. A single ITNet model with shared operator and modality-specific encoders is reported to match or exceed specialized baselines on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.

Significance. If the subsumption of the three architectural families is shown to be exact rather than approximate, and the experimental results are reproducible with proper controls, this work could significantly advance the understanding of neural network inductive biases by providing a common mathematical framework. The multi-modal empirical validation across five datasets is a notable strength, suggesting practical utility. The universal approximation result would add theoretical weight if formally established. However, the significance is currently limited by the absence of explicit derivations for the special cases.

major comments (3)

[§4] The central claim that convolution, self-attention, and recurrence arise as special cases under appropriate parameterizations is load-bearing. However, no explicit MLP weight configurations or constraints are provided to recover the exact functional forms, such as feature-independent translation-invariant kernels for convolution or content-dependent dot-product attention. This leaves the subsumption claim as a statement of expressivity rather than a demonstrated reduction.
[Efficiency Techniques] The proposed tiled kernel fusion, importance-weighted Monte Carlo integration, and low-rank factorization must be shown to preserve the exact special-case behaviors without introducing approximations that break the equivalence. No such preservation analysis is included, which is necessary for the claim to hold at scale.
[§5] The assertion that ITNet is a universal approximator of continuous operators requires a detailed proof or reference to supporting theorems; the current presentation invokes this property without sufficient mathematical justification, which is critical for the theoretical contribution.

minor comments (2)

The notation for the integral transform and kernel could be clarified with more explicit equations to aid readability.
[Experiments] Details on the specific baselines, hyperparameter matching, and statistical significance of the performance claims would strengthen the empirical section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address each major comment below, agreeing that additional explicit material is required to strengthen the claims. We will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [§4] The central claim that convolution, self-attention, and recurrence arise as special cases under appropriate parameterizations is load-bearing. However, no explicit MLP weight configurations or constraints are provided to recover the exact functional forms, such as feature-independent translation-invariant kernels for convolution or content-dependent dot-product attention. This leaves the subsumption claim as a statement of expressivity rather than a demonstrated reduction.

Authors: We agree that Section 4 presents the special cases at a high level without listing explicit MLP weights or constraints. In the revision we will add a dedicated subsection with concrete parameter settings (e.g., zeroing feature-dependent terms and enforcing translation invariance for convolution; recovering the softmax dot-product via a specific MLP form for attention; and recovering the state-transition matrices for recurrence). This will convert the claim from expressivity to explicit reduction. revision: yes
Referee: [Efficiency Techniques] The proposed tiled kernel fusion, importance-weighted Monte Carlo integration, and low-rank factorization must be shown to preserve the exact special-case behaviors without introducing approximations that break the equivalence. No such preservation analysis is included, which is necessary for the claim to hold at scale.

Authors: The referee is correct that no preservation analysis appears in the current draft. We will add an appendix that proves each technique leaves the special-case kernels unchanged: tiled fusion respects translation invariance, importance sampling is exact when the kernel is deterministic, and the low-rank factorization is identity on the rank-1 forms used by the special cases. We will also report numerical verification that the recovered operators match the baselines to machine precision. revision: yes
Referee: [§5] The assertion that ITNet is a universal approximator of continuous operators requires a detailed proof or reference to supporting theorems; the current presentation invokes this property without sufficient mathematical justification, which is critical for the theoretical contribution.

Authors: We acknowledge the presentation is insufficient. The revision will include either a short proof sketch (leveraging the universal approximation property of MLPs for the kernel together with standard results on integral operators) or a precise citation to the relevant operator-learning theorems, together with the precise function-space assumptions under which the claim holds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unification claim rests on explicit parameterization reductions rather than self-definition

full rationale

The paper defines a general integral transform whose kernel is an MLP and states that conv/attn/recurrence arise as special cases under appropriate parameterizations. This is a standard unification move: the general object is strictly more expressive, and the special cases are obtained by restricting the MLP weights (e.g., making the kernel translation-invariant and feature-independent for convolution). No equation in the supplied text equates the output to the input by construction, nor is any load-bearing premise justified solely by self-citation. The universality statement follows from the known universal-approximation property of MLPs applied to the integral operator, which is external to the paper. Empirical results on multiple benchmarks are independent of the subsumption claim. Therefore the derivation chain does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that existing architectures are exactly recoverable as special cases of the MLP-kernel integral transform and on the practical assumption that the listed efficiency techniques preserve those recoveries at scale.

free parameters (1)

MLP kernel weights
The parameters of the small MLP that defines the position-and-feature-dependent kernel are learned from data and determine which special case is realized.

axioms (2)

domain assumption Convolution, self-attention, and autoregressive recurrence arise as special cases of the learnable integral transform under appropriate MLP parameterizations
Invoked when the abstract states that the three families arise as special cases.
domain assumption The integral transform with MLP kernel is a universal approximator of continuous operators
Stated directly in the abstract as part of the theoretical claim.

invented entities (1)

ITNet with jointly position-and-feature-dependent MLP kernel no independent evidence
purpose: Single architecture that subsumes prior families
New model introduced in the paper; no independent evidence outside the claimed special cases and benchmarks.

pith-pipeline@v0.9.1-grok · 5783 in / 1704 out tokens · 62188 ms · 2026-06-30T11:02:57.497713+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 15 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

2022
[3]

An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov

V M Alekseev. An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov. Univ. Ser. I Mat. Meh., 2:28–36, 1961

1961
[4]

Neural Operator: Graph Kernel Network for Partial Differential Equations

Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki, Zongyi Li, Burigede Liu, and Andrew Stuart. Neural Operator: Graph Kernel Network for Partial Differential Equations. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020

2020
[5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

Andrew R Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

1993
[7]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-Document Trans- former.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[8]

Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

Edoardo Calvello, Nikola B Kovachki, Matthew E Levine, and Andrew M Stuart. Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

2025
[9]

Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems

Tianping Chen and Hong Chen. Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems. IEEE Transactions On Neural Networks, 6(4):911–917, 1995

1995
[10]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sub- linear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020
[12]

Dynamic Convolution: Attention over Convolution Kernels

Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic Convolution: Attention over Convolution Kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11030–11039, 2020

2020
[13]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merri ¨enboer, C ¸ a˘glar Gulc ¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InProceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

2014
[14]

Xception: Deep Learning with Depthwise Separable Convolutions

Franc ¸ois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017

2017
[15]

Rethinking Attention with Performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking Attention with Performers. InInternational Conference on Learning Representations, 2021

2021
[16]

Le, and Christopher D

Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. ELECTRA: Pre- training Text Encoders as Discriminators Rather Than Generators.International Conference on Learning Representations, 2020. 10

2020
[17]

McGraw-Hill, 1955

Earl A Coddington and Norman Levinson.Theory of Ordinary Differential Equations. McGraw-Hill, 1955

1955
[18]

On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

work page arXiv 1911
[19]

RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020

2020
[20]

Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

George Cybenko. Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

1989
[21]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022

2022
[23]

Courier Corpora- tion, 2007

Philip J Davis and Philip Rabinowitz.Methods of Numerical Integration. Courier Corpora- tion, 2007

2007
[24]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

2019
[25]

Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs

Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11963–11975, 2022

2022
[26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

2021
[27]

An Empirical Study of Training End- to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An Empirical Study of Training End- to-End Vision-and-Language Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022

2022
[28]

John Wiley & Sons, 1988

Nelson Dunford and Jacob T Schwartz.Linear Operators, part 1: General Theory. John Wiley & Sons, 1988

1988
[29]

Graph Neural Networks with Learnable Structural and Positional Representations

Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bres- son. Graph Neural Networks with Learnable Structural and Positional Representations. In International Conference on Learning Representations, 2022

2022
[30]

Provably Strict Generalisation Benefit for Equivariant Models

Bryn Elesedy and Sheheryar Zaidi. Provably Strict Generalisation Benefit for Equivariant Models. InInternational Conference on Machine Learning, pages 2959–2969. PMLR, 2021

2021
[31]

John Wiley & Sons, 1999

Gerald B Folland.Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 1999

1999
[32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

2010
[33]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 11

2017
[34]

VEB Deutscher Verlag der Wis- senschaften, 1960

Wolfgang Gr ¨obner.Die Lie-Reihen und ihre Anwendungen. VEB Deutscher Verlag der Wis- senschaften, 1960

1960
[35]

Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

Thomas Hakon Gronwall. Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

1919
[36]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher R ´e. Efficiently Modeling Long Sequences with Structured State Spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi- Min Hu. PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

2021
[39]

Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

Alfred Haar. Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

1933
[40]

SIAM, 2002

Philip Hartman.Ordinary Differential Equations. SIAM, 2002

2002
[41]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[42]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding- enhanced BERT with Disentangled Attention.arXiv preprint arXiv:2006.03654, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[43]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[44]

Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

1997
[45]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2012

2012
[46]

Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

1989
[47]

Deep Net- works with Stochastic Depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep Net- works with Stochastic Depth. InEuropean Conference on Computer Vision, pages 646–661. Springer, 2016

2016
[48]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs.arXiv preprint arXiv:2107.14795, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General Perception with Iterative Attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

2021
[50]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Confer- ence on Machine Learning, pages 5156–5165. PMLR, 2020

2020
[51]

ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision. InInternational Conference on Machine Learning, pages 5583–5594. PMLR, 2021

2021
[52]

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. InAdvances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. 12

2012
[53]

Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

1989
[54]

Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function

Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function. Neural networks, 6(6):861–867, 1993

1993
[55]

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

2021
[56]

BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInterna- tional Conference on Machine Learning, pages 12888–12900. PMLR, 2022

2022
[57]

BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models. InInterna- tional Conference on Machine Learning, pages 19730–19742. PMLR, 2023

2023
[58]

Fourier Neural Operator for Parametric Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier Neural Operator for Parametric Partial Differential Equations.arXiv preprint arXiv:2010.08895, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[59]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[60]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Bain- ing Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012– 10022, 2021

2021
[61]

Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin Transformer V2: Scaling Up Capacity and Resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022

2022
[62]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Sain- ing Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

2022
[63]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

2019
[64]

Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

2021
[65]

Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework. InInternational Con- ference on Learning Representations, 2022

2022
[66]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations, 2018

2018
[67]

Springer Science & Business Media, 2010

B ´ela Sz Nagy, Ciprian Foias, Hari Bercovici, and L ´aszl´o K ´erchy.Harmonic Analysis of Operators on Hilbert Space. Springer Science & Business Media, 2010. 13

2010
[68]

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Xiaohuan Pei, Tao Huang, and Chang Xu. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6443–6451, 2025

2025
[69]

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R ´e. Hyena Hierarchy: Towards Larger Convolutional Language Models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023

2023
[70]

Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

1992
[71]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InInternational Conference on Learning Repre- sentations, 2022

2022
[72]

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017

2017
[73]

PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

2017
[74]

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elho- seiny, and Bernard Ghanem. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

2022
[75]

Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving Language Understanding by Generative Pre-Training. 2018

2018
[76]

Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

2019
[77]

Number 6

Halsey Lawrence Royden.Real Analysis. Number 6. Krishna Prakashan Media, 1988

1988
[78]

Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

Walter Rudin. Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

1976
[79]

McGraw-Hill, New York, 3rd edition, 1987

Walter Rudin.Real and Complex Analysis. McGraw-Hill, New York, 3rd edition, 1987

1987
[80]

ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

2015

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

2022

[3] [3]

An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov

V M Alekseev. An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov. Univ. Ser. I Mat. Meh., 2:28–36, 1961

1961

[4] [4]

Neural Operator: Graph Kernel Network for Partial Differential Equations

Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki, Zongyi Li, Burigede Liu, and Andrew Stuart. Neural Operator: Graph Kernel Network for Partial Differential Equations. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020

2020

[5] [5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

Andrew R Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

1993

[7] [7]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-Document Trans- former.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[8] [8]

Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

Edoardo Calvello, Nikola B Kovachki, Matthew E Levine, and Andrew M Stuart. Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

2025

[9] [9]

Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems

Tianping Chen and Hong Chen. Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems. IEEE Transactions On Neural Networks, 6(4):911–917, 1995

1995

[10] [10]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sub- linear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

UNITER: UNiversal Image-TExt Representation Learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

2020

[12] [12]

Dynamic Convolution: Attention over Convolution Kernels

Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic Convolution: Attention over Convolution Kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11030–11039, 2020

2020

[13] [13]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merri ¨enboer, C ¸ a˘glar Gulc ¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InProceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

2014

[14] [14]

Xception: Deep Learning with Depthwise Separable Convolutions

Franc ¸ois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017

2017

[15] [15]

Rethinking Attention with Performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking Attention with Performers. InInternational Conference on Learning Representations, 2021

2021

[16] [16]

Le, and Christopher D

Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. ELECTRA: Pre- training Text Encoders as Discriminators Rather Than Generators.International Conference on Learning Representations, 2020. 10

2020

[17] [17]

McGraw-Hill, 1955

Earl A Coddington and Norman Levinson.Theory of Ordinary Differential Equations. McGraw-Hill, 1955

1955

[18] [18]

On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

work page arXiv 1911

[19] [19]

RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020

2020

[20] [20]

Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

George Cybenko. Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

1989

[21] [21]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022

2022

[23] [23]

Courier Corpora- tion, 2007

Philip J Davis and Philip Rabinowitz.Methods of Numerical Integration. Courier Corpora- tion, 2007

2007

[24] [24]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

2019

[25] [25]

Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs

Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11963–11975, 2022

2022

[26] [26]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

2021

[27] [27]

An Empirical Study of Training End- to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An Empirical Study of Training End- to-End Vision-and-Language Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022

2022

[28] [28]

John Wiley & Sons, 1988

Nelson Dunford and Jacob T Schwartz.Linear Operators, part 1: General Theory. John Wiley & Sons, 1988

1988

[29] [29]

Graph Neural Networks with Learnable Structural and Positional Representations

Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bres- son. Graph Neural Networks with Learnable Structural and Positional Representations. In International Conference on Learning Representations, 2022

2022

[30] [30]

Provably Strict Generalisation Benefit for Equivariant Models

Bryn Elesedy and Sheheryar Zaidi. Provably Strict Generalisation Benefit for Equivariant Models. InInternational Conference on Machine Learning, pages 2959–2969. PMLR, 2021

2021

[31] [31]

John Wiley & Sons, 1999

Gerald B Folland.Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 1999

1999

[32] [32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

2010

[33] [33]

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 11

2017

[34] [34]

VEB Deutscher Verlag der Wis- senschaften, 1960

Wolfgang Gr ¨obner.Die Lie-Reihen und ihre Anwendungen. VEB Deutscher Verlag der Wis- senschaften, 1960

1960

[35] [35]

Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

Thomas Hakon Gronwall. Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

1919

[36] [36]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher R ´e. Efficiently Modeling Long Sequences with Structured State Spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi- Min Hu. PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

2021

[39] [39]

Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

Alfred Haar. Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

1933

[40] [40]

SIAM, 2002

Philip Hartman.Ordinary Differential Equations. SIAM, 2002

2002

[41] [41]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[42] [42]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding- enhanced BERT with Disentangled Attention.arXiv preprint arXiv:2006.03654, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[43] [43]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[44] [44]

Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

1997

[45] [45]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2012

2012

[46] [46]

Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

1989

[47] [47]

Deep Net- works with Stochastic Depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep Net- works with Stochastic Depth. InEuropean Conference on Computer Vision, pages 646–661. Springer, 2016

2016

[48] [48]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs.arXiv preprint arXiv:2107.14795, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[49] [49]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General Perception with Iterative Attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

2021

[50] [50]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Confer- ence on Machine Learning, pages 5156–5165. PMLR, 2020

2020

[51] [51]

ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision. InInternational Conference on Machine Learning, pages 5583–5594. PMLR, 2021

2021

[52] [52]

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. InAdvances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. 12

2012

[53] [53]

Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

1989

[54] [54]

Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function

Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function. Neural networks, 6(6):861–867, 1993

1993

[55] [55]

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

2021

[56] [56]

BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInterna- tional Conference on Machine Learning, pages 12888–12900. PMLR, 2022

2022

[57] [57]

BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models. InInterna- tional Conference on Machine Learning, pages 19730–19742. PMLR, 2023

2023

[58] [58]

Fourier Neural Operator for Parametric Partial Differential Equations

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier Neural Operator for Parametric Partial Differential Equations.arXiv preprint arXiv:2010.08895, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[59] [59]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[60] [60]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Bain- ing Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012– 10022, 2021

2021

[61] [61]

Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin Transformer V2: Scaling Up Capacity and Resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022

2022

[62] [62]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Sain- ing Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

2022

[63] [63]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

2019

[64] [64]

Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

2021

[65] [65]

Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework. InInternational Con- ference on Learning Representations, 2022

2022

[66] [66]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations, 2018

2018

[67] [67]

Springer Science & Business Media, 2010

B ´ela Sz Nagy, Ciprian Foias, Hari Bercovici, and L ´aszl´o K ´erchy.Harmonic Analysis of Operators on Hilbert Space. Springer Science & Business Media, 2010. 13

2010

[68] [68]

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Xiaohuan Pei, Tao Huang, and Chang Xu. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6443–6451, 2025

2025

[69] [69]

Hyena Hierarchy: Towards Larger Convolutional Language Models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R ´e. Hyena Hierarchy: Towards Larger Convolutional Language Models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023

2023

[70] [70]

Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

1992

[71] [71]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InInternational Conference on Learning Repre- sentations, 2022

2022

[72] [72]

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017

2017

[73] [73]

PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

2017

[74] [74]

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elho- seiny, and Bernard Ghanem. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

2022

[75] [75]

Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving Language Understanding by Generative Pre-Training. 2018

2018

[76] [76]

Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

2019

[77] [77]

Number 6

Halsey Lawrence Royden.Real Analysis. Number 6. Krishna Prakashan Media, 1988

1988

[78] [78]

Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

Walter Rudin. Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

1976

[79] [79]

McGraw-Hill, New York, 3rd edition, 1987

Walter Rudin.Real and Complex Analysis. McGraw-Hill, New York, 3rd edition, 1987

1987

[80] [80]

ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

2015