pith. sign in

arxiv: 2606.19538 · v3 · pith:PELPJXJInew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Pith reviewed 2026-06-30 11:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords integral transformunified architectureconvolutionself-attentionrecurrenceuniversal approximatorkernel parameterizationmodality-agnostic model
0
0 comments X

The pith

ITNet shows convolution, attention, and recurrence arise as special cases of one learnable integral transform.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that the distinct inductive biases in convolutional, recurrent, and transformer networks stem from different views of one underlying object, a learnable integral transform. ITNet implements this with an MLP that parameterizes a kernel depending on both positions and features to model interactions. It proves that the three architecture families emerge as special cases of this kernel under specific choices, and that the model universally approximates continuous operators. Efficient computation techniques are introduced to make it scalable. Experiments show that one ITNet setup performs on par with or better than task-specific models on image classification, language understanding, 3D recognition, and visual question answering.

Core claim

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairw

What carries the argument

A learnable integral transform whose kernel is an MLP depending jointly on positions and features, which subsumes convolution, attention, and recurrence as special cases.

Load-bearing premise

An MLP-parameterized kernel depending jointly on positions and features can recover the exact functional behavior and inductive biases of convolution, attention, and recurrence without additional constraints or loss of expressivity.

What would settle it

A concrete input-output pair where no MLP parameterization makes the integral transform output match a standard convolution or attention layer exactly.

Figures

Figures reproduced from arXiv: 2606.19538 by Ashim Dhor, Pin-Yu Chen, Rasel Mondal.

Figure 1
Figure 1. Figure 1: Overview of the ITNet architecture. The model consists of [PITH_FULL_IMAGE:figures/full_fig_p057_1.png] view at source ↗
read the original abstract

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce ITNet, a unified neural architecture based on a learnable integral transform with an MLP kernel that depends jointly on positions and features. It asserts that convolution, self-attention (multi-head), and autoregressive recurrence (LSTM, GRU, S4, Mamba) arise as special cases under appropriate parameterizations of this kernel. ITNet is also claimed to be a universal approximator of continuous operators. Efficiency methods including tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization are proposed to enable scalability. A single ITNet model with shared operator and modality-specific encoders is reported to match or exceed specialized baselines on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.

Significance. If the subsumption of the three architectural families is shown to be exact rather than approximate, and the experimental results are reproducible with proper controls, this work could significantly advance the understanding of neural network inductive biases by providing a common mathematical framework. The multi-modal empirical validation across five datasets is a notable strength, suggesting practical utility. The universal approximation result would add theoretical weight if formally established. However, the significance is currently limited by the absence of explicit derivations for the special cases.

major comments (3)
  1. [§4] The central claim that convolution, self-attention, and recurrence arise as special cases under appropriate parameterizations is load-bearing. However, no explicit MLP weight configurations or constraints are provided to recover the exact functional forms, such as feature-independent translation-invariant kernels for convolution or content-dependent dot-product attention. This leaves the subsumption claim as a statement of expressivity rather than a demonstrated reduction.
  2. [Efficiency Techniques] The proposed tiled kernel fusion, importance-weighted Monte Carlo integration, and low-rank factorization must be shown to preserve the exact special-case behaviors without introducing approximations that break the equivalence. No such preservation analysis is included, which is necessary for the claim to hold at scale.
  3. [§5] The assertion that ITNet is a universal approximator of continuous operators requires a detailed proof or reference to supporting theorems; the current presentation invokes this property without sufficient mathematical justification, which is critical for the theoretical contribution.
minor comments (2)
  1. The notation for the integral transform and kernel could be clarified with more explicit equations to aid readability.
  2. [Experiments] Details on the specific baselines, hyperparameter matching, and statistical significance of the performance claims would strengthen the empirical section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address each major comment below, agreeing that additional explicit material is required to strengthen the claims. We will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] The central claim that convolution, self-attention, and recurrence arise as special cases under appropriate parameterizations is load-bearing. However, no explicit MLP weight configurations or constraints are provided to recover the exact functional forms, such as feature-independent translation-invariant kernels for convolution or content-dependent dot-product attention. This leaves the subsumption claim as a statement of expressivity rather than a demonstrated reduction.

    Authors: We agree that Section 4 presents the special cases at a high level without listing explicit MLP weights or constraints. In the revision we will add a dedicated subsection with concrete parameter settings (e.g., zeroing feature-dependent terms and enforcing translation invariance for convolution; recovering the softmax dot-product via a specific MLP form for attention; and recovering the state-transition matrices for recurrence). This will convert the claim from expressivity to explicit reduction. revision: yes

  2. Referee: [Efficiency Techniques] The proposed tiled kernel fusion, importance-weighted Monte Carlo integration, and low-rank factorization must be shown to preserve the exact special-case behaviors without introducing approximations that break the equivalence. No such preservation analysis is included, which is necessary for the claim to hold at scale.

    Authors: The referee is correct that no preservation analysis appears in the current draft. We will add an appendix that proves each technique leaves the special-case kernels unchanged: tiled fusion respects translation invariance, importance sampling is exact when the kernel is deterministic, and the low-rank factorization is identity on the rank-1 forms used by the special cases. We will also report numerical verification that the recovered operators match the baselines to machine precision. revision: yes

  3. Referee: [§5] The assertion that ITNet is a universal approximator of continuous operators requires a detailed proof or reference to supporting theorems; the current presentation invokes this property without sufficient mathematical justification, which is critical for the theoretical contribution.

    Authors: We acknowledge the presentation is insufficient. The revision will include either a short proof sketch (leveraging the universal approximation property of MLPs for the kernel together with standard results on integral operators) or a precise citation to the relevant operator-learning theorems, together with the precise function-space assumptions under which the claim holds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unification claim rests on explicit parameterization reductions rather than self-definition

full rationale

The paper defines a general integral transform whose kernel is an MLP and states that conv/attn/recurrence arise as special cases under appropriate parameterizations. This is a standard unification move: the general object is strictly more expressive, and the special cases are obtained by restricting the MLP weights (e.g., making the kernel translation-invariant and feature-independent for convolution). No equation in the supplied text equates the output to the input by construction, nor is any load-bearing premise justified solely by self-citation. The universality statement follows from the known universal-approximation property of MLPs applied to the integral operator, which is external to the paper. Empirical results on multiple benchmarks are independent of the subsumption claim. Therefore the derivation chain does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that existing architectures are exactly recoverable as special cases of the MLP-kernel integral transform and on the practical assumption that the listed efficiency techniques preserve those recoveries at scale.

free parameters (1)
  • MLP kernel weights
    The parameters of the small MLP that defines the position-and-feature-dependent kernel are learned from data and determine which special case is realized.
axioms (2)
  • domain assumption Convolution, self-attention, and autoregressive recurrence arise as special cases of the learnable integral transform under appropriate MLP parameterizations
    Invoked when the abstract states that the three families arise as special cases.
  • domain assumption The integral transform with MLP kernel is a universal approximator of continuous operators
    Stated directly in the abstract as part of the theoretical claim.
invented entities (1)
  • ITNet with jointly position-and-feature-dependent MLP kernel no independent evidence
    purpose: Single architecture that subsumes prior families
    New model introduced in the paper; no independent evidence outside the claimed special cases and benchmarks.

pith-pipeline@v0.9.1-grok · 5783 in / 1704 out tokens · 62188 ms · 2026-06-30T11:02:57.497713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 15 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  3. [3]

    An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov

    V M Alekseev. An Estimate for the Perturbations of the Solution of Ordinary Differential Equations.Vestnik Moskov. Univ. Ser. I Mat. Meh., 2:28–36, 1961

  4. [4]

    Neural Operator: Graph Kernel Network for Partial Differential Equations

    Anima Anandkumar, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Nikola Kovachki, Zongyi Li, Burigede Liu, and Andrew Stuart. Neural Operator: Graph Kernel Network for Partial Differential Equations. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2020

  5. [5]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization.arXiv preprint arXiv:1607.06450, 2016

  6. [6]

    Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

    Andrew R Barron. Universal Approximation Bounds for Superpositions of a Sigmoidal Func- tion.IEEE Transactions on Information theory, 39(3):930–945, 1993

  7. [7]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-Document Trans- former.arXiv preprint arXiv:2004.05150, 2020

  8. [8]

    Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

    Edoardo Calvello, Nikola B Kovachki, Matthew E Levine, and Andrew M Stuart. Continuum Attention for Neural Operators.Journal of Machine Learning Research, 26(300):1–52, 2025

  9. [9]

    Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems

    Tianping Chen and Hong Chen. Universal Approximation to nonlinear operators by Neural Networks with Arbitrary Activation Functions and Its Application to Dynamical Systems. IEEE Transactions On Neural Networks, 6(4):911–917, 1995

  10. [10]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sub- linear Memory Cost.arXiv preprint arXiv:1604.06174, 2016

  11. [11]

    UNITER: UNiversal Image-TExt Representation Learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: UNiversal Image-TExt Representation Learning. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

  12. [12]

    Dynamic Convolution: Attention over Convolution Kernels

    Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic Convolution: Attention over Convolution Kernels. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11030–11039, 2020

  13. [13]

    Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merri ¨enboer, C ¸ a˘glar Gulc ¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InProceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

  14. [14]

    Xception: Deep Learning with Depthwise Separable Convolutions

    Franc ¸ois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1251–1258, 2017

  15. [15]

    Rethinking Attention with Performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, An- dreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking Attention with Performers. InInternational Conference on Learning Representations, 2021

  16. [16]

    Le, and Christopher D

    Kevin Clark, Minh-Thang Luong, Quoc V . Le, and Christopher D. Manning. ELECTRA: Pre- training Text Encoders as Discriminators Rather Than Generators.International Conference on Learning Representations, 2020. 10

  17. [17]

    McGraw-Hill, 1955

    Earl A Coddington and Norman Levinson.Theory of Ordinary Differential Equations. McGraw-Hill, 1955

  18. [18]

    On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

    Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the Relationship between Self-Attention and Convolutional Layers.arXiv preprint arXiv:1911.03584, 2019

  19. [19]

    RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. RandAugment: Practical Au- tomated Data Augmentation with a Reduced Search Space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020

  20. [20]

    Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

    George Cybenko. Approximation by Superpositions of a Sigmoidal Function.Mathematics of control, signals and systems, 2(4):303–314, 1989

  21. [21]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality.arXiv preprint arXiv:2405.21060, 2024

  22. [22]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems, 2022

  23. [23]

    Courier Corpora- tion, 2007

    Philip J Davis and Philip Rabinowitz.Methods of Numerical Integration. Courier Corpora- tion, 2007

  24. [24]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

  25. [25]

    Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs

    Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling Up Your Ker- nels to 31x31: Revisiting Large Kernel Design in CNNs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11963–11975, 2022

  26. [26]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

  27. [27]

    An Empirical Study of Training End- to-End Vision-and-Language Transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An Empirical Study of Training End- to-End Vision-and-Language Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022

  28. [28]

    John Wiley & Sons, 1988

    Nelson Dunford and Jacob T Schwartz.Linear Operators, part 1: General Theory. John Wiley & Sons, 1988

  29. [29]

    Graph Neural Networks with Learnable Structural and Positional Representations

    Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bres- son. Graph Neural Networks with Learnable Structural and Positional Representations. In International Conference on Learning Representations, 2022

  30. [30]

    Provably Strict Generalisation Benefit for Equivariant Models

    Bryn Elesedy and Sheheryar Zaidi. Provably Strict Generalisation Benefit for Equivariant Models. InInternational Conference on Machine Learning, pages 2959–2969. PMLR, 2021

  31. [31]

    John Wiley & Sons, 1999

    Gerald B Folland.Real Analysis: Modern Techniques and Their Applications. John Wiley & Sons, 1999

  32. [32]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

  33. [33]

    Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017. 11

  34. [34]

    VEB Deutscher Verlag der Wis- senschaften, 1960

    Wolfgang Gr ¨obner.Die Lie-Reihen und ihre Anwendungen. VEB Deutscher Verlag der Wis- senschaften, 1960

  35. [35]

    Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

    Thomas Hakon Gronwall. Note on the Derivatives with Respect to a Parameter of the Solu- tions of a System of Differential Equations.Annals of Mathematics, 20(4):292–296, 1919

  36. [36]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces.arXiv preprint arXiv:2312.00752, 2023

  37. [37]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher R ´e. Efficiently Modeling Long Sequences with Structured State Spaces.arXiv preprint arXiv:2111.00396, 2021

  38. [38]

    PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

    Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi- Min Hu. PCT: Point cloud transformer.Computational visual media, 7(2):187–199, 2021

  39. [39]

    Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

    Alfred Haar. Der Massbegriff in der Theorie der kontinuierlichen Gruppen.Annals of math- ematics, 34(1):147–169, 1933

  40. [40]

    SIAM, 2002

    Philip Hartman.Ordinary Differential Equations. SIAM, 2002

  41. [41]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  42. [42]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding- enhanced BERT with Disentangled Attention.arXiv preprint arXiv:2006.03654, 2020

  43. [43]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016

  44. [44]

    Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural computation, 9(8):1735–1780, 1997

  45. [45]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2012

  46. [46]

    Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer Feedforward Networks are Universal Approximators.Neural networks, 2(5):359–366, 1989

  47. [47]

    Deep Net- works with Stochastic Depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep Net- works with Stochastic Depth. InEuropean Conference on Computer Vision, pages 646–661. Springer, 2016

  48. [48]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs.arXiv preprint arXiv:2107.14795, 2021

  49. [49]

    Perceiver: General Perception with Iterative Attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General Perception with Iterative Attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021

  50. [50]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Confer- ence on Machine Learning, pages 5156–5165. PMLR, 2020

  51. [51]

    ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT: Vision-and-Language Transformer With- out Convolution or Region Supervision. InInternational Conference on Machine Learning, pages 5583–5594. PMLR, 2021

  52. [52]

    ImageNet Classification with Deep Convolutional Neural Networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. InAdvances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. 12

  53. [53]

    Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

    Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation Applied to Handwritten Zip Code Recog- nition.Neural computation, 1(4):541–551, 1989

  54. [54]

    Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function

    Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer Feedfor- ward Networks With a Nonpolynomial Activation Function can Approximate Any Function. Neural networks, 6(6):861–867, 1993

  55. [55]

    Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.Advances in Neural Information Processing Systems, 34:9694– 9705, 2021

  56. [56]

    BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInterna- tional Conference on Machine Learning, pages 12888–12900. PMLR, 2022

  57. [57]

    BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models. InInterna- tional Conference on Machine Learning, pages 19730–19742. PMLR, 2023

  58. [58]

    Fourier Neural Operator for Parametric Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier Neural Operator for Parametric Partial Differential Equations.arXiv preprint arXiv:2010.08895, 2020

  59. [59]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692, 2019

  60. [60]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Bain- ing Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012– 10022, 2021

  61. [61]

    Swin Transformer V2: Scaling Up Capacity and Resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin Transformer V2: Scaling Up Capacity and Resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12009–12019, 2022

  62. [62]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Sain- ing Xie. A ConvNet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

  63. [63]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

  64. [64]

    Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

    Lu Lu, Pengzhan Jin, Guofei Pang, Zhongqiang Zhang, and George Em Karniadakis. Learn- ing nonlinear operators via DeepONet based on the universal approximation theorem of op- erators.Nature machine intelligence, 3(3):218–229, 2021

  65. [65]

    Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

    Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework. InInternational Con- ference on Learning Representations, 2022

  66. [66]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed Precision Training. InInternational Conference on Learning Representations, 2018

  67. [67]

    Springer Science & Business Media, 2010

    B ´ela Sz Nagy, Ciprian Foias, Hari Bercovici, and L ´aszl´o K ´erchy.Harmonic Analysis of Operators on Hilbert Space. Springer Science & Business Media, 2010. 13

  68. [68]

    EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

    Xiaohuan Pei, Tao Huang, and Chang Xu. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6443–6451, 2025

  69. [69]

    Hyena Hierarchy: Towards Larger Convolutional Language Models

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R ´e. Hyena Hierarchy: Towards Larger Convolutional Language Models. InInternational Conference on Machine Learning, pages 28043–28078. PMLR, 2023

  70. [70]

    Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

    Boris T Polyak and Anatoli B Juditsky. Acceleration of Stochastic Approximation by Aver- aging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

  71. [71]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. InInternational Conference on Learning Repre- sentations, 2022

  72. [72]

    PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017

  73. [73]

    PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. PointNet++: Deep Hierar- chical Feature Learning on Point Sets in a Metric Space.Advances in Neural Information Processing Systems, 30, 2017

  74. [74]

    PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

    Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elho- seiny, and Bernard Ghanem. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies.Advances in Neural Information Processing Systems, 35:23192–23204, 2022

  75. [75]

    Improving Language Understanding by Generative Pre-Training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving Language Understanding by Generative Pre-Training. 2018

  76. [76]

    Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language Models are Unsupervised Multitask Learners.OpenAI blog, 1(8):9, 2019

  77. [77]

    Number 6

    Halsey Lawrence Royden.Real Analysis. Number 6. Krishna Prakashan Media, 1988

  78. [78]

    Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

    Walter Rudin. Principles of Mathematical Analysis.International Series in Pure and Applied Mathematics, 1976

  79. [79]

    McGraw-Hill, New York, 3rd edition, 1987

    Walter Rudin.Real and Complex Analysis. McGraw-Hill, New York, 3rd edition, 1987

  80. [80]

    ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision, 115(3):211–252, 2015

Showing first 80 references.