Generalization Analysis of Transformers in Distribution Regression

Ding-Xuan Zhou; Peilin Liu

arxiv: 2606.29256 · v1 · pith:FRSFSPEVnew · submitted 2026-06-28 · 📊 stat.ML · cs.LG

Generalization Analysis of Transformers in Distribution Regression

Peilin Liu , Ding-Xuan Zhou This is my paper

Pith reviewed 2026-06-30 02:44 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords attention operatordistribution regressiongeneralization boundtransformerfunctional learningprompt tuningparameter-efficient fine-tuning

0 comments

The pith

Transformers can compress input distributions into function representations without loss of information via a novel attention operator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a distribution-regression framework in which distributions serve as inputs and a two-stage sampling process links the setup to natural-language tasks. Within this framework it introduces an attention operator that mathematically encodes the attention mechanism. The central results are that this operator compresses any input distribution into a function representation with no information loss, that the resulting model class can represent functionals of greater structural complexity than those handled by convolutional or fully-connected networks, and that generalization bounds follow directly for the distribution-regression problem. These claims supply a concrete mechanism that accounts for observed Transformer behavior on distribution-valued data and for the practical success of techniques such as prompt tuning.

Core claim

By the attention operator, Transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, Transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework.

What carries the argument

The attention operator, a mathematical formulation of the attention mechanism that maps an input distribution to a function representation while preserving all information.

If this is right

Generalization bounds hold for Transformer models trained under the distribution-regression objective.
Transformers can represent a strictly larger class of functionals than convolutional or fully-connected networks of comparable capacity.
The lossless compression property supplies a theoretical basis for prompt tuning and parameter-efficient fine-tuning inside the same framework.
Efficient scaling arguments follow from the operator’s ability to maintain information while increasing model depth or width.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same operator may explain why Transformers succeed on other modalities that can be cast as distribution regression.
If the operator is invertible, one could in principle recover the original distribution from the learned function representation.
The two-stage sampling construction suggests a natural way to transfer the analysis to sequence-to-sequence tasks beyond language.

Load-bearing premise

The proposed mathematical attention operator exactly reproduces the behavior of real Transformer layers when the inputs are probability distributions.

What would settle it

A concrete counter-example distribution on which the attention operator produces a function representation whose integral against some test functional differs from the true value by more than numerical tolerance.

read the original abstract

In recent years, models based on the Transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind Transformers and related techniques, we first propose a Transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, Transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, Transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Through the aforementioned theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sets up a distribution-regression model with a custom attention operator, proves lossless compression for it, and derives a generalization bound while claiming an edge over CNNs and FC nets, but the operator is not standard attention.

read the letter

This paper supplies a generalization bound for a Transformer model in distribution regression using a custom attention operator that they show compresses distributions losslessly. It also claims this gives Transformers an edge over CNNs and FC networks for complex functionals.

The framework is new in how it sets up the attention operator for this setting and ties it to LLM methods like fine-tuning and prompt tuning through the distribution lens. The two-stage sampling argument is an attempt to link the math back to NLP tasks.

The work is solid in laying out the model and reaching the bound. They define the operator, establish its properties, compare learning capacity, and close with the bound. That is concrete output rather than hand-waving.

The soft spot is the gap between their operator and standard Transformer attention. The operator is defined mathematically for distributions, and the no-loss claim holds for it. But if it does not emerge as a limit of the usual scaled dot-product attention on empirical measures drawn from the input distributions, then the results apply to this abstract version, not necessarily to the models used in practice. The superiority claim and the bound rest on that modeling choice. The two-stage sampling connection is also a specific construction that may not capture how language data actually arrives at the model.

This paper is for machine learning theorists interested in generalization for attention-based models on distributional inputs. It is worth sending to peer review because it delivers a specific bound and framework even if referees will press on the equivalence to real attention and the strength of the NLP link.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a distribution-regression framework for analyzing Transformers, connecting it to NLP via a two-stage sampling process. It defines a custom attention operator, claims this operator compresses arbitrary input distributions into function representations injectively (without loss of information), asserts that the resulting Transformers learn functionals with more complex structures than CNNs or fully connected networks, derives a generalization bound, and uses the framework to discuss LLM techniques such as prompt tuning and parameter-efficient fine-tuning.

Significance. If the central claims hold with the stated operator, the work would supply a theoretical lens on Transformers for distributional inputs and on scaling/fine-tuning methods. The lossless-compression and generalization results would be the primary contributions; their value hinges on whether the operator is shown to recover or limit to standard attention on empirical measures.

major comments (2)

[attention operator definition] The lossless compression claim (abstract and the section defining the attention operator) rests on injectivity of the custom attention operator. The manuscript motivates the operator via measures but does not establish that it coincides with, or is recovered in the appropriate limit from, the standard scaled dot-product attention applied to finite token sequences drawn from the input distributions. This equivalence is load-bearing for transferring the compression property and the subsequent superiority claim to actual Transformer architectures.
[generalization bound] The generalization bound (final theoretical section) is derived inside the distribution-regression framework that employs the custom operator. It is unclear whether the bound's assumptions on the function class, the two-stage sampling, or the operator's properties remain valid when the operator is replaced by standard attention on empirical measures; this gap directly affects whether the bound applies to practical models.

minor comments (2)

The abstract packs multiple distinct claims; a short enumerated list of contributions would improve readability.
All assumptions required for the compression and generalization theorems should be collected in a single, clearly labeled subsection rather than scattered through the derivations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points regarding the relationship between our attention operator and standard Transformer attention, which we address below. We will revise the manuscript to strengthen these connections.

read point-by-point responses

Referee: [attention operator definition] The lossless compression claim (abstract and the section defining the attention operator) rests on injectivity of the custom attention operator. The manuscript motivates the operator via measures but does not establish that it coincides with, or is recovered in the appropriate limit from, the standard scaled dot-product attention applied to finite token sequences drawn from the input distributions. This equivalence is load-bearing for transferring the compression property and the subsequent superiority claim to actual Transformer architectures.

Authors: We agree that a formal demonstration of how the attention operator recovers standard scaled dot-product attention on empirical measures would strengthen the link to practical architectures. The operator is defined as the natural extension of attention to the space of probability measures in the distribution-regression setting, with the two-stage sampling process providing the bridge to token sequences in NLP. In the revision we will add a dedicated remark (or short proposition) in the section defining the operator, showing that when the input measures are finite-support empirical distributions the operator reduces to the standard attention computation (up to the usual scaling and normalization). This will make the lossless-compression claim directly transferable and clarify the scope of the superiority result relative to CNNs and FCNs. revision: yes
Referee: [generalization bound] The generalization bound (final theoretical section) is derived inside the distribution-regression framework that employs the custom operator. It is unclear whether the bound's assumptions on the function class, the two-stage sampling, or the operator's properties remain valid when the operator is replaced by standard attention on empirical measures; this gap directly affects whether the bound applies to practical models.

Authors: The generalization bound is proved inside the distribution-regression framework that uses the attention operator and the two-stage sampling model. We acknowledge that the manuscript does not explicitly verify that all hypotheses of the bound continue to hold under the standard attention operator applied to finite samples. In the revision we will insert a short discussion subsection immediately after the bound statement. It will (i) restate the key assumptions in terms of the operator, (ii) note that the bound carries over verbatim once the recovery result from the first comment is established, and (iii) indicate the additional regularity conditions (if any) needed for the empirical-measure case. This will make the implications for prompt tuning and parameter-efficient fine-tuning more precise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from definitions and analysis

full rationale

The paper defines a custom attention operator motivated by distribution regression and a two-stage sampling process, then derives the lossless compression property, comparative capability over CNN/FC networks, and a generalization bound directly from that operator and framework. These steps follow standard theoretical derivation from stated assumptions and definitions rather than reducing to self-citation chains, fitted parameters renamed as predictions, or self-definitional loops. No load-bearing self-citations or ansatzes smuggled via prior work are indicated in the provided text, and the central results are presented as consequences of the novel formulation itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are identifiable beyond the new attention operator formulation itself.

pith-pipeline@v0.9.1-grok · 5734 in / 1094 out tokens · 39055 ms · 2026-06-30T02:44:32.716798+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Universal approximation bounds for superpositions of a sigmoidal function

Barron, A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory . 39, 930-945 (1993)

1993
[2]

& Scott, C

Blanchard, G., Deshmukh, A., Dogan, U., Lee, G. & Scott, C. Domain generalization by marginal transfer learning. Journal of Machine Learning Research . 22, 1-55 (2021)

2021
[3]

& Others

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L. & Others. Rethinking attention with performers. International Conference on Learning Representations . (2020)

2020
[4]

& Toutanova, K

Devlin, J., Chang, M., Lee, K. & Toutanova, K. Pre-training of deep bidirectional Transformers for language understanding. Proceedings of The 2019 Conference of The North American Chapter of The Association For Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers) . pp. 4171-4186 (2019)

2019
[5]

Learning Factored Representations in a Deep Mixture of Experts

Eigen, D., Ranzato, M. & Sutskever, I. Learning factored representations in a deep mixture of experts. ArXiv Preprint ArXiv:1312.4314 . (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[6]

& Zhou, D

Fang, Z., Guo, Z. & Zhou, D. Optimal learning rates for distribution regression. Journal of Complexity . 56 pp. 101426 (2020)

2020
[7]

& Smola, A

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems . 19 (2006)

2006
[8]

& Neubig, G

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T. & Neubig, G. Towards a unified view of parameter-efficient transfer learning. International Conference on Learning Representations . (2022)

2022
[9]

& Zellinger, W

Holzleitner, M., Pereverzyev, S. & Zellinger, W. Domain generalization by functional regression. Numerical Functional Analysis and Optimization . 45, 259-281 (2024)

2024
[10]

& Gelly, S

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. & Gelly, S. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning . pp. 2790-2799 (2019)

2019
[11]

& Others

Hu, E., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. & Others. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations . (2022)

2022
[12]

& Lim, S

Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S., Hariharan, B. & Lim, S. Visual prompt tuning. European Conference on Computer Vision . pp. 709-727 (2022)

2022
[13]

Mixtral of Experts

Jiang, A., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D., Casas, D., Hanna, E., Bressand, F. & Others. Mixtral of experts. ArXiv Preprint ArXiv:2401.04088 . (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

& Others

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žıdek, A., Potapenko, A. & Others. Highly accurate protein structure prediction with AlphaFold. Nature . 596, 583-589 (2021)

2021
[15]

& Constant, N

Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing . pp. 3045-3059 (2021)

2021
[16]

& Hoi, S

Li, J., Li, D., Savarese, S. & Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning . pp. 19730-19742 (2023)

2023
[17]

& Guo, B

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of The IEEE/CVF International Conference on Computer Vision . pp. 10012-10022 (2021)

2021
[18]

& Chen, W

Liu, J., Shen, D., Zhang, Y., Dolan, W., Carin, L. & Chen, W. What makes good in-context examples for GPT-3?. Proceedings of Deep Learning Inside Out (DeeLIO 2022): the 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures . pp. 100-114 (2022)

2022
[19]

& Neubig, G

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys . 55, 1-35 (2023)

2023
[20]

& Zhang, L

Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T. & Zhang, L. Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems . 34 pp. 21297-21309 (2021)

2021
[21]

& Zhou, D

Mao, T., Shi, Z. & Zhou, D. Approximating functions with multi-features by deep convolutional neural networks. Analysis and Applications . 21, 93-125 (2023)

2023
[22]

& Zettlemoyer, L

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H. & Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . pp. 11048-11064 (2022)

2022
[23]

& Others

Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B. & Others. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning . 10, 1-141 (2017)

2017
[24]

(2023), https://openai.com/research/gpt-4m

OpenAI GPT 4. (2023), https://openai.com/research/gpt-4m

2023
[25]

& Kong, L

Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N. & Kong, L. Random feature attention. International Conference on Learning Representations . (2020)

2020
[26]

& Zhong, Y

Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B., Yan, J., Kong, L. & Zhong, Y. cosFormer: Rethinking softmax in attention. International Conference on Learning Representations . (2022)

2022
[27]

& Others

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. & Others. Improving language understanding by generative pre-training. (San Francisco, CA, USA,2018)

2018
[28]

& Liu, P

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research . 21, 1-67 (2020)

2020
[29]

Nonparametric regression using deep neural networks with Relu activation function

Schmidt-Hieber, J. Nonparametric regression using deep neural networks with Relu activation function. The Annals of Statistics . 48, 1875-1897 (2020)

2020
[30]

& Dean, J

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. & Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts Layer. International Conference on Learning Representations . (2016)

2016
[31]

& Zhou, D

Shi, Z., Yu, Z. & Zhou, D. Learning theory of distribution regression with neural networks. ArXiv Preprint ArXiv:2307.03487 . (2023)

work page arXiv 2023
[32]

& Schölkopf, B

Simon-Gabriel, C. & Schölkopf, B. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. Journal of Machine Learning Research . 19, 1-29 (2018)

2018
[33]

& Zhou, D

Song, L., Fan, J., Chen, D. & Zhou, D. Approximation of nonlinear functionals using deep ReLU networks. Journal of Fourier Analysis and Applications . 29, 50 (2023)

2023
[34]

& Zhou, D

Song, L., Liu, Y., Fan, J. & Zhou, D. Approximation of smooth functionals using deep ReLU networks. Neural Networks . 166 pp. 424-436 (2023)

2023
[35]

& Lanckriet, G

Sriperumbudur, B., Fukumizu, K. & Lanckriet, G. Universality, characteristic kernels and RKHS embedding of measures.. Journal of Machine Learning Research . 12 (2011)

2011
[36]

& Lanckriet, G

Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B. & Lanckriet, G. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research . 11 pp. 1517-1561 (2010)

2010
[37]

& Gretton, A

Szabó, Z., Sriperumbudur, B., Póczos, B. & Gretton, A. Learning theory for distribution regression. Journal of Machine Learning Research . 17, 1-40 (2016)

2016
[38]

& Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł. & Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems . 30 (2017)

2017
[39]

Optimal Transport: Old and New

Villani, C. Optimal Transport: Old and New. (Springer,2009)

2009
[40]

Error bounds for approximations with deep ReLU networks

Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks . 94 pp. 103-114 (2017)

2017
[41]

& Zhou, D

Yu, Z., Ho, D., Shi, Z. & Zhou, D. Robust kernel-based distribution regression. Inverse Problems . 37, 105014 (2021)

2021
[42]

& Zhou, D

Yu, Z. & Zhou, D. Deep learning theory of distribution regression with CNNs. Advances In Computational Mathematics . 49, 51 (2023)

2023
[43]

Capacity of reproducing kernel spaces in learning theory

Zhou, D. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory . 49, 1743-1752 (2003)

2003
[44]

Deep distributed convolutional neural networks: Universality

Zhou, D. Deep distributed convolutional neural networks: Universality. Analysis and Applications . 16, 895-919 (2018)

2018
[45]

Universality of deep convolutional neural networks

Zhou, D. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis . 48, 787-794 (2020)

2020
[46]

& Zhang, W

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H. & Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of The AAAI Conference on Artificial Intelligence . 35, 11106-11115 (2021)

2021
[47]

& Huo, X

Zhou, T., Suh, N., Cheng, G. & Huo, X. Approximation of RKHS functionals by neural networks. ArXiv Preprint ArXiv:2403.12187 . (2024)

work page arXiv 2024

[1] [1]

Universal approximation bounds for superpositions of a sigmoidal function

Barron, A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory . 39, 930-945 (1993)

1993

[2] [2]

& Scott, C

Blanchard, G., Deshmukh, A., Dogan, U., Lee, G. & Scott, C. Domain generalization by marginal transfer learning. Journal of Machine Learning Research . 22, 1-55 (2021)

2021

[3] [3]

& Others

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L. & Others. Rethinking attention with performers. International Conference on Learning Representations . (2020)

2020

[4] [4]

& Toutanova, K

Devlin, J., Chang, M., Lee, K. & Toutanova, K. Pre-training of deep bidirectional Transformers for language understanding. Proceedings of The 2019 Conference of The North American Chapter of The Association For Computational Linguistics: Human Language Technologies, Volume 1 (Long And Short Papers) . pp. 4171-4186 (2019)

2019

[5] [5]

Learning Factored Representations in a Deep Mixture of Experts

Eigen, D., Ranzato, M. & Sutskever, I. Learning factored representations in a deep mixture of experts. ArXiv Preprint ArXiv:1312.4314 . (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[6] [6]

& Zhou, D

Fang, Z., Guo, Z. & Zhou, D. Optimal learning rates for distribution regression. Journal of Complexity . 56 pp. 101426 (2020)

2020

[7] [7]

& Smola, A

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. & Smola, A. A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems . 19 (2006)

2006

[8] [8]

& Neubig, G

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T. & Neubig, G. Towards a unified view of parameter-efficient transfer learning. International Conference on Learning Representations . (2022)

2022

[9] [9]

& Zellinger, W

Holzleitner, M., Pereverzyev, S. & Zellinger, W. Domain generalization by functional regression. Numerical Functional Analysis and Optimization . 45, 259-281 (2024)

2024

[10] [10]

& Gelly, S

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M. & Gelly, S. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning . pp. 2790-2799 (2019)

2019

[11] [11]

& Others

Hu, E., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. & Others. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations . (2022)

2022

[12] [12]

& Lim, S

Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S., Hariharan, B. & Lim, S. Visual prompt tuning. European Conference on Computer Vision . pp. 709-727 (2022)

2022

[13] [13]

Mixtral of Experts

Jiang, A., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D., Casas, D., Hanna, E., Bressand, F. & Others. Mixtral of experts. ArXiv Preprint ArXiv:2401.04088 . (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

& Others

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žıdek, A., Potapenko, A. & Others. Highly accurate protein structure prediction with AlphaFold. Nature . 596, 583-589 (2021)

2021

[15] [15]

& Constant, N

Lester, B., Al-Rfou, R. & Constant, N. The power of scale for parameter-efficient prompt tuning. Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing . pp. 3045-3059 (2021)

2021

[16] [16]

& Hoi, S

Li, J., Li, D., Savarese, S. & Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International Conference on Machine Learning . pp. 19730-19742 (2023)

2023

[17] [17]

& Guo, B

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of The IEEE/CVF International Conference on Computer Vision . pp. 10012-10022 (2021)

2021

[18] [18]

& Chen, W

Liu, J., Shen, D., Zhang, Y., Dolan, W., Carin, L. & Chen, W. What makes good in-context examples for GPT-3?. Proceedings of Deep Learning Inside Out (DeeLIO 2022): the 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures . pp. 100-114 (2022)

2022

[19] [19]

& Neubig, G

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. & Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys . 55, 1-35 (2023)

2023

[20] [20]

& Zhang, L

Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T. & Zhang, L. Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems . 34 pp. 21297-21309 (2021)

2021

[21] [21]

& Zhou, D

Mao, T., Shi, Z. & Zhou, D. Approximating functions with multi-features by deep convolutional neural networks. Analysis and Applications . 21, 93-125 (2023)

2023

[22] [22]

& Zettlemoyer, L

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H. & Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . pp. 11048-11064 (2022)

2022

[23] [23]

& Others

Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B. & Others. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning . 10, 1-141 (2017)

2017

[24] [24]

(2023), https://openai.com/research/gpt-4m

OpenAI GPT 4. (2023), https://openai.com/research/gpt-4m

2023

[25] [25]

& Kong, L

Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N. & Kong, L. Random feature attention. International Conference on Learning Representations . (2020)

2020

[26] [26]

& Zhong, Y

Qin, Z., Sun, W., Deng, H., Li, D., Wei, Y., Lv, B., Yan, J., Kong, L. & Zhong, Y. cosFormer: Rethinking softmax in attention. International Conference on Learning Representations . (2022)

2022

[27] [27]

& Others

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. & Others. Improving language understanding by generative pre-training. (San Francisco, CA, USA,2018)

2018

[28] [28]

& Liu, P

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. & Liu, P. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research . 21, 1-67 (2020)

2020

[29] [29]

Nonparametric regression using deep neural networks with Relu activation function

Schmidt-Hieber, J. Nonparametric regression using deep neural networks with Relu activation function. The Annals of Statistics . 48, 1875-1897 (2020)

2020

[30] [30]

& Dean, J

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. & Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts Layer. International Conference on Learning Representations . (2016)

2016

[31] [31]

& Zhou, D

Shi, Z., Yu, Z. & Zhou, D. Learning theory of distribution regression with neural networks. ArXiv Preprint ArXiv:2307.03487 . (2023)

work page arXiv 2023

[32] [32]

& Schölkopf, B

Simon-Gabriel, C. & Schölkopf, B. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. Journal of Machine Learning Research . 19, 1-29 (2018)

2018

[33] [33]

& Zhou, D

Song, L., Fan, J., Chen, D. & Zhou, D. Approximation of nonlinear functionals using deep ReLU networks. Journal of Fourier Analysis and Applications . 29, 50 (2023)

2023

[34] [34]

& Zhou, D

Song, L., Liu, Y., Fan, J. & Zhou, D. Approximation of smooth functionals using deep ReLU networks. Neural Networks . 166 pp. 424-436 (2023)

2023

[35] [35]

& Lanckriet, G

Sriperumbudur, B., Fukumizu, K. & Lanckriet, G. Universality, characteristic kernels and RKHS embedding of measures.. Journal of Machine Learning Research . 12 (2011)

2011

[36] [36]

& Lanckriet, G

Sriperumbudur, B., Gretton, A., Fukumizu, K., Schölkopf, B. & Lanckriet, G. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research . 11 pp. 1517-1561 (2010)

2010

[37] [37]

& Gretton, A

Szabó, Z., Sriperumbudur, B., Póczos, B. & Gretton, A. Learning theory for distribution regression. Journal of Machine Learning Research . 17, 1-40 (2016)

2016

[38] [38]

& Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł. & Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems . 30 (2017)

2017

[39] [39]

Optimal Transport: Old and New

Villani, C. Optimal Transport: Old and New. (Springer,2009)

2009

[40] [40]

Error bounds for approximations with deep ReLU networks

Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks . 94 pp. 103-114 (2017)

2017

[41] [41]

& Zhou, D

Yu, Z., Ho, D., Shi, Z. & Zhou, D. Robust kernel-based distribution regression. Inverse Problems . 37, 105014 (2021)

2021

[42] [42]

& Zhou, D

Yu, Z. & Zhou, D. Deep learning theory of distribution regression with CNNs. Advances In Computational Mathematics . 49, 51 (2023)

2023

[43] [43]

Capacity of reproducing kernel spaces in learning theory

Zhou, D. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory . 49, 1743-1752 (2003)

2003

[44] [44]

Deep distributed convolutional neural networks: Universality

Zhou, D. Deep distributed convolutional neural networks: Universality. Analysis and Applications . 16, 895-919 (2018)

2018

[45] [45]

Universality of deep convolutional neural networks

Zhou, D. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis . 48, 787-794 (2020)

2020

[46] [46]

& Zhang, W

Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H. & Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of The AAAI Conference on Artificial Intelligence . 35, 11106-11115 (2021)

2021

[47] [47]

& Huo, X

Zhou, T., Suh, N., Cheng, G. & Huo, X. Approximation of RKHS functionals by neural networks. ArXiv Preprint ArXiv:2403.12187 . (2024)

work page arXiv 2024