How Many Different Outputs Can a Transformer Generate?
Pith reviewed 2026-05-22 07:03 UTC · model grok-4.3
The pith
A transformer's longest accessible output sequence grows only linearly with prompt length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that the maximal length of accessible sequences grows linearly with the prompt length, that beyond a critical threshold the proportion of accessible sequences decays exponentially with sequence length, and that the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. These results hold even with unbounded context and computation time and are obtained from a handful of architectural characteristics.
What carries the argument
Accessible sequences, defined as those a transformer can output for some prompt, whose length and proportion are bounded using a small set of architecture traits to yield the linear growth and exponential decay results.
If this is right
- The total number of distinct outputs is bounded above by a quantity that grows linearly with prompt length and is observed to be tight within a factor of roughly ten.
- Transformers will systematically fail at tasks such as exact copying or cramming of sequences that exceed the linear length limit.
- Increasing context size or computation time alone cannot remove the linear growth restriction on accessible sequence length.
- The linear coefficient itself is capped by a theoretical expression derived solely from architecture features.
Where Pith is reading between the lines
- Architectural modifications that alter the handful of traits used in the bound may be necessary to expand output diversity beyond current limits.
- The same style of analysis could be applied to compare generative capacity across different autoregressive architectures.
- Direct measurement of the gap between the theoretical bound and observed output variety on concrete tasks would quantify how close real models come to the limit.
Load-bearing premise
The upper bound and linear growth can be derived from only a handful of characteristics of the transformer's architecture without needing the full model specification or training details.
What would settle it
A concrete counter-example would be a transformer that, for a given prompt length, produces an output sequence longer than the derived linear upper bound or generates more distinct sequences than the proved bound allows.
Figures
read the original abstract
We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the number of distinct output sequences generatable by a transformer, deriving results from a small set of architectural characteristics (independent of specific weights). It proves that the maximal length of accessible sequences grows linearly with prompt length, that the proportion of accessible sequences decays exponentially beyond a critical threshold, and that the linear coefficient admits a theoretical upper bound. These hold for arbitrary weights and even with unbounded context and computation time. The upper bound is claimed to be empirically tight within a factor of less than 10 across architectures and model sizes, with implications for explaining failures on tasks like copying and cramming.
Significance. If the derivations are rigorous, the work provides a useful abstraction for bounding transformer generative capacity without full model specification, offering a theoretical account for empirical limitations on sequence tasks. The parameter-free nature of the bounds (from architecture characteristics only) and the empirical tightness are notable strengths that could influence analysis of model scaling and task feasibility. The results appear internally consistent with the stated assumptions, though verification of the proofs would be needed to confirm broader impact.
major comments (2)
- [Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.
- [Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.
minor comments (2)
- [Introduction] The definition of 'accessible sequences' is used throughout but would benefit from an early formal definition or notation (e.g., in the introduction) to aid readability for readers unfamiliar with the concept.
- [Empirical results] Figure captions or legends for any empirical plots should explicitly state the architectures, prompt lengths, and sequence lengths used to demonstrate the factor-of-10 tightness.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our results on transformer output capacity. We address each major comment below and will revise the manuscript accordingly to improve rigor and transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.
Authors: We agree that the empirical evaluation requires more detail to support the claimed tightness. In the revised version, we will expand the experimental section with a new subsection that fully specifies the enumeration algorithm for accessible sequences (including how we enumerate over possible prompts and outputs under the architectural abstraction), lists all tested models and sizes (e.g., GPT-2 small/medium, LLaMA-7B/13B, and others), describes any data exclusion or filtering criteria, and includes error bars or standard deviations computed over multiple random seeds and prompt distributions. This addition will allow independent verification of the factor-of-less-than-10 tightness. revision: yes
-
Referee: [Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.
Authors: We concur that explicit listing and justification of the architectural characteristics would strengthen the independence claim. We will revise the theoretical analysis section to begin with a clearly enumerated list of the relevant characteristics (autoregressive token-by-token generation, fixed embedding dimension, multi-head attention with a fixed number of heads, position-independent feed-forward layers, and the absence of any external memory beyond the prompt). We will then insert a dedicated lemma and proof sketch showing, step by step, how each of these characteristics is used (and why no others are needed) to establish both the linear growth of maximal accessible length and the upper bound on the coefficient, without invoking specific weight values or training dynamics. revision: yes
Circularity Check
No significant circularity in theoretical derivation
full rationale
The paper's central results are mathematical proofs deriving linear growth of maximal accessible sequence length, exponential decay of their proportion, and an upper bound on the linear coefficient strictly from a small set of transformer architectural characteristics (independent of weights or training). These hold for arbitrary weights and unbounded context, with no reduction to fitted parameters, self-definitions, or self-citation chains. The empirical tightness (factor <10) is presented as validation rather than a load-bearing input, and the derivation remains self-contained against external architectural properties without importing uniqueness or ansatzes via prior self-work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A small number of fixed architectural characteristics (e.g., attention and layer structure) are sufficient to derive tight bounds on output diversity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.5 … upper bounded by P(B^{d×m}(0,r),∥·∥,ε) … (1+2r/ε)^{d m}
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption 4.3 [Finite Precision] … Rd partitioned into axis-aligned cubes of side ε
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Volumes of Generalized Unit Balls , urldate =
Xianfu Wang , journal =. Volumes of Generalized Unit Balls , urldate =
-
[2]
Ellis, Robert B. and Martin, Jeremy L. and Yan, Catherine , title =. Algorithmica , month = apr, pages =. 2007 , issue_date =. doi:10.1007/s00453-006-0172-y , abstract =
-
[3]
Repeat After Me: Transformers are Better than State Space Models at Copying , author=. ArXiv , year=
-
[4]
Nonlinear approximation via compositions , volume=
Shen, Zuowei and Yang, Haizhao and Zhang, Shijun , year=. Nonlinear approximation via compositions , volume=. doi:10.1016/j.neunet.2019.07.011 , journal=
-
[5]
Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =
work page 2022
-
[6]
The Thirteenth International Conference on Learning Representations , year=
Transformers are Universal In-context Learners , author=. The Thirteenth International Conference on Learning Representations , year=
-
[7]
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =
work page 2022
-
[8]
Journal of Computational Mathematics , year =
Montanelli, Hadrien and Yang, Haizhao and Qiang, Du , title =. Journal of Computational Mathematics , year =. doi:https://doi.org/10.4208/jcm.2007-m2019-0239 , url =
-
[9]
Memory Limitations of Prompt Tuning in Transformers , author=. 2025 , eprint=
work page 2025
- [10]
- [11]
-
[12]
P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks
Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8
-
[13]
Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle =. Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =
-
[14]
L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models
Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38
-
[15]
Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=
work page 2024
-
[16]
Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O Arik , booktitle=. Long-Context. 2025 , url=
work page 2025
-
[17]
Transformers: State-of-the-Art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and others. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.6
-
[18]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...
-
[19]
Proceedings of the 40th International Conference on Machine Learning , pages =
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[20]
I so S core: Measuring the Uniformity of Embedding Space Utilization
Rudman, William and Gillman, Nate and Rayne, Taylor and Eickhoff, Carsten. I so S core: Measuring the Uniformity of Embedding Space Utilization. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.262
-
[21]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[22]
Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems
Laban, Philippe and Fabbri, Alexander and Xiong, Caiming and Wu, Chien-Sheng. Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.552
-
[23]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[24]
The Twelfth International Conference on Learning Representations , year=
Memorization Capacity of Multi-Head Attention in Transformers , author=. The Twelfth International Conference on Learning Representations , year=
-
[25]
The Eleventh International Conference on Learning Representations , year=
Provable Memorization Capacity of Transformers , author=. The Eleventh International Conference on Learning Representations , year=
-
[26]
Howard, Jeremy and Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1031
-
[27]
Levy, Mosh and Jacoby, Alon and Goldberg, Yoav. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.818
-
[28]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Transformers need glasses! Information over-squashing in language tasks , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[29]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[30]
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.574
-
[31]
The Thirteenth International Conference on Learning Representations , year=
On the Optimal Memorization Capacity of Transformers , author=. The Thirteenth International Conference on Learning Representations , year=
-
[32]
The Twelfth International Conference on Learning Representations , year=
Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=. The Twelfth International Conference on Learning Representations , year=
-
[33]
L oo GLE : Can Long-Context Language Models Understand Long Contexts?
Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859
-
[34]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638
-
[35]
Andy Yang and Micha. Knee-Deep in C-. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[36]
MaPLe: Multi-modal Prompt Learning , year=
Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle=. MaPLe: Multi-modal Prompt Learning , year=
-
[37]
The Eleventh International Conference on Learning Representations , year=
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning , author=. The Eleventh International Conference on Learning Representations , year=
-
[38]
Guangyi Chen and Weiran Yao and Xiangchen Song and Xinyue Li and Yongming Rao and Kun Zhang , booktitle=. 2023 , url=
work page 2023
-
[39]
Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =
On The Computational Complexity of Self-Attention , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =. 2023 , editor =
work page 2023
-
[40]
The Twelfth International Conference on Learning Representations , year=
Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[41]
A Survey on In-context Learning
Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64
-
[42]
Zhengxiang Shi and Aldo Lipani , booktitle=. De. 2024 , url=
work page 2024
-
[43]
The Twelfth International Conference on Learning Representations , year=
Protein Multimer Structure Prediction via Prompt Learning , author=. The Twelfth International Conference on Learning Representations , year=
-
[44]
Proceedings of the 40th International Conference on Machine Learning , pages =
Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[45]
Advances in Neural Information Processing Systems , volume=
Your transformer may not be as powerful as you expect , author=. Advances in Neural Information Processing Systems , volume=
-
[46]
Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Zhang, Cyril , title =. ICML 2022 , year =
work page 2022
-
[47]
Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sashank and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakara...
work page 2023
-
[48]
The Twelfth International Conference on Learning Representations , year=
The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=
-
[49]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Approximation Rate of the Transformer Architecture for Sequence Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[50]
O(n) connections are expressive enough: Universal approximability of sparse transformers
Chulhee Yun and Chang, \ Yin Wen\ and Srinadh Bhojanapalli and Rawat, \ Ankit Singh\ and Reddi, \ Sashank J.\ and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems. 2020
work page 2020
-
[51]
On the Expressivity Role of L ayer N orm in Transformers' Attention
Brody, Shaked and Alon, Uri and Yahav, Eran. On the Expressivity Role of L ayer N orm in Transformers' Attention. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.895
-
[52]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[53]
An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=
work page 2024
-
[54]
Kuratov, Yuri and Arkhipov, Mikhail and Bulatov, Aydar and Burtsev, Mikhail. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.948
-
[55]
arXiv preprint arXiv:2504.06214 , year=
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models , author=. arXiv preprint arXiv:2504.06214 , year=
-
[56]
Extending context window in large language models with segmented base adjustment for rotary position embeddings , author=. Applied Sciences , volume=. 2024 , publisher=
work page 2024
-
[57]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[58]
International conference on machine learning , pages=
Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[59]
The Twelfth International Conference on Learning Representations , year=
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. The Twelfth International Conference on Learning Representations , year=
-
[60]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
-
[61]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[62]
Zhao, Hengshuang and Jiang, Li and Jia, Jiaya and Torr, Philip H.S. and Koltun, Vladlen , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =
work page 2021
-
[63]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[64]
URL https://aclanthology.org/2021
Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243
-
[65]
Xiang Lisa Li and Percy Liang , title =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2021 , month =. doi:10.18653/v1/2021.acl-long.353 , url =
-
[66]
High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=
Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=
-
[67]
How Smooth Is Attention? , booktitle =
Valérie Castin and Pierre Ablin and Gabriel Peyré , year =. How Smooth Is Attention? , booktitle =
-
[68]
Proceedings of the 36th International Conference on Machine Learning , pages =
Invertible Residual Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
work page 2019
-
[69]
A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts , author=. 2025 , eprint=
work page 2025
-
[70]
Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =
Peter L. Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =. J. Mach. Learn. Res. , year =
-
[71]
Universality and Limitations of Prompt Tuning , url =
Wang, Yihan and Chauhan, Jatin and Wang, Wei and Hsieh, Cho-Jui , booktitle =. Universality and Limitations of Prompt Tuning , url =
-
[72]
The Thirteenth International Conference on Learning Representations , year=
Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency , author=. The Thirteenth International Conference on Learning Representations , year=
-
[73]
ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=
Prompting a Pretrained Transformer Can Be a Universal Approximator , author=. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=
work page 2024
-
[74]
International Conference on Learning Representations , year=
Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=
-
[75]
The emergence of clusters in self-attention dynamics , url =
Geshkovski, Borjan and Letrouit, Cyril and Polyanskiy, Yury and Rigollet, Philippe , booktitle =. The emergence of clusters in self-attention dynamics , url =
-
[76]
Optimal transport for applied mathematicians , author=. 2015 , publisher=
work page 2015
-
[77]
International Conference on Learning Representations , year=
Universal Approximation Under Constraints is Possible with Transformers , author=. International Conference on Learning Representations , year=
- [78]
-
[79]
International Conference on Learning Representations , year=
Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=
-
[80]
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =
Sinkformers: Transformers with Doubly Stochastic Attention , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =. 2022 , editor =
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.