pith. sign in

arxiv: 2605.22223 · v1 · pith:BB4AVKHEnew · submitted 2026-05-21 · 💻 cs.LG

How Many Different Outputs Can a Transformer Generate?

Pith reviewed 2026-05-22 07:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformersequence generationoutput diversityaccessible sequencesprompt lengthlinear growthexponential decaycopying tasks
0
0 comments X

The pith

A transformer's longest accessible output sequence grows only linearly with prompt length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that basic architectural traits of transformers sharply restrict the distinct sequences they can generate. The longest reachable sequence for any prompt scales linearly with prompt length. Beyond a threshold the share of sequences that remain reachable falls exponentially. These restrictions persist even with unlimited context length and computation time and account for observed failures on copying and related tasks.

Core claim

We prove that the maximal length of accessible sequences grows linearly with the prompt length, that beyond a critical threshold the proportion of accessible sequences decays exponentially with sequence length, and that the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. These results hold even with unbounded context and computation time and are obtained from a handful of architectural characteristics.

What carries the argument

Accessible sequences, defined as those a transformer can output for some prompt, whose length and proportion are bounded using a small set of architecture traits to yield the linear growth and exponential decay results.

If this is right

  • The total number of distinct outputs is bounded above by a quantity that grows linearly with prompt length and is observed to be tight within a factor of roughly ten.
  • Transformers will systematically fail at tasks such as exact copying or cramming of sequences that exceed the linear length limit.
  • Increasing context size or computation time alone cannot remove the linear growth restriction on accessible sequence length.
  • The linear coefficient itself is capped by a theoretical expression derived solely from architecture features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectural modifications that alter the handful of traits used in the bound may be necessary to expand output diversity beyond current limits.
  • The same style of analysis could be applied to compare generative capacity across different autoregressive architectures.
  • Direct measurement of the gap between the theoretical bound and observed output variety on concrete tasks would quantify how close real models come to the limit.

Load-bearing premise

The upper bound and linear growth can be derived from only a handful of characteristics of the transformer's architecture without needing the full model specification or training details.

What would settle it

A concrete counter-example would be a transformer that, for a given prompt length, produces an output sequence longer than the derived linear upper bound or generates more distinct sequences than the proved bound allows.

Figures

Figures reproduced from arXiv: 2605.22223 by Caroline Chaux, Mario Michelessa, Maxime Meyer, Vincent Y. F. Tan.

Figure 1
Figure 1. Figure 1: Plane cut of the embedding space E of Qwen-2 (0.5B) (Yang et al., 2024), passing through the final token embeddings of “The quick brown fox”, “https://”, and “In a distant future” (red markers). Colors encode, for each pixel, the most probable next token via the model’s output projection (decoder readout). This induces regions Et annotated by < t >. Only tokens whose regions have the largest areas are anno… view at source ↗
Figure 2
Figure 2. Figure 2: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Concretely, we minimize the cross-entropy L(Y ; x1:n) = − Xn i=1 log pτ (xi | [Y, x1:i−1]), (1) After op… view at source ↗
Figure 3
Figure 3. Figure 3: Conceptual example of using the cell-volume distribution to tighten the upper bound. Rather than assuming equal-volume cells (Dirac mass), we take the n-fold convolution of the empirical one-step volume distribution D 1 (light violet) and track when the median of D n (violet dashed) drops below the threshold (black dashed). empirical distribution D of the cell volumes of every token t in terms of proportio… view at source ↗
Figure 4
Figure 4. Figure 4: Models are trained to copy strings up to a maximum length (grey dashed) and evaluated on generating the exact copy of longer strings. We report exact-match copying accuracy versus string length. For each model, we fit a sigmoid to the accuracy curve (continuous lines) and report the corresponding R 2 . Remark 5.2. The assumption of uniform precision ε may be too restrictive. We therefore provide a more ref… view at source ↗
Figure 5
Figure 5. Figure 5: Maximal radius R of the internal representations of the transformer for various input lengths. Definition B.2 (Mean-Field Self-Attention (Castin et al., 2024)). The mean-field generalization F of any self-attention layer—parameterized by projection matrices Wq, Wk, Wv, and Wo as defined in Definition A.1—is defined by F : µ ∈ Pc(R d ) 7→ (Γµ)♯µ ∈ Pc(R d ), where Γµ(x) := X h i=1 R Wi oWi vy · exp (Wi ky) ⊤… view at source ↗
Figure 6
Figure 6. Figure 6: Decomposition of the cone in Eδ and Fδ. since Cδ is a cone (i.e. y ∈ Cδ ⇐⇒ ry ∈ Cδ for r > 0). Thus it suffices to compute |Cδ ∩ Bd (0, 1)|. Step 2: Disjoint Decomposition. Write, as shown in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Results shown for the Pythia model suite at different scales A) 160M, B) 410M, C) 1B. 26 [PITH_FULL_IMA… view at source ↗
Figure 8
Figure 8. Figure 8: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Results shown for three architectures A) Qwen-2.5, B) Gemma-3, C) Llama-3.2. 27 [PITH_FULL_IMAGE:figure… view at source ↗
Figure 9
Figure 9. Figure 9: Upper bound on C for different support geometry —Ball (blue), Cone (green), Ellipsoid (orange)—, estimated using 10K randomly sampled input strings of maximum length ℓ. Sampling prompts longer than ℓ ≈ 500 suffices to estimate the upper bound. Pythia models for different sizes: a) 160M b) 410M, c) 1B and support for different model architectures d–e) Qwen (0.5B, 1.5B) f) Llama 1B, g) Gemma 270M [PITH_FULL… view at source ↗
Figure 10
Figure 10. Figure 10: Relative volumes of decoder cells across tokens (log-log). For each model, tokens are sorted by estimated cell volume (largest on the left). A small set of tokens (typically < 102 ) accounts for a large fraction of the support volume, while most tokens (104 –105 ) have tiny individual volumes (often 10−6 –10−8 of the support). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of n-sequences volumes proportions by estimating the n-fold convolution of the empirical one-step volume distribution D for different models: a) Pythia-160M, b) Pythia-410M, c) Pythia-1B, d) Qwen2.5 0.5B, e) Qwen2.5 1.5B, f) Llama3.2 1B, g) Gemma 270M. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
read the original abstract

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies the number of distinct output sequences generatable by a transformer, deriving results from a small set of architectural characteristics (independent of specific weights). It proves that the maximal length of accessible sequences grows linearly with prompt length, that the proportion of accessible sequences decays exponentially beyond a critical threshold, and that the linear coefficient admits a theoretical upper bound. These hold for arbitrary weights and even with unbounded context and computation time. The upper bound is claimed to be empirically tight within a factor of less than 10 across architectures and model sizes, with implications for explaining failures on tasks like copying and cramming.

Significance. If the derivations are rigorous, the work provides a useful abstraction for bounding transformer generative capacity without full model specification, offering a theoretical account for empirical limitations on sequence tasks. The parameter-free nature of the bounds (from architecture characteristics only) and the empirical tightness are notable strengths that could influence analysis of model scaling and task feasibility. The results appear internally consistent with the stated assumptions, though verification of the proofs would be needed to confirm broader impact.

major comments (2)
  1. [Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.
  2. [Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.
minor comments (2)
  1. [Introduction] The definition of 'accessible sequences' is used throughout but would benefit from an early formal definition or notation (e.g., in the introduction) to aid readability for readers unfamiliar with the concept.
  2. [Empirical results] Figure captions or legends for any empirical plots should explicitly state the architectures, prompt lengths, and sequence lengths used to demonstrate the factor-of-10 tightness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results on transformer output capacity. We address each major comment below and will revise the manuscript accordingly to improve rigor and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.

    Authors: We agree that the empirical evaluation requires more detail to support the claimed tightness. In the revised version, we will expand the experimental section with a new subsection that fully specifies the enumeration algorithm for accessible sequences (including how we enumerate over possible prompts and outputs under the architectural abstraction), lists all tested models and sizes (e.g., GPT-2 small/medium, LLaMA-7B/13B, and others), describes any data exclusion or filtering criteria, and includes error bars or standard deviations computed over multiple random seeds and prompt distributions. This addition will allow independent verification of the factor-of-less-than-10 tightness. revision: yes

  2. Referee: [Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.

    Authors: We concur that explicit listing and justification of the architectural characteristics would strengthen the independence claim. We will revise the theoretical analysis section to begin with a clearly enumerated list of the relevant characteristics (autoregressive token-by-token generation, fixed embedding dimension, multi-head attention with a fixed number of heads, position-independent feed-forward layers, and the absence of any external memory beyond the prompt). We will then insert a dedicated lemma and proof sketch showing, step by step, how each of these characteristics is used (and why no others are needed) to establish both the linear growth of maximal accessible length and the upper bound on the coefficient, without invoking specific weight values or training dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation

full rationale

The paper's central results are mathematical proofs deriving linear growth of maximal accessible sequence length, exponential decay of their proportion, and an upper bound on the linear coefficient strictly from a small set of transformer architectural characteristics (independent of weights or training). These hold for arbitrary weights and unbounded context, with no reduction to fitted parameters, self-definitions, or self-citation chains. The empirical tightness (factor <10) is presented as validation rather than a load-bearing input, and the derivation remains self-contained against external architectural properties without importing uniqueness or ansatzes via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the analysis assumes standard transformer properties such as fixed attention-based architecture suffice to bound outputs, with no explicit free parameters or new entities introduced.

axioms (1)
  • domain assumption A small number of fixed architectural characteristics (e.g., attention and layer structure) are sufficient to derive tight bounds on output diversity
    The paper states it leverages only a handful of such characteristics to predict the number of outputs.

pith-pipeline@v0.9.0 · 5685 in / 1165 out tokens · 54734 ms · 2026-05-22T07:03:06.884958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 3 internal anchors

  1. [1]

    Volumes of Generalized Unit Balls , urldate =

    Xianfu Wang , journal =. Volumes of Generalized Unit Balls , urldate =

  2. [2]

    and Martin, Jeremy L

    Ellis, Robert B. and Martin, Jeremy L. and Yan, Catherine , title =. Algorithmica , month = apr, pages =. 2007 , issue_date =. doi:10.1007/s00453-006-0172-y , abstract =

  3. [3]

    ArXiv , year=

    Repeat After Me: Transformers are Better than State Space Models at Copying , author=. ArXiv , year=

  4. [4]

    Nonlinear approximation via compositions , volume=

    Shen, Zuowei and Yang, Haizhao and Zhang, Shijun , year=. Nonlinear approximation via compositions , volume=. doi:10.1016/j.neunet.2019.07.011 , journal=

  5. [5]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  6. [6]

    The Thirteenth International Conference on Learning Representations , year=

    Transformers are Universal In-context Learners , author=. The Thirteenth International Conference on Learning Representations , year=

  7. [7]

    and Le, Quoc V

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  8. [8]

    Journal of Computational Mathematics , year =

    Montanelli, Hadrien and Yang, Haizhao and Qiang, Du , title =. Journal of Computational Mathematics , year =. doi:https://doi.org/10.4208/jcm.2007-m2019-0239 , url =

  9. [9]

    2025 , eprint=

    Memory Limitations of Prompt Tuning in Transformers , author=. 2025 , eprint=

  10. [10]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  12. [12]

    P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

    Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8

  13. [13]

    Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

    Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle =. Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

  14. [14]

    L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

    Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

  15. [15]

    2024 , url=

    Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

  16. [16]

    Long-Context

    Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O Arik , booktitle=. Long-Context. 2025 , url=

  17. [17]

    Transformers: State-of-the-Art Natural Language Processing

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and others. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.6

  18. [18]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

  19. [19]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  20. [20]

    I so S core: Measuring the Uniformity of Embedding Space Utilization

    Rudman, William and Gillman, Nate and Rayne, Taylor and Eickhoff, Carsten. I so S core: Measuring the Uniformity of Embedding Space Utilization. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.262

  21. [21]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  22. [22]

    Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems

    Laban, Philippe and Fabbri, Alexander and Xiong, Caiming and Wu, Chien-Sheng. Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.552

  23. [23]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  24. [24]

    The Twelfth International Conference on Learning Representations , year=

    Memorization Capacity of Multi-Head Attention in Transformers , author=. The Twelfth International Conference on Learning Representations , year=

  25. [25]

    The Eleventh International Conference on Learning Representations , year=

    Provable Memorization Capacity of Transformers , author=. The Eleventh International Conference on Learning Representations , year=

  26. [26]

    In: Gurevych, I., Miyao, Y

    Howard, Jeremy and Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1031

  27. [27]

    Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

    Levy, Mosh and Jacoby, Alon and Goldberg, Yoav. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.818

  28. [28]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Transformers need glasses! Information over-squashing in language tasks , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  29. [29]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  30. [30]

    Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

    Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.574

  31. [31]

    The Thirteenth International Conference on Learning Representations , year=

    On the Optimal Memorization Capacity of Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

  32. [32]

    The Twelfth International Conference on Learning Representations , year=

    Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=. The Twelfth International Conference on Learning Representations , year=

  33. [33]

    L oo GLE : Can Long-Context Language Models Understand Long Contexts?

    Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

  34. [34]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

  35. [35]

    Knee-Deep in C-

    Andy Yang and Micha. Knee-Deep in C-. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  36. [36]

    MaPLe: Multi-modal Prompt Learning , year=

    Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle=. MaPLe: Multi-modal Prompt Learning , year=

  37. [37]

    The Eleventh International Conference on Learning Representations , year=

    Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning , author=. The Eleventh International Conference on Learning Representations , year=

  38. [38]

    2023 , url=

    Guangyi Chen and Weiran Yao and Xiangchen Song and Xinyue Li and Yongming Rao and Kun Zhang , booktitle=. 2023 , url=

  39. [39]

    Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =

    On The Computational Complexity of Self-Attention , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =. 2023 , editor =

  40. [40]

    The Twelfth International Conference on Learning Representations , year=

    Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  41. [41]

    A Survey on In-context Learning

    Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64

  42. [42]

    Zhengxiang Shi and Aldo Lipani , booktitle=. De. 2024 , url=

  43. [43]

    The Twelfth International Conference on Learning Representations , year=

    Protein Multimer Structure Prediction via Prompt Learning , author=. The Twelfth International Conference on Learning Representations , year=

  44. [44]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  45. [45]

    Advances in Neural Information Processing Systems , volume=

    Your transformer may not be as powerful as you expect , author=. Advances in Neural Information Processing Systems , volume=

  46. [46]

    ICML 2022 , year =

    Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Zhang, Cyril , title =. ICML 2022 , year =

  47. [47]

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sashank and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakara...

  48. [48]

    The Twelfth International Conference on Learning Representations , year=

    The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=

  49. [49]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Approximation Rate of the Transformer Architecture for Sequence Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  50. [50]

    O(n) connections are expressive enough: Universal approximability of sparse transformers

    Chulhee Yun and Chang, \ Yin Wen\ and Srinadh Bhojanapalli and Rawat, \ Ankit Singh\ and Reddi, \ Sashank J.\ and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems. 2020

  51. [51]

    On the Expressivity Role of L ayer N orm in Transformers' Attention

    Brody, Shaked and Alon, Uri and Yahav, Eran. On the Expressivity Role of L ayer N orm in Transformers' Attention. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.895

  52. [52]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  53. [53]

    2024 , eprint=

    An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=

  54. [54]

    Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

    Kuratov, Yuri and Arkhipov, Mikhail and Bulatov, Aydar and Burtsev, Mikhail. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.948

  55. [55]

    arXiv preprint arXiv:2504.06214 , year=

    From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models , author=. arXiv preprint arXiv:2504.06214 , year=

  56. [56]

    Applied Sciences , volume=

    Extending context window in large language models with segmented base adjustment for rotary position embeddings , author=. Applied Sciences , volume=. 2024 , publisher=

  57. [57]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  58. [58]

    International conference on machine learning , pages=

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International conference on machine learning , pages=. 2021 , organization=

  59. [59]

    The Twelfth International Conference on Learning Representations , year=

    When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. The Twelfth International Conference on Learning Representations , year=

  60. [60]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  61. [61]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  62. [62]

    and Koltun, Vladlen , title =

    Zhao, Hengshuang and Jiang, Li and Jia, Jiaya and Torr, Philip H.S. and Koltun, Vladlen , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

  63. [63]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  64. [64]

    URL https://aclanthology.org/2021

    Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

  65. [65]

    Xiang Lisa Li and Percy Liang , title =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2021 , month =. doi:10.18653/v1/2021.acl-long.353 , url =

  66. [66]

    High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

    Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

  67. [67]

    How Smooth Is Attention? , booktitle =

    Valérie Castin and Pierre Ablin and Gabriel Peyré , year =. How Smooth Is Attention? , booktitle =

  68. [68]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Invertible Residual Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  69. [69]

    2025 , eprint=

    A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts , author=. 2025 , eprint=

  70. [70]

    Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =

    Peter L. Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =. J. Mach. Learn. Res. , year =

  71. [71]

    Universality and Limitations of Prompt Tuning , url =

    Wang, Yihan and Chauhan, Jatin and Wang, Wei and Hsieh, Cho-Jui , booktitle =. Universality and Limitations of Prompt Tuning , url =

  72. [72]

    The Thirteenth International Conference on Learning Representations , year=

    Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency , author=. The Thirteenth International Conference on Learning Representations , year=

  73. [73]

    ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

    Prompting a Pretrained Transformer Can Be a Universal Approximator , author=. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

  74. [74]

    International Conference on Learning Representations , year=

    Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

  75. [75]

    The emergence of clusters in self-attention dynamics , url =

    Geshkovski, Borjan and Letrouit, Cyril and Polyanskiy, Yury and Rigollet, Philippe , booktitle =. The emergence of clusters in self-attention dynamics , url =

  76. [76]

    2015 , publisher=

    Optimal transport for applied mathematicians , author=. 2015 , publisher=

  77. [77]

    International Conference on Learning Representations , year=

    Universal Approximation Under Constraints is Possible with Transformers , author=. International Conference on Learning Representations , year=

  78. [78]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  79. [79]

    International Conference on Learning Representations , year=

    Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=

  80. [80]

    Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =

    Sinkformers: Transformers with Doubly Stochastic Attention , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =. 2022 , editor =

Showing first 80 references.