How Many Different Outputs Can a Transformer Generate?

Caroline Chaux; Mario Michelessa; Maxime Meyer; Vincent Y. F. Tan

arxiv: 2605.22223 · v1 · pith:BB4AVKHEnew · submitted 2026-05-21 · 💻 cs.LG

How Many Different Outputs Can a Transformer Generate?

Maxime Meyer , Mario Michelessa , Caroline Chaux , Vincent Y. F. Tan This is my paper

Pith reviewed 2026-05-22 07:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformersequence generationoutput diversityaccessible sequencesprompt lengthlinear growthexponential decaycopying tasks

0 comments

The pith

A transformer's longest accessible output sequence grows only linearly with prompt length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that basic architectural traits of transformers sharply restrict the distinct sequences they can generate. The longest reachable sequence for any prompt scales linearly with prompt length. Beyond a threshold the share of sequences that remain reachable falls exponentially. These restrictions persist even with unlimited context length and computation time and account for observed failures on copying and related tasks.

Core claim

We prove that the maximal length of accessible sequences grows linearly with the prompt length, that beyond a critical threshold the proportion of accessible sequences decays exponentially with sequence length, and that the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. These results hold even with unbounded context and computation time and are obtained from a handful of architectural characteristics.

What carries the argument

Accessible sequences, defined as those a transformer can output for some prompt, whose length and proportion are bounded using a small set of architecture traits to yield the linear growth and exponential decay results.

If this is right

The total number of distinct outputs is bounded above by a quantity that grows linearly with prompt length and is observed to be tight within a factor of roughly ten.
Transformers will systematically fail at tasks such as exact copying or cramming of sequences that exceed the linear length limit.
Increasing context size or computation time alone cannot remove the linear growth restriction on accessible sequence length.
The linear coefficient itself is capped by a theoretical expression derived solely from architecture features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectural modifications that alter the handful of traits used in the bound may be necessary to expand output diversity beyond current limits.
The same style of analysis could be applied to compare generative capacity across different autoregressive architectures.
Direct measurement of the gap between the theoretical bound and observed output variety on concrete tasks would quantify how close real models come to the limit.

Load-bearing premise

The upper bound and linear growth can be derived from only a handful of characteristics of the transformer's architecture without needing the full model specification or training details.

What would settle it

A concrete counter-example would be a transformer that, for a given prompt length, produces an output sequence longer than the derived linear upper bound or generates more distinct sequences than the proved bound allows.

Figures

Figures reproduced from arXiv: 2605.22223 by Caroline Chaux, Mario Michelessa, Maxime Meyer, Vincent Y. F. Tan.

**Figure 1.** Figure 1: Plane cut of the embedding space E of Qwen-2 (0.5B) (Yang et al., 2024), passing through the final token embeddings of “The quick brown fox”, “https://”, and “In a distant future” (red markers). Colors encode, for each pixel, the most probable next token via the model’s output projection (decoder readout). This induces regions Et annotated by < t >. Only tokens whose regions have the largest areas are anno… view at source ↗

**Figure 2.** Figure 2: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Concretely, we minimize the cross-entropy L(Y ; x1:n) = − Xn i=1 log pτ (xi | [Y, x1:i−1]), (1) After op… view at source ↗

**Figure 3.** Figure 3: Conceptual example of using the cell-volume distribution to tighten the upper bound. Rather than assuming equal-volume cells (Dirac mass), we take the n-fold convolution of the empirical one-step volume distribution D 1 (light violet) and track when the median of D n (violet dashed) drops below the threshold (black dashed). empirical distribution D of the cell volumes of every token t in terms of proportio… view at source ↗

**Figure 4.** Figure 4: Models are trained to copy strings up to a maximum length (grey dashed) and evaluated on generating the exact copy of longer strings. We report exact-match copying accuracy versus string length. For each model, we fit a sigmoid to the accuracy curve (continuous lines) and report the corresponding R 2 . Remark 5.2. The assumption of uniform precision ε may be too restrictive. We therefore provide a more ref… view at source ↗

**Figure 5.** Figure 5: Maximal radius R of the internal representations of the transformer for various input lengths. Definition B.2 (Mean-Field Self-Attention (Castin et al., 2024)). The mean-field generalization F of any self-attention layer—parameterized by projection matrices Wq, Wk, Wv, and Wo as defined in Definition A.1—is defined by F : µ ∈ Pc(R d ) 7→ (Γµ)♯µ ∈ Pc(R d ), where Γµ(x) := X h i=1 R Wi oWi vy · exp (Wi ky) ⊤… view at source ↗

**Figure 6.** Figure 6: Decomposition of the cone in Eδ and Fδ. since Cδ is a cone (i.e. y ∈ Cδ ⇐⇒ ry ∈ Cδ for r > 0). Thus it suffices to compute |Cδ ∩ Bd (0, 1)|. Step 2: Disjoint Decomposition. Write, as shown in [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Results shown for the Pythia model suite at different scales A) 160M, B) 410M, C) 1B. 26 [PITH_FULL_IMA… view at source ↗

**Figure 8.** Figure 8: Mean accessibility for a) PG19 and b) random target sequences of length n as a function of the number of trainable memory vectors m. For each m, we fit a sigmoid (solid) and mark n50 where the fit crosses 0.5 (vertical dashed line). (c) n50(m) for PG19 (blue) and random (red), with linear fits (dashed). Results shown for three architectures A) Qwen-2.5, B) Gemma-3, C) Llama-3.2. 27 [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 9.** Figure 9: Upper bound on C for different support geometry —Ball (blue), Cone (green), Ellipsoid (orange)—, estimated using 10K randomly sampled input strings of maximum length ℓ. Sampling prompts longer than ℓ ≈ 500 suffices to estimate the upper bound. Pythia models for different sizes: a) 160M b) 410M, c) 1B and support for different model architectures d–e) Qwen (0.5B, 1.5B) f) Llama 1B, g) Gemma 270M [PITH_FULL… view at source ↗

**Figure 10.** Figure 10: Relative volumes of decoder cells across tokens (log-log). For each model, tokens are sorted by estimated cell volume (largest on the left). A small set of tokens (typically < 102 ) accounts for a large fraction of the support volume, while most tokens (104 –105 ) have tiny individual volumes (often 10−6 –10−8 of the support). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of n-sequences volumes proportions by estimating the n-fold convolution of the empirical one-step volume distribution D for different models: a) Pythia-160M, b) Pythia-410M, c) Pythia-1B, d) Qwen2.5 0.5B, e) Qwen2.5 1.5B, f) Llama3.2 1B, g) Gemma 270M. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

read the original abstract

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformers generate far fewer distinct outputs than expected, with linear growth in accessible length and exponential decay in proportion.

read the letter

Look, the big thing here is that transformers can't output that many different sequences even if you give them all the time and context in the world. The max length of what they can actually produce grows linearly with the prompt length, and the proportion of possible sequences they can hit falls exponentially after a point. Plus there's an upper bound on that linear factor. They pull this off by boiling it down to just a few architecture details, and the proofs for the linear growth and the decay look like the fresh part. The fact that their bound matches the real count within a factor of 10 in experiments is what makes it practical. It lines up with why copying and cramming tasks go wrong. The soft spot is that those few characteristics need to be laid out plainly so the bound can be checked without guessing. The factor of 10 is not bad but it's loose enough that it might not be the last word on exact counts. I'd like to see more on how they actually counted the accessible sequences in the tests to be sure it's accurate. This is for folks who care about the built-in limits of these models on long outputs. If you're thinking about why transformers hit walls on certain problems, this gives a theoretical handle. It has enough to warrant a real referee look. Recommendation is to send it for review.

Referee Report

2 major / 2 minor

Summary. The manuscript studies the number of distinct output sequences generatable by a transformer, deriving results from a small set of architectural characteristics (independent of specific weights). It proves that the maximal length of accessible sequences grows linearly with prompt length, that the proportion of accessible sequences decays exponentially beyond a critical threshold, and that the linear coefficient admits a theoretical upper bound. These hold for arbitrary weights and even with unbounded context and computation time. The upper bound is claimed to be empirically tight within a factor of less than 10 across architectures and model sizes, with implications for explaining failures on tasks like copying and cramming.

Significance. If the derivations are rigorous, the work provides a useful abstraction for bounding transformer generative capacity without full model specification, offering a theoretical account for empirical limitations on sequence tasks. The parameter-free nature of the bounds (from architecture characteristics only) and the empirical tightness are notable strengths that could influence analysis of model scaling and task feasibility. The results appear internally consistent with the stated assumptions, though verification of the proofs would be needed to confirm broader impact.

major comments (2)

[Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.
[Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.

minor comments (2)

[Introduction] The definition of 'accessible sequences' is used throughout but would benefit from an early formal definition or notation (e.g., in the introduction) to aid readability for readers unfamiliar with the concept.
[Empirical results] Figure captions or legends for any empirical plots should explicitly state the architectures, prompt lengths, and sequence lengths used to demonstrate the factor-of-10 tightness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results on transformer output capacity. We address each major comment below and will revise the manuscript accordingly to improve rigor and transparency.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of empirical tightness 'up to a factor less than 10' is load-bearing for the quantitative contribution, yet the manuscript provides no details on the enumeration procedure for accessible sequences, the specific models and sizes tested, data exclusion criteria, or error bars; this gap prevents assessment of whether the factor holds under the abstraction to a handful of characteristics.

Authors: We agree that the empirical evaluation requires more detail to support the claimed tightness. In the revised version, we will expand the experimental section with a new subsection that fully specifies the enumeration algorithm for accessible sequences (including how we enumerate over possible prompts and outputs under the architectural abstraction), lists all tested models and sizes (e.g., GPT-2 small/medium, LLaMA-7B/13B, and others), describes any data exclusion or filtering criteria, and includes error bars or standard deviations computed over multiple random seeds and prompt distributions. This addition will allow independent verification of the factor-of-less-than-10 tightness. revision: yes
Referee: [Theoretical analysis section] The proof of linear growth (claim i) and the upper bound on the coefficient (claim iii) are stated to depend only on a handful of architecture characteristics without needing full model specification. However, the manuscript does not explicitly list these characteristics or demonstrate their sufficiency in the derivation steps, which is central to the independence from weights and training details.

Authors: We concur that explicit listing and justification of the architectural characteristics would strengthen the independence claim. We will revise the theoretical analysis section to begin with a clearly enumerated list of the relevant characteristics (autoregressive token-by-token generation, fixed embedding dimension, multi-head attention with a fixed number of heads, position-independent feed-forward layers, and the absence of any external memory beyond the prompt). We will then insert a dedicated lemma and proof sketch showing, step by step, how each of these characteristics is used (and why no others are needed) to establish both the linear growth of maximal accessible length and the upper bound on the coefficient, without invoking specific weight values or training dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in theoretical derivation

full rationale

The paper's central results are mathematical proofs deriving linear growth of maximal accessible sequence length, exponential decay of their proportion, and an upper bound on the linear coefficient strictly from a small set of transformer architectural characteristics (independent of weights or training). These hold for arbitrary weights and unbounded context, with no reduction to fitted parameters, self-definitions, or self-citation chains. The empirical tightness (factor <10) is presented as validation rather than a load-bearing input, and the derivation remains self-contained against external architectural properties without importing uniqueness or ansatzes via prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the analysis assumes standard transformer properties such as fixed attention-based architecture suffice to bound outputs, with no explicit free parameters or new entities introduced.

axioms (1)

domain assumption A small number of fixed architectural characteristics (e.g., attention and layer structure) are sufficient to derive tight bounds on output diversity
The paper states it leverages only a handful of such characteristics to predict the number of outputs.

pith-pipeline@v0.9.0 · 5685 in / 1165 out tokens · 54734 ms · 2026-05-22T07:03:06.884958+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.5 … upper bounded by P(B^{d×m}(0,r),∥·∥,ε) … (1+2r/ε)^{d m}
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 4.3 [Finite Precision] … Rd partitioned into axis-aligned cubes of side ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 3 internal anchors

[1]

Volumes of Generalized Unit Balls , urldate =

Xianfu Wang , journal =. Volumes of Generalized Unit Balls , urldate =

work page
[2]

and Martin, Jeremy L

Ellis, Robert B. and Martin, Jeremy L. and Yan, Catherine , title =. Algorithmica , month = apr, pages =. 2007 , issue_date =. doi:10.1007/s00453-006-0172-y , abstract =

work page doi:10.1007/s00453-006-0172-y 2007
[3]

ArXiv , year=

Repeat After Me: Transformers are Better than State Space Models at Copying , author=. ArXiv , year=

work page
[4]

Nonlinear approximation via compositions , volume=

Shen, Zuowei and Yang, Haizhao and Zhang, Shijun , year=. Nonlinear approximation via compositions , volume=. doi:10.1016/j.neunet.2019.07.011 , journal=

work page doi:10.1016/j.neunet.2019.07.011 2019
[5]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022
[6]

The Thirteenth International Conference on Learning Representations , year=

Transformers are Universal In-context Learners , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[7]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022
[8]

Journal of Computational Mathematics , year =

Montanelli, Hadrien and Yang, Haizhao and Qiang, Du , title =. Journal of Computational Mathematics , year =. doi:https://doi.org/10.4208/jcm.2007-m2019-0239 , url =

work page doi:10.4208/jcm.2007-m2019-0239 2007
[9]

2025 , eprint=

Memory Limitations of Prompt Tuning in Transformers , author=. 2025 , eprint=

work page 2025
[10]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[12]

P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8

work page doi:10.18653/v1/2022.acl-short.8 2022
[13]

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle =. Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

work page
[14]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

work page doi:10.18653/v1/2024.acl-demos.38 2024
[15]

2024 , url=

Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

work page 2024
[16]

Long-Context

Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O Arik , booktitle=. Long-Context. 2025 , url=

work page 2025
[17]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and others. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.6

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[18]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page
[19]

Proceedings of the 40th International Conference on Machine Learning , pages =

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[20]

I so S core: Measuring the Uniformity of Embedding Space Utilization

Rudman, William and Gillman, Nate and Rayne, Taylor and Eickhoff, Carsten. I so S core: Measuring the Uniformity of Embedding Space Utilization. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.262

work page doi:10.18653/v1/2022.findings-acl.262 2022
[21]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[22]

Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems

Laban, Philippe and Fabbri, Alexander and Xiong, Caiming and Wu, Chien-Sheng. Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.552

work page doi:10.18653/v1/2024.emnlp-main.552 2024
[23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[24]

The Twelfth International Conference on Learning Representations , year=

Memorization Capacity of Multi-Head Attention in Transformers , author=. The Twelfth International Conference on Learning Representations , year=

work page
[25]

The Eleventh International Conference on Learning Representations , year=

Provable Memorization Capacity of Transformers , author=. The Eleventh International Conference on Learning Representations , year=

work page
[26]

In: Gurevych, I., Miyao, Y

Howard, Jeremy and Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1031

work page doi:10.18653/v1/p18-1031 2018
[27]

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Levy, Mosh and Jacoby, Alon and Goldberg, Yoav. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.818

work page doi:10.18653/v1/2024.acl-long.818 2024
[28]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Transformers need glasses! Information over-squashing in language tasks , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[29]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[30]

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.574

work page doi:10.18653/v1/2020.emnlp-main.574 2020
[31]

The Thirteenth International Conference on Learning Representations , year=

On the Optimal Memorization Capacity of Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[32]

The Twelfth International Conference on Learning Representations , year=

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=. The Twelfth International Conference on Learning Representations , year=

work page
[33]

L oo GLE : Can Long-Context Language Models Understand Long Contexts?

Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

work page doi:10.18653/v1/2024.acl-long.859 2024
[34]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[35]

Knee-Deep in C-

Andy Yang and Micha. Knee-Deep in C-. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[36]

MaPLe: Multi-modal Prompt Learning , year=

Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle=. MaPLe: Multi-modal Prompt Learning , year=

work page
[37]

The Eleventh International Conference on Learning Representations , year=

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[38]

2023 , url=

Guangyi Chen and Weiran Yao and Xiangchen Song and Xinyue Li and Yongming Rao and Kun Zhang , booktitle=. 2023 , url=

work page 2023
[39]

Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =

On The Computational Complexity of Self-Attention , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =. 2023 , editor =

work page 2023
[40]

The Twelfth International Conference on Learning Representations , year=

Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[41]

A Survey on In-context Learning

Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64

work page doi:10.18653/v1/2024.emnlp-main.64 2024
[42]

Zhengxiang Shi and Aldo Lipani , booktitle=. De. 2024 , url=

work page 2024
[43]

The Twelfth International Conference on Learning Representations , year=

Protein Multimer Structure Prediction via Prompt Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page
[44]

Proceedings of the 40th International Conference on Machine Learning , pages =

Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[45]

Advances in Neural Information Processing Systems , volume=

Your transformer may not be as powerful as you expect , author=. Advances in Neural Information Processing Systems , volume=

work page
[46]

ICML 2022 , year =

Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Zhang, Cyril , title =. ICML 2022 , year =

work page 2022
[47]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sashank and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakara...

work page 2023
[48]

The Twelfth International Conference on Learning Representations , year=

The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=

work page
[49]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Approximation Rate of the Transformer Architecture for Sequence Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[50]

O(n) connections are expressive enough: Universal approximability of sparse transformers

Chulhee Yun and Chang, \ Yin Wen\ and Srinadh Bhojanapalli and Rawat, \ Ankit Singh\ and Reddi, \ Sashank J.\ and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems. 2020

work page 2020
[51]

On the Expressivity Role of L ayer N orm in Transformers' Attention

Brody, Shaked and Alon, Uri and Yahav, Eran. On the Expressivity Role of L ayer N orm in Transformers' Attention. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.895

work page doi:10.18653/v1/2023.findings-acl.895 2023
[52]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[53]

2024 , eprint=

An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=

work page 2024
[54]

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Kuratov, Yuri and Arkhipov, Mikhail and Bulatov, Aydar and Burtsev, Mikhail. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.948

work page doi:10.18653/v1/2025.acl-long.948 2025
[55]

arXiv preprint arXiv:2504.06214 , year=

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models , author=. arXiv preprint arXiv:2504.06214 , year=

work page arXiv
[56]

Applied Sciences , volume=

Extending context window in large language models with segmented base adjustment for rotary position embeddings , author=. Applied Sciences , volume=. 2024 , publisher=

work page 2024
[57]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023
[58]

International conference on machine learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[59]

The Twelfth International Conference on Learning Representations , year=

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. The Twelfth International Conference on Learning Representations , year=

work page
[60]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[61]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[62]

and Koltun, Vladlen , title =

Zhao, Hengshuang and Jiang, Li and Jia, Jiaya and Torr, Philip H.S. and Koltun, Vladlen , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021
[63]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[64]

URL https://aclanthology.org/2021

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[65]

Xiang Lisa Li and Percy Liang , title =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2021 , month =. doi:10.18653/v1/2021.acl-long.353 , url =

work page doi:10.18653/v1/2021.acl-long.353 2021
[66]

High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

work page
[67]

How Smooth Is Attention? , booktitle =

Valérie Castin and Pierre Ablin and Gabriel Peyré , year =. How Smooth Is Attention? , booktitle =

work page
[68]

Proceedings of the 36th International Conference on Machine Learning , pages =

Invertible Residual Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[69]

2025 , eprint=

A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts , author=. 2025 , eprint=

work page 2025
[70]

Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =

Peter L. Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =. J. Mach. Learn. Res. , year =

work page
[71]

Universality and Limitations of Prompt Tuning , url =

Wang, Yihan and Chauhan, Jatin and Wang, Wei and Hsieh, Cho-Jui , booktitle =. Universality and Limitations of Prompt Tuning , url =

work page
[72]

The Thirteenth International Conference on Learning Representations , year=

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[73]

ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

Prompting a Pretrained Transformer Can Be a Universal Approximator , author=. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

work page 2024
[74]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

work page
[75]

The emergence of clusters in self-attention dynamics , url =

Geshkovski, Borjan and Letrouit, Cyril and Polyanskiy, Yury and Rigollet, Philippe , booktitle =. The emergence of clusters in self-attention dynamics , url =

work page
[76]

2015 , publisher=

Optimal transport for applied mathematicians , author=. 2015 , publisher=

work page 2015
[77]

International Conference on Learning Representations , year=

Universal Approximation Under Constraints is Possible with Transformers , author=. International Conference on Learning Representations , year=

work page
[78]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[79]

International Conference on Learning Representations , year=

Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=

work page
[80]

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =

Sinkformers: Transformers with Doubly Stochastic Attention , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =. 2022 , editor =

work page 2022

Showing first 80 references.

[1] [1]

Volumes of Generalized Unit Balls , urldate =

Xianfu Wang , journal =. Volumes of Generalized Unit Balls , urldate =

work page

[2] [2]

and Martin, Jeremy L

Ellis, Robert B. and Martin, Jeremy L. and Yan, Catherine , title =. Algorithmica , month = apr, pages =. 2007 , issue_date =. doi:10.1007/s00453-006-0172-y , abstract =

work page doi:10.1007/s00453-006-0172-y 2007

[3] [3]

ArXiv , year=

Repeat After Me: Transformers are Better than State Space Models at Copying , author=. ArXiv , year=

work page

[4] [4]

Nonlinear approximation via compositions , volume=

Shen, Zuowei and Yang, Haizhao and Zhang, Shijun , year=. Nonlinear approximation via compositions , volume=. doi:10.1016/j.neunet.2019.07.011 , journal=

work page doi:10.1016/j.neunet.2019.07.011 2019

[5] [5]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022

[6] [6]

The Thirteenth International Conference on Learning Representations , year=

Transformers are Universal In-context Learners , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[7] [7]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022

[8] [8]

Journal of Computational Mathematics , year =

Montanelli, Hadrien and Yang, Haizhao and Qiang, Du , title =. Journal of Computational Mathematics , year =. doi:https://doi.org/10.4208/jcm.2007-m2019-0239 , url =

work page doi:10.4208/jcm.2007-m2019-0239 2007

[9] [9]

2025 , eprint=

Memory Limitations of Prompt Tuning in Transformers , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[11] [11]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[12] [12]

P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks

Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.8

work page doi:10.18653/v1/2022.acl-short.8 2022

[13] [13]

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle =. Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , url =

work page

[14] [14]

L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models

Zheng, Yaowei and Zhang, Richong and Zhang, Junhao and Ye, Yanhan and Luo, Zheyan. L lama F actory: Unified Efficient Fine-Tuning of 100+ Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.38

work page doi:10.18653/v1/2024.acl-demos.38 2024

[15] [15]

2024 , url=

Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

work page 2024

[16] [16]

Long-Context

Bowen Jin and Jinsung Yoon and Jiawei Han and Sercan O Arik , booktitle=. Long-Context. 2025 , url=

work page 2025

[17] [17]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and others. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.6

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[18] [18]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page

[19] [19]

Proceedings of the 40th International Conference on Machine Learning , pages =

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[20] [20]

I so S core: Measuring the Uniformity of Embedding Space Utilization

Rudman, William and Gillman, Nate and Rayne, Taylor and Eickhoff, Carsten. I so S core: Measuring the Uniformity of Embedding Space Utilization. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.262

work page doi:10.18653/v1/2022.findings-acl.262 2022

[21] [21]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[22] [22]

Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems

Laban, Philippe and Fabbri, Alexander and Xiong, Caiming and Wu, Chien-Sheng. Summary of a Haystack: A Challenge to Long-Context LLM s and RAG Systems. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.552

work page doi:10.18653/v1/2024.emnlp-main.552 2024

[23] [23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page

[24] [24]

The Twelfth International Conference on Learning Representations , year=

Memorization Capacity of Multi-Head Attention in Transformers , author=. The Twelfth International Conference on Learning Representations , year=

work page

[25] [25]

The Eleventh International Conference on Learning Representations , year=

Provable Memorization Capacity of Transformers , author=. The Eleventh International Conference on Learning Representations , year=

work page

[26] [26]

In: Gurevych, I., Miyao, Y

Howard, Jeremy and Ruder, Sebastian. Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1031

work page doi:10.18653/v1/p18-1031 2018

[27] [27]

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Levy, Mosh and Jacoby, Alon and Goldberg, Yoav. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.818

work page doi:10.18653/v1/2024.acl-long.818 2024

[28] [28]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Transformers need glasses! Information over-squashing in language tasks , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[29] [29]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019

[30] [30]

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms

Kobayashi, Goro and Kuribayashi, Tatsuki and Yokoi, Sho and Inui, Kentaro. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.574

work page doi:10.18653/v1/2020.emnlp-main.574 2020

[31] [31]

The Thirteenth International Conference on Learning Representations , year=

On the Optimal Memorization Capacity of Transformers , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[32] [32]

The Twelfth International Conference on Learning Representations , year=

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author=. The Twelfth International Conference on Learning Representations , year=

work page

[33] [33]

L oo GLE : Can Long-Context Language Models Understand Long Contexts?

Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

work page doi:10.18653/v1/2024.acl-long.859 2024

[34] [34]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[35] [35]

Knee-Deep in C-

Andy Yang and Micha. Knee-Deep in C-. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[36] [36]

MaPLe: Multi-modal Prompt Learning , year=

Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz , booktitle=. MaPLe: Multi-modal Prompt Learning , year=

work page

[37] [37]

The Eleventh International Conference on Learning Representations , year=

Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page

[38] [38]

2023 , url=

Guangyi Chen and Weiran Yao and Xiangchen Song and Xinyue Li and Yongming Rao and Kun Zhang , booktitle=. 2023 , url=

work page 2023

[39] [39]

Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =

On The Computational Complexity of Self-Attention , author =. Proceedings of The 34th International Conference on Algorithmic Learning Theory , pages =. 2023 , editor =

work page 2023

[40] [40]

The Twelfth International Conference on Learning Representations , year=

Nemesis: Normalizing the Soft-prompt Vectors of Vision-Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page

[41] [41]

A Survey on In-context Learning

Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64

work page doi:10.18653/v1/2024.emnlp-main.64 2024

[42] [42]

Zhengxiang Shi and Aldo Lipani , booktitle=. De. 2024 , url=

work page 2024

[43] [43]

The Twelfth International Conference on Learning Representations , year=

Protein Multimer Structure Prediction via Prompt Learning , author=. The Twelfth International Conference on Learning Representations , year=

work page

[44] [44]

Proceedings of the 40th International Conference on Machine Learning , pages =

Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[45] [45]

Advances in Neural Information Processing Systems , volume=

Your transformer may not be as powerful as you expect , author=. Advances in Neural Information Processing Systems , volume=

work page

[46] [46]

ICML 2022 , year =

Edelman, Benjamin and Goel, Surbhi and Kakade, Sham and Zhang, Cyril , title =. ICML 2022 , year =

work page 2022

[47] [47]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sashank and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakara...

work page 2023

[48] [48]

The Twelfth International Conference on Learning Representations , year=

The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations , year=

work page

[49] [49]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Approximation Rate of the Transformer Architecture for Sequence Modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[50] [50]

O(n) connections are expressive enough: Universal approximability of sparse transformers

Chulhee Yun and Chang, \ Yin Wen\ and Srinadh Bhojanapalli and Rawat, \ Ankit Singh\ and Reddi, \ Sashank J.\ and Sanjiv Kumar. O(n) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems. 2020

work page 2020

[51] [51]

On the Expressivity Role of L ayer N orm in Transformers' Attention

Brody, Shaked and Alon, Uri and Yahav, Eran. On the Expressivity Role of L ayer N orm in Transformers' Attention. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.895

work page doi:10.18653/v1/2023.findings-acl.895 2023

[52] [52]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page

[53] [53]

2024 , eprint=

An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=

work page 2024

[54] [54]

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Kuratov, Yuri and Arkhipov, Mikhail and Bulatov, Aydar and Burtsev, Mikhail. Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.948

work page doi:10.18653/v1/2025.acl-long.948 2025

[55] [55]

arXiv preprint arXiv:2504.06214 , year=

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models , author=. arXiv preprint arXiv:2504.06214 , year=

work page arXiv

[56] [56]

Applied Sciences , volume=

Extending context window in large language models with segmented base adjustment for rotary position embeddings , author=. Applied Sciences , volume=. 2024 , publisher=

work page 2024

[57] [57]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

work page 2023

[58] [58]

International conference on machine learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[59] [59]

The Twelfth International Conference on Learning Representations , year=

When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations , author=. The Twelfth International Conference on Learning Representations , year=

work page

[60] [60]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024

[61] [61]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Scaling vision transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[62] [62]

and Koltun, Vladlen , title =

Zhao, Hengshuang and Jiang, Li and Jia, Jiaya and Torr, Philip H.S. and Koltun, Vladlen , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021

[63] [63]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page

[64] [64]

URL https://aclanthology.org/2021

Lester, Brian and Al-Rfou, Rami and Constant, Noah. The Power of Scale for Parameter-Efficient Prompt Tuning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.243

work page doi:10.18653/v1/2021.emnlp-main.243 2021

[65] [65]

Xiang Lisa Li and Percy Liang , title =. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages =. 2021 , month =. doi:10.18653/v1/2021.acl-long.353 , url =

work page doi:10.18653/v1/2021.acl-long.353 2021

[66] [66]

High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

work page

[67] [67]

How Smooth Is Attention? , booktitle =

Valérie Castin and Pierre Ablin and Gabriel Peyré , year =. How Smooth Is Attention? , booktitle =

work page

[68] [68]

Proceedings of the 36th International Conference on Machine Learning , pages =

Invertible Residual Networks , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019

[69] [69]

2025 , eprint=

A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts , author=. 2025 , eprint=

work page 2025

[70] [70]

Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =

Peter L. Bartlett and Nick Harvey and Christopher Liaw and Abbas Mehrabian , title =. J. Mach. Learn. Res. , year =

work page

[71] [71]

Universality and Limitations of Prompt Tuning , url =

Wang, Yihan and Chauhan, Jatin and Wang, Wei and Hsieh, Cho-Jui , booktitle =. Universality and Limitations of Prompt Tuning , url =

work page

[72] [72]

The Thirteenth International Conference on Learning Representations , year=

Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[73] [73]

ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

Prompting a Pretrained Transformer Can Be a Universal Approximator , author=. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

work page 2024

[74] [74]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

work page

[75] [75]

The emergence of clusters in self-attention dynamics , url =

Geshkovski, Borjan and Letrouit, Cyril and Polyanskiy, Yury and Rigollet, Philippe , booktitle =. The emergence of clusters in self-attention dynamics , url =

work page

[76] [76]

2015 , publisher=

Optimal transport for applied mathematicians , author=. 2015 , publisher=

work page 2015

[77] [77]

International Conference on Learning Representations , year=

Universal Approximation Under Constraints is Possible with Transformers , author=. International Conference on Learning Representations , year=

work page

[78] [78]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024

[79] [79]

International Conference on Learning Representations , year=

Generating Wikipedia by Summarizing Long Sequences , author=. International Conference on Learning Representations , year=

work page

[80] [80]

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =

Sinkformers: Transformers with Doubly Stochastic Attention , author =. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics , pages =. 2022 , editor =

work page 2022