pith. sign in

arxiv: 2508.00901 · v4 · pith:FWJRLGLGnew · submitted 2025-07-28 · 💻 cs.LG · cs.CL

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

Pith reviewed 2026-05-21 23:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords one-layer transformerknowledge acquisitionpre-trainingfine-tuningattention patternsfactual extractionrelation coveringhallucination mechanism
0
0 comments X

The pith

One-layer transformers store facts in structured attention patterns during pre-training that fine-tuning activates for extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that in a one-layer transformer trained by next-token prediction, the model develops structured attention patterns and relation-specific feature directions to store factual knowledge while achieving near-optimal loss under suitable conditions. Fine-tuning on question-answering data then uses the prompt format to activate these pre-trained features, allowing extraction of facts not encountered during fine-tuning as long as the underlying relations are covered. The resulting relation-covering characterization shows that extraction gets better with more pre-training on each relation and broader coverage in fine-tuning, but worse when there are more possible relation templates overall. This framework also explains a failure mode where facts are stored yet remain inaccessible, mirroring certain hallucination behaviors in language models.

Core claim

Under suitable regularity conditions, next-token pre-training in the one-layer transformer leads to near-optimal loss along with the development of structured attention patterns and relation-specific feature directions that encode factual knowledge. Fine-tuning on Q&A tasks converts the prompt format into a trigger for these features, enabling extraction of facts not revisited in the fine-tuning phase. Knowledge extraction follows a relation-covering characterization that depends on pre-training multiplicity and fine-tuning coverage of latent relation-template directions, improving with greater multiplicity and coverage but degrading as the relation-template universe expands. Insufficient覆盖e

What carries the argument

The relation-covering characterization of knowledge extraction, which determines when fine-tuning activates enough pre-trained relation directions to retrieve stored facts.

If this is right

  • Fine-tuning can extract many pre-trained facts without including every specific subject-answer pair, provided the relation templates are adequately represented.
  • Greater repetition of relations during pre-training strengthens the feature directions and improves post-fine-tuning extraction.
  • Larger spaces of possible relations make it harder to achieve reliable extraction through fine-tuning.
  • Low-rank fine-tuning recovers pre-trained factual knowledge when the relation coverage is sufficient.
  • Insufficient relation coverage during fine-tuning leaves stored facts inaccessible, leading to potential hallucinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This points to designing fine-tuning datasets that emphasize variety in relation types to maximize knowledge extraction from pre-training.
  • The analysis may connect to understanding prompt engineering, where different formats act as triggers for different pre-trained capabilities.
  • Extending the model to multiple layers could show how knowledge storage and triggering distribute across depth in larger architectures.
  • Empirical checks on real models could test if attention patterns in early layers align with the predicted structures for common relations.

Load-bearing premise

The central claim depends on the assumption that the training process under next-token prediction will produce exactly the structured attention patterns and relation-specific feature directions when regularity conditions are met.

What would settle it

Train a one-layer transformer on a synthetic dataset of facts grouped by relations, measure the learned attention weights to see if they match the predicted structured patterns at near-optimal loss, and test extraction success rates against the predicted dependence on relation coverage during fine-tuning.

Figures

Figures reproduced from arXiv: 2508.00901 by Kexin Chen, Ruichen Xu.

Figure 1
Figure 1. Figure 1: Illustration of the simplified one-layer transformer. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OOD generalization accuracy of our simplified one-layer transformers with 5-token data. (a) Impact of the number of subject-answer pairs N. (b) Impact of the size of relation phrases |R| [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: OOD generalization accuracy of GPT-2 with 5-token data. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Exact match accuracy of fine-tuned Llama-3.2-1B on PopQA [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention heatmap after pre-training. E OOD generalization accuracy and FT steps The visualization results for OOD generalization accuracy and FT steps can be found in [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: OOD generalization accuracy versus FT steps. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes knowledge acquisition and extraction in a stylized one-layer transformer trained via next-token prediction and then fine-tuned on Q&A tasks. Under suitable regularity conditions on the data distribution, initialization, and optimization, it claims to prove that pre-training achieves near-optimal loss while acquiring structured attention patterns and relation-specific feature directions for factual knowledge storage. Fine-tuning turns the Q&A format into a trigger for these pre-trained features, with extraction governed by a relation-covering characterization: extraction improves with pre-training multiplicity and fine-tuning coverage but degrades with larger relation-template universes. Insufficient coverage leads to a failure mode akin to hallucination. The theory extends to low-rank fine-tuning, and experiments on synthetic data plus PopQA with GPT-2/Llama models are said to support the trends.

Significance. If the regularity conditions can be made explicit and verified to hold for the data generators and fine-tuning regimes considered, the work would provide a rare mechanistic, provable link between pre-training dynamics, knowledge storage in attention/MLP modules, and post-fine-tuning extraction. The relation-covering characterization and its implications for LoRA and hallucination would be a substantive contribution to understanding factual recall in transformers.

major comments (2)
  1. [Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.
  2. [Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.
minor comments (1)
  1. [Experiments] The synthetic data generator and the precise construction of relation-template directions should be described with sufficient detail (including any hyper-parameters) to allow independent verification of the predicted trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to strengthen the rigor and clarity of our theoretical claims. We address each major comment below and will revise the manuscript to make the regularity conditions explicit while preserving the core contributions on knowledge acquisition and the relation-covering extraction mechanism.

read point-by-point responses
  1. Referee: [Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.

    Authors: We agree that the regularity conditions require explicit statement to ensure the claims are rigorously derived. In the revised manuscript we will add a new 'Assumptions' subsection that enumerates all conditions: (i) data distribution requires sufficient multiplicity of each relation template and bounded noise; (ii) initialization is random with norms controlled by a small constant; (iii) optimization reaches a neighborhood of a critical point where the loss is near-optimal under gradient flow. We will verify these hold by construction for the synthetic data generator. For the PopQA experiments with GPT-2 and Llama we will clarify that the setup approximates the conditions and that the observed trends are consistent with the theory, while noting that real-world data may deviate; this distinction will be added to the experimental discussion. revision: yes

  2. Referee: [Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.

    Authors: The logical structure of the proofs first derives the emergence of relation-specific feature directions and attention patterns solely from minimizing the next-token prediction loss under the stated regularity conditions, before analyzing fine-tuning. To eliminate any appearance of circularity we will expand the appendix with additional lemmas that explicitly construct the feature directions from the pre-training objective alone, without reference to Q&A extraction or relation covering. These lemmas will show that distinct directions arise from the loss landscape properties and data multiplicity, after which the relation-covering characterization follows directly as a consequence for both full and low-rank fine-tuning. This will make the derivation chain transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from explicit assumptions to consequences without reduction by construction.

full rationale

The paper's central proofs are conditioned on explicitly invoked 'suitable regularity conditions' regarding data distribution, initialization, and optimization trajectory. These are standard external assumptions that enable showing near-optimal pre-training loss together with structured attention and relation-specific directions; they are not defined in terms of the target patterns themselves. The subsequent relation-covering characterization of extraction is derived as a consequence of how fine-tuning interacts with those pre-trained features, without any fitted parameter being relabeled as a prediction or any self-citation chain serving as the sole justification. The overall argument remains self-contained as a conditional theoretical analysis rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on unspecified regularity conditions that enable the model to acquire structured attention and relation-specific features during pre-training; these conditions function as domain assumptions whose validity determines whether the extraction mechanism applies.

axioms (1)
  • domain assumption Suitable regularity conditions guarantee that next-token pre-training produces structured attention patterns and relation-specific feature directions
    Invoked to establish the knowledge-acquisition mechanism before analyzing fine-tuning extraction
invented entities (1)
  • relation-template directions no independent evidence
    purpose: Latent directions in feature space that encode facts during pre-training and are triggered by fine-tuning prompts
    New postulated construct used to explain both storage and extraction; no independent falsifiable handle is provided in the abstract

pith-pipeline@v0.9.0 · 5808 in / 1492 out tokens · 49972 ms · 2026-05-21T23:01:48.237722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    Transformers learn to implement preconditioned gradient descent for in-context learning

    Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36:45614–45650, 2023

  2. [2]

    Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

    Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020

  3. [3]

    Physics of language models: Part 3.1, knowledge storage and extraction

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023

  4. [4]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

  6. [6]

    Scaling laws for associative memories

    Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. arXiv preprint arXiv:2310.02984, 2023

  7. [7]

    Learning associative memories with gradient descent

    Vivien Cabannes, Berfin Simsek, and Alberto Bietti. Learning associative memories with gradient descent. arXiv preprint arXiv:2402.18724, 2024

  8. [8]

    Benign overfitting in two-layer convolutional neural networks

    Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems , 35: 25237–25250, 2022

  9. [9]

    Towards understanding the mixture-of-experts layer in deep learning

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 2022

  10. [10]

    Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025

    Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer. arXiv preprint arXiv:2506.01115, 2025

  11. [11]

    Transformer Feed-Forward Layers Are Key-Value Memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020

  12. [12]

    Understanding finetuning for factual knowledge extraction

    Gaurav Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785, 2024

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 14

  14. [14]

    In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

    Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

  15. [15]

    Vision transformers provably learn spatial structure

    Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 2022

  16. [16]

    Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

    Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, and Liqiang Nie. Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization. Advances in Neural Information Processing Systems, 2024

  17. [17]

    Optimal memorization capacity of transformers

    Tokio Kajitsuka and Issei Sato. Optimal memorization capacity of transformers. In Interna- tional Conference on Learning Representations, 2025

  18. [18]

    Large language models struggle to learn long-tail knowledge

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

  19. [19]

    Provable memorization capacity of transformers

    Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In International Conference on Learning Representations, 2023

  20. [20]

    Benign overfitting in two-layer relu convolutional neural networks

    Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer relu convolutional neural networks. In International conference on machine learning, 2023

  21. [21]

    Deduplicating Training Data Makes Language Models Better

    Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

  22. [22]

    Next-token prediction capacity: general upper bounds and a lower bound for transformers

    Liam Madden, Curtis Fox, and Christos Thrampoulidis. Next-token prediction capacity: general upper bounds and a lower bound for transformers. arXiv preprint arXiv:2405.13718, 2024

  23. [23]

    One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention

    Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023

  24. [24]

    Memorization Capacity of Multi-Head Attention in Transformers, 2024

    Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi- head attention in transformers. arXiv preprint arXiv:2306.02010, 2023

  25. [25]

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

  26. [26]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 2022

  27. [27]

    Nichani, A

    Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735, 2024

  28. [28]

    Understanding factual recall in transformers via associative memories

    Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories. arXiv preprint arXiv:2412.06538, 2024. 15

  29. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  30. [30]

    Data augmentation as feature manipu- lation

    Ruoqi Shen, S´ebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipu- lation. In International conference on machine learning, 2022

  31. [31]

    Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer

    Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du. Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36:71911–71947, 2023

  32. [32]

    Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention

    Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023

  33. [33]

    Rethinking benign overfitting in two-layer neural networks

    Ruichen Xu and Kexin Chen. Rethinking benign overfitting in two-layer neural networks. In International Conference on Machine Learning, 2025

  34. [34]

    Knowledge circuits in pretrained transformers

    Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In Advances in neural information processing systems, 2024

  35. [35]

    Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

  36. [36]

    Trained transformers learn linear models in-context

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024

  37. [37]

    Towards a theoretical understanding of the’reversal curse’via training dynamics

    Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, and Stuart J Russell. Towards a theoretical understanding of the’reversal curse’via training dynamics. Advances in Neural Information Processing Systems, 2024

  38. [38]

    The benefits of mixup for feature learning

    Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In International Conference on Machine Learning, 2023. 16 Contents 1 Introduction 1 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  39. [39]

    Sample a subset Rs ⊆ R of K relation phrase embeddings uniformly at random without replacement

  40. [40]

    For each i ∈ R s, construct K sentence tuples (sj, ri, aj)

  41. [41]

    Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj))

  42. [42]

    Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

    Randomly select half tuples from T and convert each of them to fine-tuning samples, ([sj p], I(aj)). Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

  43. [43]

    We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

    Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj)). We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

  44. [44]

    Randomly select half tuples from T and convert each of them to fine-tuning samples, 19 ([sj p], I(aj))

  45. [45]

    We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I)

    Convert each of the remaining tuples to test samples, ([sj p], I(aj)). We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I). We set the parameters as N = 1000 , K = 5 , |R| = 5 , σ = 1, d = 50. We pre-train the transformers (‘standard’ and ’Uniform attention + MLP’) with AdamW op...

  46. [46]

    nX i=1 I(j ∈ A i) ≥ K 2q n # ≤ exp − nK2 2q2 . (44) 25 Proof. In each time i ∈ [n], a number j ∈ [q] has probability K/q to selected. By Hoeffding’s inequality, we have: P

    demonstrated that transformers are universal approximators. Kim et al. [ 19] proved that transformers can memorize sequence mappings of length- n d -dimensional inputs with ˜O(d + n + √ nN) parameters. Later, Kajitsuka and Sato [ 17] proved lower and upper bounds of the memorization capacity of transformers in next token prediction and sequence-to-sequenc...

  47. [47]

    We have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ),

  48. [48]

    For allj ∈ [N] and i ∈ B j, we have1−logit(T1) I(aj)([osjri]) = Θ(1), and 1−Ks(j)logit(T1) I(ri)([osj]) = Θ(1),

  49. [49]

    For all j ∈ [N] and i ∈ B j, we have 0.8 ≤ α(T1) 2 ([o s j ri])/α(T1) 3 ([o s j ri]) ≤ 1.2,

  50. [50]

    (70) Proof

    At the end of Stage 1, for all j ∈ [N], i ∈ [K], we have α(T1) 1 ([o s j]) ≤ 1 2d and α(T1) 1 ([o s j ri]) ≤ 1 2d . (70) Proof. In this proof, we prove the four statements of Lemma 10 one by one. 29 First, during the first iteration, we have 1 m mX k=1 ⟨w(1) I(d),k, 2o⟩− 1 m mX k=1 ⟨w(0) I(d),k, 2o⟩ (a) =Θ(1) · η1 n n 3m nX i=1 (1 − logit(0) I(d)(2λo))σ′(...

  51. [51]

    After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm

    for t ≤ T1 (b) is because of1−logit(t) I(aj)([osjri]) ≤ 1 and ∥o + sj + 2ri∥2 2 = O(d). After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm . (80) Then, as 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Θ(λη1d/m) and 1 m Pm k=1⟨w(T1) I(ri),k, Ξ(T1)([o s j ri])⟩ = Θ( 1 m Pm k=1⟨w(T1) I(d),k, Ξ(T1)([o s j ri])⟩/n), we have 1 − logit(T...

  52. [52]

    1 − logit(t) I(aj)([o s j ri]) = Θ(1)

  53. [53]

    1 − Ks(j)logit(t) I(ri)([o s j]) = Θ(1)

  54. [54]

    0.7 ≤ α(t) 2 ([o s j ri])/α(t) 3 ([o s j ri]) ≤ 1.6

  55. [55]

    For all j ∈ [N], i ∈ [K] and T1 ≤ t ≤ T2, we have α(t) 1 ([o s j]) ≤ 2 3d and α(t) 1 ([o s j ri]) ≤ 2 3d . (96)

  56. [56]

    For T1 ≤ t ≤ T2, we have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ). Proof. First, we prove the first and the second statements. For any j ∈ [N] and i ∈ B j, the model updates satisfy 1 m mX k=1 ⟨w(t+1) I(ri),k, Ξ(t+1)([o s j])⟩ − 1 m mX k=1 ⟨w(t) I(ri),k, Ξ(t)([o s j])⟩ (a) =Θ(1) · λη2 nm(1 − Klogit(t) I(ri)([o s j]) (b) =O λη2 nm , (97) 34 and 1 m mX k=...

  57. [57]

    dX j=1 σjujv⊤ j # i , [RA1(G(tf ))]i = σ1u1v⊤ 1 i . (216) Taking the L2 norms, we have [G(tf )]i 2 2 =

    + ˜O(m) + ˜O((N + |R|)2/λ2). (168) Setting Tp − T2 = ˜Θ((md2σ2 0 + m + N 2/λ2 + |R|2/λ2)/η3) completes the proof. G.5 Key properties Lemma 15. With probability 1 − δ, for all j ∈ [N], the following holds: 1 m mX k=1 σ(⟨w(Tp) I(aj),k, sj⟩) = Ω (log(d)/λ) . (169) 45 Proof. With probability 1 − δ, at initialization, there are at least 0.4m neurons activated ...