Provable Knowledge Acquisition and Extraction in One-Layer Transformers

Kexin Chen; Ruichen Xu

arxiv: 2508.00901 · v4 · pith:FWJRLGLGnew · submitted 2025-07-28 · 💻 cs.LG · cs.CL

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

Ruichen Xu , Kexin Chen This is my paper

Pith reviewed 2026-05-21 23:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords one-layer transformerknowledge acquisitionpre-trainingfine-tuningattention patternsfactual extractionrelation coveringhallucination mechanism

0 comments

The pith

One-layer transformers store facts in structured attention patterns during pre-training that fine-tuning activates for extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that in a one-layer transformer trained by next-token prediction, the model develops structured attention patterns and relation-specific feature directions to store factual knowledge while achieving near-optimal loss under suitable conditions. Fine-tuning on question-answering data then uses the prompt format to activate these pre-trained features, allowing extraction of facts not encountered during fine-tuning as long as the underlying relations are covered. The resulting relation-covering characterization shows that extraction gets better with more pre-training on each relation and broader coverage in fine-tuning, but worse when there are more possible relation templates overall. This framework also explains a failure mode where facts are stored yet remain inaccessible, mirroring certain hallucination behaviors in language models.

Core claim

Under suitable regularity conditions, next-token pre-training in the one-layer transformer leads to near-optimal loss along with the development of structured attention patterns and relation-specific feature directions that encode factual knowledge. Fine-tuning on Q&A tasks converts the prompt format into a trigger for these features, enabling extraction of facts not revisited in the fine-tuning phase. Knowledge extraction follows a relation-covering characterization that depends on pre-training multiplicity and fine-tuning coverage of latent relation-template directions, improving with greater multiplicity and coverage but degrading as the relation-template universe expands. Insufficient覆盖e

What carries the argument

The relation-covering characterization of knowledge extraction, which determines when fine-tuning activates enough pre-trained relation directions to retrieve stored facts.

If this is right

Fine-tuning can extract many pre-trained facts without including every specific subject-answer pair, provided the relation templates are adequately represented.
Greater repetition of relations during pre-training strengthens the feature directions and improves post-fine-tuning extraction.
Larger spaces of possible relations make it harder to achieve reliable extraction through fine-tuning.
Low-rank fine-tuning recovers pre-trained factual knowledge when the relation coverage is sufficient.
Insufficient relation coverage during fine-tuning leaves stored facts inaccessible, leading to potential hallucinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This points to designing fine-tuning datasets that emphasize variety in relation types to maximize knowledge extraction from pre-training.
The analysis may connect to understanding prompt engineering, where different formats act as triggers for different pre-trained capabilities.
Extending the model to multiple layers could show how knowledge storage and triggering distribute across depth in larger architectures.
Empirical checks on real models could test if attention patterns in early layers align with the predicted structures for common relations.

Load-bearing premise

The central claim depends on the assumption that the training process under next-token prediction will produce exactly the structured attention patterns and relation-specific feature directions when regularity conditions are met.

What would settle it

Train a one-layer transformer on a synthetic dataset of facts grouped by relations, measure the learned attention weights to see if they match the predicted structured patterns at near-optimal loss, and test extraction success rates against the predicted dependence on relation coverage during fine-tuning.

Figures

Figures reproduced from arXiv: 2508.00901 by Kexin Chen, Ruichen Xu.

**Figure 2.** Figure 2: OOD generalization accuracy of our simplified one-layer transformers with 5-token data. (a) Impact of the number of subject-answer pairs N. (b) Impact of the size of relation phrases |R| [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: OOD generalization accuracy of GPT-2 with 5-token data. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Exact match accuracy of fine-tuned Llama-3.2-1B on PopQA [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Attention heatmap after pre-training. E OOD generalization accuracy and FT steps The visualization results for OOD generalization accuracy and FT steps can be found in [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: OOD generalization accuracy versus FT steps. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a relation-covering story for how one-layer transformers store facts in pre-training and extract them after fine-tuning, but the proofs rest on unspecified regularity conditions that limit how far the claims travel.

read the letter

The main takeaway is that this work supplies a concrete mechanism connecting next-token pre-training to post-fine-tuning fact extraction in a one-layer transformer. Under regularity conditions the model learns structured attention and distinct relation directions; fine-tuning then uses the Q&A format to activate those directions. Extraction succeeds when fine-tuning covers enough of the latent relation-template directions, improves with higher pre-training multiplicity, and fails when coverage is thin, producing a stored-but-inaccessible regime they tie to hallucination. The same logic is shown to apply to low-rank adaptation when coverage holds.

Referee Report

2 major / 1 minor

Summary. The paper analyzes knowledge acquisition and extraction in a stylized one-layer transformer trained via next-token prediction and then fine-tuned on Q&A tasks. Under suitable regularity conditions on the data distribution, initialization, and optimization, it claims to prove that pre-training achieves near-optimal loss while acquiring structured attention patterns and relation-specific feature directions for factual knowledge storage. Fine-tuning turns the Q&A format into a trigger for these pre-trained features, with extraction governed by a relation-covering characterization: extraction improves with pre-training multiplicity and fine-tuning coverage but degrades with larger relation-template universes. Insufficient coverage leads to a failure mode akin to hallucination. The theory extends to low-rank fine-tuning, and experiments on synthetic data plus PopQA with GPT-2/Llama models are said to support the trends.

Significance. If the regularity conditions can be made explicit and verified to hold for the data generators and fine-tuning regimes considered, the work would provide a rare mechanistic, provable link between pre-training dynamics, knowledge storage in attention/MLP modules, and post-fine-tuning extraction. The relation-covering characterization and its implications for LoRA and hallucination would be a substantive contribution to understanding factual recall in transformers.

major comments (2)

[Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.
[Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.

minor comments (1)

[Experiments] The synthetic data generator and the precise construction of relation-template directions should be described with sufficient detail (including any hyper-parameters) to allow independent verification of the predicted trends.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to strengthen the rigor and clarity of our theoretical claims. We address each major comment below and will revise the manuscript to make the regularity conditions explicit while preserving the core contributions on knowledge acquisition and the relation-covering extraction mechanism.

read point-by-point responses

Referee: [Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.

Authors: We agree that the regularity conditions require explicit statement to ensure the claims are rigorously derived. In the revised manuscript we will add a new 'Assumptions' subsection that enumerates all conditions: (i) data distribution requires sufficient multiplicity of each relation template and bounded noise; (ii) initialization is random with norms controlled by a small constant; (iii) optimization reaches a neighborhood of a critical point where the loss is near-optimal under gradient flow. We will verify these hold by construction for the synthetic data generator. For the PopQA experiments with GPT-2 and Llama we will clarify that the setup approximates the conditions and that the observed trends are consistent with the theory, while noting that real-world data may deviate; this distinction will be added to the experimental discussion. revision: yes
Referee: [Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.

Authors: The logical structure of the proofs first derives the emergence of relation-specific feature directions and attention patterns solely from minimizing the next-token prediction loss under the stated regularity conditions, before analyzing fine-tuning. To eliminate any appearance of circularity we will expand the appendix with additional lemmas that explicitly construct the feature directions from the pre-training objective alone, without reference to Q&A extraction or relation covering. These lemmas will show that distinct directions arise from the loss landscape properties and data multiplicity, after which the relation-covering characterization follows directly as a consequence for both full and low-rank fine-tuning. This will make the derivation chain transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from explicit assumptions to consequences without reduction by construction.

full rationale

The paper's central proofs are conditioned on explicitly invoked 'suitable regularity conditions' regarding data distribution, initialization, and optimization trajectory. These are standard external assumptions that enable showing near-optimal pre-training loss together with structured attention and relation-specific directions; they are not defined in terms of the target patterns themselves. The subsequent relation-covering characterization of extraction is derived as a consequence of how fine-tuning interacts with those pre-trained features, without any fitted parameter being relabeled as a prediction or any self-citation chain serving as the sole justification. The overall argument remains self-contained as a conditional theoretical analysis rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on unspecified regularity conditions that enable the model to acquire structured attention and relation-specific features during pre-training; these conditions function as domain assumptions whose validity determines whether the extraction mechanism applies.

axioms (1)

domain assumption Suitable regularity conditions guarantee that next-token pre-training produces structured attention patterns and relation-specific feature directions
Invoked to establish the knowledge-acquisition mechanism before analyzing fine-tuning extraction

invented entities (1)

relation-template directions no independent evidence
purpose: Latent directions in feature space that encode facts during pre-training and are triggered by fine-tuning prompts
New postulated construct used to explain both storage and extraction; no independent falsifiable handle is provided in the abstract

pith-pipeline@v0.9.0 · 5808 in / 1492 out tokens · 49972 ms · 2026-05-21T23:01:48.237722+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under suitable regularity conditions, the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

self-attention module learns to filter out the irrelevant contextual information, while the MLP module memorizes the filtered context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

[1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36:45614–45650, 2023

work page 2023
[2]

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020

work page arXiv 2012
[3]

Physics of language models: Part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023

work page arXiv 2023
[4]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

work page 2023
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

work page 1901
[6]

Scaling laws for associative memories

Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. arXiv preprint arXiv:2310.02984, 2023

work page arXiv 2023
[7]

Learning associative memories with gradient descent

Vivien Cabannes, Berfin Simsek, and Alberto Bietti. Learning associative memories with gradient descent. arXiv preprint arXiv:2402.18724, 2024

work page arXiv 2024
[8]

Benign overfitting in two-layer convolutional neural networks

Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems , 35: 25237–25250, 2022

work page 2022
[9]

Towards understanding the mixture-of-experts layer in deep learning

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 2022

work page 2022
[10]

Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025

Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer. arXiv preprint arXiv:2506.01115, 2025

work page arXiv 2025
[11]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2012
[12]

Understanding finetuning for factual knowledge extraction

Gaurav Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785, 2024

work page arXiv 2024
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

work page arXiv 2023
[15]

Vision transformers provably learn spatial structure

Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 2022

work page 2022
[16]

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, and Liqiang Nie. Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization. Advances in Neural Information Processing Systems, 2024

work page 2024
[17]

Optimal memorization capacity of transformers

Tokio Kajitsuka and Issei Sato. Optimal memorization capacity of transformers. In Interna- tional Conference on Learning Representations, 2025

work page 2025
[18]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

work page 2023
[19]

Provable memorization capacity of transformers

Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In International Conference on Learning Representations, 2023

work page 2023
[20]

Benign overfitting in two-layer relu convolutional neural networks

Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer relu convolutional neural networks. In International conference on machine learning, 2023

work page 2023
[21]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden, Curtis Fox, and Christos Thrampoulidis. Next-token prediction capacity: general upper bounds and a lower bound for transformers. arXiv preprint arXiv:2405.13718, 2024

work page arXiv 2024
[23]

One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention

Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023

work page arXiv 2023
[24]

Memorization Capacity of Multi-Head Attention in Transformers, 2024

Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi- head attention in transformers. arXiv preprint arXiv:2306.02010, 2023

work page arXiv 2023
[25]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 2022

work page 2022
[27]

Nichani, A

Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735, 2024

work page arXiv 2024
[28]

Understanding factual recall in transformers via associative memories

Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories. arXiv preprint arXiv:2412.06538, 2024. 15

work page arXiv 2024
[29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[30]

Data augmentation as feature manipu- lation

Ruoqi Shen, S´ebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipu- lation. In International conference on machine learning, 2022

work page 2022
[31]

Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer

Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du. Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36:71911–71947, 2023

work page 2023
[32]

Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023

work page arXiv 2023
[33]

Rethinking benign overfitting in two-layer neural networks

Ruichen Xu and Kexin Chen. Rethinking benign overfitting in two-layer neural networks. In International Conference on Machine Learning, 2025

work page 2025
[34]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In Advances in neural information processing systems, 2024

work page 2024
[35]

Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

work page 2020
[36]

Trained transformers learn linear models in-context

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024

work page 2024
[37]

Towards a theoretical understanding of the’reversal curse’via training dynamics

Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, and Stuart J Russell. Towards a theoretical understanding of the’reversal curse’via training dynamics. Advances in Neural Information Processing Systems, 2024

work page 2024
[38]

The benefits of mixup for feature learning

Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In International Conference on Machine Learning, 2023. 16 Contents 1 Introduction 1 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2023
[39]

Sample a subset Rs ⊆ R of K relation phrase embeddings uniformly at random without replacement

work page
[40]

For each i ∈ R s, construct K sentence tuples (sj, ri, aj)

work page
[41]

Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj))

work page
[42]

Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

Randomly select half tuples from T and convert each of them to fine-tuning samples, ([sj p], I(aj)). Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

work page
[43]

We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj)). We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

work page
[44]

Randomly select half tuples from T and convert each of them to fine-tuning samples, 19 ([sj p], I(aj))

work page
[45]

We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I)

Convert each of the remaining tuples to test samples, ([sj p], I(aj)). We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I). We set the parameters as N = 1000 , K = 5 , |R| = 5 , σ = 1, d = 50. We pre-train the transformers (‘standard’ and ’Uniform attention + MLP’) with AdamW op...

work page
[46]

nX i=1 I(j ∈ A i) ≥ K 2q n # ≤ exp − nK2 2q2 . (44) 25 Proof. In each time i ∈ [n], a number j ∈ [q] has probability K/q to selected. By Hoeffding’s inequality, we have: P

demonstrated that transformers are universal approximators. Kim et al. [ 19] proved that transformers can memorize sequence mappings of length- n d -dimensional inputs with ˜O(d + n + √ nN) parameters. Later, Kajitsuka and Sato [ 17] proved lower and upper bounds of the memorization capacity of transformers in next token prediction and sequence-to-sequenc...

work page
[47]

We have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ),

work page
[48]

For allj ∈ [N] and i ∈ B j, we have1−logit(T1) I(aj)([osjri]) = Θ(1), and 1−Ks(j)logit(T1) I(ri)([osj]) = Θ(1),

work page
[49]

For all j ∈ [N] and i ∈ B j, we have 0.8 ≤ α(T1) 2 ([o s j ri])/α(T1) 3 ([o s j ri]) ≤ 1.2,

work page
[50]

(70) Proof

At the end of Stage 1, for all j ∈ [N], i ∈ [K], we have α(T1) 1 ([o s j]) ≤ 1 2d and α(T1) 1 ([o s j ri]) ≤ 1 2d . (70) Proof. In this proof, we prove the four statements of Lemma 10 one by one. 29 First, during the first iteration, we have 1 m mX k=1 ⟨w(1) I(d),k, 2o⟩− 1 m mX k=1 ⟨w(0) I(d),k, 2o⟩ (a) =Θ(1) · η1 n n 3m nX i=1 (1 − logit(0) I(d)(2λo))σ′(...

work page
[51]

After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm

for t ≤ T1 (b) is because of1−logit(t) I(aj)([osjri]) ≤ 1 and ∥o + sj + 2ri∥2 2 = O(d). After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm . (80) Then, as 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Θ(λη1d/m) and 1 m Pm k=1⟨w(T1) I(ri),k, Ξ(T1)([o s j ri])⟩ = Θ( 1 m Pm k=1⟨w(T1) I(d),k, Ξ(T1)([o s j ri])⟩/n), we have 1 − logit(T...

work page
[52]

1 − logit(t) I(aj)([o s j ri]) = Θ(1)

work page
[53]

1 − Ks(j)logit(t) I(ri)([o s j]) = Θ(1)

work page
[54]

0.7 ≤ α(t) 2 ([o s j ri])/α(t) 3 ([o s j ri]) ≤ 1.6

work page
[55]

For all j ∈ [N], i ∈ [K] and T1 ≤ t ≤ T2, we have α(t) 1 ([o s j]) ≤ 2 3d and α(t) 1 ([o s j ri]) ≤ 2 3d . (96)

work page
[56]

For T1 ≤ t ≤ T2, we have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ). Proof. First, we prove the first and the second statements. For any j ∈ [N] and i ∈ B j, the model updates satisfy 1 m mX k=1 ⟨w(t+1) I(ri),k, Ξ(t+1)([o s j])⟩ − 1 m mX k=1 ⟨w(t) I(ri),k, Ξ(t)([o s j])⟩ (a) =Θ(1) · λη2 nm(1 − Klogit(t) I(ri)([o s j]) (b) =O λη2 nm , (97) 34 and 1 m mX k=...

work page
[57]

dX j=1 σjujv⊤ j # i , [RA1(G(tf ))]i = σ1u1v⊤ 1 i . (216) Taking the L2 norms, we have [G(tf )]i 2 2 =

+ ˜O(m) + ˜O((N + |R|)2/λ2). (168) Setting Tp − T2 = ˜Θ((md2σ2 0 + m + N 2/λ2 + |R|2/λ2)/η3) completes the proof. G.5 Key properties Lemma 15. With probability 1 − δ, for all j ∈ [N], the following holds: 1 m mX k=1 σ(⟨w(Tp) I(aj),k, sj⟩) = Ω (log(d)/λ) . (169) 45 Proof. With probability 1 − δ, at initialization, there are at least 0.4m neurons activated ...

work page 2000

[1] [1]

Transformers learn to implement preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36:45614–45650, 2023

work page 2023

[2] [2]

Towards understanding ensemble, knowledge distillation and self-distillation in deep learning

Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020

work page arXiv 2012

[3] [3]

Physics of language models: Part 3.1, knowledge storage and extraction

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023

work page arXiv 2023

[4] [4]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023

work page 2023

[5] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

work page 1901

[6] [6]

Scaling laws for associative memories

Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. arXiv preprint arXiv:2310.02984, 2023

work page arXiv 2023

[7] [7]

Learning associative memories with gradient descent

Vivien Cabannes, Berfin Simsek, and Alberto Bietti. Learning associative memories with gradient descent. arXiv preprint arXiv:2402.18724, 2024

work page arXiv 2024

[8] [8]

Benign overfitting in two-layer convolutional neural networks

Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems , 35: 25237–25250, 2022

work page 2022

[9] [9]

Towards understanding the mixture-of-experts layer in deep learning

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 2022

work page 2022

[10] [10]

Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer.arXiv preprint arXiv:2506.01115, 2025

Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer. arXiv preprint arXiv:2506.01115, 2025

work page arXiv 2025

[11] [11]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2012

[12] [12]

Understanding finetuning for factual knowledge extraction

Gaurav Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785, 2024

work page arXiv 2024

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023

work page arXiv 2023

[15] [15]

Vision transformers provably learn spatial structure

Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 2022

work page 2022

[16] [16]

Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization

Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, and Liqiang Nie. Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization. Advances in Neural Information Processing Systems, 2024

work page 2024

[17] [17]

Optimal memorization capacity of transformers

Tokio Kajitsuka and Issei Sato. Optimal memorization capacity of transformers. In Interna- tional Conference on Learning Representations, 2025

work page 2025

[18] [18]

Large language models struggle to learn long-tail knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023

work page 2023

[19] [19]

Provable memorization capacity of transformers

Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In International Conference on Learning Representations, 2023

work page 2023

[20] [20]

Benign overfitting in two-layer relu convolutional neural networks

Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer relu convolutional neural networks. In International conference on machine learning, 2023

work page 2023

[21] [21]

Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [22]

Next-token prediction capacity: general upper bounds and a lower bound for transformers

Liam Madden, Curtis Fox, and Christos Thrampoulidis. Next-token prediction capacity: general upper bounds and a lower bound for transformers. arXiv preprint arXiv:2405.13718, 2024

work page arXiv 2024

[23] [23]

One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention

Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023

work page arXiv 2023

[24] [24]

Memorization Capacity of Multi-Head Attention in Transformers, 2024

Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi- head attention in transformers. arXiv preprint arXiv:2306.02010, 2023

work page arXiv 2023

[25] [25]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 2022

work page 2022

[27] [27]

Nichani, A

Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735, 2024

work page arXiv 2024

[28] [28]

Understanding factual recall in transformers via associative memories

Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories. arXiv preprint arXiv:2412.06538, 2024. 15

work page arXiv 2024

[29] [29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019

[30] [30]

Data augmentation as feature manipu- lation

Ruoqi Shen, S´ebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipu- lation. In International conference on machine learning, 2022

work page 2022

[31] [31]

Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer

Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du. Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36:71911–71947, 2023

work page 2023

[32] [32]

Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023

work page arXiv 2023

[33] [33]

Rethinking benign overfitting in two-layer neural networks

Ruichen Xu and Kexin Chen. Rethinking benign overfitting in two-layer neural networks. In International Conference on Machine Learning, 2025

work page 2025

[34] [34]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In Advances in neural information processing systems, 2024

work page 2024

[35] [35]

Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020

work page 2020

[36] [36]

Trained transformers learn linear models in-context

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024

work page 2024

[37] [37]

Towards a theoretical understanding of the’reversal curse’via training dynamics

Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, and Stuart J Russell. Towards a theoretical understanding of the’reversal curse’via training dynamics. Advances in Neural Information Processing Systems, 2024

work page 2024

[38] [38]

The benefits of mixup for feature learning

Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In International Conference on Machine Learning, 2023. 16 Contents 1 Introduction 1 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2023

[39] [39]

Sample a subset Rs ⊆ R of K relation phrase embeddings uniformly at random without replacement

work page

[40] [40]

For each i ∈ R s, construct K sentence tuples (sj, ri, aj)

work page

[41] [41]

Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj))

work page

[42] [42]

Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

Randomly select half tuples from T and convert each of them to fine-tuning samples, ([sj p], I(aj)). Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :

work page

[43] [43]

We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj)). We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :

work page

[44] [44]

Randomly select half tuples from T and convert each of them to fine-tuning samples, 19 ([sj p], I(aj))

work page

[45] [45]

We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I)

Convert each of the remaining tuples to test samples, ([sj p], I(aj)). We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I). We set the parameters as N = 1000 , K = 5 , |R| = 5 , σ = 1, d = 50. We pre-train the transformers (‘standard’ and ’Uniform attention + MLP’) with AdamW op...

work page

[46] [46]

nX i=1 I(j ∈ A i) ≥ K 2q n # ≤ exp − nK2 2q2 . (44) 25 Proof. In each time i ∈ [n], a number j ∈ [q] has probability K/q to selected. By Hoeffding’s inequality, we have: P

demonstrated that transformers are universal approximators. Kim et al. [ 19] proved that transformers can memorize sequence mappings of length- n d -dimensional inputs with ˜O(d + n + √ nN) parameters. Later, Kajitsuka and Sato [ 17] proved lower and upper bounds of the memorization capacity of transformers in next token prediction and sequence-to-sequenc...

work page

[47] [47]

We have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ),

work page

[48] [48]

For allj ∈ [N] and i ∈ B j, we have1−logit(T1) I(aj)([osjri]) = Θ(1), and 1−Ks(j)logit(T1) I(ri)([osj]) = Θ(1),

work page

[49] [49]

For all j ∈ [N] and i ∈ B j, we have 0.8 ≤ α(T1) 2 ([o s j ri])/α(T1) 3 ([o s j ri]) ≤ 1.2,

work page

[50] [50]

(70) Proof

At the end of Stage 1, for all j ∈ [N], i ∈ [K], we have α(T1) 1 ([o s j]) ≤ 1 2d and α(T1) 1 ([o s j ri]) ≤ 1 2d . (70) Proof. In this proof, we prove the four statements of Lemma 10 one by one. 29 First, during the first iteration, we have 1 m mX k=1 ⟨w(1) I(d),k, 2o⟩− 1 m mX k=1 ⟨w(0) I(d),k, 2o⟩ (a) =Θ(1) · η1 n n 3m nX i=1 (1 − logit(0) I(d)(2λo))σ′(...

work page

[51] [51]

After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm

for t ≤ T1 (b) is because of1−logit(t) I(aj)([osjri]) ≤ 1 and ∥o + sj + 2ri∥2 2 = O(d). After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm . (80) Then, as 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Θ(λη1d/m) and 1 m Pm k=1⟨w(T1) I(ri),k, Ξ(T1)([o s j ri])⟩ = Θ( 1 m Pm k=1⟨w(T1) I(d),k, Ξ(T1)([o s j ri])⟩/n), we have 1 − logit(T...

work page

[52] [52]

1 − logit(t) I(aj)([o s j ri]) = Θ(1)

work page

[53] [53]

1 − Ks(j)logit(t) I(ri)([o s j]) = Θ(1)

work page

[54] [54]

0.7 ≤ α(t) 2 ([o s j ri])/α(t) 3 ([o s j ri]) ≤ 1.6

work page

[55] [55]

For all j ∈ [N], i ∈ [K] and T1 ≤ t ≤ T2, we have α(t) 1 ([o s j]) ≤ 2 3d and α(t) 1 ([o s j ri]) ≤ 2 3d . (96)

work page

[56] [56]

For T1 ≤ t ≤ T2, we have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ). Proof. First, we prove the first and the second statements. For any j ∈ [N] and i ∈ B j, the model updates satisfy 1 m mX k=1 ⟨w(t+1) I(ri),k, Ξ(t+1)([o s j])⟩ − 1 m mX k=1 ⟨w(t) I(ri),k, Ξ(t)([o s j])⟩ (a) =Θ(1) · λη2 nm(1 − Klogit(t) I(ri)([o s j]) (b) =O λη2 nm , (97) 34 and 1 m mX k=...

work page

[57] [57]

dX j=1 σjujv⊤ j # i , [RA1(G(tf ))]i = σ1u1v⊤ 1 i . (216) Taking the L2 norms, we have [G(tf )]i 2 2 =

+ ˜O(m) + ˜O((N + |R|)2/λ2). (168) Setting Tp − T2 = ˜Θ((md2σ2 0 + m + N 2/λ2 + |R|2/λ2)/η3) completes the proof. G.5 Key properties Lemma 15. With probability 1 − δ, for all j ∈ [N], the following holds: 1 m mX k=1 σ(⟨w(Tp) I(aj),k, sj⟩) = Ω (log(d)/λ) . (169) 45 Proof. With probability 1 − δ, at initialization, there are at least 0.4m neurons activated ...

work page 2000