Provable Knowledge Acquisition and Extraction in One-Layer Transformers
Pith reviewed 2026-05-21 23:01 UTC · model grok-4.3
The pith
One-layer transformers store facts in structured attention patterns during pre-training that fine-tuning activates for extraction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under suitable regularity conditions, next-token pre-training in the one-layer transformer leads to near-optimal loss along with the development of structured attention patterns and relation-specific feature directions that encode factual knowledge. Fine-tuning on Q&A tasks converts the prompt format into a trigger for these features, enabling extraction of facts not revisited in the fine-tuning phase. Knowledge extraction follows a relation-covering characterization that depends on pre-training multiplicity and fine-tuning coverage of latent relation-template directions, improving with greater multiplicity and coverage but degrading as the relation-template universe expands. Insufficient覆盖e
What carries the argument
The relation-covering characterization of knowledge extraction, which determines when fine-tuning activates enough pre-trained relation directions to retrieve stored facts.
If this is right
- Fine-tuning can extract many pre-trained facts without including every specific subject-answer pair, provided the relation templates are adequately represented.
- Greater repetition of relations during pre-training strengthens the feature directions and improves post-fine-tuning extraction.
- Larger spaces of possible relations make it harder to achieve reliable extraction through fine-tuning.
- Low-rank fine-tuning recovers pre-trained factual knowledge when the relation coverage is sufficient.
- Insufficient relation coverage during fine-tuning leaves stored facts inaccessible, leading to potential hallucinations.
Where Pith is reading between the lines
- This points to designing fine-tuning datasets that emphasize variety in relation types to maximize knowledge extraction from pre-training.
- The analysis may connect to understanding prompt engineering, where different formats act as triggers for different pre-trained capabilities.
- Extending the model to multiple layers could show how knowledge storage and triggering distribute across depth in larger architectures.
- Empirical checks on real models could test if attention patterns in early layers align with the predicted structures for common relations.
Load-bearing premise
The central claim depends on the assumption that the training process under next-token prediction will produce exactly the structured attention patterns and relation-specific feature directions when regularity conditions are met.
What would settle it
Train a one-layer transformer on a synthetic dataset of facts grouped by relations, measure the learned attention weights to see if they match the predicted structured patterns at near-optimal loss, and test extraction success rates against the predicted dependence on relation coverage during fine-tuning.
Figures
read the original abstract
Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes knowledge acquisition and extraction in a stylized one-layer transformer trained via next-token prediction and then fine-tuned on Q&A tasks. Under suitable regularity conditions on the data distribution, initialization, and optimization, it claims to prove that pre-training achieves near-optimal loss while acquiring structured attention patterns and relation-specific feature directions for factual knowledge storage. Fine-tuning turns the Q&A format into a trigger for these pre-trained features, with extraction governed by a relation-covering characterization: extraction improves with pre-training multiplicity and fine-tuning coverage but degrades with larger relation-template universes. Insufficient coverage leads to a failure mode akin to hallucination. The theory extends to low-rank fine-tuning, and experiments on synthetic data plus PopQA with GPT-2/Llama models are said to support the trends.
Significance. If the regularity conditions can be made explicit and verified to hold for the data generators and fine-tuning regimes considered, the work would provide a rare mechanistic, provable link between pre-training dynamics, knowledge storage in attention/MLP modules, and post-fine-tuning extraction. The relation-covering characterization and its implications for LoRA and hallucination would be a substantive contribution to understanding factual recall in transformers.
major comments (2)
- [Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.
- [Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.
minor comments (1)
- [Experiments] The synthetic data generator and the precise construction of relation-template directions should be described with sufficient detail (including any hyper-parameters) to allow independent verification of the predicted trends.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that highlight opportunities to strengthen the rigor and clarity of our theoretical claims. We address each major comment below and will revise the manuscript to make the regularity conditions explicit while preserving the core contributions on knowledge acquisition and the relation-covering extraction mechanism.
read point-by-point responses
-
Referee: [Abstract and main theoretical sections] Abstract and theoretical development: the central claims (near-optimal pre-training loss, acquisition of structured attention patterns, and relation-specific feature directions) are stated to hold only under unspecified 'suitable regularity conditions' on data distribution, initialization, and optimization trajectory. These conditions must be stated explicitly (e.g., as assumptions on the data generator or loss landscape) and shown to be satisfied by the synthetic setup and PopQA regime; otherwise the mechanism for knowledge acquisition and the subsequent relation-covering extraction result are not rigorously derived from the pre-training dynamics.
Authors: We agree that the regularity conditions require explicit statement to ensure the claims are rigorously derived. In the revised manuscript we will add a new 'Assumptions' subsection that enumerates all conditions: (i) data distribution requires sufficient multiplicity of each relation template and bounded noise; (ii) initialization is random with norms controlled by a small constant; (iii) optimization reaches a neighborhood of a critical point where the loss is near-optimal under gradient flow. We will verify these hold by construction for the synthetic data generator. For the PopQA experiments with GPT-2 and Llama we will clarify that the setup approximates the conditions and that the observed trends are consistent with the theory, while noting that real-world data may deviate; this distinction will be added to the experimental discussion. revision: yes
-
Referee: [Theoretical results on extraction] Proofs of the relation-covering characterization and low-rank adaptation result: these appear to rest on the model having learned exactly the relation-template directions during pre-training. Without explicit derivations or verification that the regularity conditions are not circular with respect to the target behavior (i.e., they do not presuppose the separation into distinct directions), it is impossible to confirm that the extraction characterization follows as a consequence rather than an extrapolation.
Authors: The logical structure of the proofs first derives the emergence of relation-specific feature directions and attention patterns solely from minimizing the next-token prediction loss under the stated regularity conditions, before analyzing fine-tuning. To eliminate any appearance of circularity we will expand the appendix with additional lemmas that explicitly construct the feature directions from the pre-training objective alone, without reference to Q&A extraction or relation covering. These lemmas will show that distinct directions arise from the loss landscape properties and data multiplicity, after which the relation-covering characterization follows directly as a consequence for both full and low-rank fine-tuning. This will make the derivation chain transparent. revision: yes
Circularity Check
No significant circularity; derivation proceeds from explicit assumptions to consequences without reduction by construction.
full rationale
The paper's central proofs are conditioned on explicitly invoked 'suitable regularity conditions' regarding data distribution, initialization, and optimization trajectory. These are standard external assumptions that enable showing near-optimal pre-training loss together with structured attention and relation-specific directions; they are not defined in terms of the target patterns themselves. The subsequent relation-covering characterization of extraction is derived as a consequence of how fine-tuning interacts with those pre-trained features, without any fitted parameter being relabeled as a prediction or any self-citation chain serving as the sole justification. The overall argument remains self-contained as a conditional theoretical analysis rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Suitable regularity conditions guarantee that next-token pre-training produces structured attention patterns and relation-specific feature directions
invented entities (1)
-
relation-template directions
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under suitable regularity conditions, the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
self-attention module learns to filter out the irrelevant contextual information, while the MLP module memorizes the filtered context
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transformers learn to implement preconditioned gradient descent for in-context learning
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36:45614–45650, 2023
work page 2023
-
[2]
Towards understanding ensemble, knowledge distillation and self-distillation in deep learning
Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020
-
[3]
Physics of language models: Part 3.1, knowledge storage and extraction
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023
-
[4]
Birth of a transformer: A memory viewpoint
Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36: 1560–1588, 2023
work page 2023
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020
work page 1901
-
[6]
Scaling laws for associative memories
Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. arXiv preprint arXiv:2310.02984, 2023
-
[7]
Learning associative memories with gradient descent
Vivien Cabannes, Berfin Simsek, and Alberto Bietti. Learning associative memories with gradient descent. arXiv preprint arXiv:2402.18724, 2024
-
[8]
Benign overfitting in two-layer convolutional neural networks
Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems , 35: 25237–25250, 2022
work page 2022
-
[9]
Towards understanding the mixture-of-experts layer in deep learning
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning. Advances in neural information processing systems, 2022
work page 2022
-
[10]
Yihe Dong, Lorenzo Noci, Mikhail Khodak, and Mufan Li. Attention retrieves, mlp memorizes: Disentangling trainable components in the transformer. arXiv preprint arXiv:2506.01115, 2025
-
[11]
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[12]
Understanding finetuning for factual knowledge extraction
Gaurav Ghosal, Tatsunori Hashimoto, and Aditi Raghunathan. Understanding finetuning for factual knowledge extraction. arXiv preprint arXiv:2406.14785, 2024
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023
Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers.arXiv preprint arXiv:2310.05249, 2023
-
[15]
Vision transformers provably learn spatial structure
Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[16]
Jiarui Jiang, Wei Huang, Miao Zhang, Taiji Suzuki, and Liqiang Nie. Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[17]
Optimal memorization capacity of transformers
Tokio Kajitsuka and Issei Sato. Optimal memorization capacity of transformers. In Interna- tional Conference on Learning Representations, 2025
work page 2025
-
[18]
Large language models struggle to learn long-tail knowledge
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 2023
work page 2023
-
[19]
Provable memorization capacity of transformers
Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In International Conference on Learning Representations, 2023
work page 2023
-
[20]
Benign overfitting in two-layer relu convolutional neural networks
Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer relu convolutional neural networks. In International conference on machine learning, 2023
work page 2023
-
[21]
Deduplicating Training Data Makes Language Models Better
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Next-token prediction capacity: general upper bounds and a lower bound for transformers
Liam Madden, Curtis Fox, and Christos Thrampoulidis. Next-token prediction capacity: general upper bounds and a lower bound for transformers. arXiv preprint arXiv:2405.13718, 2024
-
[23]
Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023
-
[24]
Memorization Capacity of Multi-Head Attention in Transformers, 2024
Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi- head attention in transformers. arXiv preprint arXiv:2306.02010, 2023
-
[25]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 2022
work page 2022
-
[27]
Eshaan Nichani, Alex Damian, and Jason D Lee. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735, 2024
-
[28]
Understanding factual recall in transformers via associative memories
Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories. arXiv preprint arXiv:2412.06538, 2024. 15
-
[29]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[30]
Data augmentation as feature manipu- lation
Ruoqi Shen, S´ebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipu- lation. In International conference on machine learning, 2022
work page 2022
-
[31]
Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer
Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du. Scan and snap: Understand- ing training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36:71911–71947, 2023
work page 2023
-
[32]
Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention
Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. Joma: Demys- tifying multilayer transformers via joint dynamics of mlp and attention. arXiv preprint arXiv:2310.00535, 2023
-
[33]
Rethinking benign overfitting in two-layer neural networks
Ruichen Xu and Kexin Chen. Rethinking benign overfitting in two-layer neural networks. In International Conference on Machine Learning, 2025
work page 2025
-
[34]
Knowledge circuits in pretrained transformers
Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In Advances in neural information processing systems, 2024
work page 2024
-
[35]
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? International Conference on Learning Representations, 2020
work page 2020
-
[36]
Trained transformers learn linear models in-context
Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024
work page 2024
-
[37]
Towards a theoretical understanding of the’reversal curse’via training dynamics
Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, and Stuart J Russell. Towards a theoretical understanding of the’reversal curse’via training dynamics. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[38]
The benefits of mixup for feature learning
Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In International Conference on Machine Learning, 2023. 16 Contents 1 Introduction 1 1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
work page 2023
-
[39]
Sample a subset Rs ⊆ R of K relation phrase embeddings uniformly at random without replacement
-
[40]
For each i ∈ R s, construct K sentence tuples (sj, ri, aj)
-
[41]
Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj))
-
[42]
Randomly select half tuples from T and convert each of them to fine-tuning samples, ([sj p], I(aj)). Based on the sentence set T , we construct the pre-training dataset as follows: For each (sj, ri, aj) ∈ T :
-
[43]
We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :
Convert each tuple(sj, ri, aj) to NTP pre-training samples, (sj, I(ri)) and ([sj ri], I(aj)). We construct the fine-tuning dataset and the test dataset as For each (sj, ri, aj) ∈ T :
-
[44]
Randomly select half tuples from T and convert each of them to fine-tuning samples, 19 ([sj p], I(aj))
-
[45]
Convert each of the remaining tuples to test samples, ([sj p], I(aj)). We assume all token embeddings— sj, ri, aj, for all j ∈ [N] and i ∈ R are generated from random Gaussian distributions N (0, σ2I). We set the parameters as N = 1000 , K = 5 , |R| = 5 , σ = 1, d = 50. We pre-train the transformers (‘standard’ and ’Uniform attention + MLP’) with AdamW op...
-
[46]
demonstrated that transformers are universal approximators. Kim et al. [ 19] proved that transformers can memorize sequence mappings of length- n d -dimensional inputs with ˜O(d + n + √ nN) parameters. Later, Kajitsuka and Sato [ 17] proved lower and upper bounds of the memorization capacity of transformers in next token prediction and sequence-to-sequenc...
-
[47]
We have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ),
-
[48]
For allj ∈ [N] and i ∈ B j, we have1−logit(T1) I(aj)([osjri]) = Θ(1), and 1−Ks(j)logit(T1) I(ri)([osj]) = Θ(1),
-
[49]
For all j ∈ [N] and i ∈ B j, we have 0.8 ≤ α(T1) 2 ([o s j ri])/α(T1) 3 ([o s j ri]) ≤ 1.2,
-
[50]
At the end of Stage 1, for all j ∈ [N], i ∈ [K], we have α(T1) 1 ([o s j]) ≤ 1 2d and α(T1) 1 ([o s j ri]) ≤ 1 2d . (70) Proof. In this proof, we prove the four statements of Lemma 10 one by one. 29 First, during the first iteration, we have 1 m mX k=1 ⟨w(1) I(d),k, 2o⟩− 1 m mX k=1 ⟨w(0) I(d),k, 2o⟩ (a) =Θ(1) · η1 n n 3m nX i=1 (1 − logit(0) I(d)(2λo))σ′(...
-
[51]
After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm
for t ≤ T1 (b) is because of1−logit(t) I(aj)([osjri]) ≤ 1 and ∥o + sj + 2ri∥2 2 = O(d). After 2 iterations, we have 1 m mX k=1 ⟨w(T1) I(aj),k, Ξ(T1)([o s j ri])⟩ = O λη1dKs(j) nm . (80) Then, as 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Θ(λη1d/m) and 1 m Pm k=1⟨w(T1) I(ri),k, Ξ(T1)([o s j ri])⟩ = Θ( 1 m Pm k=1⟨w(T1) I(d),k, Ξ(T1)([o s j ri])⟩/n), we have 1 − logit(T...
-
[52]
1 − logit(t) I(aj)([o s j ri]) = Θ(1)
-
[53]
1 − Ks(j)logit(t) I(ri)([o s j]) = Θ(1)
-
[54]
0.7 ≤ α(t) 2 ([o s j ri])/α(t) 3 ([o s j ri]) ≤ 1.6
-
[55]
For all j ∈ [N], i ∈ [K] and T1 ≤ t ≤ T2, we have α(t) 1 ([o s j]) ≤ 2 3d and α(t) 1 ([o s j ri]) ≤ 2 3d . (96)
-
[56]
For T1 ≤ t ≤ T2, we have 1 m Pm k=1⟨w(T1) I(d),k, 2o⟩ = Ω(log(d)/λ). Proof. First, we prove the first and the second statements. For any j ∈ [N] and i ∈ B j, the model updates satisfy 1 m mX k=1 ⟨w(t+1) I(ri),k, Ξ(t+1)([o s j])⟩ − 1 m mX k=1 ⟨w(t) I(ri),k, Ξ(t)([o s j])⟩ (a) =Θ(1) · λη2 nm(1 − Klogit(t) I(ri)([o s j]) (b) =O λη2 nm , (97) 34 and 1 m mX k=...
-
[57]
+ ˜O(m) + ˜O((N + |R|)2/λ2). (168) Setting Tp − T2 = ˜Θ((md2σ2 0 + m + N 2/λ2 + |R|2/λ2)/η3) completes the proof. G.5 Key properties Lemma 15. With probability 1 − δ, for all j ∈ [N], the following holds: 1 m mX k=1 σ(⟨w(Tp) I(aj),k, sj⟩) = Ω (log(d)/λ) . (169) 45 Proof. With probability 1 − δ, at initialization, there are at least 0.4m neurons activated ...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.