pith. machine review for the scientific record. sign in

arxiv: 2605.12908 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.LG

Recognition: no theorem link

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords weak-to-strong generalizationfeature elicitationtwo-layer neural networksreward model learningcatastrophic forgettinglatent knowledgefeature learning
0
0 comments X

The pith

A strong neural network learns a target task from weak-model outputs by eliciting its own pre-trained feature direction rather than overwriting it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that a strong two-layer network, when fine-tuned on the outputs of a weaker model specialized for task kappa, acquires the required feature direction from its own pre-trained low-dimensional subspaces. This process lets the strong model perform the target task while keeping its other capabilities intact. The same training avoids the catastrophic forgetting that occurs under ordinary supervised fine-tuning whenever off-target feature directions overlap with the target. The result is shown for reward-model learning and establishes weak-to-strong generalization in the feature-learning regime, where the target direction is not assumed to be present from the start but is instead recovered through training.

Core claim

In the setting of reward-model learning with two-layer neural networks, the strong model whose pre-trained representations lie in low-dimensional subspaces V_k acquires the target feature direction for task kappa through multi-step SGD under weak-model supervision, thereby learning the task while retaining general capabilities and preserving off-target features even when those features are correlated with the target.

What carries the argument

Low-dimensional subspaces V_k organizing the strong model's pre-trained representations, which the weak-to-strong training uses to elicit the target feature direction kappa.

If this is right

  • The strong model acquires the target feature direction through W2S training rather than receiving it a priori.
  • W2S training preserves pre-trained off-target features even when they correlate with the target direction.
  • Standard supervised fine-tuning produces catastrophic forgetting of correlated off-target features.
  • W2S generalization holds in the feature-learning regime for two-layer networks under reward-model learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If deeper networks maintain comparable subspace organization, the same elicitation mechanism could scale beyond two-layer models.
  • Alignment procedures that rely on weak supervisors may reduce capability loss by eliciting rather than overwriting latent features.
  • Experiments that deliberately entangle feature directions would test whether the low-dimensional subspace assumption is necessary for the observed preservation effect.

Load-bearing premise

The strong model's pre-trained representations are organized into distinct low-dimensional subspaces separating target and off-target features.

What would settle it

A simulation in which the strong model's representations lack low-dimensional subspace structure would show either failure to acquire the target feature or loss of off-target capabilities under the same weak-to-strong training.

Figures

Figures reproduced from arXiv: 2605.12908 by Ryoya Awano, Taiji Suzuki.

Figure 1
Figure 1. Figure 1: Per-neuron alignment magnitude |θ ⊤ k w t k,n| during training (d = 1024, s = 128, K = 2, σ ∗ k = He4, θ ⊤ 1 θ2 = 0.3). Line colors distinguish neuron types by the signs of α˜2βk,2 and α˜4βk,4 (W2S) or α˜4βk,4 alone (SFT); neurons with initial alignment magnitude |θ ⊤ k w 0 k,n| < s−1/2 are shown semi-transparent. Top (W2S, η = 0.2, T = 10000): Absolute alignment |θ ⊤ 1 w t 1,n| with the target feature (le… view at source ↗
read the original abstract

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $\kappa$. We prove that the strong model efficiently learns task $\kappa$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript claims to prove that, in reward-model learning with two-layer neural networks, a strong model whose pre-trained representations are organized into low-dimensional subspaces V_k can be fine-tuned via multi-step SGD under supervision from a weak model specialized on task κ. The strong model thereby acquires the target feature direction from its latent knowledge (establishing W2S generalization in the feature-learning regime) while preserving off-target directions; standard supervised fine-tuning, by contrast, produces catastrophic forgetting when off-target directions are correlated with the target. Synthetic experiments are said to confirm the theoretical predictions.

Significance. If the central derivation holds, the work supplies a concrete, parameter-free mechanism explaining how W2S training elicits task-relevant features from pre-organized subspaces without a priori provision of the target direction, while simultaneously protecting general capabilities—an issue left open by prior analyses that either fix representations or restrict the setting. The explicit contrast with catastrophic forgetting under standard SFT is a clear strength, and the restriction to two-layer networks and reward-model learning is stated up front, making the scope transparent. The result therefore offers a useful foundation for understanding alignment of superhuman models, provided the two-layer analysis can be lifted.

minor comments (3)
  1. [Abstract] The abstract and introduction should state the two-layer and reward-model assumptions more explicitly at the outset, as these delimit the entire analysis.
  2. [Experiments] The synthetic experiments are referenced as confirmation but lack sufficient detail on data generation, exact subspace construction, and quantitative metrics (e.g., cosine similarity to target direction or off-target retention); adding these would strengthen reproducibility.
  3. [Setup] Notation for the subspaces V_k and the specialization of the weak model on κ could be accompanied by a small illustrative diagram in the setup section to aid readers unfamiliar with the geometric picture.

Simulated Author's Rebuttal

0 responses · 1 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The referee's summary correctly captures our central claims regarding weak-to-strong generalization in the feature-learning regime for two-layer networks. We address the report below.

standing simulated objections not resolved
  • Extending the two-layer analysis to deeper networks or general architectures, as the proofs rely on the specific low-dimensional subspace structure and update dynamics available only in the two-layer setting.

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper's central result is a proof that multi-step SGD on a two-layer network under weak supervision elicits the target feature direction from explicitly assumed pre-organized low-dimensional subspaces V_k while preserving off-target directions. The derivation begins from the stated model architecture, weak-model specialization on task κ, and reward-model learning dynamics, then produces the feature-acquisition guarantee directly from those inputs. No step reduces by the paper's own equations to a fitted quantity renamed as prediction, no self-citation chain is load-bearing for the uniqueness or existence claim, and the argument remains scoped to the given assumptions without self-definition or ansatz smuggling. This is the normal case of an internally consistent theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that pre-trained knowledge lives in low-dimensional subspaces V_k and that the weak supervisor is task-specialized; these are introduced to enable the feature-elicitation analysis and are not derived from more basic principles within the paper.

axioms (2)
  • domain assumption Pre-trained representations of the strong model are organized into low-dimensional subspaces V_k
    This structures the initial knowledge so that target and off-target features can be analyzed separately during fine-tuning.
  • domain assumption The weak model is specialized on task κ
    Defines the form of the supervision signal used to drive the strong model's updates.

pith-pipeline@v0.9.0 · 5501 in / 1456 out tokens · 86419 ms · 2026-05-14T19:02:34.366782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Conference on Learning Theory (COLT), volume 195, pages 2552--2623. PMLR, 2023

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT -4 technical report, 2023. arXiv:2303.08774

  3. [3]

    Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions

    Luca Arnaboldi, Yatin Dandi, Florent Krzaka a, Luca Pesce, and Ludovic Stephan. Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024

  4. [4]

    A latent variable model approach to PMI -based word embeddings

    Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to PMI -based word embeddings. Transactions of the Association for Computational Linguistics, 4: 0 385--399, 2016

  5. [5]

    High-dimensional asymptotics of feature learning: How one gradient step improves the representation

    Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 37932--37946, 2022

  6. [6]

    Learning in the presence of low-dimensional structure: A spiked random matrix perspective

    Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  7. [7]

    Toward universal steering and monitoring of ai models

    Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, and Mikhail Belkin. Toward universal steering and monitoring of ai models. Science, 391 0 (6787): 0 787--792, 2026

  8. [8]

    Online stochastic gradient descent on non-convex losses from high-dimensional inference

    G \'e rard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. Journal of Machine Learning Research, 22 0 (106): 0 1--51, 2021

  9. [9]

    Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

    G \'e rard Ben Arous, Murat A Erdogdu, Nuri Mert Vural, and Denny Wu. Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  10. [10]

    Learning time-scales in two-layers neural networks

    Rapha \"e l Berthier, Andrea Montanari, and Kangjie Zhou. Learning time-scales in two-layers neural networks. Foundations of Computational Mathematics, pages 1--84, 2024

  11. [11]

    Learning single-index models with shallow neural networks

    Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  12. [12]

    On using extended statistical queries to avoid membership queries

    Nader H Bshouty and Vitaly Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2 0 (Feb): 0 359--395, 2002

  13. [13]

    Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-Strong Generalization : Eliciting strong capabilities with weak supervision, 2023. arXiv:2312.09390

  14. [14]

    Chernoff-type bounds for the G aussian error function

    Seok-Ho Chang, Pamela C Cosman, and Laurence B Milstein. Chernoff-type bounds for the G aussian error function. IEEE Transactions on Communications, 59 0 (11): 0 2939--2944, 2011

  15. [15]

    Quantifying the gain in Weak-to-Strong Generalization

    Moses Charikar, Chirag Pabbaraju, and Kirankumar Shiragur. Quantifying the gain in Weak-to-Strong Generalization . In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 126474--126499, 2024

  16. [16]

    Learning polynomials in few relevant dimensions

    Sitan Chen and Raghu Meka. Learning polynomials in few relevant dimensions. In Conference on Learning Theory (COLT), volume 125, pages 1161--1227. PMLR, 2020

  17. [17]

    Knowledge neurons in pretrained transformers

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022

  18. [18]

    Smoothing the landscape boosts the signal for SGD : Optimal sample complexity for learning single index models

    Alex Damian, Eshaan Nichani, Rong Ge, and Jason D Lee. Smoothing the landscape boosts the signal for SGD : Optimal sample complexity for learning single index models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024 a

  19. [19]

    Computational-statistical gaps in G aussian single-index models (extended abstract)

    Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in G aussian single-index models (extended abstract). In Conference on Learning Theory (COLT), volume 247 of Proceedings of Machine Learning Research, pages 1262--1262, 30 Jun--03 Jul 2024 b . Full version available at arXiv:2403.05529

  20. [20]

    Lee, and Mahdi Soltanolkotabi

    Alexandru Damian, Jason D. Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory (COLT), volume 178, pages 5413--5452. PMLR, 2022

  21. [21]

    The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents

    Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborov \'a , and Florent Krzaka a. The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents. In International Conference on Machine Learning (ICML), 2024

  22. [22]

    Lee, and Qi Lei

    Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, and Qi Lei. Discrepancies are virtue: Weak-to-Strong Generalization through lens of intrinsic dimension. In International Conference on Machine Learning (ICML), 2025

  23. [23]

    Learning single-index models in G aussian space

    Rishabh Dudeja and Daniel Hsu. Learning single-index models in G aussian space. In Conference on Learning Theory (COLT), volume 75, pages 1887--1930, 2018

  24. [24]

    Statistical-computational trade-offs in tensor PCA and related problems via communication complexity

    Rishabh Dudeja and Daniel Hsu. Statistical-computational trade-offs in tensor PCA and related problems via communication complexity. The Annals of Statistics, 52 0 (1): 0 131--156, 2024

  25. [25]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022. arXiv:2209.10652

  26. [26]

    Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025

    Margalit Glasgow, Denny Wu, and Joan Bruna. Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025. arXiv:2504.13110

  27. [27]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. In International Conference on Learning Representations (ICLR), 2024

  28. [28]

    Linearity of relation decoding in transformer language models

    Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In International Conference on Learning Representations (ICLR), 2024

  29. [29]

    Disentangling and mitigating the impact of task similarity for continual learning

    Naoki Hiratani. Disentangling and mitigating the impact of task similarity for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  30. [30]

    High-dimensional analysis of knowledge distillation: Weak-to-Strong Generalization and scaling laws

    Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-Strong Generalization and scaling laws. In International Conference on Learning Representations (ICLR), 2025

  31. [31]

    Aligner: Efficient alignment by learning to correct

    Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Tianyi Qiu, Juntao Dai, and Yaodong Yang. Aligner: Efficient alignment by learning to correct. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  32. [32]

    On the complexity of learning sparse functions with statistical and gradient queries

    Nirmit Joshi, Theodor Misiakiewicz, and Nathan Srebro. On the complexity of learning sparse functions with statistical and gradient queries. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  33. [33]

    Understanding catastrophic forgetting in language models via implicit inference

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. In International Conference on Learning Representations (ICLR), 2024

  34. [34]

    Theoretical analysis of Weak-to-Strong Generalization

    Hunter Lang, David Sontag, and Aravindan Vijayaraghavan. Theoretical analysis of Weak-to-Strong Generalization . In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 46837--46880, 2024

  35. [35]

    Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu

    Jason D. Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 58716--58756, 2024. doi:10.52202/079017-1872

  36. [36]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE Transactions on Audio, Speech and Language Processing, 33: 0 3776--3786, 2025

  37. [37]

    Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time

    Arvind Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  38. [38]

    Weak-to-Strong Generalization even in random feature networks, provably

    Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, and Nathan Srebro. Weak-to-Strong Generalization even in random feature networks, provably. In International Conference on Machine Learning (ICML), 2025

  39. [39]

    Language models implement simple word2vec-style vector arithmetic

    Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple word2vec-style vector arithmetic. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  40. [40]

    Linguistic regularities in continuous space word representations

    Tom \' a s Mikolov, Wen - tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2013

  41. [41]

    On the mechanisms of Weak-to-Strong Generalization : A theoretical perspective

    Behrad Moniri and Hamed Hassani. On the mechanisms of Weak-to-Strong Generalization : A theoretical perspective. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  42. [42]

    A theory of non-linear feature learning with one gradient step in two-layer neural networks

    Behrad Moniri, Donghwan Lee, Hamed Hassani, and Edgar Dobriban. A theory of non-linear feature learning with one gradient step in two-layer neural networks. In International Conference on Machine Learning (ICML), volume 235, pages 36106--36159. PMLR, 2024

  43. [43]

    Neural networks efficiently learn low-dimensional representations with SGD

    Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. Neural networks efficiently learn low-dimensional representations with SGD . In International Conference on Learning Representations (ICLR), 2022

  44. [44]

    Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A. Erdogdu. Gradient-based feature learning under structured data. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  45. [45]

    Emergent linear representations in world models of self-supervised sequence models

    Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In BlackboxNLP Workshop at Empirical Methods in Natural Language Processing (BlackboxNLP@EMNLP), 2023

  46. [46]

    Nonlinear transformers can perform inference-time feature learning

    Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki. Nonlinear transformers can perform inference-time feature learning. In International Conference on Machine Learning (ICML), 2025

  47. [47]

    From linear to nonlinear: Provable Weak-to-Strong Generalization through feature learning

    Junsoo Oh, Jerry Song, and Chulhee Yun. From linear to nonlinear: Provable Weak-to-Strong Generalization through feature learning. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  48. [48]

    Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

    Kazusato Oko, Yujin Song, Taiji Suzuki, and Denny Wu. Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations. In Conference on Learning Theory (COLT), volume 247, pages 4009--4081, 2024 a

  49. [49]

    Pretrained transformer efficiently learns low-dimensional target functions in-context

    Kazusato Oko, Yujin Song, Taiji Suzuki, and Denny Wu. Pretrained transformer efficiently learns low-dimensional target functions in-context. In Advances in Neural Information Processing Systems (NeurIPS), 2024 b

  50. [50]

    Task-specific skill localization in fine-tuned language models

    Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. In International Conference on Learning Representations (ICLR), 2023

  51. [51]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In International Conference on Machine Learning (ICML), 2024

  52. [52]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019. URL https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf

  53. [53]

    Yunwei Ren, Eshaan Nichani, Denny Wu, and Jason D. Lee. Emergence and scaling laws in SGD learning of shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2025

  54. [54]

    Weak-to-Strong Generalization through the data-centric lens

    Changho Shin, John Cooper, and Frederic Sala. Weak-to-Strong Generalization through the data-centric lens. In International Conference on Learning Representations (ICLR), 2025

  55. [55]

    Learning G aussian multi-index models with gradient flow: Time complexity and directional convergence

    Berfin Simsek, Amire Bendjeddou, and Daniel Hsu. Learning G aussian multi-index models with gradient flow: Time complexity and directional convergence. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2025

  56. [56]

    Your weak LLM is secretly a strong teacher for alignment

    Leitian Tao and Yixuan Li. Your weak LLM is secretly a strong teacher for alignment. In International Conference on Learning Representations (ICLR), 2025

  57. [57]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2023. arXiv:2308.10248

  58. [58]

    High-Dimensional Probability: An Introduction with Applications in Data Science

    Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

  59. [59]

    Yu, Cho - Jui Hsieh, Inderjit S

    Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix X. Yu, Cho - Jui Hsieh, Inderjit S. Dhillon, and Sanjiv Kumar. Two-stage LLM fine-tuning with less specialization and more generalization. In International Conference on Learning Representations (ICLR), 2024

  60. [60]

    Provable Weak-to-Strong Generalization via benign overfitting

    David Xing Wu and Anant Sahai. Provable Weak-to-Strong Generalization via benign overfitting. In International Conference on Learning Representations (ICLR), 2025

  61. [61]

    Representations shape Weak-to-Strong Generalization : Theoretical insights and empirical predictions

    Yihao Xue, Jiping Li, and Baharan Mirzasoleiman. Representations shape Weak-to-Strong Generalization : Theoretical insights and empirical predictions. In International Conference on Machine Learning (ICML), 2025

  62. [62]

    Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the D avis- K ahan theorem for statisticians. Biometrika, 102 0 (2): 0 315--323, 2015