arxiv: 2605.12908 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.LG

Recognition: no theorem link

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

Ryoya Awano , Taiji Suzuki

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords weak-to-strong generalizationfeature elicitationtwo-layer neural networksreward model learningcatastrophic forgettinglatent knowledgefeature learning

0 comments

The pith

A strong neural network learns a target task from weak-model outputs by eliciting its own pre-trained feature direction rather than overwriting it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that a strong two-layer network, when fine-tuned on the outputs of a weaker model specialized for task kappa, acquires the required feature direction from its own pre-trained low-dimensional subspaces. This process lets the strong model perform the target task while keeping its other capabilities intact. The same training avoids the catastrophic forgetting that occurs under ordinary supervised fine-tuning whenever off-target feature directions overlap with the target. The result is shown for reward-model learning and establishes weak-to-strong generalization in the feature-learning regime, where the target direction is not assumed to be present from the start but is instead recovered through training.

Core claim

In the setting of reward-model learning with two-layer neural networks, the strong model whose pre-trained representations lie in low-dimensional subspaces V_k acquires the target feature direction for task kappa through multi-step SGD under weak-model supervision, thereby learning the task while retaining general capabilities and preserving off-target features even when those features are correlated with the target.

What carries the argument

Low-dimensional subspaces V_k organizing the strong model's pre-trained representations, which the weak-to-strong training uses to elicit the target feature direction kappa.

If this is right

The strong model acquires the target feature direction through W2S training rather than receiving it a priori.
W2S training preserves pre-trained off-target features even when they correlate with the target direction.
Standard supervised fine-tuning produces catastrophic forgetting of correlated off-target features.
W2S generalization holds in the feature-learning regime for two-layer networks under reward-model learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If deeper networks maintain comparable subspace organization, the same elicitation mechanism could scale beyond two-layer models.
Alignment procedures that rely on weak supervisors may reduce capability loss by eliciting rather than overwriting latent features.
Experiments that deliberately entangle feature directions would test whether the low-dimensional subspace assumption is necessary for the observed preservation effect.

Load-bearing premise

The strong model's pre-trained representations are organized into distinct low-dimensional subspaces separating target and off-target features.

What would settle it

A simulation in which the strong model's representations lack low-dimensional subspace structure would show either failure to acquire the target feature or loss of off-target capabilities under the same weak-to-strong training.

Figures

Figures reproduced from arXiv: 2605.12908 by Ryoya Awano, Taiji Suzuki.

**Figure 1.** Figure 1: Per-neuron alignment magnitude |θ ⊤ k w t k,n| during training (d = 1024, s = 128, K = 2, σ ∗ k = He4, θ ⊤ 1 θ2 = 0.3). Line colors distinguish neuron types by the signs of α˜2βk,2 and α˜4βk,4 (W2S) or α˜4βk,4 alone (SFT); neurons with initial alignment magnitude |θ ⊤ k w 0 k,n| < s−1/2 are shown semi-transparent. Top (W2S, η = 0.2, T = 10000): Absolute alignment |θ ⊤ 1 w t 1,n| with the target feature (le… view at source ↗

read the original abstract

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $\kappa$. We prove that the strong model efficiently learns task $\kappa$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper proves that in two-layer networks with pre-organized subspaces, weak supervision elicits the target feature direction through SGD while preserving off-target features, unlike standard fine-tuning.

read the letter

Hi, the main takeaway is that this work shows multi-step SGD on a two-layer network under weak supervision pulls the target feature out of the pre-trained subspaces V_k without destroying the off-target directions. That is the concrete mechanism they derive for W2S generalization in the feature-learning regime, where the strong model acquires the direction rather than having it handed to it upfront. Prior analyses either locked the representations or stayed in simpler regimes, so the explicit derivation from the architecture and dynamics is the new piece. It also cleanly shows why W2S avoids the catastrophic forgetting that hits standard supervised fine-tuning when off-target features correlate with the target. The synthetic experiments line up with the theory in that controlled case. The limits are straightforward: the whole argument sits inside two-layer networks with the subspaces already structured that way and the weak model specialized on task κ. No evidence is given that the subspace assumption carries over to deeper models or real data, and the experiments stay synthetic. This is aimed at theorists working on the mechanics of weak-to-strong generalization. Anyone wanting a precise, scoped derivation of feature elicitation will get something usable from it. The claim is internally consistent within its boundaries and directly tackles an open question, so it deserves a serious referee even if the setting stays narrow. I'd send it out for review.

Referee Report

0 major / 3 minor

Summary. The manuscript claims to prove that, in reward-model learning with two-layer neural networks, a strong model whose pre-trained representations are organized into low-dimensional subspaces V_k can be fine-tuned via multi-step SGD under supervision from a weak model specialized on task κ. The strong model thereby acquires the target feature direction from its latent knowledge (establishing W2S generalization in the feature-learning regime) while preserving off-target directions; standard supervised fine-tuning, by contrast, produces catastrophic forgetting when off-target directions are correlated with the target. Synthetic experiments are said to confirm the theoretical predictions.

Significance. If the central derivation holds, the work supplies a concrete, parameter-free mechanism explaining how W2S training elicits task-relevant features from pre-organized subspaces without a priori provision of the target direction, while simultaneously protecting general capabilities—an issue left open by prior analyses that either fix representations or restrict the setting. The explicit contrast with catastrophic forgetting under standard SFT is a clear strength, and the restriction to two-layer networks and reward-model learning is stated up front, making the scope transparent. The result therefore offers a useful foundation for understanding alignment of superhuman models, provided the two-layer analysis can be lifted.

minor comments (3)

[Abstract] The abstract and introduction should state the two-layer and reward-model assumptions more explicitly at the outset, as these delimit the entire analysis.
[Experiments] The synthetic experiments are referenced as confirmation but lack sufficient detail on data generation, exact subspace construction, and quantitative metrics (e.g., cosine similarity to target direction or off-target retention); adding these would strengthen reproducibility.
[Setup] Notation for the subspaces V_k and the specialization of the weak model on κ could be accompanied by a small illustrative diagram in the setup section to aid readers unfamiliar with the geometric picture.

Simulated Author's Rebuttal

0 responses · 1 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The referee's summary correctly captures our central claims regarding weak-to-strong generalization in the feature-learning regime for two-layer networks. We address the report below.

standing simulated objections not resolved

Extending the two-layer analysis to deeper networks or general architectures, as the proofs rely on the specific low-dimensional subspace structure and update dynamics available only in the two-layer setting.

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper's central result is a proof that multi-step SGD on a two-layer network under weak supervision elicits the target feature direction from explicitly assumed pre-organized low-dimensional subspaces V_k while preserving off-target directions. The derivation begins from the stated model architecture, weak-model specialization on task κ, and reward-model learning dynamics, then produces the feature-acquisition guarantee directly from those inputs. No step reduces by the paper's own equations to a fitted quantity renamed as prediction, no self-citation chain is load-bearing for the uniqueness or existence claim, and the argument remains scoped to the given assumptions without self-definition or ansatz smuggling. This is the normal case of an internally consistent theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that pre-trained knowledge lives in low-dimensional subspaces V_k and that the weak supervisor is task-specialized; these are introduced to enable the feature-elicitation analysis and are not derived from more basic principles within the paper.

axioms (2)

domain assumption Pre-trained representations of the strong model are organized into low-dimensional subspaces V_k
This structures the initial knowledge so that target and off-target features can be analyzed separately during fine-tuning.
domain assumption The weak model is specialized on task κ
Defines the form of the supervision signal used to drive the strong model's updates.

pith-pipeline@v0.9.0 · 5501 in / 1456 out tokens · 86419 ms · 2026-05-14T19:02:34.366782+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 7 canonical work pages · 3 internal anchors

[1]

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics. In Conference on Learning Theory (COLT), volume 195, pages 2552--2623. PMLR, 2023

2023
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT -4 technical report, 2023. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions

Luca Arnaboldi, Yatin Dandi, Florent Krzaka a, Luca Pesce, and Ludovic Stephan. Repetita iuvant: Data repetition allows SGD to learn high-dimensional multi-index functions. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024

2024
[4]

A latent variable model approach to PMI -based word embeddings

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to PMI -based word embeddings. Transactions of the Association for Computational Linguistics, 4: 0 385--399, 2016

2016
[5]

High-dimensional asymptotics of feature learning: How one gradient step improves the representation

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 37932--37946, 2022

2022
[6]

Learning in the presence of low-dimensional structure: A spiked random matrix perspective

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, and Denny Wu. Learning in the presence of low-dimensional structure: A spiked random matrix perspective. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[7]

Toward universal steering and monitoring of ai models

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, and Mikhail Belkin. Toward universal steering and monitoring of ai models. Science, 391 0 (6787): 0 787--792, 2026

2026
[8]

Online stochastic gradient descent on non-convex losses from high-dimensional inference

G \'e rard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. Journal of Machine Learning Research, 22 0 (106): 0 1--51, 2021

2021
[9]

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

G \'e rard Ben Arous, Murat A Erdogdu, Nuri Mert Vural, and Denny Wu. Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[10]

Learning time-scales in two-layers neural networks

Rapha \"e l Berthier, Andrea Montanari, and Kangjie Zhou. Learning time-scales in two-layers neural networks. Foundations of Computational Mathematics, pages 1--84, 2024

2024
[11]

Learning single-index models with shallow neural networks

Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2022

2022
[12]

On using extended statistical queries to avoid membership queries

Nader H Bshouty and Vitaly Feldman. On using extended statistical queries to avoid membership queries. Journal of Machine Learning Research, 2 0 (Feb): 0 359--395, 2002

2002
[13]

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-Strong Generalization : Eliciting strong capabilities with weak supervision, 2023. arXiv:2312.09390

work page arXiv 2023
[14]

Chernoff-type bounds for the G aussian error function

Seok-Ho Chang, Pamela C Cosman, and Laurence B Milstein. Chernoff-type bounds for the G aussian error function. IEEE Transactions on Communications, 59 0 (11): 0 2939--2944, 2011

2011
[15]

Quantifying the gain in Weak-to-Strong Generalization

Moses Charikar, Chirag Pabbaraju, and Kirankumar Shiragur. Quantifying the gain in Weak-to-Strong Generalization . In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 126474--126499, 2024

2024
[16]

Learning polynomials in few relevant dimensions

Sitan Chen and Raghu Meka. Learning polynomials in few relevant dimensions. In Conference on Learning Theory (COLT), volume 125, pages 1161--1227. PMLR, 2020

2020
[17]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022

2022
[18]

Smoothing the landscape boosts the signal for SGD : Optimal sample complexity for learning single index models

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D Lee. Smoothing the landscape boosts the signal for SGD : Optimal sample complexity for learning single index models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2024 a

2024
[19]

Computational-statistical gaps in G aussian single-index models (extended abstract)

Alex Damian, Loucas Pillaud-Vivien, Jason Lee, and Joan Bruna. Computational-statistical gaps in G aussian single-index models (extended abstract). In Conference on Learning Theory (COLT), volume 247 of Proceedings of Machine Learning Research, pages 1262--1262, 30 Jun--03 Jul 2024 b . Full version available at arXiv:2403.05529

work page arXiv 2024
[20]

Lee, and Mahdi Soltanolkotabi

Alexandru Damian, Jason D. Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory (COLT), volume 178, pages 5413--5452. PMLR, 2022

2022
[21]

The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents

Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborov \'a , and Florent Krzaka a. The benefits of reusing batches for gradient descent in two-layer networks: Breaking the curse of information and leap exponents. In International Conference on Machine Learning (ICML), 2024

2024
[22]

Lee, and Qi Lei

Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, and Qi Lei. Discrepancies are virtue: Weak-to-Strong Generalization through lens of intrinsic dimension. In International Conference on Machine Learning (ICML), 2025

2025
[23]

Learning single-index models in G aussian space

Rishabh Dudeja and Daniel Hsu. Learning single-index models in G aussian space. In Conference on Learning Theory (COLT), volume 75, pages 1887--1930, 2018

1930
[24]

Statistical-computational trade-offs in tensor PCA and related problems via communication complexity

Rishabh Dudeja and Daniel Hsu. Statistical-computational trade-offs in tensor PCA and related problems via communication complexity. The Annals of Statistics, 52 0 (1): 0 131--156, 2024

2024
[25]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition, 2022. arXiv:2209.10652

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025

Margalit Glasgow, Denny Wu, and Joan Bruna. Propagation of chaos in one-hidden-layer neural networks beyond logarithmic time, 2025. arXiv:2504.13110

work page arXiv 2025
[27]

Language models represent space and time

Wes Gurnee and Max Tegmark. Language models represent space and time. In International Conference on Learning Representations (ICLR), 2024

2024
[28]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In International Conference on Learning Representations (ICLR), 2024

2024
[29]

Disentangling and mitigating the impact of task similarity for continual learning

Naoki Hiratani. Disentangling and mitigating the impact of task similarity for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[30]

High-dimensional analysis of knowledge distillation: Weak-to-Strong Generalization and scaling laws

Muhammed Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, and Samet Oymak. High-dimensional analysis of knowledge distillation: Weak-to-Strong Generalization and scaling laws. In International Conference on Learning Representations (ICLR), 2025

2025
[31]

Aligner: Efficient alignment by learning to correct

Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Tianyi Qiu, Juntao Dai, and Yaodong Yang. Aligner: Efficient alignment by learning to correct. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[32]

On the complexity of learning sparse functions with statistical and gradient queries

Nirmit Joshi, Theodor Misiakiewicz, and Nathan Srebro. On the complexity of learning sparse functions with statistical and gradient queries. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[33]

Understanding catastrophic forgetting in language models via implicit inference

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. In International Conference on Learning Representations (ICLR), 2024

2024
[34]

Theoretical analysis of Weak-to-Strong Generalization

Hunter Lang, David Sontag, and Aravindan Vijayaraghavan. Theoretical analysis of Weak-to-Strong Generalization . In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 46837--46880, 2024

2024
[35]

Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu

Jason D. Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pages 58716--58756, 2024. doi:10.52202/079017-1872

work page doi:10.52202/079017-1872 2024
[36]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. IEEE Transactions on Audio, Speech and Language Processing, 33: 0 3776--3786, 2025

2025
[37]

Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time

Arvind Mahankali, Haochen Zhang, Kefan Dong, Margalit Glasgow, and Tengyu Ma. Beyond NTK with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[38]

Weak-to-Strong Generalization even in random feature networks, provably

Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, and Nathan Srebro. Weak-to-Strong Generalization even in random feature networks, provably. In International Conference on Machine Learning (ICML), 2025

2025
[39]

Language models implement simple word2vec-style vector arithmetic

Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple word2vec-style vector arithmetic. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024
[40]

Linguistic regularities in continuous space word representations

Tom \' a s Mikolov, Wen - tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2013

2013
[41]

On the mechanisms of Weak-to-Strong Generalization : A theoretical perspective

Behrad Moniri and Hamed Hassani. On the mechanisms of Weak-to-Strong Generalization : A theoretical perspective. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[42]

A theory of non-linear feature learning with one gradient step in two-layer neural networks

Behrad Moniri, Donghwan Lee, Hamed Hassani, and Edgar Dobriban. A theory of non-linear feature learning with one gradient step in two-layer neural networks. In International Conference on Machine Learning (ICML), volume 235, pages 36106--36159. PMLR, 2024

2024
[43]

Neural networks efficiently learn low-dimensional representations with SGD

Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. Neural networks efficiently learn low-dimensional representations with SGD . In International Conference on Learning Representations (ICLR), 2022

2022
[44]

Alireza Mousavi-Hosseini, Denny Wu, Taiji Suzuki, and Murat A. Erdogdu. Gradient-based feature learning under structured data. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[45]

Emergent linear representations in world models of self-supervised sequence models

Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In BlackboxNLP Workshop at Empirical Methods in Natural Language Processing (BlackboxNLP@EMNLP), 2023

2023
[46]

Nonlinear transformers can perform inference-time feature learning

Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, and Taiji Suzuki. Nonlinear transformers can perform inference-time feature learning. In International Conference on Machine Learning (ICML), 2025

2025
[47]

From linear to nonlinear: Provable Weak-to-Strong Generalization through feature learning

Junsoo Oh, Jerry Song, and Chulhee Yun. From linear to nonlinear: Provable Weak-to-Strong Generalization through feature learning. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[48]

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Kazusato Oko, Yujin Song, Taiji Suzuki, and Denny Wu. Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations. In Conference on Learning Theory (COLT), volume 247, pages 4009--4081, 2024 a

2024
[49]

Pretrained transformer efficiently learns low-dimensional target functions in-context

Kazusato Oko, Yujin Song, Taiji Suzuki, and Denny Wu. Pretrained transformer efficiently learns low-dimensional target functions in-context. In Advances in Neural Information Processing Systems (NeurIPS), 2024 b

2024
[50]

Task-specific skill localization in fine-tuned language models

Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. In International Conference on Learning Representations (ICLR), 2023

2023
[51]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In International Conference on Machine Learning (ICML), 2024

2024
[52]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019. URL https://storage.prod.researchhub.com/uploads/papers/2020/06/01/language-models.pdf

2019
[53]

Yunwei Ren, Eshaan Nichani, Denny Wu, and Jason D. Lee. Emergence and scaling laws in SGD learning of shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[54]

Weak-to-Strong Generalization through the data-centric lens

Changho Shin, John Cooper, and Frederic Sala. Weak-to-Strong Generalization through the data-centric lens. In International Conference on Learning Representations (ICLR), 2025

2025
[55]

Learning G aussian multi-index models with gradient flow: Time complexity and directional convergence

Berfin Simsek, Amire Bendjeddou, and Daniel Hsu. Learning G aussian multi-index models with gradient flow: Time complexity and directional convergence. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2025

2025
[56]

Your weak LLM is secretly a strong teacher for alignment

Leitian Tao and Yixuan Li. Your weak LLM is secretly a strong teacher for alignment. In International Conference on Learning Representations (ICLR), 2025

2025
[57]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2023. arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

High-Dimensional Probability: An Introduction with Applications in Data Science

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018

2018
[59]

Yu, Cho - Jui Hsieh, Inderjit S

Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix X. Yu, Cho - Jui Hsieh, Inderjit S. Dhillon, and Sanjiv Kumar. Two-stage LLM fine-tuning with less specialization and more generalization. In International Conference on Learning Representations (ICLR), 2024

2024
[60]

Provable Weak-to-Strong Generalization via benign overfitting

David Xing Wu and Anant Sahai. Provable Weak-to-Strong Generalization via benign overfitting. In International Conference on Learning Representations (ICLR), 2025

2025
[61]

Representations shape Weak-to-Strong Generalization : Theoretical insights and empirical predictions

Yihao Xue, Jiping Li, and Baharan Mirzasoleiman. Representations shape Weak-to-Strong Generalization : Theoretical insights and empirical predictions. In International Conference on Machine Learning (ICML), 2025

2025
[62]

Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the D avis- K ahan theorem for statisticians. Biometrika, 102 0 (2): 0 315--323, 2015

2015