Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

Jason D. Lee; Qi Lei; Yicheng Li; Yijun Dong; Yunai Li

arxiv: 2502.05075 · v6 · submitted 2025-02-07 · 💻 cs.LG · cs.NA· math.NA· stat.ML

Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

Yijun Dong , Yicheng Li , Yunai Li , Jason D. Lee , Qi Lei This is my paper

Pith reviewed 2026-05-23 03:21 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAstat.ML

keywords weak-to-strong generalizationintrinsic dimensionvariance reductionridgeless regressionfeature subspacesfinetuninggeneralization error

0 comments

The pith

Discrepancies between weak teacher and strong student subspaces reduce variance by dim(V_s)/N in weak-to-strong finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models weak-to-strong finetuning as ridgeless regression in low-dimensional feature spaces and derives an exact variance expression for the generalization error. It shows that the weak teacher's variance carries over unchanged in the shared subspace but drops by the ratio of the strong subspace dimension to the number of pseudo-labels in the discrepant directions. A reader would care because this accounts for why the strong student can beat the weak teacher despite noisy labels. The result also supplies concrete scaling for how many pseudo-labels are needed to close the performance gap.

Core claim

For a strong student-weak teacher pair whose finetuning occurs in sufficiently expressive low-dimensional subspaces V_s and V_w, the variance term that dominates generalization error is inherited from the weak teacher throughout V_s ∩ V_w and is multiplied by the factor dim(V_s)/N throughout the discrepancy subspace V_w ∖ V_s.

What carries the argument

The decomposition of variance across the intersection V_s ∩ V_w and the difference V_w ∖ V_s in the ridgeless regression on intrinsic feature subspaces.

If this is right

The strong student recovers performance from the weak teacher once N exceeds a multiple of dim(V_s) in the discrepant directions.
Sample complexity of weak-to-strong generalization is governed by the dimension of V_s rather than the ambient input dimension.
The performance gap shrinks proportionally to the relative size of the discrepancy subspace.
Variance reduction is absent when the two subspaces coincide, recovering ordinary weak-teacher error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace decomposition could be used to select or design weak teachers that maximize the useful discrepancy volume.
The analysis supplies a concrete test for whether a given pair of models will exhibit the observed weak-to-strong improvement before running finetuning.
If the low-dimensional assumption holds for other label-noise settings, similar variance-reduction gains may appear outside weak-to-strong finetuning.

Load-bearing premise

Finetuning of both models occurs inside intrinsically low-dimensional feature subspaces that justify the ridgeless regression analysis.

What would settle it

Measure the student's prediction variance separately inside and outside the estimated discrepancy subspace; if the reduction factor is not close to dim(V_s)/N after training on N pseudo-labels, the claimed variance characterization does not hold.

Figures

Figures reproduced from arXiv: 2502.05075 by Jason D. Lee, Qi Lei, Yicheng Li, Yijun Dong, Yunai Li.

**Figure 2.** Figure 2: Scaling for excess risks on the synthetic regression task in a [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Scaling for excess risks on the synthetic regression task when [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling for PGR and OPR under different ds∧w on the synthetic regression task in a variance-dominated regime. Scaling for PGR and OPR [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling for PGR and OPR of different weak teachers with a fixed strong student on UTKFace. The legends show the comparison between ds∧w and dw. • The relative W2S performance tends to be better when ds∧w/dw is lower, i.e. the larger discrepancy between weak and strong features leads to better W2S generalization. Meanwhile, both PGR and OPR scale inversely with the labeled sample size n and exhibit diminis… view at source ↗

**Figure 6.** Figure 6: Scaling for PGR and OPR on UTKFace with injected label noise: yi ← yi + ζi where ζi ∼ N (0, ς2 ) i.i.d.. errors for “accuracy”, intrinsic dimensions for “complexity”, and student-teacher correlation for “alignment”. Our analysis shows that W2S generalization is driven by variance reduction in the discrepancy between the weak teacher and strong student features. This generalization analysis is followed by a… view at source ↗

**Figure 7.** Figure 7: Scaling for MSE on UTKFace with CLIP-B32 as the strong student and ResNet18 as the weak teacher 5000 10000 15000 N 10 2 10 3 n = 1000 1000 1500 2000 2500 3000 n 10 2 10 3 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 522 (ResNet50), ds = 443 (CLIP-B32), ds w = 301.06 [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗

**Figure 8.** Figure 8: Scaling for MSE on UTKFace with CLIP-B32 as the strong student and ResNet50 as the weak teacher • The intrinsic dimensions of the pretrained features are significantly smaller than the ambiance feature dimensions, which is consistent with our theoretical analysis and the empirical observations in Aghajanyan et al. (2021). • The correlation dimensions ds∧w are considerably smaller than the corresponding in… view at source ↗

**Figure 9.** Figure 9: Scaling for MSE on UTKFace with CLIP-B32 as the strong student and ResNet152 as the weak teacher [PITH_FULL_IMAGE:figures/full_fig_p045_9.png] view at source ↗

**Figure 10.** Figure 10: Scaling for PGR and OPR of different weak teachers with a fixed strong student on ColoredMNIST. guage. The corpus is based on the dataset introduced by Pang & Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges. We conduct binary classi… view at source ↗

**Figure 11.** Figure 11: Scaling for PGR and OPR of W2S on ColoredMNIST with injected label noise. is necessary to retain 90% accuracy of the full models. Here, the 90% accuracy is a common threshold used to estimate intrinsic dimensions in the literature (Li et al., 2018). Correlation Dimension. Let Ds, Dw ∈ N be the finetunable parameter counts of the strong and weak models, respectively. For full FT whose dynamics fall in the … view at source ↗

**Figure 12.** Figure 12: Scaling for PGR and OPR of different weak teachers with a fixed strong student on SST-2. affordable) to obtain Φ′ s , Φ′ w ∈ R N×D. (ii) Then, we use randomized rangefinder (Halko et al., 2011, Algorithm 4.1) to approximate the first ds, dw right singular vectors, Vs ∈ R D×ds and Vw ∈ R D×dw , of Φ′ s , Φ′ w. Taking the evaluation of Vs as an example, we draw a Gaussian random matrix Gs ∈ R ds×D and compu… view at source ↗

**Figure 13.** Figure 13: Scaling for PGR and OPR of W2S on SST-2 with injected label noise. Discrepancies lead to better W2S [PITH_FULL_IMAGE:figures/full_fig_p050_13.png] view at source ↗

read the original abstract

Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an exact variance split in ridgeless regression that credits the discrepant subspace with a specific reduction factor, but the low-dimensional subspace modeling for actual finetuning is the part that needs checking.

read the letter

The main point is that they give an exact characterization of the variance dominating W2S generalization error under ridgeless regression on fixed low-dim subspaces. In the overlap the student picks up the teacher's variance; in the discrepant part it drops by dim(Vs)/N with N pseudo-labels. This is new relative to earlier W2S work, which did not isolate the discrepancy subspace this way via intrinsic dimension.

Referee Report

1 major / 1 minor

Summary. The paper analyzes weak-to-strong (W2S) generalization, in which a strong student model is finetuned on pseudo-labels from a weak teacher. Observing that finetuning occurs in intrinsically low-dimensional spaces, it models the setting as ridgeless regression on fixed low-dimensional feature subspaces V_s (strong) and V_w (weak). Under the assumption that these subspaces are sufficiently expressive, it derives an exact characterization of the variance term that dominates generalization error: the weak teacher's variance is inherited by the student in the intersection V_s ∩ V_w, while reduced by a factor of dim(V_s)/N in the discrepancy subspace V_w ∖ V_s (with N pseudo-labels). The analysis also addresses sample complexities and scaling of performance-gap recovery, and is supported by synthetic regression experiments plus real vision and NLP tasks.

Significance. If the low-dimensional subspace modeling of finetuning holds for W2S, the work supplies a first-principles variance-reduction account of why discrepancy between weak teacher and strong student can be beneficial, rather than merely an empirical curiosity. The exact, parameter-free characterization under the ridgeless model is a clear strength; it yields concrete predictions about when and how much the performance gap recovers with more pseudo-labels.

major comments (1)

Abstract and theoretical derivation (presumably §3–4): the central claim of an 'exact characterization' of the dominating variance, and the resulting 'virtue of discrepancy,' is derived under the modeling premise that W2S finetuning can be represented as ridgeless regression on fixed, sufficiently expressive low-dimensional subspaces V_s and V_w. The only justification offered is the general statement that 'FT often occurs in intrinsically low-dimensional spaces'; no formal bound, effective-dimension estimate, or verification is supplied showing that this premise remains valid specifically during pseudo-label training, that the subspaces capture the dominant variance contributions in neural-network finetuning, or that the ridgeless approximation continues to dominate the generalization error once the student is large. Because the variance-inheritance and dim(V_s)/N reduction statements (

minor comments (1)

Notation: the precise construction or approximation of the subspaces V_s and V_w is not stated explicitly when the real-task experiments are described; a short paragraph clarifying how these subspaces are obtained (or approximated) from the trained models would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential value of our variance-reduction perspective on W2S generalization. We address the sole major comment below.

read point-by-point responses

Referee: Abstract and theoretical derivation (presumably §3–4): the central claim of an 'exact characterization' of the dominating variance, and the resulting 'virtue of discrepancy,' is derived under the modeling premise that W2S finetuning can be represented as ridgeless regression on fixed, sufficiently expressive low-dimensional subspaces V_s and V_w. The only justification offered is the general statement that 'FT often occurs in intrinsically low-dimensional spaces'; no formal bound, effective-dimension estimate, or verification is supplied showing that this premise remains valid specifically during pseudo-label training, that the subspaces capture the dominant variance contributions in neural-network finetuning, or that the ridgeless approximation continues to dominate the generalization error once the student is large. Because the variance-inheritance and dim(V_s)/N reduction statements

Authors: Our analysis derives an exact variance characterization strictly inside the stated ridgeless regression model on fixed subspaces V_s and V_w (see §3–4). This modeling choice is motivated by the body of work on intrinsic low-dimensionality of fine-tuning (cited in §1 and §2), not presented as a formally proven property of pseudo-label training. The paper makes no claim of a general bound on effective dimension or dominance of the ridgeless regime for arbitrarily large students; the exact formulas are offered as an interpretable lens that isolates the effect of subspace discrepancy. Sections 5.1–5.3 supply empirical checks on both synthetic data generated from the model and real vision/NLP tasks, where the predicted dim(V_s)/N scaling and inheritance in the overlap are observed. We will add an explicit “Modeling Assumptions and Scope” paragraph in the revision to clarify these boundaries. revision: partial

Circularity Check

0 steps flagged

No circularity: variance characterization follows directly from ridgeless regression algebra on assumed subspaces

full rationale

The paper's central derivation computes the variance of the W2S estimator explicitly in the ridgeless linear regression setting on fixed subspaces V_s and V_w, yielding the stated inheritance and reduction factors as algebraic consequences of the projection and pseudo-label averaging. This calculation is self-contained within the model and does not reduce any claimed prediction to a fitted quantity from the target data, nor does it rely on self-citations for the uniqueness or validity of the variance expressions. The low-dimensional subspace premise is introduced as an external modeling choice motivated by prior empirical observations on finetuning, not derived from or tautological with the variance result itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the low intrinsic dimensionality of finetuning and the choice of ridgeless regression; no free parameters are explicitly fitted in the abstract, and no new entities are introduced.

axioms (2)

domain assumption Finetuning occurs in intrinsically low-dimensional spaces
Invoked at the start of the analysis to justify the subspace model.
domain assumption Analysis performed in the ridgeless regression setting
Stated as the mathematical framework for the variance derivation.

pith-pipeline@v0.9.0 · 5798 in / 1330 out tokens · 37569 ms · 2026-05-23T03:21:08.787075+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
stat.ML 2026-05 unverdicted novelty 7.0

In the linear-width regime, the second GD step yields a spiked random matrix whose number of outliers is floor(alpha2 / (1/2 - alpha1)), and batch reuse enables learning directions with information exponent greater th...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

Amari, S.-i., Ba, J., Grosse, R., Li, X., Nitanda, A., Suzuki, T., Wu, D., and Xu, J. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

work page arXiv 2006
[2]

Invariant Risk Minimization

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization.arXiv preprint arXiv:1907.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

and Andersen, L

Borup, K. and Andersen, L. N. Self-distillation for gaussian process regression and classification. arXiv preprint arXiv:2304.02641,

work page arXiv
[4]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Clark, K., Luong, M.-T., Le, Q. V ., and Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555,

work page internal anchor Pith review Pith/arXiv arXiv 2003
[5]

Efficient bounds and estimates for canonical angles in randomized subspace approximations.SIAM Journal on Matrix Analysis and Applica- tions, 45(4):1978–2006, 2024a

Dong, Y ., Martinsson, P.-G., and Nakatsukasa, Y . Efficient bounds and estimates for canonical angles in randomized subspace approximations.SIAM Journal on Matrix Analysis and Applica- tions, 45(4):1978–2006, 2024a. Dong, Y ., Miller, K., Lei, Q., and Ward, R. Cluster-aware semi-supervised learning: relational knowledge distillation provably learns clust...

work page arXiv 1978
[6]

A., Chandra, K

Goel, S., Struber, J., Auzina, I. A., Chandra, K. K., Kumaraguru, P., Kiela, D., Prabhu, A., Bethge, M., and Geiping, J. Great models think alike and this undermines ai oversight.arXiv preprint arXiv:2502.04313,

work page arXiv
[7]

Vision superalignment: Weak-to- strong generalization for vision foundation models.arXiv preprint arXiv:2402.03749,

Guo, J., Chen, H., Wang, C., Han, K., Xu, C., and Wang, Y . Vision superalignment: Weak-to- strong generalization for vision foundation models.arXiv preprint arXiv:2402.03749,

work page arXiv
[8]

and Yang, Y

Guo, Y . and Yang, Y . Improving weak-to-strong generalization with reliability-aware alignment. arXiv preprint arXiv:2406.19032,

work page arXiv
[9]

Distilling the Knowledge in a Neural Network

Hinton, G. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,

Huang, W., Yi, M., Zhao, X., and Jiang, Z. Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,

work page arXiv
[11]

E., Gozeten, H

Ildiz, M. E., Gozeten, H. A., Taga, E. O., Mondelli, M., and Oymak, S. High-dimensional anal- ysis of knowledge distillation: Weak-to-strong generalization and scaling laws.arXiv preprint arXiv:2410.18837,

work page arXiv
[12]

Johnson, W. B. Extensions of lipshitz mapping into hilbert space. InConference modern analysis and probability, 1984, pp. 189–206,

work page 1984
[13]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,

Lang, H., Sontag, D., and Vijayaraghavan, A. Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,

work page arXiv
[15]

and Alahi, A

Liu, Y . and Alahi, A. Co-supervised learning: Improving weak-to-strong generalization with hier- archical mixture of experts.arXiv preprint arXiv:2402.15505,

work page arXiv
[16]

Weak-to-strong generalization even in random feature networks, provably.arXiv preprint arXiv:2503.02877,

Medvedev, M., Lyu, K., Yu, D., Arora, S., Li, Z., and Srebro, N. Weak-to-strong generalization even in random feature networks, provably.arXiv preprint arXiv:2503.02877,

work page arXiv
[17]

S., and Oh, S

Pareek, D., Du, S. S., and Oh, S. Understanding the gains from repeated self-distillation.arXiv preprint arXiv:2407.04600,

work page arXiv
[18]

Weak-to-strong generalization through the data-centric lens

Shin, C., Cooper, J., and Sala, F. Weak-to-strong generalization through the data-centric lens. arXiv preprint arXiv:2412.03881,

work page arXiv
[19]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013
[20]

M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y

Somerstep, S., Polo, F. M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y . A statistical framework for weak-to-strong generalization.arXiv preprint arXiv:2405.16236,

work page arXiv
[21]

Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experi- ment, 2020(12):124001,

Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experi- ment, 2020(12):124001,

work page 2020
[22]

Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,

Turc, I., Chang, M.-W., Lee, K., and Toutanova, K. Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,

work page arXiv 1908
[23]

Introduction to the non-asymptotic analysis of random matrices

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices.arXiv preprint arXiv:1011.3027,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622,

Wei, C., Shen, K., Chen, Y ., and Ma, T. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622,

work page arXiv 2010
[25]

Wu, D. X. and Sahai, A. Provable weak-to-strong generalization via benign overfitting.arXiv preprint arXiv:2410.04638,

work page arXiv
[26]

Super (ficial)- alignment: Strong models may deceive weak models in weak-to-strong generalization.arXiv preprint arXiv:2406.11431, 2024a

Yang, W., Shen, S., Shen, G., Yao, W., Liu, Y ., Gong, Z., Lin, Y ., and Wen, J.-R. Super (ficial)- alignment: Strong models may deceive weak models in weak-to-strong generalization.arXiv preprint arXiv:2406.11431, 2024a. Yang, Y ., Ma, Y ., and Liu, P. Weak-to-strong reasoning. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pp...

work page arXiv 2024
[27]

Understanding the capabilities and limitations of weak-to-strong generalization.arXiv preprint arXiv:2502.01458,

Yao, W., Yang, W., Wang, Z., Lin, Y ., and Liu, Y . Understanding the capabilities and limitations of weak-to-strong generalization.arXiv preprint arXiv:2502.01458,

work page arXiv
[28]

27 B.2 Proof of Proposition 1 and Corollary 1

24 Appendices A Additional related works 25 B Proofs in Section 3 26 B.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Proof of Proposition 1 and Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . 32 B.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.4 ...

work page 2015
[29]

weak” and “strong

is closely connected to W2S generalization regarding the teacher-student setup, while W2S reverts the ca- pacities of teacher and student in KD. In KD, a strong teacher model guides a weak student model to learn the teacher’s knowledge. In contrast, W2S generalization occurs when a strong student model surpasses a weak teacher model under weak supervision...

work page 2019
[30]

Self-distillation and self-training.In contrast to W2S, which considers distinct student and teacher models, self-distillation (Zhang et al., 2019,

and student-teacher correlation in W2S. Self-distillation and self-training.In contrast to W2S, which considers distinct student and teacher models, self-distillation (Zhang et al., 2019,

work page 2019
[31]

previous version

use the same or progressively refined architectures to iteratively distill knowledge from a “previous version” of the model. There have been extensive 25 theoretical analyses toward understanding the mechanism behind self-distillation (Mobahi et al., 2020; Das & Sanghavi, 2023; Borup & Andersen, 2023; Pareek et al., 2024). Self-training (Scudder, 1965; Le...

work page 2020
[32]

Wei et al

is a closely related method to self-distillation that takes a single model’s confident predictions to create pseudo-labels for unlabeled data and refines that model iteratively. Wei et al. (2020); Oymak & Gulcu (2021); Frei et al. (2022) provide theo- retical insights into the generalization of self-training. In particular, Wei et al. (2020) introduced a ...

work page 2020
[33]

(8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS eΓ⊤ weΓw † V⊤ w

Variance.For the variance term, we observe that Var(fw2s) = 1 N ESx, eS PsΦweΦ† wez 2 2 = 1 N ESx, eS h tr Φ⊤ wPsΦweΦ† wezez⊤(eΦ† w)⊤ i = σ2 N ESx, eS h tr Φ⊤ wPsΦw(eΦ⊤ weΦw)† i , which implies Var(fw2s) = σ2 N tr ESx Σ−1/2 w Φ⊤ wPsΦwΣ−1/2 w E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † . (8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS...

work page 1928
[34]

Overall, by (19) and (24), we have Var(fw2s)⩽ σ2 4(αwn)(αw2sN) 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) , Bias(fw2s)⩽α w Σ−1/2 w Σ1/2 ∗ θ∗ 2 2 +α w2s Σ−1/2 s Σ1/2 ∗ θ∗ 2 2 ⩽α wϱw +α w2sϱs. The upper bound the excess riskER(f w2s) =Var(f w2s) +Bias(f w2s)is minimized by taking αw = σ2 4nN ϱs ϱ2 w 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) 1/3 , αw2s = σ2 4nN ϱw ϱ2...

work page 2013
[35]

We observe the following: 43 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 194 (ResNet18), ds = 443 (CLIP-B32), ds w = 167.64 Figure 7: Scaling for MSE on UTKFace withCLIP-B32as the strong student andResNet18as the weak teacher 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500...

work page 2000
[36]

• It is worth highlighting that while the MSE loss off w2s monotonically decreases with respect to both sample sizesn, N, the different rates of convergence compared tof w,f s, andf c lead to the 44 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 589 (ResNet152), ds = 443 (CLIP-B32), ...

work page 2000
[37]

We replace ResNet18 and ResNet34 used in Section 4.2 to experiment on weak models with similar intrinsic dimensions but different correlation dimensions

and vary the weak teacher among the ResNet-d series and ResNet series (ResNet18D, ResNet34D, ResNet101, ResNet152) (He et al., 2019, 2016). We replace ResNet18 and ResNet34 used in Section 4.2 to experiment on weak models with similar intrinsic dimensions but different correlation dimensions. We treat the backbone of the models (excluding the classificati...

work page 2019
[38]

is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in lan- 46 Figure 10: Scaling forPGRandOPRof different weak teachers with a fixed strong student on ColoredMNIST. guage. The corpus is based on the dataset introduced by Pang & Lee (2005) and consists of 11,855 single sentences extracte...

work page 2005
[39]

With manageable model sizes, we conduct full finetuning experiments following the setup in Burns et al

(Bert-Tiny, Bert-Mini, Bert-Small, Bert-Medium). With manageable model sizes, we conduct full finetuning experiments following the setup in Burns et al. (2024). We use the standard cross-entropy loss for supervised finetun- ing. When training strong students on weak labels (W2S), we use the confidence-weighted loss proposed by Burns et al. (2024), which i...

work page 2024
[40]

features

with a learning rate of 5e-5, a cosine learning rate schedule, and 40 warmup steps. We train for 3 epochs, which is sufficient for the train and validation losses to stabilize. Intrinsic dimension.The intrinsic dimensionsd w, ds are measured based on the Structure-Aware Intrinsic Dimension (SAID) method proposed by Aghajanyan et al. (2021). We first train...

work page 2021
[41]

to accelerate estimation ofd s∧w via sketching (Halko et al., 2011; Woodruff et al., 2014). (i) We first reduce bothD s, Dw to the same lower dimensionD= 0.01 min{D s, Dw}(with D≫max{d s, dw}) by subsampling columns ofΦ s,Φ w (uniformly for efficiency, or adap- tively via sketching-based interpolative decomposition (Dong & Martinsson,

work page 2011
[42]

when 8Notice thatf s, fw are scalar-valued functions for binary classification tasks like SST-2, and thus the gradients ∇θfs and∇ θfw are row vectors. For multi-class classification tasks wheref s, fw output vectors of logits, a common heuristic to keepΦ s,Φ w as matrices of manageable sizes (in constrast to tensors) is to replace gradients of the models,...

work page 2011

[1] [1]

When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

Amari, S.-i., Ba, J., Grosse, R., Li, X., Nitanda, A., Suzuki, T., Wu, D., and Xu, J. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

work page arXiv 2006

[2] [2]

Invariant Risk Minimization

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization.arXiv preprint arXiv:1907.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

and Andersen, L

Borup, K. and Andersen, L. N. Self-distillation for gaussian process regression and classification. arXiv preprint arXiv:2304.02641,

work page arXiv

[4] [4]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Clark, K., Luong, M.-T., Le, Q. V ., and Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555,

work page internal anchor Pith review Pith/arXiv arXiv 2003

[5] [5]

Efficient bounds and estimates for canonical angles in randomized subspace approximations.SIAM Journal on Matrix Analysis and Applica- tions, 45(4):1978–2006, 2024a

Dong, Y ., Martinsson, P.-G., and Nakatsukasa, Y . Efficient bounds and estimates for canonical angles in randomized subspace approximations.SIAM Journal on Matrix Analysis and Applica- tions, 45(4):1978–2006, 2024a. Dong, Y ., Miller, K., Lei, Q., and Ward, R. Cluster-aware semi-supervised learning: relational knowledge distillation provably learns clust...

work page arXiv 1978

[6] [6]

A., Chandra, K

Goel, S., Struber, J., Auzina, I. A., Chandra, K. K., Kumaraguru, P., Kiela, D., Prabhu, A., Bethge, M., and Geiping, J. Great models think alike and this undermines ai oversight.arXiv preprint arXiv:2502.04313,

work page arXiv

[7] [7]

Vision superalignment: Weak-to- strong generalization for vision foundation models.arXiv preprint arXiv:2402.03749,

Guo, J., Chen, H., Wang, C., Han, K., Xu, C., and Wang, Y . Vision superalignment: Weak-to- strong generalization for vision foundation models.arXiv preprint arXiv:2402.03749,

work page arXiv

[8] [8]

and Yang, Y

Guo, Y . and Yang, Y . Improving weak-to-strong generalization with reliability-aware alignment. arXiv preprint arXiv:2406.19032,

work page arXiv

[9] [9]

Distilling the Knowledge in a Neural Network

Hinton, G. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,

Huang, W., Yi, M., Zhao, X., and Jiang, Z. Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,

work page arXiv

[11] [11]

E., Gozeten, H

Ildiz, M. E., Gozeten, H. A., Taga, E. O., Mondelli, M., and Oymak, S. High-dimensional anal- ysis of knowledge distillation: Weak-to-strong generalization and scaling laws.arXiv preprint arXiv:2410.18837,

work page arXiv

[12] [12]

Johnson, W. B. Extensions of lipshitz mapping into hilbert space. InConference modern analysis and probability, 1984, pp. 189–206,

work page 1984

[13] [13]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,

Lang, H., Sontag, D., and Vijayaraghavan, A. Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,

work page arXiv

[15] [15]

and Alahi, A

Liu, Y . and Alahi, A. Co-supervised learning: Improving weak-to-strong generalization with hier- archical mixture of experts.arXiv preprint arXiv:2402.15505,

work page arXiv

[16] [16]

Weak-to-strong generalization even in random feature networks, provably.arXiv preprint arXiv:2503.02877,

Medvedev, M., Lyu, K., Yu, D., Arora, S., Li, Z., and Srebro, N. Weak-to-strong generalization even in random feature networks, provably.arXiv preprint arXiv:2503.02877,

work page arXiv

[17] [17]

S., and Oh, S

Pareek, D., Du, S. S., and Oh, S. Understanding the gains from repeated self-distillation.arXiv preprint arXiv:2407.04600,

work page arXiv

[18] [18]

Weak-to-strong generalization through the data-centric lens

Shin, C., Cooper, J., and Sala, F. Weak-to-strong generalization through the data-centric lens. arXiv preprint arXiv:2412.03881,

work page arXiv

[19] [19]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013

[20] [20]

M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y

Somerstep, S., Polo, F. M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y . A statistical framework for weak-to-strong generalization.arXiv preprint arXiv:2405.16236,

work page arXiv

[21] [21]

Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experi- ment, 2020(12):124001,

Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experi- ment, 2020(12):124001,

work page 2020

[22] [22]

Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,

Turc, I., Chang, M.-W., Lee, K., and Toutanova, K. Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,

work page arXiv 1908

[23] [23]

Introduction to the non-asymptotic analysis of random matrices

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices.arXiv preprint arXiv:1011.3027,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622,

Wei, C., Shen, K., Chen, Y ., and Ma, T. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622,

work page arXiv 2010

[25] [25]

Wu, D. X. and Sahai, A. Provable weak-to-strong generalization via benign overfitting.arXiv preprint arXiv:2410.04638,

work page arXiv

[26] [26]

Super (ficial)- alignment: Strong models may deceive weak models in weak-to-strong generalization.arXiv preprint arXiv:2406.11431, 2024a

Yang, W., Shen, S., Shen, G., Yao, W., Liu, Y ., Gong, Z., Lin, Y ., and Wen, J.-R. Super (ficial)- alignment: Strong models may deceive weak models in weak-to-strong generalization.arXiv preprint arXiv:2406.11431, 2024a. Yang, Y ., Ma, Y ., and Liu, P. Weak-to-strong reasoning. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pp...

work page arXiv 2024

[27] [27]

Understanding the capabilities and limitations of weak-to-strong generalization.arXiv preprint arXiv:2502.01458,

Yao, W., Yang, W., Wang, Z., Lin, Y ., and Liu, Y . Understanding the capabilities and limitations of weak-to-strong generalization.arXiv preprint arXiv:2502.01458,

work page arXiv

[28] [28]

27 B.2 Proof of Proposition 1 and Corollary 1

24 Appendices A Additional related works 25 B Proofs in Section 3 26 B.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Proof of Proposition 1 and Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . 32 B.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.4 ...

work page 2015

[29] [29]

weak” and “strong

is closely connected to W2S generalization regarding the teacher-student setup, while W2S reverts the ca- pacities of teacher and student in KD. In KD, a strong teacher model guides a weak student model to learn the teacher’s knowledge. In contrast, W2S generalization occurs when a strong student model surpasses a weak teacher model under weak supervision...

work page 2019

[30] [30]

Self-distillation and self-training.In contrast to W2S, which considers distinct student and teacher models, self-distillation (Zhang et al., 2019,

and student-teacher correlation in W2S. Self-distillation and self-training.In contrast to W2S, which considers distinct student and teacher models, self-distillation (Zhang et al., 2019,

work page 2019

[31] [31]

previous version

use the same or progressively refined architectures to iteratively distill knowledge from a “previous version” of the model. There have been extensive 25 theoretical analyses toward understanding the mechanism behind self-distillation (Mobahi et al., 2020; Das & Sanghavi, 2023; Borup & Andersen, 2023; Pareek et al., 2024). Self-training (Scudder, 1965; Le...

work page 2020

[32] [32]

Wei et al

is a closely related method to self-distillation that takes a single model’s confident predictions to create pseudo-labels for unlabeled data and refines that model iteratively. Wei et al. (2020); Oymak & Gulcu (2021); Frei et al. (2022) provide theo- retical insights into the generalization of self-training. In particular, Wei et al. (2020) introduced a ...

work page 2020

[33] [33]

(8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS eΓ⊤ weΓw † V⊤ w

Variance.For the variance term, we observe that Var(fw2s) = 1 N ESx, eS PsΦweΦ† wez 2 2 = 1 N ESx, eS h tr Φ⊤ wPsΦweΦ† wezez⊤(eΦ† w)⊤ i = σ2 N ESx, eS h tr Φ⊤ wPsΦw(eΦ⊤ weΦw)† i , which implies Var(fw2s) = σ2 N tr ESx Σ−1/2 w Φ⊤ wPsΦwΣ−1/2 w E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † . (8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS...

work page 1928

[34] [34]

Overall, by (19) and (24), we have Var(fw2s)⩽ σ2 4(αwn)(αw2sN) 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) , Bias(fw2s)⩽α w Σ−1/2 w Σ1/2 ∗ θ∗ 2 2 +α w2s Σ−1/2 s Σ1/2 ∗ θ∗ 2 2 ⩽α wϱw +α w2sϱs. The upper bound the excess riskER(f w2s) =Var(f w2s) +Bias(f w2s)is minimized by taking αw = σ2 4nN ϱs ϱ2 w 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) 1/3 , αw2s = σ2 4nN ϱw ϱ2...

work page 2013

[35] [35]

We observe the following: 43 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 194 (ResNet18), ds = 443 (CLIP-B32), ds w = 167.64 Figure 7: Scaling for MSE on UTKFace withCLIP-B32as the strong student andResNet18as the weak teacher 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500...

work page 2000

[36] [36]

• It is worth highlighting that while the MSE loss off w2s monotonically decreases with respect to both sample sizesn, N, the different rates of convergence compared tof w,f s, andf c lead to the 44 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 589 (ResNet152), ds = 443 (CLIP-B32), ...

work page 2000

[37] [37]

We replace ResNet18 and ResNet34 used in Section 4.2 to experiment on weak models with similar intrinsic dimensions but different correlation dimensions

and vary the weak teacher among the ResNet-d series and ResNet series (ResNet18D, ResNet34D, ResNet101, ResNet152) (He et al., 2019, 2016). We replace ResNet18 and ResNet34 used in Section 4.2 to experiment on weak models with similar intrinsic dimensions but different correlation dimensions. We treat the backbone of the models (excluding the classificati...

work page 2019

[38] [38]

is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in lan- 46 Figure 10: Scaling forPGRandOPRof different weak teachers with a fixed strong student on ColoredMNIST. guage. The corpus is based on the dataset introduced by Pang & Lee (2005) and consists of 11,855 single sentences extracte...

work page 2005

[39] [39]

With manageable model sizes, we conduct full finetuning experiments following the setup in Burns et al

(Bert-Tiny, Bert-Mini, Bert-Small, Bert-Medium). With manageable model sizes, we conduct full finetuning experiments following the setup in Burns et al. (2024). We use the standard cross-entropy loss for supervised finetun- ing. When training strong students on weak labels (W2S), we use the confidence-weighted loss proposed by Burns et al. (2024), which i...

work page 2024

[40] [40]

features

with a learning rate of 5e-5, a cosine learning rate schedule, and 40 warmup steps. We train for 3 epochs, which is sufficient for the train and validation losses to stabilize. Intrinsic dimension.The intrinsic dimensionsd w, ds are measured based on the Structure-Aware Intrinsic Dimension (SAID) method proposed by Aghajanyan et al. (2021). We first train...

work page 2021

[41] [41]

to accelerate estimation ofd s∧w via sketching (Halko et al., 2011; Woodruff et al., 2014). (i) We first reduce bothD s, Dw to the same lower dimensionD= 0.01 min{D s, Dw}(with D≫max{d s, dw}) by subsampling columns ofΦ s,Φ w (uniformly for efficiency, or adap- tively via sketching-based interpolative decomposition (Dong & Martinsson,

work page 2011

[42] [42]

when 8Notice thatf s, fw are scalar-valued functions for binary classification tasks like SST-2, and thus the gradients ∇θfs and∇ θfw are row vectors. For multi-class classification tasks wheref s, fw output vectors of logits, a common heuristic to keepΦ s,Φ w as matrices of manageable sizes (in constrast to tensors) is to replace gradients of the models,...

work page 2011