Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
Pith reviewed 2026-05-23 03:21 UTC · model grok-4.3
The pith
Discrepancies between weak teacher and strong student subspaces reduce variance by dim(V_s)/N in weak-to-strong finetuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a strong student-weak teacher pair whose finetuning occurs in sufficiently expressive low-dimensional subspaces V_s and V_w, the variance term that dominates generalization error is inherited from the weak teacher throughout V_s ∩ V_w and is multiplied by the factor dim(V_s)/N throughout the discrepancy subspace V_w ∖ V_s.
What carries the argument
The decomposition of variance across the intersection V_s ∩ V_w and the difference V_w ∖ V_s in the ridgeless regression on intrinsic feature subspaces.
If this is right
- The strong student recovers performance from the weak teacher once N exceeds a multiple of dim(V_s) in the discrepant directions.
- Sample complexity of weak-to-strong generalization is governed by the dimension of V_s rather than the ambient input dimension.
- The performance gap shrinks proportionally to the relative size of the discrepancy subspace.
- Variance reduction is absent when the two subspaces coincide, recovering ordinary weak-teacher error.
Where Pith is reading between the lines
- The same subspace decomposition could be used to select or design weak teachers that maximize the useful discrepancy volume.
- The analysis supplies a concrete test for whether a given pair of models will exhibit the observed weak-to-strong improvement before running finetuning.
- If the low-dimensional assumption holds for other label-noise settings, similar variance-reduction gains may appear outside weak-to-strong finetuning.
Load-bearing premise
Finetuning of both models occurs inside intrinsically low-dimensional feature subspaces that justify the ridgeless regression analysis.
What would settle it
Measure the student's prediction variance separately inside and outside the estimated discrepancy subspace; if the reduction factor is not close to dim(V_s)/N after training on N pseudo-labels, the claimed variance characterization does not hold.
Figures
read the original abstract
Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes weak-to-strong (W2S) generalization, in which a strong student model is finetuned on pseudo-labels from a weak teacher. Observing that finetuning occurs in intrinsically low-dimensional spaces, it models the setting as ridgeless regression on fixed low-dimensional feature subspaces V_s (strong) and V_w (weak). Under the assumption that these subspaces are sufficiently expressive, it derives an exact characterization of the variance term that dominates generalization error: the weak teacher's variance is inherited by the student in the intersection V_s ∩ V_w, while reduced by a factor of dim(V_s)/N in the discrepancy subspace V_w ∖ V_s (with N pseudo-labels). The analysis also addresses sample complexities and scaling of performance-gap recovery, and is supported by synthetic regression experiments plus real vision and NLP tasks.
Significance. If the low-dimensional subspace modeling of finetuning holds for W2S, the work supplies a first-principles variance-reduction account of why discrepancy between weak teacher and strong student can be beneficial, rather than merely an empirical curiosity. The exact, parameter-free characterization under the ridgeless model is a clear strength; it yields concrete predictions about when and how much the performance gap recovers with more pseudo-labels.
major comments (1)
- Abstract and theoretical derivation (presumably §3–4): the central claim of an 'exact characterization' of the dominating variance, and the resulting 'virtue of discrepancy,' is derived under the modeling premise that W2S finetuning can be represented as ridgeless regression on fixed, sufficiently expressive low-dimensional subspaces V_s and V_w. The only justification offered is the general statement that 'FT often occurs in intrinsically low-dimensional spaces'; no formal bound, effective-dimension estimate, or verification is supplied showing that this premise remains valid specifically during pseudo-label training, that the subspaces capture the dominant variance contributions in neural-network finetuning, or that the ridgeless approximation continues to dominate the generalization error once the student is large. Because the variance-inheritance and dim(V_s)/N reduction statements (
minor comments (1)
- Notation: the precise construction or approximation of the subspaces V_s and V_w is not stated explicitly when the real-task experiments are described; a short paragraph clarifying how these subspaces are obtained (or approximated) from the trained models would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential value of our variance-reduction perspective on W2S generalization. We address the sole major comment below.
read point-by-point responses
-
Referee: Abstract and theoretical derivation (presumably §3–4): the central claim of an 'exact characterization' of the dominating variance, and the resulting 'virtue of discrepancy,' is derived under the modeling premise that W2S finetuning can be represented as ridgeless regression on fixed, sufficiently expressive low-dimensional subspaces V_s and V_w. The only justification offered is the general statement that 'FT often occurs in intrinsically low-dimensional spaces'; no formal bound, effective-dimension estimate, or verification is supplied showing that this premise remains valid specifically during pseudo-label training, that the subspaces capture the dominant variance contributions in neural-network finetuning, or that the ridgeless approximation continues to dominate the generalization error once the student is large. Because the variance-inheritance and dim(V_s)/N reduction statements
Authors: Our analysis derives an exact variance characterization strictly inside the stated ridgeless regression model on fixed subspaces V_s and V_w (see §3–4). This modeling choice is motivated by the body of work on intrinsic low-dimensionality of fine-tuning (cited in §1 and §2), not presented as a formally proven property of pseudo-label training. The paper makes no claim of a general bound on effective dimension or dominance of the ridgeless regime for arbitrarily large students; the exact formulas are offered as an interpretable lens that isolates the effect of subspace discrepancy. Sections 5.1–5.3 supply empirical checks on both synthetic data generated from the model and real vision/NLP tasks, where the predicted dim(V_s)/N scaling and inheritance in the overlap are observed. We will add an explicit “Modeling Assumptions and Scope” paragraph in the revision to clarify these boundaries. revision: partial
Circularity Check
No circularity: variance characterization follows directly from ridgeless regression algebra on assumed subspaces
full rationale
The paper's central derivation computes the variance of the W2S estimator explicitly in the ridgeless linear regression setting on fixed subspaces V_s and V_w, yielding the stated inheritance and reduction factors as algebraic consequences of the projection and pseudo-label averaging. This calculation is self-contained within the model and does not reduce any claimed prediction to a fitted quantity from the target data, nor does it rely on self-citations for the uniqueness or validity of the variance expressions. The low-dimensional subspace premise is introduced as an external modeling choice motivated by prior empirical observations on finetuning, not derived from or tautological with the variance result itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Finetuning occurs in intrinsically low-dimensional spaces
- domain assumption Analysis performed in the ridgeless regression setting
Forward citations
Cited by 1 Pith paper
-
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
In the linear-width regime, the second GD step yields a spiked random matrix whose number of outliers is floor(alpha2 / (1/2 - alpha1)), and batch reuse enables learning directions with information exponent greater th...
Reference graph
Works this paper leans on
-
[1]
When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,
Amari, S.-i., Ba, J., Grosse, R., Li, X., Nitanda, A., Suzuki, T., Wu, D., and Xu, J. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,
-
[2]
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization.arXiv preprint arXiv:1907.02893,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[3]
Borup, K. and Andersen, L. N. Self-distillation for gaussian process regression and classification. arXiv preprint arXiv:2304.02641,
-
[4]
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Clark, K., Luong, M.-T., Le, Q. V ., and Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555,
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[5]
Dong, Y ., Martinsson, P.-G., and Nakatsukasa, Y . Efficient bounds and estimates for canonical angles in randomized subspace approximations.SIAM Journal on Matrix Analysis and Applica- tions, 45(4):1978–2006, 2024a. Dong, Y ., Miller, K., Lei, Q., and Ward, R. Cluster-aware semi-supervised learning: relational knowledge distillation provably learns clust...
-
[6]
Goel, S., Struber, J., Auzina, I. A., Chandra, K. K., Kumaraguru, P., Kiela, D., Prabhu, A., Bethge, M., and Geiping, J. Great models think alike and this undermines ai oversight.arXiv preprint arXiv:2502.04313,
-
[7]
Guo, J., Chen, H., Wang, C., Han, K., Xu, C., and Wang, Y . Vision superalignment: Weak-to- strong generalization for vision foundation models.arXiv preprint arXiv:2402.03749,
-
[8]
Guo, Y . and Yang, Y . Improving weak-to-strong generalization with reliability-aware alignment. arXiv preprint arXiv:2406.19032,
-
[9]
Distilling the Knowledge in a Neural Network
Hinton, G. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,
Huang, W., Yi, M., Zhao, X., and Jiang, Z. Towards the generalization of contrastive self- supervised learning.arXiv preprint arXiv:2111.00743,
-
[11]
Ildiz, M. E., Gozeten, H. A., Taga, E. O., Mondelli, M., and Oymak, S. High-dimensional anal- ysis of knowledge distillation: Weak-to-strong generalization and scaling laws.arXiv preprint arXiv:2410.18837,
-
[12]
Johnson, W. B. Extensions of lipshitz mapping into hilbert space. InConference modern analysis and probability, 1984, pp. 189–206,
work page 1984
-
[13]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,
Lang, H., Sontag, D., and Vijayaraghavan, A. Theoretical analysis of weak-to-strong generaliza- tion.arXiv preprint arXiv:2405.16043,
-
[15]
Liu, Y . and Alahi, A. Co-supervised learning: Improving weak-to-strong generalization with hier- archical mixture of experts.arXiv preprint arXiv:2402.15505,
-
[16]
Medvedev, M., Lyu, K., Yu, D., Arora, S., Li, Z., and Srebro, N. Weak-to-strong generalization even in random feature networks, provably.arXiv preprint arXiv:2503.02877,
-
[17]
Pareek, D., Du, S. S., and Oh, S. Understanding the gains from repeated self-distillation.arXiv preprint arXiv:2407.04600,
-
[18]
Weak-to-strong generalization through the data-centric lens
Shin, C., Cooper, J., and Sala, F. Weak-to-strong generalization through the data-centric lens. arXiv preprint arXiv:2412.03881,
- [19]
-
[20]
M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y
Somerstep, S., Polo, F. M., Banerjee, M., Ritov, Y ., Yurochkin, M., and Sun, Y . A statistical framework for weak-to-strong generalization.arXiv preprint arXiv:2405.16236,
-
[21]
Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experi- ment, 2020(12):124001,
work page 2020
-
[22]
Turc, I., Chang, M.-W., Lee, K., and Toutanova, K. Well-read students learn better: On the impor- tance of pre-training compact models.arXiv preprint arXiv:1908.08962,
-
[23]
Introduction to the non-asymptotic analysis of random matrices
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices.arXiv preprint arXiv:1011.3027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Wei, C., Shen, K., Chen, Y ., and Ma, T. Theoretical analysis of self-training with deep networks on unlabeled data.arXiv preprint arXiv:2010.03622,
- [25]
-
[26]
Yang, W., Shen, S., Shen, G., Yao, W., Liu, Y ., Gong, Z., Lin, Y ., and Wen, J.-R. Super (ficial)- alignment: Strong models may deceive weak models in weak-to-strong generalization.arXiv preprint arXiv:2406.11431, 2024a. Yang, Y ., Ma, Y ., and Liu, P. Weak-to-strong reasoning. InFindings of the Association for Com- putational Linguistics: EMNLP 2024, pp...
-
[27]
Yao, W., Yang, W., Wang, Z., Lin, Y ., and Liu, Y . Understanding the capabilities and limitations of weak-to-strong generalization.arXiv preprint arXiv:2502.01458,
-
[28]
27 B.2 Proof of Proposition 1 and Corollary 1
24 Appendices A Additional related works 25 B Proofs in Section 3 26 B.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.2 Proof of Proposition 1 and Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . 32 B.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 B.4 ...
work page 2015
-
[29]
is closely connected to W2S generalization regarding the teacher-student setup, while W2S reverts the ca- pacities of teacher and student in KD. In KD, a strong teacher model guides a weak student model to learn the teacher’s knowledge. In contrast, W2S generalization occurs when a strong student model surpasses a weak teacher model under weak supervision...
work page 2019
-
[30]
and student-teacher correlation in W2S. Self-distillation and self-training.In contrast to W2S, which considers distinct student and teacher models, self-distillation (Zhang et al., 2019,
work page 2019
-
[31]
use the same or progressively refined architectures to iteratively distill knowledge from a “previous version” of the model. There have been extensive 25 theoretical analyses toward understanding the mechanism behind self-distillation (Mobahi et al., 2020; Das & Sanghavi, 2023; Borup & Andersen, 2023; Pareek et al., 2024). Self-training (Scudder, 1965; Le...
work page 2020
-
[32]
is a closely related method to self-distillation that takes a single model’s confident predictions to create pseudo-labels for unlabeled data and refines that model iteratively. Wei et al. (2020); Oymak & Gulcu (2021); Frei et al. (2022) provide theo- retical insights into the generalization of self-training. In particular, Wei et al. (2020) introduced a ...
work page 2020
-
[33]
(8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS eΓ⊤ weΓw † V⊤ w
Variance.For the variance term, we observe that Var(fw2s) = 1 N ESx, eS PsΦweΦ† wez 2 2 = 1 N ESx, eS h tr Φ⊤ wPsΦweΦ† wezez⊤(eΦ† w)⊤ i = σ2 N ESx, eS h tr Φ⊤ wPsΦw(eΦ⊤ weΦw)† i , which implies Var(fw2s) = σ2 N tr ESx Σ−1/2 w Φ⊤ wPsΦwΣ−1/2 w E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † . (8) We observe that E eS Σ−1/2 w eΦ⊤ weΦwΣ−1/2 w † =E eS VweΓ⊤ weΓwV⊤ w † =V wE eS...
work page 1928
-
[34]
Overall, by (19) and (24), we have Var(fw2s)⩽ σ2 4(αwn)(αw2sN) 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) , Bias(fw2s)⩽α w Σ−1/2 w Σ1/2 ∗ θ∗ 2 2 +α w2s Σ−1/2 s Σ1/2 ∗ θ∗ 2 2 ⩽α wϱw +α w2sϱs. The upper bound the excess riskER(f w2s) =Var(f w2s) +Bias(f w2s)is minimized by taking αw = σ2 4nN ϱs ϱ2 w 1 + 1 N tr (ΣsΣw) + 1 N tr(Σs) tr(Σw) 1/3 , αw2s = σ2 4nN ϱw ϱ2...
work page 2013
-
[35]
We observe the following: 43 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 194 (ResNet18), ds = 443 (CLIP-B32), ds w = 167.64 Figure 7: Scaling for MSE on UTKFace withCLIP-B32as the strong student andResNet18as the weak teacher 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500...
work page 2000
-
[36]
• It is worth highlighting that while the MSE loss off w2s monotonically decreases with respect to both sample sizesn, N, the different rates of convergence compared tof w,f s, andf c lead to the 44 5000 10000 15000 N 102 103 n = 1000 1000 1500 2000 2500 3000 n 102 103 MSE N = 10000 W2S Weak S-Baseline S-Ceiling dw = 589 (ResNet152), ds = 443 (CLIP-B32), ...
work page 2000
-
[37]
and vary the weak teacher among the ResNet-d series and ResNet series (ResNet18D, ResNet34D, ResNet101, ResNet152) (He et al., 2019, 2016). We replace ResNet18 and ResNet34 used in Section 4.2 to experiment on weak models with similar intrinsic dimensions but different correlation dimensions. We treat the backbone of the models (excluding the classificati...
work page 2019
-
[38]
is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in lan- 46 Figure 10: Scaling forPGRandOPRof different weak teachers with a fixed strong student on ColoredMNIST. guage. The corpus is based on the dataset introduced by Pang & Lee (2005) and consists of 11,855 single sentences extracte...
work page 2005
-
[39]
(Bert-Tiny, Bert-Mini, Bert-Small, Bert-Medium). With manageable model sizes, we conduct full finetuning experiments following the setup in Burns et al. (2024). We use the standard cross-entropy loss for supervised finetun- ing. When training strong students on weak labels (W2S), we use the confidence-weighted loss proposed by Burns et al. (2024), which i...
work page 2024
-
[40]
with a learning rate of 5e-5, a cosine learning rate schedule, and 40 warmup steps. We train for 3 epochs, which is sufficient for the train and validation losses to stabilize. Intrinsic dimension.The intrinsic dimensionsd w, ds are measured based on the Structure-Aware Intrinsic Dimension (SAID) method proposed by Aghajanyan et al. (2021). We first train...
work page 2021
-
[41]
to accelerate estimation ofd s∧w via sketching (Halko et al., 2011; Woodruff et al., 2014). (i) We first reduce bothD s, Dw to the same lower dimensionD= 0.01 min{D s, Dw}(with D≫max{d s, dw}) by subsampling columns ofΦ s,Φ w (uniformly for efficiency, or adap- tively via sketching-based interpolative decomposition (Dong & Martinsson,
work page 2011
-
[42]
when 8Notice thatf s, fw are scalar-valued functions for binary classification tasks like SST-2, and thus the gradients ∇θfs and∇ θfw are row vectors. For multi-class classification tasks wheref s, fw output vectors of logits, a common heuristic to keepΦ s,Φ w as matrices of manageable sizes (in constrast to tensors) is to replace gradients of the models,...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.