On the Unreasonable Effectiveness of Last-layer Retraining

John C. Hill; Tyler LaBonte; Vidya Muthukumar; Xinchen Zhang

arxiv: 2512.01766 · v2 · pith:64SV5BSOnew · submitted 2025-12-01 · 💻 cs.LG

On the Unreasonable Effectiveness of Last-layer Retraining

John C. Hill , Tyler LaBonte , Xinchen Zhang , Vidya Muthukumar This is my paper

Pith reviewed 2026-05-17 02:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords last-layer retrainingworst-group accuracygroup balancespurious correlationsneural collapserobustnessmachine learning

0 comments

The pith

Last-layer retraining succeeds mainly because the held-out set has better group balance than the full training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why last-layer retraining boosts worst-group accuracy even when the held-out set remains imbalanced like the original training data. The authors first test whether this occurs by preventing neural collapse so that gradient descent favors robust features, but their experiments find no supporting evidence. They instead show that the held-out set simply contains a more even mix of groups, which allows the retrained final layer to reduce reliance on spurious correlations for minority groups. This account also explains the gains from related methods CB-LLR and AFR.

Core claim

Last-layer retraining improves worst-group accuracy primarily because the held-out set supplies better group balance, enabling the retrained classifier to perform well across all groups. Direct tests on neural collapse do not explain the gains. CB-LLR and AFR succeed by carrying out implicit group balancing as part of their procedures.

What carries the argument

Better group balance in the held-out set, which lets the retrained last layer reduce dependence on spurious features for underrepresented groups.

If this is right

LLR gains occur even with imperfect balance in the held-out set as long as that balance exceeds the training set's.
Methods such as CB-LLR and AFR obtain robustness by implicitly adjusting group representation without needing explicit labels.
Selecting or subsampling the held-out set to improve group coverage offers a direct route to higher worst-group performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicitly balancing the held-out set before retraining could produce larger robustness gains than current LLR practice.
The same balancing principle might underlie success in other forms of targeted retraining or fine-tuning.
Controlled experiments that vary only group proportions while holding all else fixed would isolate the effect more cleanly.

Load-bearing premise

That measured differences in group balance are the main reason for the observed robustness gains rather than other unexamined aspects of the training process.

What would settle it

Running last-layer retraining on a held-out set deliberately constructed to match the exact group proportions of the training set and finding no worst-group accuracy improvement would undermine the balance explanation.

Figures

Figures reproduced from arXiv: 2512.01766 by John C. Hill, Tyler LaBonte, Vidya Muthukumar, Xinchen Zhang.

**Figure 2.** Figure 2: Collapse of class feature variability occurs after standard ERM training, if at all. We plot a stochastic estimate of the empirical metric of neural collapse N C1 using Algorithm 1 throughout training a ResNet-50 on Waterbirds and CelebA and a BERT-Base on CivilComments and MultiNLI. For Waterbirds and CelebA, N C1 is computed using m = 10 random vectors, while for CivilComments and MultiNLI, N C1 is compu… view at source ↗

**Figure 3.** Figure 3: Convergence of LLR to the maximum-margin SVM solution is extremely slow. We plot the mean and standard deviation over 3 experimental seeds of the directional error Err d between the last layer weights of a neural network model and an SVM (both trained on the features of the held-out set). We use a ResNet-50 for Waterbirds and CelebA and a BERT-Base for CivilComments and MultiNLI. Here, Err d := || θNN ||θN… view at source ↗

**Figure 4.** Figure 4: LLR performance is determined by held-out set group balance. We compare the test WGA of ERM and LLR models while controlling the group balance of the training and held-out sets. We use ResNet-50 for the vision datasets and BERT-Base for the language datasets, and we plot the mean and standard deviation over 3 experimental seeds. We compute the Pearson correlation coefficient between the ERM test WGA and th… view at source ↗

**Figure 5.** Figure 5: LLR can recover optimally class-balanced WGA. We compare the test WGA of ERM (trained on 100% of the data) against LLR (trained on a held-out subset with the same group ratio). For both approaches, we evaluate four class-balancing strategies: no class balancing (No CB), upsampling, subsetting, and upweighting (defined in Section 2). We use ResNet-50 for the vision datasets and BERT-Base for the language da… view at source ↗

**Figure 6.** Figure 6: Group-balancing with upsampling and upweighting can lead to catastrophic collapse. We compare three group-balancing methods: subsetting, upsampling, and upweighting. We plot the mean and standard deviation over 3 experimental seeds for a ResNet-50 on the vision datasets and a BERT-Base on the language datasets. We note a dramatic decrease in test WGA during training for both upsampling and upweighting on C… view at source ↗

**Figure 7.** Figure 7: Choice of group-balancing method is important to DFR success. We compare class-balanced ERM to group-balanced DFR across three balancing methods: subsetting, upsampling, and upweighting. We plot the mean and standard deviation over 3 experimental seeds for a ResNet-50 on the vision datasets and a BERT-Base on the language datasets. We note that the choice of balancing method has a dramatic effect on the te… view at source ↗

**Figure 8.** Figure 8: Test average accuracy (AA) for ERM, upsampling, subsetting, and upweighting across the [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Test average accuracy (AA) over training epochs for ERM, upsampling, subsetting, and upweighting [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Test average accuracy of last-layer retraining (LLR) when the underlying ERM model is trained with [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Test average accuracy of last-layer retraining (LLR) under varying group ratios, evaluated across [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Last-layer retraining (LLR) methods -- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training -- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript empirically investigates last-layer retraining (LLR) on a held-out set after ERM training as a method to improve worst-group accuracy under spurious correlations. The authors test and reject the hypothesis that LLR succeeds by mitigating neural collapse (thereby allowing gradient descent's implicit bias to promote robustness). They instead advance the claim that LLR's effectiveness is primarily explained by the held-out set having better group balance than the training distribution, and they interpret CB-LLR and AFR as implicitly performing such balancing.

Significance. If the group-balance account is correct and can be isolated from confounders, the work supplies a simpler, data-centric explanation for LLR's observed robustness gains and suggests that future algorithms should explicitly target group proportions in the retraining set. The empirical rejection of the neural-collapse account is a useful negative result, though its strength depends on the completeness of the controls and metrics employed.

major comments (3)

[§4] §4 (group-balance experiments): the paper demonstrates correlations between held-out group-balance metrics and worst-group accuracy, but does not report controls that hold group balance fixed while varying other held-out-set properties (total size, class-conditional moments, or selection mechanism). Without such isolation, it remains possible that the observed robustness gains are driven by implicit regularization or reduced overfitting rather than balance per se.
[§3] §3 (neural-collapse tests): the empirical metrics used to rule out neural collapse (e.g., within-class variability or simplex ETF alignment) are applied after LLR; it is unclear whether these metrics are sensitive enough to detect partial or transient contributions of collapse mitigation during the retraining phase, leaving open the possibility that collapse effects interact with balance.
[§5] §5 (CB-LLR and AFR analysis): the claim that these methods achieve robustness via implicit group balancing would be strengthened by direct quantitative comparison of the effective group proportions induced by each method against a standard LLR baseline on the same held-out set.

minor comments (2)

[Figure 3] Figure 3: axis labels and legend entries for the different held-out-set constructions are difficult to distinguish at print size; consider adding a small table of exact group proportions for each condition.
[Notation] Notation: the definition of 'group balance' (e.g., max-min ratio or entropy) should be stated explicitly in the main text rather than only in the appendix, as it is central to the alternative hypothesis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify the presentation of our results. We address each major comment below and indicate revisions made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (group-balance experiments): the paper demonstrates correlations between held-out group-balance metrics and worst-group accuracy, but does not report controls that hold group balance fixed while varying other held-out-set properties (total size, class-conditional moments, or selection mechanism). Without such isolation, it remains possible that the observed robustness gains are driven by implicit regularization or reduced overfitting rather than balance per se.

Authors: We agree that fully isolating group balance from confounders such as total size or selection mechanism would strengthen the causal claim. Our experiments already vary held-out sets with differing group balances while reporting results across multiple sizes and datasets; the robustness gains track balance metrics more closely than size or other properties. To further address this, we have added in the revision a controlled subsampling experiment that matches held-out set sizes while varying balance, along with a discussion ruling out implicit regularization as the primary driver (as ERM retraining on balanced subsets without the original LLR procedure does not reproduce the gains). We view this as partial resolution of the concern. revision: partial
Referee: [§3] §3 (neural-collapse tests): the empirical metrics used to rule out neural collapse (e.g., within-class variability or simplex ETF alignment) are applied after LLR; it is unclear whether these metrics are sensitive enough to detect partial or transient contributions of collapse mitigation during the retraining phase, leaving open the possibility that collapse effects interact with balance.

Authors: We monitored the neural-collapse metrics (within-class variability and ETF alignment) at multiple checkpoints throughout the LLR retraining phase, not solely at convergence. These trajectories show no meaningful correlation between reductions in collapse and improvements in worst-group accuracy; instead, the gains align with the group-balance properties of the held-out set. We have added the intermediate metric plots and a brief discussion of potential interactions to the revised §3. While transient effects cannot be entirely excluded in principle, the empirical evidence favors balance as the dominant mechanism. revision: yes
Referee: [§5] §5 (CB-LLR and AFR analysis): the claim that these methods achieve robustness via implicit group balancing would be strengthened by direct quantitative comparison of the effective group proportions induced by each method against a standard LLR baseline on the same held-out set.

Authors: We appreciate the suggestion. The revised §5 now includes a direct quantitative comparison: for the same held-out set, we report the effective group proportions (normalized counts after reweighting or selection) induced by CB-LLR and AFR versus standard LLR. These results show that both methods produce substantially more balanced group distributions than vanilla LLR, consistent with our interpretation that implicit balancing explains their robustness gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hypothesis testing with independent experimental support

full rationale

The paper conducts an empirical study: it proposes a neural-collapse hypothesis for LLR effectiveness, performs targeted experiments to test it, rejects the hypothesis on the basis of observed results, and advances an alternative explanation centered on group balance in the held-out set. All load-bearing steps are data comparisons (worst-group accuracy, balance metrics) rather than closed-form derivations or parameter fits that reduce to the inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central claim; the analysis remains open to external falsification via additional controls or datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study and introduces no new free parameters, mathematical axioms, or invented entities; it relies on standard supervised learning assumptions and the availability of group labels for analysis.

axioms (1)

domain assumption Standard assumptions in supervised learning that group labels exist in the held-out set and that worst-group accuracy is a meaningful robustness metric.
Invoked when interpreting LLR performance on minority groups.

pith-pipeline@v0.9.0 · 5463 in / 1109 out tokens · 62564 ms · 2026-05-17T02:39:21.315834+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Boosting the margin: A new explanation for the effectiveness of voting methods

Bartlett, Peter, Yoav Freund, Wee Sun Lee, and Robert E Schapire (1998). “Boosting the margin: A new explanation for the effectiveness of voting methods”. In:The Annals of Statistics26.5, pp. 1651–1686 (cit. on pp. 3, 8). Beery, Sara, Grant van Horn, and Pietro Perona (2018). “Recognition in Terra Incognita”. In:European Conference on Computer Vision (ECC...

work page doi:10.1073/pnas.2103091118.url:http://dx.doi.org/10.1073/ 1998
[2]

A broad-coverage challenge corpus for sentence understanding through inference

Tech. rep. California Institute of Technology (cit. on p. 4). Williams, Adina, Nikita Nangia, and Samuel Bowman (2018). “A broad-coverage challenge corpus for sentence understanding through inference”. In:North American Association for Computational Linguistics (NAACL)(cit. on pp. 4, 5, 24). Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clem...

work page 2018
[3]

Note that Waterbirds is the only dataset that has a distribution shift and MultiNLI is the only dataset which is class-balanceda priori

Table 5:Dataset composition.The class probabilities change dramatically when conditioned on the spurious feature. Note that Waterbirds is the only dataset that has a distribution shift and MultiNLI is the only dataset which is class-balanceda priori. The minority groups within each class are denoted by an asterisk in the “Num” column. Probabilities may no...

work page 1992
[4]

These pretrained models are used as the initialization for ERM finetuning under the cross-entropy loss

and English Wikipedia for CivilComments and MultiNLI. These pretrained models are used as the initialization for ERM finetuning under the cross-entropy loss. We use standard ImageNet normalization with standard flip and crop data augmentation for the vision tasks and BERT tokenization for the language tasks (Izmailov et al., 2022). Our implementation uses...

work page 2022

[1] [1]

Boosting the margin: A new explanation for the effectiveness of voting methods

Bartlett, Peter, Yoav Freund, Wee Sun Lee, and Robert E Schapire (1998). “Boosting the margin: A new explanation for the effectiveness of voting methods”. In:The Annals of Statistics26.5, pp. 1651–1686 (cit. on pp. 3, 8). Beery, Sara, Grant van Horn, and Pietro Perona (2018). “Recognition in Terra Incognita”. In:European Conference on Computer Vision (ECC...

work page doi:10.1073/pnas.2103091118.url:http://dx.doi.org/10.1073/ 1998

[2] [2]

A broad-coverage challenge corpus for sentence understanding through inference

Tech. rep. California Institute of Technology (cit. on p. 4). Williams, Adina, Nikita Nangia, and Samuel Bowman (2018). “A broad-coverage challenge corpus for sentence understanding through inference”. In:North American Association for Computational Linguistics (NAACL)(cit. on pp. 4, 5, 24). Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clem...

work page 2018

[3] [3]

Note that Waterbirds is the only dataset that has a distribution shift and MultiNLI is the only dataset which is class-balanceda priori

Table 5:Dataset composition.The class probabilities change dramatically when conditioned on the spurious feature. Note that Waterbirds is the only dataset that has a distribution shift and MultiNLI is the only dataset which is class-balanceda priori. The minority groups within each class are denoted by an asterisk in the “Num” column. Probabilities may no...

work page 1992

[4] [4]

These pretrained models are used as the initialization for ERM finetuning under the cross-entropy loss

and English Wikipedia for CivilComments and MultiNLI. These pretrained models are used as the initialization for ERM finetuning under the cross-entropy loss. We use standard ImageNet normalization with standard flip and crop data augmentation for the vision tasks and BERT tokenization for the language tasks (Izmailov et al., 2022). Our implementation uses...

work page 2022