Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers
Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3
The pith
Architectures forget fundamentally different samples during fine-tuning, with per-sample patterns being stochastic across random seeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
During fine-tuning, ResNet-18 and DeiT-Small forget fundamentally different samples, shown by Jaccard overlaps of only 0.34 and 0.15 for the top 10 percent most-forgotten examples on the two datasets. ViT forgetting is more structured, with higher mean R-squared fits to exponential decay. Per-sample forgetting is stochastic across seeds with Spearman correlations near zero, while class-level forgetting is consistent and semantically interpretable. A sample's loss after head warmup predicts its long-term decay constant, yet a spaced-repetition sampler built on those constants does not beat random sampling.
What carries the argument
Per-sample retention traces fitted to Ebbinghaus-style exponential decay curves, used to quantify and compare forgetting across architectures and random seeds.
If this is right
- Architectural diversity in ensembles supplies complementary retention coverage.
- Curriculum or pruning methods based on per-sample difficulty fail to generalize across runs.
- Early loss after head warmup predicts long-term forgetting rates.
- Class-level forgetting follows visual similarity and remains stable.
- Static scheduling using decay constants brings no gain over random sampling.
Where Pith is reading between the lines
- Forgetting appears driven more by model inductive bias than by fixed sample properties, so mixing architectures could improve overall retention.
- Single-run difficulty rankings are too noisy for reliable data selection in pruning or curricula.
- Class-level semantic patterns could guide targeted augmentation on visually similar categories.
- The low per-sample stability suggests forgetting may need to be modeled as an ensemble property rather than an individual-sample trait.
Load-bearing premise
Fitting exponential decay curves to per-sample correctness traces over epochs yields a stable and meaningful characterization of forgetting that generalizes beyond the two datasets and training setups used.
What would settle it
Observing a Jaccard overlap above 0.6 between the most-forgotten samples of the CNN and ViT on the same data, or a Spearman correlation above 0.4 for per-sample forgetting ranks across random seeds.
Figures
read the original abstract
Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $\rho \approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample's loss after head warmup predicts its long-term decay constant ($\rho = 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fine-tuning ResNet-18 and DeiT-Small on OCTDL and CUB-200-2011 leads to architecture-dependent forgetting of different samples, quantified by low Jaccard overlap (0.34 and 0.15) in the top 10% most-forgotten samples when ranked by decay constants from Ebbinghaus-style exponential fits to per-epoch per-sample correctness traces. It further reports higher mean R² for ViT traces (0.74) than CNN (0.52), near-zero Spearman correlation of per-sample forgetting across random seeds (ρ≈0.01), semantically consistent class-level forgetting patterns, and moderate predictive correlation (ρ=0.30–0.50) between post-warmup loss and long-term decay constants. The work concludes that architectural diversity in ensembles provides complementary retention and that static per-sample difficulty signals are unreliable for curriculum or pruning.
Significance. If the reported architecture-dependent and stochastic forgetting patterns hold under more rigorous validation of the per-sample rankings, the results would be significant for fine-tuning practices in computer vision. They provide empirical evidence that different architectures retain complementary information, which could guide ensemble construction, and they challenge the use of fixed difficulty metrics in data pruning or curriculum design. The direct per-epoch trace analysis offers a concrete, falsifiable approach to studying retention dynamics with potential downstream impact on training efficiency and generalization.
major comments (2)
- [Abstract] Abstract: The mean R²=0.52 for ResNet-18 traces indicates that the exponential decay model explains only half the variance on average. Since the headline Jaccard overlaps (0.34 on OCTDL, 0.15 on CUB-200) are computed from rankings of samples by the fitted decay constant, this moderate fit quality means epoch-to-epoch label noise or non-monotonic traces can perturb the top-10% sets. The architecture-dependence conclusion therefore depends on the stability of these noisy orderings; a sensitivity analysis (e.g., bootstrapped rankings or restriction to high-R² samples) is required to establish that the low overlap is not an artifact of fit quality.
- [Results] Results: No error bars, confidence intervals, or details on the number of random seeds, exact train/validation splits, or multiple-testing corrections are reported for the Jaccard indices, Spearman correlations, or R² values. Given the 56:1 class imbalance in OCTDL and the per-sample fitting procedure, these omissions prevent verification of whether the quantitative claims (including the stochasticity result ρ≈0.01) are statistically robust.
minor comments (2)
- [Methods] The precise functional form of the 'Ebbinghaus-style' exponential decay (including any offset or scaling parameters) should be stated explicitly, along with the fitting procedure (e.g., least-squares on correctness probability or log-odds).
- [Abstract] The spaced-repetition sampler experiment would benefit from reporting the exact performance metric, number of trials, and statistical test against random sampling to support the claim that it 'does not outperform'.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments, which highlight important aspects of statistical robustness and the stability of our per-sample rankings. We agree that additional analyses are warranted to strengthen the claims regarding architecture-dependent forgetting. Below we provide point-by-point responses to the major comments and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The mean R²=0.52 for ResNet-18 traces indicates that the exponential decay model explains only half the variance on average. Since the headline Jaccard overlaps (0.34 on OCTDL, 0.15 on CUB-200) are computed from rankings of samples by the fitted decay constant, this moderate fit quality means epoch-to-epoch label noise or non-monotonic traces can perturb the top-10% sets. The architecture-dependence conclusion therefore depends on the stability of these noisy orderings; a sensitivity analysis (e.g., bootstrapped rankings or restriction to high-R² samples) is required to establish that the low overlap is not an artifact of fit quality.
Authors: We acknowledge that the mean R² of 0.52 for ResNet-18 indicates moderate fit quality for the exponential model on CNN traces, which could introduce noise into individual decay constant estimates and potentially affect the stability of top-10% rankings. However, the low Jaccard overlaps are observed consistently across two datasets with differing characteristics (imbalanced OCTDL and balanced CUB-200), and the higher R² for ViTs (0.74) is itself part of the architecture-specific finding. In the revision we will add a sensitivity analysis by (i) recomputing Jaccard overlaps after restricting to samples with R² > 0.6 and (ii) reporting bootstrapped 95% confidence intervals on the Jaccard indices via resampling of the decay constants. We expect the overlaps to remain low, supporting that the architecture dependence is not an artifact of fit noise. We will also expand the discussion to note that lower R² for CNNs reflects more stochastic retention dynamics. revision: yes
-
Referee: [Results] Results: No error bars, confidence intervals, or details on the number of random seeds, exact train/validation splits, or multiple-testing corrections are reported for the Jaccard indices, Spearman correlations, or R² values. Given the 56:1 class imbalance in OCTDL and the per-sample fitting procedure, these omissions prevent verification of whether the quantitative claims (including the stochasticity result ρ≈0.01) are statistically robust.
Authors: We agree that the original manuscript omitted error bars, exact seed counts, split details, and multiple-testing clarifications, which limits verifiability especially given OCTDL's class imbalance. In the revised manuscript we will report: the use of 5 random seeds for all main experiments, the precise train/validation splits (80/20 stratified by class), and 95% confidence intervals or standard errors for all quantitative metrics (Jaccard indices, Spearman ρ values, and mean R²). For the stochasticity result (ρ≈0.01 across seeds), we will add pairwise seed comparisons and confirm the near-zero correlation holds with statistical testing. Regarding multiple testing, the per-sample loss-to-decay correlations have p < 10^{-45}; we will note that these survive Bonferroni correction across the sample set. The class imbalance concern is mitigated by the replication of core patterns on the balanced CUB-200 dataset, which we will emphasize. revision: yes
Circularity Check
No significant circularity; all central results are independent empirical computations from direct measurements.
full rationale
The paper records per-epoch correctness traces for each sample during fine-tuning, fits Ebbinghaus-style exponentials post-hoc as a descriptive tool, ranks samples by the resulting decay constants, and then computes Jaccard overlaps, Spearman correlations, and class-level patterns directly from those rankings and traces. None of the reported quantities (Jaccard 0.15-0.34, stochasticity ρ≈0.01, loss-to-decay correlation ρ=0.30-0.50) is defined in terms of itself or reduces by the paper's own equations to a quantity already fixed by the inputs. The exponential fit is an analysis step whose quality is separately reported (mean R² values), not a self-referential premise. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-sample exponential decay parameters
axioms (1)
- domain assumption Per-sample correctness over training epochs can be meaningfully summarized by an exponential decay curve
Reference graph
Works this paper leans on
-
[1]
Do vision transformers see like convolutional neural networks? Advances in neural information processing systems 34, 12116–12128. Settles, B., Meeder, B., 2016. A trainable spaced repetition model for language learning, in: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp. 1848–1858. Swaya...
-
[2]
Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., etal.,2020. Scipy1.0:fundamentalalgorithmsforscientificcomputing in...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.