Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

Miit Daga; Swarna Priya Ramu

arxiv: 2604.11508 · v2 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

Miit Daga , Swarna Priya Ramu This is my paper

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords forgetting dynamicsfine-tuningimage classifiersarchitecture dependenceretention curvessample difficultyensemble learningcurriculum design

0 comments

The pith

Architectures forget fundamentally different samples during fine-tuning, with per-sample patterns being stochastic across random seeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks per-sample correctness at every epoch while fine-tuning ResNet-18 and DeiT-Small on an imbalanced retinal OCT dataset and on CUB-200 bird images. It fits Ebbinghaus-style exponential decay curves to each sample's retention trace and compares the resulting forgetting dynamics between the CNN and ViT architectures. The two models forget largely non-overlapping sets of samples, ViT traces fit the decay model more closely, and individual sample forgetting shows near-zero correlation across random seeds. Class-level forgetting remains consistent and aligns with visual similarity, while early loss after head warmup predicts later decay rates.

Core claim

During fine-tuning, ResNet-18 and DeiT-Small forget fundamentally different samples, shown by Jaccard overlaps of only 0.34 and 0.15 for the top 10 percent most-forgotten examples on the two datasets. ViT forgetting is more structured, with higher mean R-squared fits to exponential decay. Per-sample forgetting is stochastic across seeds with Spearman correlations near zero, while class-level forgetting is consistent and semantically interpretable. A sample's loss after head warmup predicts its long-term decay constant, yet a spaced-repetition sampler built on those constants does not beat random sampling.

What carries the argument

Per-sample retention traces fitted to Ebbinghaus-style exponential decay curves, used to quantify and compare forgetting across architectures and random seeds.

If this is right

Architectural diversity in ensembles supplies complementary retention coverage.
Curriculum or pruning methods based on per-sample difficulty fail to generalize across runs.
Early loss after head warmup predicts long-term forgetting rates.
Class-level forgetting follows visual similarity and remains stable.
Static scheduling using decay constants brings no gain over random sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forgetting appears driven more by model inductive bias than by fixed sample properties, so mixing architectures could improve overall retention.
Single-run difficulty rankings are too noisy for reliable data selection in pruning or curricula.
Class-level semantic patterns could guide targeted augmentation on visually similar categories.
The low per-sample stability suggests forgetting may need to be modeled as an ensemble property rather than an individual-sample trait.

Load-bearing premise

Fitting exponential decay curves to per-sample correctness traces over epochs yields a stable and meaningful characterization of forgetting that generalizes beyond the two datasets and training setups used.

What would settle it

Observing a Jaccard overlap above 0.6 between the most-forgotten samples of the CNN and ViT on the same data, or a Spearman correlation above 0.4 for per-sample forgetting ranks across random seeds.

Figures

Figures reproduced from arXiv: 2604.11508 by Miit Daga, Swarna Priya Ramu.

**Figure 1.** Figure 1: Pipeline overview. Phase 1 trains only the classification head. During Phase 2 vanilla training, per-sample correctness [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of per-sample decay constants [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example per-sample retention traces (dots) with fitted exponential decay curves (lines). Top row: samples with low [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Jaccard similarity between the top-𝑘% mostforgotten samples of ResNet-18 and DeiT-Small, for 𝑘 ∈ {10, 20, 30, 40, 50}. Overlap is low across all thresholds, particularly on CUB-200 (𝐽 < 0.36 even at 𝑘 = 50%). classes are hard even as they disagree on which individual samples within those classes are forgotten. On OCTDL (only 7 classes), the class-level correlation is unstable, ranging from 0.00 to 0.89. 5… view at source ↗

**Figure 4.** Figure 4: Distribution of per-sample 𝑅2 for exponential decay fits (seed 42). DeiT-Small achieves higher 𝑅2 across both datasets, indicating more structured and predictable forgetting dynamics than ResNet-18. 10 15 20 25 30 35 40 45 50 Top-k% most forgotten 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard similarity octdl: ResNet-18 vs DeiT-Small overlap seed 42 seed 99 seed 2026 10 15 20 25 30 35 40 45 50 Top-k% most forgotten 0.0 … view at source ↗

**Figure 6.** Figure 6: Per-sample selection frequency across sampling [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $\rho \approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample's loss after head warmup predicts its long-term decay constant ($\rho = 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core observations are low Jaccard overlap in forgotten samples across ResNet and DeiT plus near-zero cross-seed correlation in per-sample forgetting, but moderate exponential fit quality weakens how much weight those rankings can carry.

read the letter

The two things to know are that the two architectures forget largely different samples and that which samples get forgotten varies a lot with random seed. The Jaccard overlaps on the top 10% most forgotten are low at 0.34 on OCTDL and 0.15 on CUB-200, and Spearman correlation across seeds is near zero at about 0.01. These are concrete empirical numbers not already standard in the forgetting literature. The work does a solid job of tracking correctness per sample over every epoch during fine-tuning on the imbalanced OCT dataset and on CUB-200. Fitting the Ebbinghaus curves and reporting the R^2 differences between architectures, plus the class-level semantic patterns, gives some structure to the results. The correlation between early loss and later decay constant is a useful finding, and showing that a sampler based on these constants doesn't beat random is a straightforward negative result worth having. The main concern is that the exponential fits explain only about half the variance for the CNN (R^2 0.52), so the decay constants used to identify the most-forgotten samples may not be very stable. That makes the architecture-dependence claim rest on rankings that could shift with small changes in the traces. The stochasticity result fits with this but doesn't strengthen the overlap numbers. With just two models and two datasets, it's hard to know how far these patterns go. This paper is for people doing fine-tuning in computer vision who care about data selection or ensembles. It raises good questions about whether sample difficulty is fixed or depends on the model and run. I would send it to peer review. The questions are practical and the measurements are direct, so referees can check the robustness of the fits and see if the claims hold with better controls or different models.

Referee Report

2 major / 2 minor

Summary. The paper claims that fine-tuning ResNet-18 and DeiT-Small on OCTDL and CUB-200-2011 leads to architecture-dependent forgetting of different samples, quantified by low Jaccard overlap (0.34 and 0.15) in the top 10% most-forgotten samples when ranked by decay constants from Ebbinghaus-style exponential fits to per-epoch per-sample correctness traces. It further reports higher mean R² for ViT traces (0.74) than CNN (0.52), near-zero Spearman correlation of per-sample forgetting across random seeds (ρ≈0.01), semantically consistent class-level forgetting patterns, and moderate predictive correlation (ρ=0.30–0.50) between post-warmup loss and long-term decay constants. The work concludes that architectural diversity in ensembles provides complementary retention and that static per-sample difficulty signals are unreliable for curriculum or pruning.

Significance. If the reported architecture-dependent and stochastic forgetting patterns hold under more rigorous validation of the per-sample rankings, the results would be significant for fine-tuning practices in computer vision. They provide empirical evidence that different architectures retain complementary information, which could guide ensemble construction, and they challenge the use of fixed difficulty metrics in data pruning or curriculum design. The direct per-epoch trace analysis offers a concrete, falsifiable approach to studying retention dynamics with potential downstream impact on training efficiency and generalization.

major comments (2)

[Abstract] Abstract: The mean R²=0.52 for ResNet-18 traces indicates that the exponential decay model explains only half the variance on average. Since the headline Jaccard overlaps (0.34 on OCTDL, 0.15 on CUB-200) are computed from rankings of samples by the fitted decay constant, this moderate fit quality means epoch-to-epoch label noise or non-monotonic traces can perturb the top-10% sets. The architecture-dependence conclusion therefore depends on the stability of these noisy orderings; a sensitivity analysis (e.g., bootstrapped rankings or restriction to high-R² samples) is required to establish that the low overlap is not an artifact of fit quality.
[Results] Results: No error bars, confidence intervals, or details on the number of random seeds, exact train/validation splits, or multiple-testing corrections are reported for the Jaccard indices, Spearman correlations, or R² values. Given the 56:1 class imbalance in OCTDL and the per-sample fitting procedure, these omissions prevent verification of whether the quantitative claims (including the stochasticity result ρ≈0.01) are statistically robust.

minor comments (2)

[Methods] The precise functional form of the 'Ebbinghaus-style' exponential decay (including any offset or scaling parameters) should be stated explicitly, along with the fitting procedure (e.g., least-squares on correctness probability or log-odds).
[Abstract] The spaced-repetition sampler experiment would benefit from reporting the exact performance metric, number of trials, and statistical test against random sampling to support the claim that it 'does not outperform'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important aspects of statistical robustness and the stability of our per-sample rankings. We agree that additional analyses are warranted to strengthen the claims regarding architecture-dependent forgetting. Below we provide point-by-point responses to the major comments and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The mean R²=0.52 for ResNet-18 traces indicates that the exponential decay model explains only half the variance on average. Since the headline Jaccard overlaps (0.34 on OCTDL, 0.15 on CUB-200) are computed from rankings of samples by the fitted decay constant, this moderate fit quality means epoch-to-epoch label noise or non-monotonic traces can perturb the top-10% sets. The architecture-dependence conclusion therefore depends on the stability of these noisy orderings; a sensitivity analysis (e.g., bootstrapped rankings or restriction to high-R² samples) is required to establish that the low overlap is not an artifact of fit quality.

Authors: We acknowledge that the mean R² of 0.52 for ResNet-18 indicates moderate fit quality for the exponential model on CNN traces, which could introduce noise into individual decay constant estimates and potentially affect the stability of top-10% rankings. However, the low Jaccard overlaps are observed consistently across two datasets with differing characteristics (imbalanced OCTDL and balanced CUB-200), and the higher R² for ViTs (0.74) is itself part of the architecture-specific finding. In the revision we will add a sensitivity analysis by (i) recomputing Jaccard overlaps after restricting to samples with R² > 0.6 and (ii) reporting bootstrapped 95% confidence intervals on the Jaccard indices via resampling of the decay constants. We expect the overlaps to remain low, supporting that the architecture dependence is not an artifact of fit noise. We will also expand the discussion to note that lower R² for CNNs reflects more stochastic retention dynamics. revision: yes
Referee: [Results] Results: No error bars, confidence intervals, or details on the number of random seeds, exact train/validation splits, or multiple-testing corrections are reported for the Jaccard indices, Spearman correlations, or R² values. Given the 56:1 class imbalance in OCTDL and the per-sample fitting procedure, these omissions prevent verification of whether the quantitative claims (including the stochasticity result ρ≈0.01) are statistically robust.

Authors: We agree that the original manuscript omitted error bars, exact seed counts, split details, and multiple-testing clarifications, which limits verifiability especially given OCTDL's class imbalance. In the revised manuscript we will report: the use of 5 random seeds for all main experiments, the precise train/validation splits (80/20 stratified by class), and 95% confidence intervals or standard errors for all quantitative metrics (Jaccard indices, Spearman ρ values, and mean R²). For the stochasticity result (ρ≈0.01 across seeds), we will add pairwise seed comparisons and confirm the near-zero correlation holds with statistical testing. Regarding multiple testing, the per-sample loss-to-decay correlations have p < 10^{-45}; we will note that these survive Bonferroni correction across the sample set. The class imbalance concern is mitigated by the replication of core patterns on the balanced CUB-200 dataset, which we will emphasize. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all central results are independent empirical computations from direct measurements.

full rationale

The paper records per-epoch correctness traces for each sample during fine-tuning, fits Ebbinghaus-style exponentials post-hoc as a descriptive tool, ranks samples by the resulting decay constants, and then computes Jaccard overlaps, Spearman correlations, and class-level patterns directly from those rankings and traces. None of the reported quantities (Jaccard 0.15-0.34, stochasticity ρ≈0.01, loss-to-decay correlation ρ=0.30-0.50) is defined in terms of itself or reduces by the paper's own equations to a quantity already fixed by the inputs. The exponential fit is an analysis step whose quality is separately reported (mean R² values), not a self-referential premise. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that exponential decay is an appropriate model for per-sample retention traces and that the reported correlations reflect genuine architectural differences rather than dataset-specific artifacts.

free parameters (1)

per-sample exponential decay parameters
Each sample's correctness trace is fit to an Ebbinghaus-style exponential decay, introducing sample-specific decay constants and initial retention values that are then used for downstream comparisons.

axioms (1)

domain assumption Per-sample correctness over training epochs can be meaningfully summarized by an exponential decay curve
The paper reports mean R² values for these fits and uses the resulting decay constants for all subsequent analyses.

pith-pipeline@v0.9.0 · 5622 in / 1414 out tokens · 81003 ms · 2026-05-10T15:33:25.208338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

Do vision transformers see like convolutional neural networks? Advances in neural information processing systems 34, 12116–12128. Settles, B., Meeder, B., 2016. A trainable spaced repetition model for language learning, in: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp. 1848–1858. Swaya...

work page arXiv 2016
[2]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., etal.,2020. Scipy1.0:fundamentalalgorithmsforscientificcomputing in...

work page 2020

[1] [1]

An empirical study of example forgetting during deep neural network learning.arXiv preprint arXiv:1812.05159,

Do vision transformers see like convolutional neural networks? Advances in neural information processing systems 34, 12116–12128. Settles, B., Meeder, B., 2016. A trainable spaced repetition model for language learning, in: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp. 1848–1858. Swaya...

work page arXiv 2016

[2] [2]

Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., etal.,2020. Scipy1.0:fundamentalalgorithmsforscientificcomputing in...

work page 2020