Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study

Aisha Sartaj; Alexander Hackett; Ginny Fisher; Jason Fisher; Mahule Roy; Srikanth Thudumu

arxiv: 2605.15599 · v1 · pith:B4XLADPSnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study

Alexander Hackett , Srikanth Thudumu , Ginny Fisher , Mahule Roy , Aisha Sartaj , Jason Fisher This is my paper

Pith reviewed 2026-05-20 19:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords extreme low-data learningfine-grained visual classificationpretraining objectivesvision transformerscontrastive learningmasked autoencoderslinear separabilityemerald inclusion grading

0 comments

The pith

Supervised and contrastive pretraining produce stronger linear representations than masked reconstruction or self-distillation for extreme low-data fine-grained classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how pretraining objectives shape representation quality when labeled images are extremely scarce, using a custom three-class emerald inclusion grading dataset as the test case. It holds backbone size fixed at ViT-B/16 and compares supervised classification, contrastive learning, masked reconstruction, and self-distillation encoders under frozen conditions. Linear probes show supervised and contrastive methods achieving the highest AUC scores, while masked reconstruction improves when nonlinear classifiers are applied; self-distillation lags across both. Rigorous leave-one-out cross-validation plus permutation testing controls for variance in the small-N regime. The results give practitioners concrete guidance on matching pretraining choice to the type of downstream classifier feasible with limited labels.

Core claim

In a backbone-controlled comparison on a custom emerald inclusion dataset, supervised and SigLIP2 contrastive ViT-B/16 encoders deliver the highest macro one-vs-rest AUC under linear probes (logistic regression 0.768 and 0.735; SVM 0.739 and 0.697), masked autoencoder pretraining reaches 0.713 with XGBoost nonlinear probes, and DINOv3 underperforms across probe families.

What carries the argument

Matched frozen ViT-B/16 encoders pretrained with supervised classification, contrastive learning, masked reconstruction, or self-distillation objectives, evaluated by leave-one-out cross-validation and permutation testing on macro AUC with linear and nonlinear probes.

If this is right

When dataset size restricts probing to linear classifiers, margin-enforcing objectives such as supervised or contrastive pretraining should be selected first.
Reconstruction-style pretraining becomes preferable once nonlinear classifiers can be trained without overfitting.
Self-distillation objectives are not favored in this domain under either linear or nonlinear evaluation.
Practitioners facing scarce labels should match the pretraining objective to the complexity of the downstream probe permitted by their data budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed linear-versus-nonlinear trade-off may extend to other expert domains that rely on small annotated image sets, such as rare medical findings or specialized scientific photography.
A natural next test is whether hybrid pretraining that combines contrastive and reconstruction losses closes the gap across probe types.
The results imply that model-selection pipelines for low-data FGVC could incorporate a quick probe-type check before final encoder choice.

Load-bearing premise

The custom three-class emerald dataset together with leave-one-out cross-validation and permutation testing isolates the effect of pretraining objective from other sources of variance in the extreme low-data regime.

What would settle it

Re-running the identical backbone-controlled protocol on a different extreme low-data FGVC dataset and finding that DINOv3 or MAE produces higher linear-probe AUC than supervised pretraining would falsify the reported ordering.

Figures

Figures reproduced from arXiv: 2605.15599 by Aisha Sartaj, Alexander Hackett, Ginny Fisher, Jason Fisher, Mahule Roy, Srikanth Thudumu.

**Figure 1.** Figure 1: Representative specimens from each clarity grade. (a) Eye-clean: saturated green, continuous color, no eye-visible inclusions. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Representative clean/perturbed pair. Left: original eye [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Supervised and contrastive pretraining give better linear probe results than MAE or DINOv3 on this small emerald dataset, but differences in upstream training data and scale undercut clean isolation of the objective effect.

read the letter

The main takeaway is that on this three-class emerald inclusion task with very few examples, the supervised and SigLIP2 encoders produce the highest AUCs with linear probes (0.768 and 0.735 for logistic regression), MAE improves when a nonlinear probe like XGBoost is allowed (0.713), and DINOv3 lags across the board. They run the comparison with matched ViT-B/16 backbones, leave-one-out cross-validation, and 1000 permutations to manage noise in the low-N setting.

Referee Report

1 major / 2 minor

Summary. The paper claims that pretraining objective affects representation quality in extreme low-data FGVC on a custom three-class emerald inclusion dataset. Using four frozen ViT-B/16 encoders (supervised classification, SigLIP2 contrastive, MAE reconstruction, DINOv3 self-distillation) evaluated via LOOCV with linear (logistic, SVM) and nonlinear (XGBoost) probes plus 1000-permutation testing on macro one-vs-rest AUC, it reports strongest linear separability for supervised (logistic AUC 0.768) and contrastive (0.735) encoders, better MAE performance under nonlinear probes (XGBoost AUC 0.713), and consistent underperformance by DINOv3. This leads to a practical recommendation prioritizing margin-enforcing objectives for linear probes in low-data settings.

Significance. If the isolation of objective effects holds, the work offers concrete guidance for encoder selection in expert domains with scarce labels. Strengths include the backbone-matched design, use of multiple probe families, and permutation testing to mitigate low-N statistical noise; these elements make the empirical comparison more robust than typical single-probe evaluations in the low-data regime.

major comments (1)

[Abstract / Encoder selection and experimental design] The central claim that pretraining objective can be isolated under 'matched backbone capacity' (abstract) is not supported by the experimental setup. The four encoders are distinct public checkpoints whose pretraining corpora, image counts, epoch budgets, and augmentations are not shown to be identical; any such upstream mismatches can produce the observed AUC gaps (e.g., logistic 0.768 supervised vs. 0.735 SigLIP2) independently of the loss function. Downstream LOOCV and permutation tests address label noise but cannot equalize upstream training distributions.

minor comments (2)

[Abstract] Exact dataset size, number of images per class, and total N are not reported in the abstract or methods summary; these details are essential to evaluate whether the regime is truly 'extreme low-data' and whether LOOCV is appropriate.
[Results] Full tables of per-fold AUC values, exact permutation p-values, and confidence intervals would strengthen verification of the reported differences (e.g., 0.768 vs. 0.735).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Encoder selection and experimental design] The central claim that pretraining objective can be isolated under 'matched backbone capacity' (abstract) is not supported by the experimental setup. The four encoders are distinct public checkpoints whose pretraining corpora, image counts, epoch budgets, and augmentations are not shown to be identical; any such upstream mismatches can produce the observed AUC gaps (e.g., logistic 0.768 supervised vs. 0.735 SigLIP2) independently of the loss function. Downstream LOOCV and permutation tests address label noise but cannot equalize upstream training distributions.

Authors: We agree that the four public ViT-B/16 checkpoints differ in pretraining corpora, scale, epoch counts, and augmentations, and that these upstream factors could contribute to the reported AUC differences independently of the objective. Our study matches only the backbone architecture and capacity (all ViT-B/16) to enable a controlled comparison of commonly available encoders, which is the practical setting most relevant to low-data FGVC practitioners. Full isolation of the objective would require retraining every model from scratch on identical data and schedules, which lies outside the scope of this work. We will revise the abstract and introduction to replace 'matched backbone capacity' with 'matched backbone architecture' and add an explicit limitations paragraph acknowledging upstream heterogeneity. We will also add a supplementary table summarizing the known pretraining details for each checkpoint drawn from their original publications. These changes will prevent overstatement while preserving the empirical comparison of standard models. revision: partial

Circularity Check

0 steps flagged

Empirical comparison study with no derivations or self-referential reductions

full rationale

The paper conducts a direct empirical evaluation of four public ViT-B/16 checkpoints (supervised, SigLIP2, MAE, DINOv3) on a custom three-class emerald inclusion dataset. Performance is measured via leave-one-out cross-validation and 1000-permutation macro AUC tests for linear and nonlinear probes. No equations, fitted parameters, predictions derived from inputs, or derivation chains appear in the reported results; AUC values are computed from held-out data splits. The central claim—that supervised and contrastive objectives yield stronger linear separability—is grounded in these experimental outcomes rather than any self-definition, ansatz smuggling, or load-bearing self-citation. Upstream differences in pretraining corpora are a methodological concern but do not create circularity because the paper contains no derivation that reduces to its own inputs by construction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the small custom dataset is representative of extreme low-data FGVC challenges and that performance gaps can be attributed primarily to pretraining objective after controlling for backbone capacity.

axioms (1)

domain assumption The four ViT-B/16 encoders differ only in pretraining objective and share matched capacity.
Invoked to isolate the effect of pretraining objective in the study design.

pith-pipeline@v0.9.0 · 5777 in / 1397 out tokens · 66324 ms · 2026-05-20T19:17:09.845544+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

margin-enforcing objectives (supervised, contrastive) yield stronger linear separability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Understanding inter- mediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InProc. Int. Conf. Learn. Represent., 2017. 1

work page 2017
[2]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021. 1

work page 2021
[3]

Emerald quality fac- tors.https://www.gia.edu/emerald-quality- factor, 2026

Gemological Institute of America. Emerald quality fac- tors.https://www.gia.edu/emerald-quality- factor, 2026. Accessed: 2026-03-30. 2

work page 2026
[4]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022. 1

work page 2022
[5]

Markus Ojala and Gemma C. Garriga. Permutation tests for studying classifier performance.Journal of Machine Learn- ing Research, 11:1833–1863, 2010. 1, 3

work page 2010
[6]

F. B. Pena, D. Crabi, S. C. Izidoro, ´E. O. Rodrigues, and G. Bernardes. Machine learning applied to emerald gem- stone grading: framework proposal and creation of a public dataset.Pattern Analysis and Applications, 25(1):241–251,

work page
[7]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 1

work page 2015
[8]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025
[9]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025
[10]

Con- vnet vs transformer, supervised vs clip: Beyond imagenet accuracy

Kirill Vishniakov, Zhiqiang Shen, and Zhuang Liu. Con- vnet vs transformer, supervised vs clip: Beyond imagenet accuracy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1

work page 2024
[11]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InProceedings of the 37th International Conference on Machine Learning, pages 9929–9939. PMLR,

work page
[12]

Understanding deep neural networks via linear separability of hidden layers, 2023

Chao Zhang, Xinyu Chen, Wensheng Li, Lixue Liu, Wei Wu, and Dacheng Tao. Understanding deep neural networks via linear separability of hidden layers, 2023. arXiv:2307.13962. 1, 4

work page arXiv 2023

[1] [1]

Understanding inter- mediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InProc. Int. Conf. Learn. Represent., 2017. 1

work page 2017

[2] [2]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021. 1

work page 2021

[3] [3]

Emerald quality fac- tors.https://www.gia.edu/emerald-quality- factor, 2026

Gemological Institute of America. Emerald quality fac- tors.https://www.gia.edu/emerald-quality- factor, 2026. Accessed: 2026-03-30. 2

work page 2026

[4] [4]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022. 1

work page 2022

[5] [5]

Markus Ojala and Gemma C. Garriga. Permutation tests for studying classifier performance.Journal of Machine Learn- ing Research, 11:1833–1863, 2010. 1, 3

work page 2010

[6] [6]

F. B. Pena, D. Crabi, S. C. Izidoro, ´E. O. Rodrigues, and G. Bernardes. Machine learning applied to emerald gem- stone grading: framework proposal and creation of a public dataset.Pattern Analysis and Applications, 25(1):241–251,

work page

[7] [7]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition chal- lenge.International Journal of Computer Vision (IJCV), 115 (3):211–252, 2015. 1

work page 2015

[8] [8]

Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

work page 2025

[9] [9]

Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense feature...

work page 2025

[10] [10]

Con- vnet vs transformer, supervised vs clip: Beyond imagenet accuracy

Kirill Vishniakov, Zhiqiang Shen, and Zhuang Liu. Con- vnet vs transformer, supervised vs clip: Beyond imagenet accuracy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 1

work page 2024

[11] [11]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InProceedings of the 37th International Conference on Machine Learning, pages 9929–9939. PMLR,

work page

[12] [12]

Understanding deep neural networks via linear separability of hidden layers, 2023

Chao Zhang, Xinyu Chen, Wensheng Li, Lixue Liu, Wei Wu, and Dacheng Tao. Understanding deep neural networks via linear separability of hidden layers, 2023. arXiv:2307.13962. 1, 4

work page arXiv 2023