arxiv: 2605.08618 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification

Devesh Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords OOD detectionplant pathologyenergy-based modelsout-of-distributiondistribution shiftsfine-grained classificationdeep learning evaluation

0 comments

The pith

Energy-based fine-tuning outperforms softmax and other OOD methods on plant pathology images without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates six out-of-distribution detection methods on the Plant Pathology 2021 dataset, a fine-grained classification task with natural distribution shifts that go beyond typical toy benchmarks. Energy-based fine-tuning emerges as the strongest performer, delivering better OOD detection than the softmax baseline while keeping in-distribution accuracy intact. The gains trace to both a restructured embedding space and improved calibration of the scoring function. The evaluation also reveals training instabilities in constrained optimization methods when applied to moderate-sized real datasets, issues rarely examined in existing work.

Core claim

Energy-based fine-tuning performs best across OOD settings on the Plant Pathology 2021 dataset, improving detection over the softmax baseline while preserving in-distribution accuracy. These gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. Scaling constrained optimization methods to this dataset size produces practical training instabilities that are largely absent from prior literature.

What carries the argument

Energy-based fine-tuning, which applies energy-based objectives during model adaptation to restructure embeddings and calibrate OOD scores.

Load-bearing premise

The Plant Pathology 2021 dataset and its natural distribution shifts are representative of real-world challenges, and the six methods were compared fairly without implementation-specific biases affecting the ranking.

What would settle it

A replication study on another real-world fine-grained dataset with natural shifts where energy-based fine-tuning fails to improve OOD detection over the softmax baseline, or where constrained optimization methods train stably.

Figures

Figures reproduced from arXiv: 2605.08618 by Devesh Shah.

**Figure 2.** Figure 2: Softmax probabilities (E1) and logits (E5b) for a representative ID sample (diseased plant [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Left and center: distributions of cosine distance to the 5 nearest neighbors in the training [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Five randomly sampled images per disease class from the training split of the Plant Pathol [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Randomly sampled images from each of the four OOD datasets used in this study. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Out-of-distribution (OOD) detection is essential for reliable deployment of deep learning systems, yet the majority of existing methods are evaluated on small, visually homogeneous benchmarks. In this work, we study six OOD detection methods spanning post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization on the Plant Pathology 2021 dataset, a fine-grained task with natural distribution shifts. Energy-based fine-tuning performs best across OOD settings, improving detection over the softmax baseline while preserving in-distribution accuracy. Analysis shows these gains stem from both a restructuring of the embedding space alongside calibration of the scoring function. We further document practical training instabilities that arise when scaling constrained optimization methods to moderate-sized datasets, findings that are largely absent from existing literature. Our results demonstrate that principled OOD detection is achievable on real-world domain-specific data and that benchmark evaluations alone may not capture the challenges that emerge in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Energy-based fine-tuning ranks highest on this plant pathology dataset with natural shifts, and the paper usefully flags training instabilities, though the method comparison may not be fully controlled for tuning effort.

read the letter

The core finding is that energy-based fine-tuning beats the other five OOD methods tested on Plant Pathology 2021, lifts detection performance over a plain softmax baseline, and keeps in-distribution accuracy steady. The authors trace the gains to changes in the embedding space plus better calibration of the score. They also call out instabilities that show up when constrained optimization is scaled to this dataset size, which is a practical detail most toy-benchmark papers ignore.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates six OOD detection methods (post-hoc scoring, auxiliary objectives, energy-based models, and constrained optimization) on the Plant Pathology 2021 dataset, a fine-grained classification task with natural distribution shifts. It reports that energy-based fine-tuning achieves the strongest OOD detection performance across settings while preserving in-distribution accuracy, with the gains arising from both embedding-space restructuring and scoring-function calibration. The work also documents practical training instabilities that appear when scaling constrained optimization methods to this dataset size.

Significance. If the method comparisons prove to be equitable, the results would be significant for moving OOD detection research beyond small, homogeneous toy benchmarks toward realistic domain-specific applications. The explicit documentation of scaling instabilities for constrained optimization is a useful practical contribution that is rarely reported in the literature, and the attribution of gains to both embedding geometry and scoring calibration offers a concrete mechanistic insight that could guide future method design in agricultural computer vision.

major comments (1)

[Abstract] Abstract: the claim that energy-based fine-tuning is superior rests on the premise that all six methods were optimized under equivalent conditions. The abstract explicitly notes practical training instabilities for constrained optimization at this dataset scale, yet supplies no parallel stability analysis, tuning budget, or hyperparameter protocol for energy-based fine-tuning or the remaining methods. Without such documentation, the reported performance gap and the attribution to embedding restructuring plus scoring calibration could reflect differences in optimization effort rather than intrinsic method properties.

minor comments (2)

[Abstract] Abstract: the summary of comparative results omits the concrete OOD detection metrics, any statistical tests, the precise definitions of the OOD settings, and the experimental controls employed, which prevents a reader from immediately gauging the strength of the reported rankings.
The manuscript would be strengthened by the inclusion of a reproducibility statement (e.g., code repository or exhaustive hyperparameter tables) so that the documented instabilities and the energy-based performance advantage can be independently verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on ensuring equitable comparisons. We address the major comment below and have revised the manuscript to provide the requested documentation.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that energy-based fine-tuning is superior rests on the premise that all six methods were optimized under equivalent conditions. The abstract explicitly notes practical training instabilities for constrained optimization at this dataset scale, yet supplies no parallel stability analysis, tuning budget, or hyperparameter protocol for energy-based fine-tuning or the remaining methods. Without such documentation, the reported performance gap and the attribution to embedding restructuring plus scoring calibration could reflect differences in optimization effort rather than intrinsic method properties.

Authors: We agree that the original manuscript did not supply parallel stability analysis or explicit tuning budgets for all methods, which limits the strength of the superiority claim as presented. In practice, we applied standard hyperparameter search procedures with comparable computational budgets to each method, but these details were not reported. To resolve this, we have added a new subsection titled 'Optimization Protocols and Stability Analysis' that documents the hyperparameter search ranges, number of trials, early-stopping criteria, and observed instabilities or convergence behavior for every method, including energy-based fine-tuning and the post-hoc baselines. This revision makes the optimization effort transparent and supports the attribution of gains to embedding restructuring and scoring calibration rather than unequal tuning. We have also updated the abstract to reference the new analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential reductions

full rationale

The paper conducts a systematic experimental comparison of six OOD detection methods on the Plant Pathology 2021 dataset, reporting performance rankings, embedding space analysis, and training instabilities as observed outcomes. No equations, first-principles derivations, or predictions are claimed; results rest on direct empirical measurements rather than any reduction to fitted inputs, self-definitions, or self-citation chains. The central claim of energy-based fine-tuning superiority is presented as an experimental finding, not a constructed equivalence, satisfying the criteria for a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on empirical comparison; limited details available from abstract only. No invented entities. Standard ML assumptions apply but are not explicitly audited.

free parameters (1)

Training hyperparameters and fine-tuning settings
Standard ML training choices (learning rates, epochs, etc.) that affect method performance but not specified in abstract.

axioms (1)

domain assumption Plant Pathology 2021 dataset exhibits natural distribution shifts representative of real-world fine-grained classification challenges.
Invoked to justify the evaluation's relevance to practical deployment.

pith-pipeline@v0.9.0 · 5447 in / 1292 out tokens · 76309 ms · 2026-05-12T01:07:33.326850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2021 , howpublished =

Thapa, Ranjita and Zhang, Kai and Snavely, Noah and Belongie, Serge and Khan, Awais , title =. 2021 , howpublished =

work page 2021
[5]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

A baseline for detecting misclassified and out-of-distribution examples in neural networks , author=. arXiv preprint arXiv:1610.02136 , year=

work page internal anchor Pith review arXiv
[6]

Deep Anomaly Detection with Outlier Exposure

Deep anomaly detection with outlier exposure , author=. arXiv preprint arXiv:1812.04606 , year=

work page Pith review arXiv
[7]

Advances in neural information processing systems , volume=

Energy-based out-of-distribution detection , author=. Advances in neural information processing systems , volume=

work page
[8]

International Conference on Machine Learning , pages=

Training ood detectors in their natural habitats , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[9]

International Conference on Machine Learning , pages=

Feed two birds with one scone: Exploiting wild data for both out-of-distribution generalization and detection , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[10]

Advances in Neural Information Processing Systems , volume=

Openood: Benchmarking generalized out-of-distribution detection , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Nature ecology & evolution , volume=

The global burden of pathogens and pests on major food crops , author=. Nature ecology & evolution , volume=. 2019 , publisher=

work page 2019
[12]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[13]

ACM computing surveys (CSUR) , volume=

Transformers in vision: A survey , author=. ACM computing surveys (CSUR) , volume=. 2022 , publisher=

work page 2022
[14]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[15]

Concrete Problems in AI Safety

Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[17]

Liang, Y

Enhancing the reliability of out-of-distribution image detection in neural networks , author=. arXiv preprint arXiv:1706.02690 , year=

work page arXiv
[18]

Advances in neural information processing systems , volume=

A simple unified framework for detecting out-of-distribution samples and adversarial attacks , author=. Advances in neural information processing systems , volume=

work page
[19]

Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples

Training confidence-calibrated classifiers for detecting out-of-distribution samples , author=. arXiv preprint arXiv:1711.09325 , year=

work page Pith review arXiv
[20]

Do Deep Generative Models Know What They Don't Know?

Do deep generative models know what they don't know? , author=. arXiv preprint arXiv:1810.09136 , year=

work page Pith review arXiv
[21]

Your classifier is secretly an energy based model and you should treat it like one, 2020

Your classifier is secretly an energy based model and you should treat it like one , author=. arXiv preprint arXiv:1912.03263 , year=

work page arXiv 1912
[22]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops , month =

Krause, Jonathan and Stark, Michael and Deng, Jia and Fei-Fei, Li , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops , month =

work page
[23]

Proceedings of the

Cimpoi, Mircea and Maji, Subhransu and Kokkinos, Iasonas and Mohamed, Sammy and Vedaldi, Andrea , Title =. Proceedings of the

work page
[24]

2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing , pages=

Automated flower classification over a large number of classes , author=. 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing , pages=. 2008 , organization=

work page 2008
[25]

CVPR , year=

Natural Adversarial Examples , author=. CVPR , year=

work page
[26]

International Journal of Computer Vision , volume=

Generalized out-of-distribution detection: A survey , author=. International Journal of Computer Vision , volume=. 2024 , publisher=

work page 2024
[27]

Predicting structured data , volume=

A tutorial on energy-based learning , author=. Predicting structured data , volume=

work page