pith. sign in

arxiv: 2605.18700 · v1 · pith:EYKWFBUFnew · submitted 2026-05-18 · 💻 cs.CV

A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition

Pith reviewed 2026-05-20 11:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained image recognitiondata augmentationaccuracy cost trade-offcounterfactual attention learninginference efficiencyimage classificationcomputer vision
0
0 comments X

The pith

Data-aware augmentations during training alone let fine-grained recognition models reach high accuracy without using crops at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a large-scale empirical comparison of accuracy versus computational cost in fine-grained image recognition, testing six different training and evaluation combinations on 17 datasets and nine backbones. It extends an existing method called Counterfactual Attention Learning by adding cross-image mixing of discriminative regions during training and introduces a lighter evaluation procedure that skips the usual forward pass over cropped regions. The central finding is that these training-only changes produce models whose accuracy remains competitive even when no special cropping is performed at test time, which lowers the cost of running the model in practice. A sympathetic reader would care because inference cost often determines whether accurate recognition systems can be deployed at scale.

Core claim

Across more than 2000 experiments the authors show that data-aware augmentations applied exclusively during training enable models to achieve excellent accuracy without the need for forward passes on discriminative crops during evaluation, thereby reducing inference costs while maintaining performance on fine-grained tasks.

What carries the argument

Cross-image discriminative region mixing augmentation extended from Counterfactual Attention Learning and applied only at training time, paired with an evaluation-only variant that omits crop processing.

If this is right

  • Fine-grained recognition systems can drop the extra crop forward pass at inference without large accuracy loss.
  • Training-focused augmentation strategies can substitute for more complex evaluation pipelines in FGIR.
  • Inference cost reductions become available for any backbone once the model has been trained with the proposed augmentations.
  • Future work can shift emphasis from evaluation-time attention tricks toward stronger training augmentations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training-only logic might simplify deployment in other localization-heavy vision tasks such as object detection or medical image analysis.
  • If the internalised attention from mixing generalizes, runtime overhead from explicit cropping could be avoided across a wider range of models.
  • Cost savings would be largest on resource-constrained devices where every extra forward pass matters.

Load-bearing premise

The 17 datasets and 9 backbones are representative enough of real-world fine-grained recognition problems and the new mixing augmentation generalizes beyond the tested conditions.

What would settle it

On a new fine-grained dataset or backbone the training-only augmentation model would show substantially lower accuracy than the full cropping version of the same method.

Figures

Figures reproduced from arXiv: 2605.18700 by Augusto Christian Surya, Bo-Cheng Lai, Edwin Arkel Rios, Fernando Mikael, Kisoon Jang, Mary Madeline Nicole, Min-Chun Hu, Oswin Gosal.

Figure 1
Figure 1. Figure 1: Example images of each dataset used in this experiment. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of 9 Backbones across 13 Datasets. Accuracy as Performance, Train Time & Inference Throughput as Cost. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of FZ, FT, CAL and CAL-NC settings with Image Size 224 (IS224) vs. Image Size 384 (IS384) across 9 Backbones [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of 6 Training and Evaluation Settings across 9 Backbones and 4 Datasets. Accuracy as Performance, Train Time & [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a large-scale empirical study of accuracy versus computational cost trade-offs in fine-grained image recognition (FGIR). It reports results from over 2000 experiments spanning 6 training/evaluation settings, 9 pretrained backbones, and 17 datasets. The authors extend Counterfactual Attention Learning (CAL) with a cross-image discriminative region mixing augmentation and introduce an efficient evaluation-only variant that avoids the extra forward pass on discriminative crops. The central finding is that data-aware augmentations applied only at training time enable models to reach high accuracy when evaluated without crops or additional inference passes, thereby lowering inference costs. Code and checkpoints are released publicly.

Significance. If the empirical findings hold, the work supplies actionable guidance for designing efficient FGIR pipelines by demonstrating that carefully chosen training-time augmentations can substitute for costly inference-time operations such as crop-based forward passes. The scale of the study (multiple backbones and datasets) and the public release of code/checkpoints are clear strengths that support reproducibility. The results could influence practical deployment in settings where inference latency or compute is constrained, provided the observed benefits are not artifacts of the specific 17 datasets.

major comments (2)
  1. §4 (Experimental Protocol) and §5 (Results): The central claim that training-only data-aware augmentations (including the proposed cross-image mixing extension) yield excellent accuracy without crops at inference relies on the reported numbers across the 17 datasets. However, the manuscript does not indicate whether error bars, multiple random seeds, or statistical significance tests (e.g., paired t-tests) accompany the accuracy figures; without these, it is impossible to assess whether the observed gains over baselines are robust or could be explained by run-to-run variance, directly affecting the strength of the inference-cost reduction conclusion.
  2. §5.3 (Efficient Evaluation Variant): The paper asserts that the proposed evaluation-only variant maintains competitive accuracy while eliminating the crop forward pass. Specific quantitative comparisons (accuracy drop relative to full CAL, inference-time savings) are presented, yet no ablation isolating the contribution of the cross-image mixing versus the original CAL components is reported; this omission makes it difficult to determine whether the cost reduction is load-bearing on the new augmentation or would hold with simpler training-only augmentations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: §4 (Experimental Protocol) and §5 (Results): The central claim that training-only data-aware augmentations (including the proposed cross-image mixing extension) yield excellent accuracy without crops at inference relies on the reported numbers across the 17 datasets. However, the manuscript does not indicate whether error bars, multiple random seeds, or statistical significance tests (e.g., paired t-tests) accompany the accuracy figures; without these, it is impossible to assess whether the observed gains over baselines are robust or could be explained by run-to-run variance, directly affecting the strength of the inference-cost reduction conclusion.

    Authors: We acknowledge that the absence of error bars or statistical tests limits the ability to quantify run-to-run variance. The scale of over 2000 experiments made full multi-seed runs for every configuration computationally prohibitive. In the revised manuscript we have added a discussion in Section 5 on this limitation and included error bars (from 3 seeds) for the primary accuracy comparisons on CUB-200-2011 and Stanford Cars. We also note the consistency of trends across 17 datasets and 9 backbones as supporting evidence of robustness. This constitutes a partial revision given resource constraints. revision: partial

  2. Referee: §5.3 (Efficient Evaluation Variant): The paper asserts that the proposed evaluation-only variant maintains competitive accuracy while eliminating the crop forward pass. Specific quantitative comparisons (accuracy drop relative to full CAL, inference-time savings) are presented, yet no ablation isolating the contribution of the cross-image mixing versus the original CAL components is reported; this omission makes it difficult to determine whether the cost reduction is load-bearing on the new augmentation or would hold with simpler training-only augmentations.

    Authors: We thank the referee for this suggestion. The original experiments focused on end-to-end comparisons of the full pipeline. To isolate the role of cross-image mixing, the revised Section 5.3 now includes an ablation comparing the efficient evaluation variant trained with original CAL components versus with the added cross-image mixing. Results show that the mixing augmentation contributes measurably to maintaining accuracy in the no-crop setting. A new table has been added to present these findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study grounded in new experiments

full rationale

The paper reports results from over 2000 new experiments across 17 datasets, 9 backbones, and 6 training/evaluation settings. The central claim—that data-aware augmentations (including the proposed cross-image mixing extension to CAL) enable high accuracy without crops at inference—is directly supported by these fresh empirical measurements and shared code/checkpoints rather than any derivation, fitted parameter, or self-citation chain that reduces to the inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the work is self-contained against external benchmarks via explicit experimental protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical computer vision study. It relies on standard deep learning training practices but introduces no new free parameters, mathematical axioms, or invented physical entities.

axioms (1)
  • standard math Standard assumptions of supervised deep learning for image classification hold, including the suitability of cross-entropy loss and transfer from ImageNet-pretrained backbones.
    The study uses established training protocols without deriving or questioning these foundations.

pith-pipeline@v0.9.0 · 5747 in / 1206 out tokens · 33622 ms · 2026-05-20T11:33:18.723836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Tagged Anime Illustrations. 2

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2

  3. [3]

    Battle of the backbones: a large- scale comparison of pretrained models across com- puter vision tasks

    Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, and Tom Goldstein. Battle of the backbones: a large- scale comparison of pretrained models across com- puter vision tasks. InProceedings of the 37th Interna- tiona...

  4. [4]

    Curran Associates Inc. 1

  5. [5]

    Visual Attention Network, 2022

    Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual Attention Network, 2022. arXiv:2202.09741 [cs]. 2

  6. [6]

    Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset

    Gwern. Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset. 2015. 2

  7. [7]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 2

  8. [8]

    Identity Mappings in Deep Residual Networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2

  9. [9]

    VegFru: A Domain-Specific Dataset for Fine-Grained Visual Cat- egorization

    Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A Domain-Specific Dataset for Fine-Grained Visual Cat- egorization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 541–549, Venice,

  10. [10]

    See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

    Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1, 2

  11. [11]

    RAMS- Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

    Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. RAMS- Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. InProceedings of the 29th ACM International Conference on Multi- media, pages 4239–4248, New York, NY , USA, 2021. Association for Computing Machinery. 2

  12. [12]

    SnapMix: Semantically Proportional Mixing for Aug- menting Fine-grained Data

    Shaoli Huang, Xinchao Wang, and Dacheng Tao. SnapMix: Semantically Proportional Mixing for Aug- menting Fine-grained Data. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1628–1636, 2021. 2

  13. [13]

    Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei

    A. Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs. 2012. 2

  14. [14]

    Big Transfer (BiT): General Visual Repre- sentation Learning, 2020

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning, 2020. arXiv:1912.11370 [cs]. 2

  15. [15]

    Simon Kornblith, Jonathon Shlens, and Quoc V . Le. Do Better ImageNet Models Transfer Better?, 2019. arXiv:1805.08974 [cs]. 1, 2

  16. [16]

    Krause, Jia Deng, Michael Stark, and Li Fei-Fei

    J. Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a Large-scale Dataset of Fine-grained Cars

  17. [17]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992– 10002, 2021. 2

  18. [18]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. In2022 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 11966–11976, 2022. 2

  19. [19]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Visual Classification of Aircraft, 2013. arXiv:1306.5151 [cs]. 2

  20. [20]

    Auto- mated Flower Classification over a Large Number of Classes

    Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, Bhubaneswar, India, 2008. IEEE. 2

  21. [21]

    Which Backbone to Use: A Resource-efficient Domain Specific Compar- ison for Computer Vision.Transactions on Machine Learning Research, 2024

    Pranav Jeevan P and Amit Sethi. Which Backbone to Use: A Resource-efficient Domain Specific Compar- ison for Computer Vision.Transactions on Machine Learning Research, 2024. 1, 2

  22. [22]

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. ISSN: 1063-6919. 2

  23. [23]

    Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification

    Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, Montreal, QC, Canada, 2021. IEEE. 1, 2

  24. [24]

    DAF:re: A Challenging, Crowd-Sourced, Large- Scale, Long-Tailed Dataset For Anime Character Recognition, 2021

    Edwin Arkel Rios, Wen-Huang Cheng, and Bo-Cheng Lai. DAF:re: A Challenging, Crowd-Sourced, Large- Scale, Long-Tailed Dataset For Anime Character Recognition, 2021. arXiv: 2101.08674. 2 5

  25. [25]

    Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

    Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. 2

  26. [26]

    Very Deep Convolutional Networks for Large-Scale Im- age Recognition

    Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Im- age Recognition. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceed- ings, 2015. 2

  27. [27]

    Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 595–604, Boston, MA, USA,

  28. [28]

    The iNaturalist Species Classification and Detection Dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The iNaturalist Species Classification and Detection Dataset. In2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 8769–8778, Salt Lake City, UT, 2018. IEEE. 2

  29. [29]

    The Caltech-UCSD Birds-200-2011 Dataset, 2011

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset, 2011. 2

  30. [30]

    Discriminative information restoration and extraction for weakly supervised low-resolution fine-grained image recognition.Pattern Recognition, 127:108629, 2022

    Tiantian Yan, Jian Shi, Haojie Li, Zhongxuan Luo, and Zhihui Wang. Discriminative information restoration and extraction for weakly supervised low-resolution fine-grained image recognition.Pattern Recognition, 127:108629, 2022. 2

  31. [31]

    Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. 1, 2

  32. [32]

    Benchmark Platform for Ultra-Fine-Grained Visual Categorization Beyond Human Performance

    Xiaohan Yu, Yang Zhao, Yongsheng Gao, Xiaohui Yuan, and Shengwu Xiong. Benchmark Platform for Ultra-Fine-Grained Visual Categorization Beyond Human Performance. In2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 10265–10275, Montreal, QC, Canada, 2021. IEEE. 2

  33. [33]

    CutMix: Regularization Strategy to Train Strong Clas- sifiers With Localizable Features

    Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix: Regularization Strategy to Train Strong Clas- sifiers With Localizable Features. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6022–6031, Seoul, Korea (South), 2019. IEEE. 2

  34. [34]

    Intra-class Part Swapping for Fine-Grained Image Classification

    Lianbo Zhang, Shaoli Huang, and Wei Liu. Intra-class Part Swapping for Fine-Grained Image Classification. In2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3208–3217, 2021. 2

  35. [35]

    Part-based R-CNNs for Fine-grained Category Detection

    Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based R-CNNs for Fine-grained Cate- gory Detection, 2014. arXiv:1407.3867 [cs]. 1, 2

  36. [36]

    S3Mix: Same Category Same Semantics Mixing for Augmenting Fine-grained Im- ages.ACM Trans

    Zi-Chao Zhang, Zhen-Duo Chen, Zhen-Yu Xie, Xin Luo, and Xin-Shun Xu. S3Mix: Same Category Same Semantics Mixing for Augmenting Fine-grained Im- ages.ACM Trans. Multimedia Comput. Commun. Appl., 20(1):9:1–9:16, 2023. 2

  37. [37]

    Learning Multi-attention Convolutional Neural Net- work for Fine-Grained Image Recognition

    Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning Multi-attention Convolutional Neural Net- work for Fine-Grained Image Recognition. In2017 IEEE International Conference on Computer Vision (ICCV), pages 5219–5227, Venice, 2017. IEEE. 1, 2 6