A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition
Pith reviewed 2026-05-20 11:33 UTC · model grok-4.3
The pith
Data-aware augmentations during training alone let fine-grained recognition models reach high accuracy without using crops at inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across more than 2000 experiments the authors show that data-aware augmentations applied exclusively during training enable models to achieve excellent accuracy without the need for forward passes on discriminative crops during evaluation, thereby reducing inference costs while maintaining performance on fine-grained tasks.
What carries the argument
Cross-image discriminative region mixing augmentation extended from Counterfactual Attention Learning and applied only at training time, paired with an evaluation-only variant that omits crop processing.
If this is right
- Fine-grained recognition systems can drop the extra crop forward pass at inference without large accuracy loss.
- Training-focused augmentation strategies can substitute for more complex evaluation pipelines in FGIR.
- Inference cost reductions become available for any backbone once the model has been trained with the proposed augmentations.
- Future work can shift emphasis from evaluation-time attention tricks toward stronger training augmentations.
Where Pith is reading between the lines
- The same training-only logic might simplify deployment in other localization-heavy vision tasks such as object detection or medical image analysis.
- If the internalised attention from mixing generalizes, runtime overhead from explicit cropping could be avoided across a wider range of models.
- Cost savings would be largest on resource-constrained devices where every extra forward pass matters.
Load-bearing premise
The 17 datasets and 9 backbones are representative enough of real-world fine-grained recognition problems and the new mixing augmentation generalizes beyond the tested conditions.
What would settle it
On a new fine-grained dataset or backbone the training-only augmentation model would show substantially lower accuracy than the full cropping version of the same method.
Figures
read the original abstract
Prior work on fine-grained image recognition (FGIR) has established the importance of the backbone selection, but has neglected the accuracy-vs-cost trade-offs under different training and evaluation settings. In this work we conduct a large-scale study with over 2000 experiments across 6 training and evaluation settings, 9 pretrained backbones, and 17 datasets. Preliminary observations on the effectiveness of data augmentation for fine-grained training motivate us to extend Counterfactual Attention Learning (CAL), a state-of-the-art method based on data-aware cropping and masking augmentations, with cross-image discriminative region mixing augmentation. We also propose an efficient evaluation-only variant that maintains competitive accuracy while reducing inference costs by forfeiting the forward pass on discriminative crops that is normally used by CAL and similar FGIR methods. Our results show that data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs. To support future research we share our code and checkpoints at: \url{https://github.com/arkel23/FGIR-Backbones}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a large-scale empirical study of accuracy versus computational cost trade-offs in fine-grained image recognition (FGIR). It reports results from over 2000 experiments spanning 6 training/evaluation settings, 9 pretrained backbones, and 17 datasets. The authors extend Counterfactual Attention Learning (CAL) with a cross-image discriminative region mixing augmentation and introduce an efficient evaluation-only variant that avoids the extra forward pass on discriminative crops. The central finding is that data-aware augmentations applied only at training time enable models to reach high accuracy when evaluated without crops or additional inference passes, thereby lowering inference costs. Code and checkpoints are released publicly.
Significance. If the empirical findings hold, the work supplies actionable guidance for designing efficient FGIR pipelines by demonstrating that carefully chosen training-time augmentations can substitute for costly inference-time operations such as crop-based forward passes. The scale of the study (multiple backbones and datasets) and the public release of code/checkpoints are clear strengths that support reproducibility. The results could influence practical deployment in settings where inference latency or compute is constrained, provided the observed benefits are not artifacts of the specific 17 datasets.
major comments (2)
- §4 (Experimental Protocol) and §5 (Results): The central claim that training-only data-aware augmentations (including the proposed cross-image mixing extension) yield excellent accuracy without crops at inference relies on the reported numbers across the 17 datasets. However, the manuscript does not indicate whether error bars, multiple random seeds, or statistical significance tests (e.g., paired t-tests) accompany the accuracy figures; without these, it is impossible to assess whether the observed gains over baselines are robust or could be explained by run-to-run variance, directly affecting the strength of the inference-cost reduction conclusion.
- §5.3 (Efficient Evaluation Variant): The paper asserts that the proposed evaluation-only variant maintains competitive accuracy while eliminating the crop forward pass. Specific quantitative comparisons (accuracy drop relative to full CAL, inference-time savings) are presented, yet no ablation isolating the contribution of the cross-image mixing versus the original CAL components is reported; this omission makes it difficult to determine whether the cost reduction is load-bearing on the new augmentation or would hold with simpler training-only augmentations.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: §4 (Experimental Protocol) and §5 (Results): The central claim that training-only data-aware augmentations (including the proposed cross-image mixing extension) yield excellent accuracy without crops at inference relies on the reported numbers across the 17 datasets. However, the manuscript does not indicate whether error bars, multiple random seeds, or statistical significance tests (e.g., paired t-tests) accompany the accuracy figures; without these, it is impossible to assess whether the observed gains over baselines are robust or could be explained by run-to-run variance, directly affecting the strength of the inference-cost reduction conclusion.
Authors: We acknowledge that the absence of error bars or statistical tests limits the ability to quantify run-to-run variance. The scale of over 2000 experiments made full multi-seed runs for every configuration computationally prohibitive. In the revised manuscript we have added a discussion in Section 5 on this limitation and included error bars (from 3 seeds) for the primary accuracy comparisons on CUB-200-2011 and Stanford Cars. We also note the consistency of trends across 17 datasets and 9 backbones as supporting evidence of robustness. This constitutes a partial revision given resource constraints. revision: partial
-
Referee: §5.3 (Efficient Evaluation Variant): The paper asserts that the proposed evaluation-only variant maintains competitive accuracy while eliminating the crop forward pass. Specific quantitative comparisons (accuracy drop relative to full CAL, inference-time savings) are presented, yet no ablation isolating the contribution of the cross-image mixing versus the original CAL components is reported; this omission makes it difficult to determine whether the cost reduction is load-bearing on the new augmentation or would hold with simpler training-only augmentations.
Authors: We thank the referee for this suggestion. The original experiments focused on end-to-end comparisons of the full pipeline. To isolate the role of cross-image mixing, the revised Section 5.3 now includes an ablation comparing the efficient evaluation variant trained with original CAL components versus with the added cross-image mixing. Results show that the mixing augmentation contributes measurably to maintaining accuracy in the no-crop setting. A new table has been added to present these findings. revision: yes
Circularity Check
No circularity: empirical study grounded in new experiments
full rationale
The paper reports results from over 2000 new experiments across 17 datasets, 9 backbones, and 6 training/evaluation settings. The central claim—that data-aware augmentations (including the proposed cross-image mixing extension to CAL) enable high accuracy without crops at inference—is directly supported by these fresh empirical measurements and shared code/checkpoints rather than any derivation, fitted parameter, or self-citation chain that reduces to the inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked in a load-bearing way; the work is self-contained against external benchmarks via explicit experimental protocols.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of supervised deep learning for image classification hold, including the suitability of cross-entropy loss and transfer from ImageNet-pretrained backbones.
Lean theorems connected to this paper
-
IndisputableMonolith/CostJcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
data-aware augmentations during training only can enable a model to achieve excellent accuracy even without crops, significantly reducing inference costs
-
IndisputableMonolith/Foundation/BranchSelectionbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
extend Counterfactual Attention Learning (CAL) ... with cross-image discriminative region mixing augmentation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tagged Anime Illustrations. 2
-
[2]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2
work page 2020
-
[3]
Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, and Tom Goldstein. Battle of the backbones: a large- scale comparison of pretrained models across com- puter vision tasks. InProceedings of the 37th Interna- tiona...
-
[4]
Curran Associates Inc. 1
-
[5]
Visual Attention Network, 2022
Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual Attention Network, 2022. arXiv:2202.09741 [cs]. 2
-
[6]
Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset
Gwern. Danbooru2021: A Large-Scale Crowdsourced & Tagged Anime Illustration Dataset. 2015. 2
work page 2015
-
[7]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 2
work page 2016
-
[8]
Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2
work page 2016
-
[9]
VegFru: A Domain-Specific Dataset for Fine-Grained Visual Cat- egorization
Saihui Hou, Yushan Feng, and Zilei Wang. VegFru: A Domain-Specific Dataset for Fine-Grained Visual Cat- egorization. In2017 IEEE International Conference on Computer Vision (ICCV), pages 541–549, Venice,
-
[10]
Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[11]
RAMS- Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition
Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. RAMS- Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition. InProceedings of the 29th ACM International Conference on Multi- media, pages 4239–4248, New York, NY , USA, 2021. Association for Computing Machinery. 2
work page 2021
-
[12]
SnapMix: Semantically Proportional Mixing for Aug- menting Fine-grained Data
Shaoli Huang, Xinchao Wang, and Dacheng Tao. SnapMix: Semantically Proportional Mixing for Aug- menting Fine-grained Data. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1628–1636, 2021. 2
work page 2021
-
[13]
Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei
A. Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs. 2012. 2
work page 2012
-
[14]
Big Transfer (BiT): General Visual Repre- sentation Learning, 2020
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning, 2020. arXiv:1912.11370 [cs]. 2
-
[15]
Simon Kornblith, Jonathon Shlens, and Quoc V . Le. Do Better ImageNet Models Transfer Better?, 2019. arXiv:1805.08974 [cs]. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[16]
Krause, Jia Deng, Michael Stark, and Li Fei-Fei
J. Krause, Jia Deng, Michael Stark, and Li Fei-Fei. Collecting a Large-scale Dataset of Fine-grained Cars
-
[17]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992– 10002, 2021. 2
work page 2021
-
[18]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. In2022 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 11966–11976, 2022. 2
work page 2022
-
[19]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Visual Classification of Aircraft, 2013. arXiv:1306.5151 [cs]. 2
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[20]
Auto- mated Flower Classification over a Large Number of Classes
Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, Bhubaneswar, India, 2008. IEEE. 2
work page 2008
-
[21]
Pranav Jeevan P and Amit Sethi. Which Backbone to Use: A Resource-efficient Domain Specific Compar- ison for Computer Vision.Transactions on Machine Learning Research, 2024. 1, 2
work page 2024
-
[22]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. ISSN: 1063-6919. 2
work page 2012
-
[23]
Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification
Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, Montreal, QC, Canada, 2021. IEEE. 1, 2
work page 2021
-
[24]
Edwin Arkel Rios, Wen-Huang Cheng, and Bo-Cheng Lai. DAF:re: A Challenging, Crowd-Sourced, Large- Scale, Long-Tailed Dataset For Anime Character Recognition, 2021. arXiv: 2101.08674. 2 5
-
[25]
Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. 2
work page 2025
-
[26]
Very Deep Convolutional Networks for Large-Scale Im- age Recognition
Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Im- age Recognition. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceed- ings, 2015. 2
work page 2015
-
[27]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In2015 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 595–604, Boston, MA, USA,
-
[28]
The iNaturalist Species Classification and Detection Dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The iNaturalist Species Classification and Detection Dataset. In2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 8769–8778, Salt Lake City, UT, 2018. IEEE. 2
work page 2018
-
[29]
The Caltech-UCSD Birds-200-2011 Dataset, 2011
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset, 2011. 2
work page 2011
-
[30]
Tiantian Yan, Jian Shi, Haojie Li, Zhongxuan Luo, and Zhihui Wang. Discriminative information restoration and extraction for weakly supervised low-resolution fine-grained image recognition.Pattern Recognition, 127:108629, 2022. 2
work page 2022
-
[31]
Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. 1, 2
work page 2024
-
[32]
Benchmark Platform for Ultra-Fine-Grained Visual Categorization Beyond Human Performance
Xiaohan Yu, Yang Zhao, Yongsheng Gao, Xiaohui Yuan, and Shengwu Xiong. Benchmark Platform for Ultra-Fine-Grained Visual Categorization Beyond Human Performance. In2021 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 10265–10275, Montreal, QC, Canada, 2021. IEEE. 2
work page 2021
-
[33]
CutMix: Regularization Strategy to Train Strong Clas- sifiers With Localizable Features
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. CutMix: Regularization Strategy to Train Strong Clas- sifiers With Localizable Features. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6022–6031, Seoul, Korea (South), 2019. IEEE. 2
work page 2019
-
[34]
Intra-class Part Swapping for Fine-Grained Image Classification
Lianbo Zhang, Shaoli Huang, and Wei Liu. Intra-class Part Swapping for Fine-Grained Image Classification. In2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 3208–3217, 2021. 2
work page 2021
-
[35]
Part-based R-CNNs for Fine-grained Category Detection
Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based R-CNNs for Fine-grained Cate- gory Detection, 2014. arXiv:1407.3867 [cs]. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[36]
S3Mix: Same Category Same Semantics Mixing for Augmenting Fine-grained Im- ages.ACM Trans
Zi-Chao Zhang, Zhen-Duo Chen, Zhen-Yu Xie, Xin Luo, and Xin-Shun Xu. S3Mix: Same Category Same Semantics Mixing for Augmenting Fine-grained Im- ages.ACM Trans. Multimedia Comput. Commun. Appl., 20(1):9:1–9:16, 2023. 2
work page 2023
-
[37]
Learning Multi-attention Convolutional Neural Net- work for Fine-Grained Image Recognition
Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning Multi-attention Convolutional Neural Net- work for Fine-Grained Image Recognition. In2017 IEEE International Conference on Computer Vision (ICCV), pages 5219–5227, Venice, 2017. IEEE. 1, 2 6
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.