How to Choose Your Teacher for Fine Grained Image Recognition
Pith reviewed 2026-05-20 19:51 UTC · model grok-4.3
The pith
A ratio of a teacher's top two prediction probabilities reliably identifies the best teacher for knowledge distillation in fine-grained image recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the Ratio 1-2, computed as the ratio of the highest to the second-highest softmax probability output by the teacher model on a dataset, correlates strongly with the final accuracy of the distilled student model. This holds across three student architectures, eight teacher models, eight fine-grained datasets, and four different training strategies for distillation.
What carries the argument
The Ratio 1-2 metric, which quantifies the teacher's prediction sharpness between its top two classes and ranks candidate teachers accordingly.
If this is right
- Selecting teachers with higher Ratio 1-2 values leads to greater accuracy improvements in the student model during distillation.
- Previous teacher selection methods are outperformed by 18 percent in terms of choosing the optimal teacher.
- Small student models can achieve accuracy gains of up to 17 percent compared to using suboptimal teachers.
- The metric works consistently under multiple distillation training strategies including standard KD and others.
- Extensive validation on eight diverse fine-grained datasets supports broad applicability within this domain.
Where Pith is reading between the lines
- This approach might reduce the computational cost of teacher selection by avoiding full distillation runs for every candidate.
- Similar ratio-based metrics could be explored for teacher selection in other computer vision tasks beyond fine-grained recognition.
- Testing the metric on emerging model architectures not covered in the experiments would clarify its robustness.
- Integrating this selection into automated model compression pipelines could streamline deployment of efficient classifiers.
Load-bearing premise
The Ratio 1-2 metric derived from teacher predictions on the training data will continue to predict good distillation outcomes on new datasets or with different model families not tested here.
What would settle it
Running the full set of distillation experiments on a ninth fine-grained dataset and finding that the teacher with the highest Ratio 1-2 does not produce the best student accuracy would challenge the metric's reliability.
Figures
read the original abstract
Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a teacher selection metric called Ratio 1-2 for knowledge distillation in fine-grained image recognition, based on the ratio of the teacher's top-1 to top-2 prediction probabilities. It reports results from over one thousand experiments across 3 students, 8 teachers, 8 datasets, and 4 training strategies, claiming an 18% improvement in teacher selection over previous methods and up to 17% accuracy gains for small student models.
Significance. If the results hold, this work provides a practical and computationally lightweight method for choosing teachers in distillation setups for FGIR, potentially leading to more efficient models without sacrificing much accuracy. The large-scale experimental analysis across multiple dimensions offers robust empirical support for the claims within the evaluated settings, and the public codebase enhances reproducibility.
major comments (1)
- [Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.
minor comments (1)
- [Abstract] Consider clarifying how the 18% improvement is calculated, e.g., relative to which baseline and using what aggregation across experiments.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The feedback highlights an important aspect of generalizability that we address directly below.
read point-by-point responses
-
Referee: [Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.
Authors: We agree that experiments on held-out datasets would provide stronger support for the metric's reliability across arbitrary FGIR scenarios. Our evaluation already covers eight diverse datasets spanning multiple domains (e.g., birds, cars, aircraft) with varying class counts, image resolutions, and noise characteristics, and Ratio 1-2 shows consistent gains in every case. This breadth reduces the likelihood that results are tied to idiosyncratic dataset properties. To directly address the concern, we will add a new set of held-out experiments in the revised manuscript: we will introduce additional FGIR datasets not used in the original study, compute Ratio 1-2 on each new dataset to select the teacher, perform distillation, and report the resulting student performance. These results will be presented alongside the existing analysis to demonstrate cross-dataset applicability. revision: yes
Circularity Check
No circularity: Ratio 1-2 defined directly from observable predictions and validated empirically
full rationale
The paper introduces Ratio 1-2 as a metric computed from a teacher's top-1 and top-2 class prediction probabilities on input images. This definition uses only raw model outputs and contains no fitted parameters, self-referential equations, or reduction to prior results by the same authors. The reported 18% improvement and accuracy gains are obtained from over 1000 held-out experimental runs across 3 students, 8 teachers, 8 datasets, and 4 strategies; the metric is applied first and performance is measured afterward. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation. The chain is therefore self-contained and externally falsifiable via the released codebase.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A larger teacher model can transfer useful knowledge to a smaller student via distillation on fine-grained tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
The Ratio 1-2 metric is defined as: R12 = P1 / P2, where P1 and P2 denote the highest and second-highest logits... A large ratio indicates that the teacher strongly favors a single class... a smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals that better capture fine-grained class relationships
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tagged Anime Illustrations. 2
-
[2]
On the Ef- ficacy of Knowledge Distillation
Jang Hyun Cho and Bharath Hariharan. On the Ef- ficacy of Knowledge Distillation. pages 4794–4802,
-
[3]
PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021
Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021. arXiv:2109.15099 [cs]. 2
-
[4]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2
work page 2020
-
[5]
Jianlong Fu, Heliang Zheng, and Tao Mei. Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484,
-
[6]
LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J ´egou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12239–12249, 2021. 2
work page 2021
-
[7]
Visual attention network.Computational Visual Media, 9(4):733–752,
Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual attention network.Computational Visual Media, 9(4):733–752,
-
[8]
TransFG: A Transformer Architecture for Fine- Grained Recognition
Ju He, Jie-Neng Chen, Shuai Liu, Adam Ko- rtylewski, Cheng Yang, Yutong Bai, and Changhu Wang. TransFG: A Transformer Architecture for Fine- Grained Recognition. InProceedings of the First MiniCon Conference, 2022. 1
work page 2022
-
[9]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016. 2
work page 2016
-
[10]
Identity Mappings in Deep Residual Networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2
work page 2016
-
[11]
Hinkle, William Wiersma, and Stephen G
Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs.Applied Statistics for the Behavioral Sci- ences. Houghton Mifflin, 2003. Google-Books-ID: 7tntAAAAMAAJ. 3
work page 2003
-
[12]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the Knowledge in a Neural Network, 2015. arXiv:1503.02531 [stat]. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[13]
Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[14]
Novel Dataset for Fine- Grained Image Categorization
Aditya Khosla, Nityananda Jayadevaprakash, Bang- peng Yao, and Li Fei-Fei. Novel Dataset for Fine- Grained Image Categorization. InFirst Workshop on Fine-Grained Visual Categorization, IEEE Con- ference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011. 2
work page 2011
-
[15]
Big Transfer (BiT): General Visual Repre- sentation Learning
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning. InComputer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing. 2
work page 2020
-
[16]
3D Object Representations for Fine-Grained Cat- egorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei- Fei. 3D Object Representations for Fine-Grained Cat- egorization. In2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 1, 2
work page 2013
-
[17]
Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pages 10012–10022, 2021. 2
work page 2021
-
[18]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. pages 11976–11986, 2022. 2
work page 2022
-
[19]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Vi- sual Classification of Aircraft.arXiv:1306.5151 [cs],
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
MM Mukaka. A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012. 3
work page 2012
-
[21]
Auto- mated Flower Classification over a Large Number of Classes
Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, 2008. 2
work page 2008
-
[22]
Relational Knowledge Distillation
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational Knowledge Distillation. pages 3967–3976,
-
[23]
Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. 2
work page 2012
-
[24]
Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification
Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. 5 In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, 2021. 1, 2
work page 2021
-
[25]
Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers
Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. ISSN: 2158-1525. 1
work page 2025
-
[26]
Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025
Edwin Arkel Rios, Fernando Mikael, Oswin Gosal, Femiloye Oyerinde, Hao-Chun Liang, Bo-Cheng Lai, and Min-Chun Hu. Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025. arXiv:2507.12157 [cs]. 2
-
[27]
FitNets: Hints for Thin Deep Nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for Thin Deep Nets, 2015. arXiv:1412.6550 [cs]. 1
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
K. Simonyan and Andrew Zisserman. Very Deep Con- volutional Networks for Large-Scale Image Recogni- tion.CoRR, 2014. 2
work page 2014
-
[29]
Chao Tan and Jie Liu. Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024. Conference Name: IEEE Transactions on Neural Networks and Learning Sys- tems. 1, 2, 3, 4
work page 2024
-
[30]
Contrastive Representation Distillation
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. 2019. 1
work page 2019
-
[31]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. pages 595–604, 2015. 2
work page 2015
-
[32]
Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge
John Violos, Symeon Papadopoulos, and Ioannis Kompatsiaris. Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1896–1900, 2024. 1
work page 1900
-
[33]
The Caltech-UCSD Birds-200-2011 Dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. page 8. 1, 2
work page 2011
-
[34]
Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations
Hanxiao Wang, Venkatesh Saligrama, Stan Sclaroff, and Vitaly Ablavsky. Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations. In2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1252–1261, 2019. 1
work page 2019
-
[35]
Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. Conference Name: IEEE Transac- tions on Circuits and Systems for Video Technology. 1 6
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.