pith. sign in

arxiv: 2605.15689 · v1 · pith:CKNPYNYRnew · submitted 2026-05-15 · 💻 cs.CV

How to Choose Your Teacher for Fine Grained Image Recognition

Pith reviewed 2026-05-20 19:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords fine-grained image recognitionknowledge distillationteacher selectionmodel compressionprediction ratiostudent-teacher modelsimage classification
0
0 comments X

The pith

A ratio of a teacher's top two prediction probabilities reliably identifies the best teacher for knowledge distillation in fine-grained image recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve knowledge distillation for fine-grained image recognition by finding a better way to select the teacher model. It introduces the Ratio 1-2 metric, which measures how distinctly the teacher separates its top two class predictions. Analysis of over one thousand experiments shows this metric selects teachers 18 percent better than prior approaches. This selection allows compact student models to reach up to 17 percent higher accuracy on tasks like distinguishing bird species or car models. The result matters because it reduces the need for trial-and-error in deploying efficient models on limited hardware.

Core claim

The central discovery is that the Ratio 1-2, computed as the ratio of the highest to the second-highest softmax probability output by the teacher model on a dataset, correlates strongly with the final accuracy of the distilled student model. This holds across three student architectures, eight teacher models, eight fine-grained datasets, and four different training strategies for distillation.

What carries the argument

The Ratio 1-2 metric, which quantifies the teacher's prediction sharpness between its top two classes and ranks candidate teachers accordingly.

If this is right

  • Selecting teachers with higher Ratio 1-2 values leads to greater accuracy improvements in the student model during distillation.
  • Previous teacher selection methods are outperformed by 18 percent in terms of choosing the optimal teacher.
  • Small student models can achieve accuracy gains of up to 17 percent compared to using suboptimal teachers.
  • The metric works consistently under multiple distillation training strategies including standard KD and others.
  • Extensive validation on eight diverse fine-grained datasets supports broad applicability within this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might reduce the computational cost of teacher selection by avoiding full distillation runs for every candidate.
  • Similar ratio-based metrics could be explored for teacher selection in other computer vision tasks beyond fine-grained recognition.
  • Testing the metric on emerging model architectures not covered in the experiments would clarify its robustness.
  • Integrating this selection into automated model compression pipelines could streamline deployment of efficient classifiers.

Load-bearing premise

The Ratio 1-2 metric derived from teacher predictions on the training data will continue to predict good distillation outcomes on new datasets or with different model families not tested here.

What would settle it

Running the full set of distillation experiments on a ninth fine-grained dataset and finding that the teacher with the highest Ratio 1-2 does not produce the best student accuracy would challenge the metric's reliability.

Figures

Figures reproduced from arXiv: 2605.15689 by Augusto Christian Surya, Bo-Cheng Lai, Edwin Arkel Rios, Fernando Mikael, Min-Chun Hu, Oswin Gosal.

Figure 1
Figure 1. Figure 1: Visualization of the top-1 and top-2 teacher logits used in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of ResNet-18 accuracies trained from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the teachers’ R12 on the CUB under the TGDA setting, using a ResNet-18 trained from scratch. 4.3. Teacher Impact We further examine the behavior of individual teachers and their impact on student performance [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a teacher selection metric called Ratio 1-2 for knowledge distillation in fine-grained image recognition, based on the ratio of the teacher's top-1 to top-2 prediction probabilities. It reports results from over one thousand experiments across 3 students, 8 teachers, 8 datasets, and 4 training strategies, claiming an 18% improvement in teacher selection over previous methods and up to 17% accuracy gains for small student models.

Significance. If the results hold, this work provides a practical and computationally lightweight method for choosing teachers in distillation setups for FGIR, potentially leading to more efficient models without sacrificing much accuracy. The large-scale experimental analysis across multiple dimensions offers robust empirical support for the claims within the evaluated settings, and the public codebase enhances reproducibility.

major comments (1)
  1. [Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.
minor comments (1)
  1. [Abstract] Consider clarifying how the 18% improvement is calculated, e.g., relative to which baseline and using what aggregation across experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback highlights an important aspect of generalizability that we address directly below.

read point-by-point responses
  1. Referee: [Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.

    Authors: We agree that experiments on held-out datasets would provide stronger support for the metric's reliability across arbitrary FGIR scenarios. Our evaluation already covers eight diverse datasets spanning multiple domains (e.g., birds, cars, aircraft) with varying class counts, image resolutions, and noise characteristics, and Ratio 1-2 shows consistent gains in every case. This breadth reduces the likelihood that results are tied to idiosyncratic dataset properties. To directly address the concern, we will add a new set of held-out experiments in the revised manuscript: we will introduce additional FGIR datasets not used in the original study, compute Ratio 1-2 on each new dataset to select the teacher, perform distillation, and report the resulting student performance. These results will be presented alongside the existing analysis to demonstrate cross-dataset applicability. revision: yes

Circularity Check

0 steps flagged

No circularity: Ratio 1-2 defined directly from observable predictions and validated empirically

full rationale

The paper introduces Ratio 1-2 as a metric computed from a teacher's top-1 and top-2 class prediction probabilities on input images. This definition uses only raw model outputs and contains no fitted parameters, self-referential equations, or reduction to prior results by the same authors. The reported 18% improvement and accuracy gains are obtained from over 1000 held-out experimental runs across 3 students, 8 teachers, 8 datasets, and 4 strategies; the metric is applied first and performance is measured afterward. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation. The chain is therefore self-contained and externally falsifiable via the released codebase.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of knowledge distillation (that a larger teacher can transfer useful knowledge to a smaller student) and on the empirical observation that prediction-ratio statistics correlate with downstream student accuracy. No explicit free parameters, new physical entities, or ad-hoc axioms beyond these domain conventions are introduced in the abstract.

axioms (1)
  • domain assumption A larger teacher model can transfer useful knowledge to a smaller student via distillation on fine-grained tasks.
    Standard premise of knowledge distillation literature invoked to motivate teacher selection.

pith-pipeline@v0.9.0 · 5709 in / 1323 out tokens · 33484 ms · 2026-05-20T19:51:20.445776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The Ratio 1-2 metric is defined as: R12 = P1 / P2, where P1 and P2 denote the highest and second-highest logits... A large ratio indicates that the teacher strongly favors a single class... a smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals that better capture fine-grained class relationships

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

  1. [1]

    Tagged Anime Illustrations. 2

  2. [2]

    On the Ef- ficacy of Knowledge Distillation

    Jang Hyun Cho and Bharath Hariharan. On the Ef- ficacy of Knowledge Distillation. pages 4794–4802,

  3. [3]

    PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021

    Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021. arXiv:2109.15099 [cs]. 2

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2

  5. [5]

    Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition

    Jianlong Fu, Heliang Zheng, and Tao Mei. Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484,

  6. [6]

    LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

    Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J ´egou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12239–12249, 2021. 2

  7. [7]

    Visual attention network.Computational Visual Media, 9(4):733–752,

    Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual attention network.Computational Visual Media, 9(4):733–752,

  8. [8]

    TransFG: A Transformer Architecture for Fine- Grained Recognition

    Ju He, Jie-Neng Chen, Shuai Liu, Adam Ko- rtylewski, Cheng Yang, Yutong Bai, and Changhu Wang. TransFG: A Transformer Architecture for Fine- Grained Recognition. InProceedings of the First MiniCon Conference, 2022. 1

  9. [9]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016. 2

  10. [10]

    Identity Mappings in Deep Residual Networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2

  11. [11]

    Hinkle, William Wiersma, and Stephen G

    Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs.Applied Statistics for the Behavioral Sci- ences. Houghton Mifflin, 2003. Google-Books-ID: 7tntAAAAMAAJ. 3

  12. [12]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the Knowledge in a Neural Network, 2015. arXiv:1503.02531 [stat]. 1, 2

  13. [13]

    See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

    Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1

  14. [14]

    Novel Dataset for Fine- Grained Image Categorization

    Aditya Khosla, Nityananda Jayadevaprakash, Bang- peng Yao, and Li Fei-Fei. Novel Dataset for Fine- Grained Image Categorization. InFirst Workshop on Fine-Grained Visual Categorization, IEEE Con- ference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011. 2

  15. [15]

    Big Transfer (BiT): General Visual Repre- sentation Learning

    Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning. InComputer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing. 2

  16. [16]

    3D Object Representations for Fine-Grained Cat- egorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei- Fei. 3D Object Representations for Fine-Grained Cat- egorization. In2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 1, 2

  17. [17]

    Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pages 10012–10022, 2021. 2

  18. [18]

    A ConvNet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. pages 11976–11986, 2022. 2

  19. [19]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Vi- sual Classification of Aircraft.arXiv:1306.5151 [cs],

  20. [20]

    A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012

    MM Mukaka. A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012. 3

  21. [21]

    Auto- mated Flower Classification over a Large Number of Classes

    Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, 2008. 2

  22. [22]

    Relational Knowledge Distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational Knowledge Distillation. pages 3967–3976,

  23. [23]

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. 2

  24. [24]

    Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification

    Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. 5 In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, 2021. 1, 2

  25. [25]

    Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

    Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. ISSN: 2158-1525. 1

  26. [26]

    Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025

    Edwin Arkel Rios, Fernando Mikael, Oswin Gosal, Femiloye Oyerinde, Hao-Chun Liang, Bo-Cheng Lai, and Min-Chun Hu. Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025. arXiv:2507.12157 [cs]. 2

  27. [27]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for Thin Deep Nets, 2015. arXiv:1412.6550 [cs]. 1

  28. [28]

    Simonyan and Andrew Zisserman

    K. Simonyan and Andrew Zisserman. Very Deep Con- volutional Networks for Large-Scale Image Recogni- tion.CoRR, 2014. 2

  29. [29]

    Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024

    Chao Tan and Jie Liu. Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024. Conference Name: IEEE Transactions on Neural Networks and Learning Sys- tems. 1, 2, 3, 4

  30. [30]

    Contrastive Representation Distillation

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. 2019. 1

  31. [31]

    Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection

    Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. pages 595–604, 2015. 2

  32. [32]

    Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge

    John Violos, Symeon Papadopoulos, and Ioannis Kompatsiaris. Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1896–1900, 2024. 1

  33. [33]

    The Caltech-UCSD Birds-200-2011 Dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. page 8. 1, 2

  34. [34]

    Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations

    Hanxiao Wang, Venkatesh Saligrama, Stan Sclaroff, and Vitaly Ablavsky. Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations. In2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1252–1261, 2019. 1

  35. [35]

    Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. Conference Name: IEEE Transac- tions on Circuits and Systems for Video Technology. 1 6