How to Choose Your Teacher for Fine Grained Image Recognition

Augusto Christian Surya; Bo-Cheng Lai; Edwin Arkel Rios; Fernando Mikael; Min-Chun Hu; Oswin Gosal

arxiv: 2605.15689 · v1 · pith:CKNPYNYRnew · submitted 2026-05-15 · 💻 cs.CV

How to Choose Your Teacher for Fine Grained Image Recognition

Oswin Gosal , Edwin Arkel Rios , Augusto Christian Surya , Fernando Mikael , Bo-Cheng Lai , Min-Chun Hu This is my paper

Pith reviewed 2026-05-20 19:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained image recognitionknowledge distillationteacher selectionmodel compressionprediction ratiostudent-teacher modelsimage classification

0 comments

The pith

A ratio of a teacher's top two prediction probabilities reliably identifies the best teacher for knowledge distillation in fine-grained image recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve knowledge distillation for fine-grained image recognition by finding a better way to select the teacher model. It introduces the Ratio 1-2 metric, which measures how distinctly the teacher separates its top two class predictions. Analysis of over one thousand experiments shows this metric selects teachers 18 percent better than prior approaches. This selection allows compact student models to reach up to 17 percent higher accuracy on tasks like distinguishing bird species or car models. The result matters because it reduces the need for trial-and-error in deploying efficient models on limited hardware.

Core claim

The central discovery is that the Ratio 1-2, computed as the ratio of the highest to the second-highest softmax probability output by the teacher model on a dataset, correlates strongly with the final accuracy of the distilled student model. This holds across three student architectures, eight teacher models, eight fine-grained datasets, and four different training strategies for distillation.

What carries the argument

The Ratio 1-2 metric, which quantifies the teacher's prediction sharpness between its top two classes and ranks candidate teachers accordingly.

If this is right

Selecting teachers with higher Ratio 1-2 values leads to greater accuracy improvements in the student model during distillation.
Previous teacher selection methods are outperformed by 18 percent in terms of choosing the optimal teacher.
Small student models can achieve accuracy gains of up to 17 percent compared to using suboptimal teachers.
The metric works consistently under multiple distillation training strategies including standard KD and others.
Extensive validation on eight diverse fine-grained datasets supports broad applicability within this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might reduce the computational cost of teacher selection by avoiding full distillation runs for every candidate.
Similar ratio-based metrics could be explored for teacher selection in other computer vision tasks beyond fine-grained recognition.
Testing the metric on emerging model architectures not covered in the experiments would clarify its robustness.
Integrating this selection into automated model compression pipelines could streamline deployment of efficient classifiers.

Load-bearing premise

The Ratio 1-2 metric derived from teacher predictions on the training data will continue to predict good distillation outcomes on new datasets or with different model families not tested here.

What would settle it

Running the full set of distillation experiments on a ninth fine-grained dataset and finding that the teacher with the highest Ratio 1-2 does not produce the best student accuracy would challenge the metric's reliability.

Figures

Figures reproduced from arXiv: 2605.15689 by Augusto Christian Surya, Bo-Cheng Lai, Edwin Arkel Rios, Fernando Mikael, Min-Chun Hu, Oswin Gosal.

**Figure 2.** Figure 2: Comparison of ResNet-18 accuracies trained from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the teachers’ R12 on the CUB under the TGDA setting, using a ResNet-18 trained from scratch. 4.3. Teacher Impact We further examine the behavior of individual teachers and their impact on student performance [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ratio 1-2 gives a simple, observable rule for picking distillation teachers that beats baselines in their 1000+ runs, but the gains stay inside the same eight datasets with no cross-holdout check.

read the letter

The paper's core idea is a teacher selection metric called Ratio 1-2, built from the ratio of a teacher's top-1 to top-2 prediction on the data. They show it improves selection over earlier methods by 18 percent and lifts student accuracy by as much as 17 percent in fine-grained tasks like bird or car classification. That is the main new piece: a low-cost, post-training signal that does not require extra fitting or labels.

Referee Report

1 major / 1 minor

Summary. The paper introduces a teacher selection metric called Ratio 1-2 for knowledge distillation in fine-grained image recognition, based on the ratio of the teacher's top-1 to top-2 prediction probabilities. It reports results from over one thousand experiments across 3 students, 8 teachers, 8 datasets, and 4 training strategies, claiming an 18% improvement in teacher selection over previous methods and up to 17% accuracy gains for small student models.

Significance. If the results hold, this work provides a practical and computationally lightweight method for choosing teachers in distillation setups for FGIR, potentially leading to more efficient models without sacrificing much accuracy. The large-scale experimental analysis across multiple dimensions offers robust empirical support for the claims within the evaluated settings, and the public codebase enhances reproducibility.

major comments (1)

[Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.

minor comments (1)

[Abstract] Consider clarifying how the 18% improvement is calculated, e.g., relative to which baseline and using what aggregation across experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback highlights an important aspect of generalizability that we address directly below.

read point-by-point responses

Referee: [Results and Discussion] All reported performance gains and the superiority of Ratio 1-2 are based on experiments conducted within the same 8 datasets and 4 strategies. The manuscript does not include held-out dataset or architecture experiments where Ratio 1-2 is calculated on a new dataset to select the teacher before evaluating the student's distilled performance. This is a load-bearing issue for the claim that the metric reliably improves teacher selection in general FGIR scenarios, as the correlation might be specific to the chosen datasets' characteristics such as class granularity or noise patterns.

Authors: We agree that experiments on held-out datasets would provide stronger support for the metric's reliability across arbitrary FGIR scenarios. Our evaluation already covers eight diverse datasets spanning multiple domains (e.g., birds, cars, aircraft) with varying class counts, image resolutions, and noise characteristics, and Ratio 1-2 shows consistent gains in every case. This breadth reduces the likelihood that results are tied to idiosyncratic dataset properties. To directly address the concern, we will add a new set of held-out experiments in the revised manuscript: we will introduce additional FGIR datasets not used in the original study, compute Ratio 1-2 on each new dataset to select the teacher, perform distillation, and report the resulting student performance. These results will be presented alongside the existing analysis to demonstrate cross-dataset applicability. revision: yes

Circularity Check

0 steps flagged

No circularity: Ratio 1-2 defined directly from observable predictions and validated empirically

full rationale

The paper introduces Ratio 1-2 as a metric computed from a teacher's top-1 and top-2 class prediction probabilities on input images. This definition uses only raw model outputs and contains no fitted parameters, self-referential equations, or reduction to prior results by the same authors. The reported 18% improvement and accuracy gains are obtained from over 1000 held-out experimental runs across 3 students, 8 teachers, 8 datasets, and 4 strategies; the metric is applied first and performance is measured afterward. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation. The chain is therefore self-contained and externally falsifiable via the released codebase.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of knowledge distillation (that a larger teacher can transfer useful knowledge to a smaller student) and on the empirical observation that prediction-ratio statistics correlate with downstream student accuracy. No explicit free parameters, new physical entities, or ad-hoc axioms beyond these domain conventions are introduced in the abstract.

axioms (1)

domain assumption A larger teacher model can transfer useful knowledge to a smaller student via distillation on fine-grained tasks.
Standard premise of knowledge distillation literature invoked to motivate teacher selection.

pith-pipeline@v0.9.0 · 5709 in / 1323 out tokens · 33484 ms · 2026-05-20T19:51:20.445776+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The Ratio 1-2 metric is defined as: R12 = P1 / P2, where P1 and P2 denote the highest and second-highest logits... A large ratio indicates that the teacher strongly favors a single class... a smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

smaller ratio suggests that the teacher considers multiple classes plausible, providing richer signals that better capture fine-grained class relationships

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

[1]

Tagged Anime Illustrations. 2

work page
[2]

On the Ef- ficacy of Knowledge Distillation

Jang Hyun Cho and Bharath Hariharan. On the Ef- ficacy of Knowledge Distillation. pages 4794–4802,

work page
[3]

PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021

Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021. arXiv:2109.15099 [cs]. 2

work page arXiv 2021
[4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2

work page 2020
[5]

Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition

Jianlong Fu, Heliang Zheng, and Tao Mei. Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484,

work page
[6]

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J ´egou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12239–12249, 2021. 2

work page 2021
[7]

Visual attention network.Computational Visual Media, 9(4):733–752,

Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual attention network.Computational Visual Media, 9(4):733–752,

work page
[8]

TransFG: A Transformer Architecture for Fine- Grained Recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Ko- rtylewski, Cheng Yang, Yutong Bai, and Changhu Wang. TransFG: A Transformer Architecture for Fine- Grained Recognition. InProceedings of the First MiniCon Conference, 2022. 1

work page 2022
[9]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016. 2

work page 2016
[10]

Identity Mappings in Deep Residual Networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2

work page 2016
[11]

Hinkle, William Wiersma, and Stephen G

Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs.Applied Statistics for the Behavioral Sci- ences. Houghton Mifflin, 2003. Google-Books-ID: 7tntAAAAMAAJ. 3

work page 2003
[12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the Knowledge in a Neural Network, 2015. arXiv:1503.02531 [stat]. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

Novel Dataset for Fine- Grained Image Categorization

Aditya Khosla, Nityananda Jayadevaprakash, Bang- peng Yao, and Li Fei-Fei. Novel Dataset for Fine- Grained Image Categorization. InFirst Workshop on Fine-Grained Visual Categorization, IEEE Con- ference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011. 2

work page 2011
[15]

Big Transfer (BiT): General Visual Repre- sentation Learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning. InComputer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing. 2

work page 2020
[16]

3D Object Representations for Fine-Grained Cat- egorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei- Fei. 3D Object Representations for Fine-Grained Cat- egorization. In2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 1, 2

work page 2013
[17]

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pages 10012–10022, 2021. 2

work page 2021
[18]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. pages 11976–11986, 2022. 2

work page 2022
[19]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Vi- sual Classification of Aircraft.arXiv:1306.5151 [cs],

work page internal anchor Pith review Pith/arXiv arXiv
[20]

A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012

MM Mukaka. A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012. 3

work page 2012
[21]

Auto- mated Flower Classification over a Large Number of Classes

Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, 2008. 2

work page 2008
[22]

Relational Knowledge Distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational Knowledge Distillation. pages 3967–3976,

work page
[23]

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. 2

work page 2012
[24]

Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification

Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. 5 In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, 2021. 1, 2

work page 2021
[25]

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. ISSN: 2158-1525. 1

work page 2025
[26]

Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025

Edwin Arkel Rios, Fernando Mikael, Oswin Gosal, Femiloye Oyerinde, Hao-Chun Liang, Bo-Cheng Lai, and Min-Chun Hu. Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025. arXiv:2507.12157 [cs]. 2

work page arXiv 2025
[27]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for Thin Deep Nets, 2015. arXiv:1412.6550 [cs]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Simonyan and Andrew Zisserman

K. Simonyan and Andrew Zisserman. Very Deep Con- volutional Networks for Large-Scale Image Recogni- tion.CoRR, 2014. 2

work page 2014
[29]

Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024

Chao Tan and Jie Liu. Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024. Conference Name: IEEE Transactions on Neural Networks and Learning Sys- tems. 1, 2, 3, 4

work page 2024
[30]

Contrastive Representation Distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. 2019. 1

work page 2019
[31]

Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. pages 595–604, 2015. 2

work page 2015
[32]

Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge

John Violos, Symeon Papadopoulos, and Ioannis Kompatsiaris. Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1896–1900, 2024. 1

work page 1900
[33]

The Caltech-UCSD Birds-200-2011 Dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. page 8. 1, 2

work page 2011
[34]

Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations

Hanxiao Wang, Venkatesh Saligrama, Stan Sclaroff, and Vitaly Ablavsky. Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations. In2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1252–1261, 2019. 1

work page 2019
[35]

Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. Conference Name: IEEE Transac- tions on Circuits and Systems for Video Technology. 1 6

work page 2024

[1] [1]

Tagged Anime Illustrations. 2

work page

[2] [2]

On the Ef- ficacy of Knowledge Distillation

Jang Hyun Cho and Bharath Hariharan. On the Ef- ficacy of Knowledge Distillation. pages 4794–4802,

work page

[3] [3]

PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021

Cheng Cui, Tingquan Gao, Shengyu Wei, Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. PP-LCNet: A Lightweight CPU Con- volutional Neural Network, 2021. arXiv:2109.15099 [cs]. 2

work page arXiv 2021

[4] [4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2020. 2

work page 2020

[5] [5]

Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition

Jianlong Fu, Heliang Zheng, and Tao Mei. Look Closer to See Better: Recurrent Attention Convolu- tional Neural Network for Fine-Grained Image Recog- nition. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484,

work page

[6] [6]

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J ´egou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 12239–12249, 2021. 2

work page 2021

[7] [7]

Visual attention network.Computational Visual Media, 9(4):733–752,

Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, and Shi-Min Hu. Visual attention network.Computational Visual Media, 9(4):733–752,

work page

[8] [8]

TransFG: A Transformer Architecture for Fine- Grained Recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Ko- rtylewski, Cheng Yang, Yutong Bai, and Changhu Wang. TransFG: A Transformer Architecture for Fine- Grained Recognition. InProceedings of the First MiniCon Conference, 2022. 1

work page 2022

[9] [9]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016. 2

work page 2016

[10] [10]

Identity Mappings in Deep Residual Networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep Residual Networks. InComputer Vision – ECCV 2016, pages 630–645, Cham, 2016. Springer International Publishing. 2

work page 2016

[11] [11]

Hinkle, William Wiersma, and Stephen G

Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs.Applied Statistics for the Behavioral Sci- ences. Houghton Mifflin, 2003. Google-Books-ID: 7tntAAAAMAAJ. 3

work page 2003

[12] [12]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the Knowledge in a Neural Network, 2015. arXiv:1503.02531 [stat]. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

See Better Before Looking Closer: Weakly Supervised Data Augmentation Network for Fine-Grained Visual Classification

Tao Hu, Honggang Qi, Qingming Huang, and Yan Lu. See Better Before Looking Closer: Weakly Super- vised Data Augmentation Network for Fine-Grained Visual Classification, 2019. arXiv:1901.09891 [cs]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

Novel Dataset for Fine- Grained Image Categorization

Aditya Khosla, Nityananda Jayadevaprakash, Bang- peng Yao, and Li Fei-Fei. Novel Dataset for Fine- Grained Image Categorization. InFirst Workshop on Fine-Grained Visual Categorization, IEEE Con- ference on Computer Vision and Pattern Recognition, Colorado Springs, CO, 2011. 2

work page 2011

[15] [15]

Big Transfer (BiT): General Visual Repre- sentation Learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General Visual Repre- sentation Learning. InComputer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing. 2

work page 2020

[16] [16]

3D Object Representations for Fine-Grained Cat- egorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei- Fei. 3D Object Representations for Fine-Grained Cat- egorization. In2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. 1, 2

work page 2013

[17] [17]

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. pages 10012–10022, 2021. 2

work page 2021

[18] [18]

A ConvNet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A ConvNet for the 2020s. pages 11976–11986, 2022. 2

work page 2022

[19] [19]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-Grained Vi- sual Classification of Aircraft.arXiv:1306.5151 [cs],

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012

MM Mukaka. A guide to appropriate use of Corre- lation coefficient in medical research.Malawi Medi- cal Journal : The Journal of Medical Association of Malawi, 24(3):69–71, 2012. 3

work page 2012

[21] [21]

Auto- mated Flower Classification over a Large Number of Classes

Maria-Elena Nilsback and Andrew Zisserman. Auto- mated Flower Classification over a Large Number of Classes. In2008 Sixth Indian Conference on Com- puter Vision, Graphics & Image Processing, pages 722–729, 2008. 2

work page 2008

[22] [22]

Relational Knowledge Distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational Knowledge Distillation. pages 3967–3976,

work page

[23] [23]

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. In2012 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012. 2

work page 2012

[24] [24]

Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification

Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. Counterfactual Attention Learning for Fine- Grained Visual Categorization and Re-identification. 5 In2021 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1005–1014, 2021. 1, 2

work page 2021

[25] [25]

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

Edwin Arkel Rios, Min-Chun Hu, and Bo-Cheng Lai. Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers. In2025 IEEE International Symposium on Circuits and Sys- tems (ISCAS), pages 1–5, 2025. ISSN: 2158-1525. 1

work page 2025

[26] [26]

Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025

Edwin Arkel Rios, Fernando Mikael, Oswin Gosal, Femiloye Oyerinde, Hao-Chun Liang, Bo-Cheng Lai, and Min-Chun Hu. Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmenta- tion, 2025. arXiv:2507.12157 [cs]. 2

work page arXiv 2025

[27] [27]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for Thin Deep Nets, 2015. arXiv:1412.6550 [cs]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Simonyan and Andrew Zisserman

K. Simonyan and Andrew Zisserman. Very Deep Con- volutional Networks for Large-Scale Image Recogni- tion.CoRR, 2014. 2

work page 2014

[29] [29]

Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024

Chao Tan and Jie Liu. Improving Knowledge Dis- tillation With a Customized Teacher.IEEE Trans- actions on Neural Networks and Learning Systems, 35(2):2290–2299, 2024. Conference Name: IEEE Transactions on Neural Networks and Learning Sys- tems. 1, 2, 3, 4

work page 2024

[30] [30]

Contrastive Representation Distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Representation Distillation. 2019. 1

work page 2019

[31] [31]

Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. pages 595–604, 2015. 2

work page 2015

[32] [32]

Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge

John Violos, Symeon Papadopoulos, and Ioannis Kompatsiaris. Towards Optimal Trade-Offs in Knowl- edge Distillation for CNNs and Vision Transformers at the Edge. In2024 32nd European Signal Processing Conference (EUSIPCO), pages 1896–1900, 2024. 1

work page 1900

[33] [33]

The Caltech-UCSD Birds-200-2011 Dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. page 8. 1, 2

work page 2011

[34] [34]

Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations

Hanxiao Wang, Venkatesh Saligrama, Stan Sclaroff, and Vitaly Ablavsky. Cost-Aware Fine-Grained Recognition for IoTs Based on Sequential Fixations. In2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 1252–1261, 2019. 1

work page 2019

[35] [35]

Shuo Ye, Yu Wang, Qinmu Peng, Xinge You, and C. L. Philip Chen. The Image Data and Backbone in Weakly Supervised Fine-Grained Visual Categoriza- tion: A Revisit and Further Thinking.IEEE Transac- tions on Circuits and Systems for Video Technology, 34(1):2–16, 2024. Conference Name: IEEE Transac- tions on Circuits and Systems for Video Technology. 1 6

work page 2024