AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification
Pith reviewed 2026-05-09 14:03 UTC · model grok-4.3
The pith
AgriKD distills knowledge from a Vision Transformer into a compact CNN, achieving comparable leaf disease classification accuracy with 172 times fewer parameters and 18-22 times lower latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgriKD enables effective cross-architecture knowledge transfer by applying distillation at the output level for class predictions, the feature level for intermediate activations, and the relational level for inter-class relationships. This multi-objective setup allows the compact CNN student to retain the Vision Transformer's capacity for modeling long-range dependencies in leaf images, resulting in performance comparable to the teacher alongside large reductions in size and compute.
What carries the argument
The multi-level distillation framework of AgriKD, which integrates output, feature, and relational objectives to bridge representational gaps between ViT teachers and CNN students.
If this is right
- The student model matches the ViT teacher's accuracy on multiple leaf disease datasets while using far fewer resources.
- Parameters drop by roughly 172 times, computational cost by 47.57 times, and inference latency by 18-22 times.
- The optimized model preserves performance when exported to ONNX, TFLite Float16, and TensorRT FP16.
- Real-time inference works reliably on NVIDIA Jetson edge hardware and mobile applications for field use.
Where Pith is reading between the lines
- The same multi-level distillation could be tested on other vision tasks where transformers capture context that CNNs miss, such as weed identification in drone imagery.
- Tuning the relative weights of the three distillation losses per dataset might yield even better accuracy-efficiency balances on specific edge hardware.
- Applying the method to time-series or multi-spectral plant images could extend its value to monitoring disease spread over time.
Load-bearing premise
That combining distillation at output, feature, and relational levels can transfer the transformer's global modeling ability to a CNN without major accuracy loss.
What would settle it
Evaluating the distilled CNN on a new leaf disease dataset and finding accuracy more than a few percentage points below the ViT teacher would challenge the comparable-performance claim while efficiency metrics are measured directly.
Figures
read the original abstract
Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgriKD, a cross-architecture knowledge distillation framework that transfers knowledge from a Vision Transformer teacher to a compact CNN student for leaf disease classification. It combines distillation objectives at output, feature, and relational levels to bridge the representational gap between architectures, enabling the student to preserve global representations. Experiments on multiple leaf disease datasets report that the distilled student achieves performance comparable to the teacher while reducing parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. The model is further optimized and deployed in formats including ONNX, TFLite, and TensorRT, with real-time inference demonstrated on NVIDIA Jetson edge devices and a mobile application.
Significance. If the results hold under rigorous validation, AgriKD would offer a practical contribution to efficient CV deployment in agriculture by showing how multi-level distillation can compress high-capacity ViT models into lightweight CNNs without substantial accuracy loss. The reported efficiency multipliers and successful edge-device deployment add applied value for resource-constrained settings. However, the absence of detailed experimental protocols, baselines, and ablations currently limits the strength of this assessment.
major comments (3)
- [Abstract and Experiments] Abstract and Experiments section: the central claim that the student achieves 'performance comparable to the teacher' is presented without any accuracy metrics, dataset names/sizes, class counts, train/test splits, or statistical tests, making it impossible to evaluate whether the efficiency gains come at an acceptable accuracy cost or to reproduce the results.
- [Section 3 (Proposed Method)] Section 3 (Proposed Method): the integration of output, feature, and relational distillation is described at a high level, but the precise loss formulations, hyperparameter weights for each term, and the mechanism for computing relational knowledge across ViT-CNN architectures (e.g., how attention or correlation matrices are aligned) are not specified, leaving the claim that this bridges the representational gap untestable.
- [Experiments] Experiments section: no ablation studies isolating the contribution of each distillation level, no comparisons against standard KD baselines (e.g., vanilla KD, attention transfer) or other cross-architecture methods, and no error analysis or failure-case examination are provided, which are load-bearing for substantiating that the multi-level approach is responsible for the reported gains.
minor comments (2)
- [Abstract] The efficiency numbers are reported as approximate (e.g., 'approximately 172 times'); providing the exact teacher/student architectures, parameter counts, and FLOPs tables would improve precision and allow direct verification.
- [Experiments] The deployment results mention 'negligible accuracy degradation' across runtimes but supply no quantitative before/after numbers or latency measurements on the Jetson platform.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address each of the major comments by adding the requested details, formulations, and analyses. These changes improve the clarity, rigor, and reproducibility of the work without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim that the student achieves 'performance comparable to the teacher' is presented without any accuracy metrics, dataset names/sizes, class counts, train/test splits, or statistical tests, making it impossible to evaluate whether the efficiency gains come at an acceptable accuracy cost or to reproduce the results.
Authors: We agree that the original abstract and experiments section lacked explicit numerical results and dataset specifications, which hinders evaluation and reproducibility. In the revised manuscript, we have updated the abstract to include key accuracy metrics for both teacher and student models and expanded the experiments section with full details on all dataset names, sizes, class counts, train/test splits, and statistical tests (including paired t-tests) confirming performance comparability. revision: yes
-
Referee: [Section 3 (Proposed Method)] Section 3 (Proposed Method): the integration of output, feature, and relational distillation is described at a high level, but the precise loss formulations, hyperparameter weights for each term, and the mechanism for computing relational knowledge across ViT-CNN architectures (e.g., how attention or correlation matrices are aligned) are not specified, leaving the claim that this bridges the representational gap untestable.
Authors: We acknowledge that the method section required more precise specifications. We have revised Section 3 to provide the exact mathematical formulations for the output (KL divergence), feature (MSE with alignment), and relational distillation losses, the specific hyperparameter weights for each term in the combined objective, and a detailed description of the cross-architecture relational alignment mechanism, including projection layers to match dimensions and correlation matrix computation adapted from ViT attention maps to CNN features. revision: yes
-
Referee: [Experiments] Experiments section: no ablation studies isolating the contribution of each distillation level, no comparisons against standard KD baselines (e.g., vanilla KD, attention transfer) or other cross-architecture methods, and no error analysis or failure-case examination are provided, which are load-bearing for substantiating that the multi-level approach is responsible for the reported gains.
Authors: We recognize the value of ablations and baseline comparisons for validating the multi-level distillation. We have added a dedicated ablation study isolating the contribution of each distillation level (output, feature, relational). We have also included direct comparisons against vanilla KD, attention transfer, and other cross-architecture methods, along with an error analysis section examining failure cases and common misclassifications to substantiate the gains from the proposed approach. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical cross-architecture knowledge distillation framework (AgriKD) that combines output, feature, and relational distillation objectives to transfer ViT knowledge to a compact CNN student. All load-bearing claims rest on direct experimental comparisons of accuracy, parameter count, FLOPs, and latency across leaf disease datasets, with deployment results on ONNX/TFLite/TensorRT and Jetson hardware. No equations, derivations, or fitted parameters are presented that reduce to self-defined quantities or self-citation chains; the method is validated by independent empirical metrics rather than by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Knowledge from a Vision Transformer can be transferred to a CNN via output, feature, and relational distillation objectives
- ad hoc to paper The student model can preserve global representations from the teacher despite architectural differences
invented entities (1)
-
AgriKD framework
no independent evidence
Reference graph
Works this paper leans on
-
[2]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =
MobileNetV2: Inverted Residuals and Linear Bottlenecks , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2018 , note =
work page 2018
-
[3]
Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages =
Focal Loss for Dense Object Detection , author =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages =. 2017 , note =
work page 2017
-
[4]
International Conference on Computational Science and Its Applications (ICCSA) , year =
[Title to be confirmed] , author =. International Conference on Computational Science and Its Applications (ICCSA) , year =
-
[5]
[Title to be confirmed] , author =. EAI International Conference on Research, Innovation and Development for Sustainable Engineering (RAIDS) , year =
-
[6]
[Year to be confirmed] , note =
Tomato Leaf Disease Dataset , author =. [Year to be confirmed] , note =
-
[7]
[Year to be confirmed] , note =
Burmese Grape Leaf Disease Dataset , author =. [Year to be confirmed] , note =
-
[8]
[Year to be confirmed] , note =
Potato Leaf Disease Dataset , author =. [Year to be confirmed] , note =
-
[9]
Computers and Electronics in Agriculture , volume=
Plant diseases recognition on images using convolutional neural networks: A systematic review , author=. Computers and Electronics in Agriculture , volume=. 2021 , publisher=
work page 2021
-
[10]
A survey of deep convolutional neural networks applied for prediction of plant leaf diseases , author=. Sensors , volume=. 2021 , publisher=
work page 2021
-
[11]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [12]
-
[13]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
A comprehensive overhaul of feature distillation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[14]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[15]
Information Processing in Agriculture , year=
A survey of deep learning techniques for image-based disease detection in dicot plants , author=. Information Processing in Agriculture , year=
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Ren, Sucheng and Gao, Zhengqi and Hua, Tianyu and Xue, Zihui and Tian, Yonglong and He, Shengfeng and Zhao, Hang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
work page 2022
-
[17]
Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT , author=. 2025 , eprint=
work page 2025
-
[18]
Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture , author=. 2025 , eprint=
work page 2025
-
[19]
Journal of Stored Products Research , volume=
Losses in agricultural produce: A review of causes and solutions, with a specific focus on grain crops , author=. Journal of Stored Products Research , volume=. 2025 , publisher=
work page 2025
-
[20]
Emerging Contaminants , volume=
A comprehensive review on environmental and human health impacts of chemical pesticide usage , author=. Emerging Contaminants , volume=. 2025 , publisher=
work page 2025
-
[21]
A novel lightweight hybrid CNN--ViT for maize leaf disease classification , author=. Scientific Reports , year=
-
[22]
Training data-efficient image transformers and distillation through attention , author=. 2021 , eprint=
work page 2021
- [23]
-
[24]
Knowledge Distillation from A Stronger Teacher , author=. 2022 , eprint=
work page 2022
-
[25]
From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels , author=. 2023 , eprint=
work page 2023
- [26]
-
[27]
Knowledge distillation in image classification: The impact of datasets , author=. Computers , volume=. 2024 , publisher=
work page 2024
-
[28]
Potato leaf disease classification using transfer learning and reweighting-based training with imbalanced data , author=. SN Computer Science , volume=. 2024 , publisher=
work page 2024
-
[29]
Le Minh Dung and Truong Hoang Vu and Hoang Minh Duc and Phan Thi Thu Hong , title =. Proceedings of the 2nd EAI International Conference on Responsible Artificial Intelligence and Data Science (EAI RAIDS 2026) , pages =
work page 2026
-
[30]
PeerJ Computer Science , volume=
Improved MobileNetV2 crop disease identification model for intelligent agriculture , author=. PeerJ Computer Science , volume=. 2023 , publisher=
work page 2023
- [31]
-
[32]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=
work page 2019
-
[33]
A deep learning based approach for automated plant disease classification using vision transformer , author=. Scientific Reports , volume=. 2022 , publisher=
work page 2022
-
[34]
Computers and Electronics in Agriculture , volume=
FormerLeaf: An efficient vision transformer for Cassava Leaf Disease detection , author=. Computers and Electronics in Agriculture , volume=. 2023 , publisher=
work page 2023
-
[35]
Vaibhav Solapure and SmartAgroTech DY and Anish Jawale , title =. Mendeley Data , volume =. 2024 , doi =
work page 2024
-
[36]
Nafiz Imtiaz and Naima Ahmed and Md Hasan Imam Bijoy , title =
Salman Af Rahman and Md. Nafiz Imtiaz and Naima Ahmed and Md Hasan Imam Bijoy , title =. Mendeley Data , volume =. 2025 , doi =
work page 2025
-
[37]
Nabila Husna Shabrina and Siwi Indarti and Rina Maharani and Dinar Ajeng Kristiyanti and Irmawati Irmawati and Niki Prastomo and Tika Adillah M , title =. Data in Brief , volume =. 2024 , doi =
work page 2024
-
[38]
Proceedings of the IEEE international conference on computer vision , pages=
Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[39]
Deep learning for mango leaf disease identification: A vision transformer perspective , author=. Heliyon , volume=. 2024 , publisher=
work page 2024
-
[40]
Current Plant Biology , volume=
RTR\_Lite\_MobileNetV2: A lightweight and efficient model for plant disease detection and classification , author=. Current Plant Biology , volume=. 2025 , publisher=
work page 2025
-
[41]
ViX-MangoEFormer: An Enhanced Vision Transformer--EfficientFormer and Stacking Ensemble Approach for Mango Leaf Disease Recognition with Explainable Artificial Intelligence , author=. Computers , volume=. 2025 , publisher=
work page 2025
-
[42]
ConvTransNet-S: a CNN-Transformer hybrid disease recognition model for complex field environments , author=. Plants , volume=. 2025 , publisher=
work page 2025
-
[43]
Advances in Neural Information Processing Systems , volume=
One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Distilling knowledge from heterogeneous architectures for semantic segmentation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[45]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Heterogeneous Complementary Distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[46]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[47]
International conference on machine learning , pages=
Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
-
[48]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[50]
Machine Learning and Knowledge Extraction , volume=
Potato leaf disease detection based on a lightweight deep learning model , author=. Machine Learning and Knowledge Extraction , volume=. 2024 , publisher=
work page 2024
-
[51]
Advancing plant leaf disease detection integrating machine learning and deep learning , author=. Scientific Reports , volume=. 2025 , publisher=
work page 2025
-
[52]
arXiv preprint arXiv:2512.22239 , year=
Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture , author=. arXiv preprint arXiv:2512.22239 , year=
-
[53]
Hoang, Phi-Hung and Phan, Thi-Thu-Hong. Potato Leaf Disease Classification in Uncontrolled Environments: Leveraging the Synergy of Handcrafted Features. Multi-disciplinary Trends in Artificial Intelligence. 2026
work page 2026
-
[54]
A hybrid CNN-transformer model with adaptive activation function for potato leaf disease classification , author=. Scientific Reports , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.