pith. sign in

arxiv: 2605.01355 · v2 · submitted 2026-05-02 · 💻 cs.CV · cs.AI

AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification

Pith reviewed 2026-05-09 14:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords leaf disease classificationknowledge distillationvision transformerconvolutional neural networkedge deploymentmodel compressionagricultural AI
0
0 comments X

The pith

AgriKD distills knowledge from a Vision Transformer into a compact CNN, achieving comparable leaf disease classification accuracy with 172 times fewer parameters and 18-22 times lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgriKD as a framework to transfer rich global representations from heavy Vision Transformers to lightweight convolutional networks for automated leaf disease detection. It combines distillation losses at output, feature, and relational levels to overcome the architectural differences between transformers and CNNs. A reader would care because this directly addresses the need for accurate AI tools that run on cheap hardware in remote agricultural settings. Experiments across datasets confirm the student model matches teacher performance while delivering massive gains in efficiency and speed. The work further validates the approach through successful deployment on edge devices and mobile formats with minimal accuracy loss.

Core claim

AgriKD enables effective cross-architecture knowledge transfer by applying distillation at the output level for class predictions, the feature level for intermediate activations, and the relational level for inter-class relationships. This multi-objective setup allows the compact CNN student to retain the Vision Transformer's capacity for modeling long-range dependencies in leaf images, resulting in performance comparable to the teacher alongside large reductions in size and compute.

What carries the argument

The multi-level distillation framework of AgriKD, which integrates output, feature, and relational objectives to bridge representational gaps between ViT teachers and CNN students.

If this is right

  • The student model matches the ViT teacher's accuracy on multiple leaf disease datasets while using far fewer resources.
  • Parameters drop by roughly 172 times, computational cost by 47.57 times, and inference latency by 18-22 times.
  • The optimized model preserves performance when exported to ONNX, TFLite Float16, and TensorRT FP16.
  • Real-time inference works reliably on NVIDIA Jetson edge hardware and mobile applications for field use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-level distillation could be tested on other vision tasks where transformers capture context that CNNs miss, such as weed identification in drone imagery.
  • Tuning the relative weights of the three distillation losses per dataset might yield even better accuracy-efficiency balances on specific edge hardware.
  • Applying the method to time-series or multi-spectral plant images could extend its value to monitoring disease spread over time.

Load-bearing premise

That combining distillation at output, feature, and relational levels can transfer the transformer's global modeling ability to a CNN without major accuracy loss.

What would settle it

Evaluating the distilled CNN on a new leaf disease dataset and finding accuracy more than a few percentage points below the ViT teacher would challenge the comparable-performance claim while efficiency metrics are measured directly.

Figures

Figures reproduced from arXiv: 2605.01355 by Hoang-Vu Truong, Minh-Duc Hoang, Minh-Dung Le, Thi-Thu-Hong Phan.

Figure 2
Figure 2. Figure 2: Illustration of the relation-based distillation component. The loss preserves view at source ↗
Figure 3
Figure 3. Figure 3: Projection 1: Partially Cross-Attention (PCA) projector for aligning student view at source ↗
Figure 4
Figure 4. Figure 4: Projection 2: Group-wise Linear (GWL) projector for matching grouped student view at source ↗
Figure 5
Figure 5. Figure 5: Representative samples from each class in the Tomato Leaf Disease dataset. view at source ↗
Figure 7
Figure 7. Figure 7: Burmese Grape Leaf Disease dataset overview. view at source ↗
Figure 8
Figure 8. Figure 8: Representative samples from each class in the Potato Leaf Disease dataset. view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy comparison between teacher and distilled student across the three view at source ↗
Figure 10
Figure 10. Figure 10: Macro F1-score comparison between teacher and distilled student across the view at source ↗
Figure 11
Figure 11. Figure 11: Grad-CAM on Burmese Grape dataset ii) Grad-CAM analysis on Tomato dataset view at source ↗
Figure 12
Figure 12. Figure 12: Grad-CAM on Tomato dataset iii) Grad-CAM analysis on Potato dataset view at source ↗
Figure 13
Figure 13. Figure 13: Grad-CAM on Potato dataset The Grad-CAM results consistently show that the distilled student pro￾duces more focused and coherent attention than both the baseline CNN and the ViT teacher. This indicates that multi-level distillation improves not only predictive performance but also the quality and reliability of learned representations. 6. Comparison with previous studies view at source ↗
Figure 14
Figure 14. Figure 14: Performance comparison of existing approaches on the Potato Leaf Disease view at source ↗
Figure 15
Figure 15. Figure 15: Inference on NVIDIA Jetson Orin Nano using TensorRT FP16: Potato datasets. view at source ↗
Figure 16
Figure 16. Figure 16: Mobile application using TFLite Float16: application menu (left), Tomato view at source ↗
read the original abstract

Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgriKD, a cross-architecture knowledge distillation framework that transfers knowledge from a Vision Transformer teacher to a compact CNN student for leaf disease classification. It combines distillation objectives at output, feature, and relational levels to bridge the representational gap between architectures, enabling the student to preserve global representations. Experiments on multiple leaf disease datasets report that the distilled student achieves performance comparable to the teacher while reducing parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. The model is further optimized and deployed in formats including ONNX, TFLite, and TensorRT, with real-time inference demonstrated on NVIDIA Jetson edge devices and a mobile application.

Significance. If the results hold under rigorous validation, AgriKD would offer a practical contribution to efficient CV deployment in agriculture by showing how multi-level distillation can compress high-capacity ViT models into lightweight CNNs without substantial accuracy loss. The reported efficiency multipliers and successful edge-device deployment add applied value for resource-constrained settings. However, the absence of detailed experimental protocols, baselines, and ablations currently limits the strength of this assessment.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: the central claim that the student achieves 'performance comparable to the teacher' is presented without any accuracy metrics, dataset names/sizes, class counts, train/test splits, or statistical tests, making it impossible to evaluate whether the efficiency gains come at an acceptable accuracy cost or to reproduce the results.
  2. [Section 3 (Proposed Method)] Section 3 (Proposed Method): the integration of output, feature, and relational distillation is described at a high level, but the precise loss formulations, hyperparameter weights for each term, and the mechanism for computing relational knowledge across ViT-CNN architectures (e.g., how attention or correlation matrices are aligned) are not specified, leaving the claim that this bridges the representational gap untestable.
  3. [Experiments] Experiments section: no ablation studies isolating the contribution of each distillation level, no comparisons against standard KD baselines (e.g., vanilla KD, attention transfer) or other cross-architecture methods, and no error analysis or failure-case examination are provided, which are load-bearing for substantiating that the multi-level approach is responsible for the reported gains.
minor comments (2)
  1. [Abstract] The efficiency numbers are reported as approximate (e.g., 'approximately 172 times'); providing the exact teacher/student architectures, parameter counts, and FLOPs tables would improve precision and allow direct verification.
  2. [Experiments] The deployment results mention 'negligible accuracy degradation' across runtimes but supply no quantitative before/after numbers or latency measurements on the Jetson platform.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address each of the major comments by adding the requested details, formulations, and analyses. These changes improve the clarity, rigor, and reproducibility of the work without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim that the student achieves 'performance comparable to the teacher' is presented without any accuracy metrics, dataset names/sizes, class counts, train/test splits, or statistical tests, making it impossible to evaluate whether the efficiency gains come at an acceptable accuracy cost or to reproduce the results.

    Authors: We agree that the original abstract and experiments section lacked explicit numerical results and dataset specifications, which hinders evaluation and reproducibility. In the revised manuscript, we have updated the abstract to include key accuracy metrics for both teacher and student models and expanded the experiments section with full details on all dataset names, sizes, class counts, train/test splits, and statistical tests (including paired t-tests) confirming performance comparability. revision: yes

  2. Referee: [Section 3 (Proposed Method)] Section 3 (Proposed Method): the integration of output, feature, and relational distillation is described at a high level, but the precise loss formulations, hyperparameter weights for each term, and the mechanism for computing relational knowledge across ViT-CNN architectures (e.g., how attention or correlation matrices are aligned) are not specified, leaving the claim that this bridges the representational gap untestable.

    Authors: We acknowledge that the method section required more precise specifications. We have revised Section 3 to provide the exact mathematical formulations for the output (KL divergence), feature (MSE with alignment), and relational distillation losses, the specific hyperparameter weights for each term in the combined objective, and a detailed description of the cross-architecture relational alignment mechanism, including projection layers to match dimensions and correlation matrix computation adapted from ViT attention maps to CNN features. revision: yes

  3. Referee: [Experiments] Experiments section: no ablation studies isolating the contribution of each distillation level, no comparisons against standard KD baselines (e.g., vanilla KD, attention transfer) or other cross-architecture methods, and no error analysis or failure-case examination are provided, which are load-bearing for substantiating that the multi-level approach is responsible for the reported gains.

    Authors: We recognize the value of ablations and baseline comparisons for validating the multi-level distillation. We have added a dedicated ablation study isolating the contribution of each distillation level (output, feature, relational). We have also included direct comparisons against vanilla KD, attention transfer, and other cross-architecture methods, along with an error analysis section examining failure cases and common misclassifications to substantiate the gains from the proposed approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical cross-architecture knowledge distillation framework (AgriKD) that combines output, feature, and relational distillation objectives to transfer ViT knowledge to a compact CNN student. All load-bearing claims rest on direct experimental comparisons of accuracy, parameter count, FLOPs, and latency across leaf disease datasets, with deployment results on ONNX/TFLite/TensorRT and Jetson hardware. No equations, derivations, or fitted parameters are presented that reduce to self-defined quantities or self-citation chains; the method is validated by independent empirical metrics rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard assumptions of knowledge distillation and introduces a new named framework without new mathematical entities or free parameters beyond typical training hyperparameters.

axioms (2)
  • domain assumption Knowledge from a Vision Transformer can be transferred to a CNN via output, feature, and relational distillation objectives
    Central premise invoked to justify the multi-objective approach in the abstract
  • ad hoc to paper The student model can preserve global representations from the teacher despite architectural differences
    Assumption required for the claim that the distilled student matches teacher performance
invented entities (1)
  • AgriKD framework no independent evidence
    purpose: Cross-architecture knowledge distillation for efficient leaf disease classification
    Newly proposed method combining existing KD techniques

pith-pipeline@v0.9.0 · 5571 in / 1310 out tokens · 49203 ms · 2026-05-09T14:03:52.489157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

  1. [2]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

    MobileNetV2: Inverted Residuals and Linear Bottlenecks , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages =. 2018 , note =

  2. [3]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages =

    Focal Loss for Dense Object Detection , author =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages =. 2017 , note =

  3. [4]

    International Conference on Computational Science and Its Applications (ICCSA) , year =

    [Title to be confirmed] , author =. International Conference on Computational Science and Its Applications (ICCSA) , year =

  4. [5]

    EAI International Conference on Research, Innovation and Development for Sustainable Engineering (RAIDS) , year =

    [Title to be confirmed] , author =. EAI International Conference on Research, Innovation and Development for Sustainable Engineering (RAIDS) , year =

  5. [6]

    [Year to be confirmed] , note =

    Tomato Leaf Disease Dataset , author =. [Year to be confirmed] , note =

  6. [7]

    [Year to be confirmed] , note =

    Burmese Grape Leaf Disease Dataset , author =. [Year to be confirmed] , note =

  7. [8]

    [Year to be confirmed] , note =

    Potato Leaf Disease Dataset , author =. [Year to be confirmed] , note =

  8. [9]

    Computers and Electronics in Agriculture , volume=

    Plant diseases recognition on images using convolutional neural networks: A systematic review , author=. Computers and Electronics in Agriculture , volume=. 2021 , publisher=

  9. [10]

    Sensors , volume=

    A survey of deep convolutional neural networks applied for prediction of plant leaf diseases , author=. Sensors , volume=. 2021 , publisher=

  10. [11]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  11. [12]

    2015 , eprint=

    FitNets: Hints for Thin Deep Nets , author=. 2015 , eprint=

  12. [13]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    A comprehensive overhaul of feature distillation , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  13. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

  14. [15]

    Information Processing in Agriculture , year=

    A survey of deep learning techniques for image-based disease detection in dicot plants , author=. Information Processing in Agriculture , year=

  15. [16]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Ren, Sucheng and Gao, Zhengqi and Hua, Tianyu and Xue, Zihui and Tian, Yonglong and He, Shengfeng and Zhao, Hang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

  16. [17]

    2025 , eprint=

    Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT , author=. 2025 , eprint=

  17. [18]

    2025 , eprint=

    Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture , author=. 2025 , eprint=

  18. [19]

    Journal of Stored Products Research , volume=

    Losses in agricultural produce: A review of causes and solutions, with a specific focus on grain crops , author=. Journal of Stored Products Research , volume=. 2025 , publisher=

  19. [20]

    Emerging Contaminants , volume=

    A comprehensive review on environmental and human health impacts of chemical pesticide usage , author=. Emerging Contaminants , volume=. 2025 , publisher=

  20. [21]

    Scientific Reports , year=

    A novel lightweight hybrid CNN--ViT for maize leaf disease classification , author=. Scientific Reports , year=

  21. [22]

    2021 , eprint=

    Training data-efficient image transformers and distillation through attention , author=. 2021 , eprint=

  22. [23]

    2022 , eprint=

    Cross-Architecture Knowledge Distillation , author=. 2022 , eprint=

  23. [24]

    2022 , eprint=

    Knowledge Distillation from A Stronger Teacher , author=. 2022 , eprint=

  24. [25]

    2023 , eprint=

    From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels , author=. 2023 , eprint=

  25. [26]

    2019 , eprint=

    Relational Knowledge Distillation , author=. 2019 , eprint=

  26. [27]

    Computers , volume=

    Knowledge distillation in image classification: The impact of datasets , author=. Computers , volume=. 2024 , publisher=

  27. [28]

    SN Computer Science , volume=

    Potato leaf disease classification using transfer learning and reweighting-based training with imbalanced data , author=. SN Computer Science , volume=. 2024 , publisher=

  28. [29]

    Proceedings of the 2nd EAI International Conference on Responsible Artificial Intelligence and Data Science (EAI RAIDS 2026) , pages =

    Le Minh Dung and Truong Hoang Vu and Hoang Minh Duc and Phan Thi Thu Hong , title =. Proceedings of the 2nd EAI International Conference on Responsible Artificial Intelligence and Data Science (EAI RAIDS 2026) , pages =

  29. [30]

    PeerJ Computer Science , volume=

    Improved MobileNetV2 crop disease identification model for intelligent agriculture , author=. PeerJ Computer Science , volume=. 2023 , publisher=

  30. [31]

    2018 , eprint=

    Focal Loss for Dense Object Detection , author=. 2018 , eprint=

  31. [32]

    2019 , eprint=

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

  32. [33]

    Scientific Reports , volume=

    A deep learning based approach for automated plant disease classification using vision transformer , author=. Scientific Reports , volume=. 2022 , publisher=

  33. [34]

    Computers and Electronics in Agriculture , volume=

    FormerLeaf: An efficient vision transformer for Cassava Leaf Disease detection , author=. Computers and Electronics in Agriculture , volume=. 2023 , publisher=

  34. [35]

    Mendeley Data , volume =

    Vaibhav Solapure and SmartAgroTech DY and Anish Jawale , title =. Mendeley Data , volume =. 2024 , doi =

  35. [36]

    Nafiz Imtiaz and Naima Ahmed and Md Hasan Imam Bijoy , title =

    Salman Af Rahman and Md. Nafiz Imtiaz and Naima Ahmed and Md Hasan Imam Bijoy , title =. Mendeley Data , volume =. 2025 , doi =

  36. [37]

    Data in Brief , volume =

    Nabila Husna Shabrina and Siwi Indarti and Rina Maharani and Dinar Ajeng Kristiyanti and Irmawati Irmawati and Niki Prastomo and Tika Adillah M , title =. Data in Brief , volume =. 2024 , doi =

  37. [38]

    Proceedings of the IEEE international conference on computer vision , pages=

    Grad-cam: Visual explanations from deep networks via gradient-based localization , author=. Proceedings of the IEEE international conference on computer vision , pages=

  38. [39]

    Heliyon , volume=

    Deep learning for mango leaf disease identification: A vision transformer perspective , author=. Heliyon , volume=. 2024 , publisher=

  39. [40]

    Current Plant Biology , volume=

    RTR\_Lite\_MobileNetV2: A lightweight and efficient model for plant disease detection and classification , author=. Current Plant Biology , volume=. 2025 , publisher=

  40. [41]

    Computers , volume=

    ViX-MangoEFormer: An Enhanced Vision Transformer--EfficientFormer and Stacking Ensemble Approach for Mango Leaf Disease Recognition with Explainable Artificial Intelligence , author=. Computers , volume=. 2025 , publisher=

  41. [42]

    Plants , volume=

    ConvTransNet-S: a CNN-Transformer hybrid disease recognition model for complex field environments , author=. Plants , volume=. 2025 , publisher=

  42. [43]

    Advances in Neural Information Processing Systems , volume=

    One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation , author=. Advances in Neural Information Processing Systems , volume=

  43. [44]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Distilling knowledge from heterogeneous architectures for semantic segmentation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  44. [45]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Heterogeneous Complementary Distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  45. [46]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Densely connected convolutional networks , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  46. [47]

    International conference on machine learning , pages=

    Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

  47. [48]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Very deep convolutional networks for large-scale image recognition , author=. arXiv preprint arXiv:1409.1556 , year=

  48. [49]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  49. [50]

    Machine Learning and Knowledge Extraction , volume=

    Potato leaf disease detection based on a lightweight deep learning model , author=. Machine Learning and Knowledge Extraction , volume=. 2024 , publisher=

  50. [51]

    Scientific Reports , volume=

    Advancing plant leaf disease detection integrating machine learning and deep learning , author=. Scientific Reports , volume=. 2025 , publisher=

  51. [52]

    arXiv preprint arXiv:2512.22239 , year=

    Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture , author=. arXiv preprint arXiv:2512.22239 , year=

  52. [53]

    Potato Leaf Disease Classification in Uncontrolled Environments: Leveraging the Synergy of Handcrafted Features

    Hoang, Phi-Hung and Phan, Thi-Thu-Hong. Potato Leaf Disease Classification in Uncontrolled Environments: Leveraging the Synergy of Handcrafted Features. Multi-disciplinary Trends in Artificial Intelligence. 2026

  53. [54]

    Scientific Reports , year=

    A hybrid CNN-transformer model with adaptive activation function for potato leaf disease classification , author=. Scientific Reports , year=