FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu; Jialu Li; Junhui Li; Ruitong Lu; Youshan Zhang

arxiv: 2605.20892 · v1 · pith:WTHIJB6Xnew · submitted 2026-05-20 · 💻 cs.CV

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu , Junhui Li , Ruitong Lu , Jialu Li , Youshan Zhang This is my paper

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords fine-grained fruit classificationheterogeneous ensembleMLLM arbitrationchain-of-thought reasoningagricultural computer visionhard sample lossfruit dataset

0 comments

The pith

A two-stage ensemble with MLLM arbitration reaches 70.49 percent accuracy on 306 fruit categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds a dataset of 306 fruit categories containing 116233 images to overcome shortages of high-quality data and the problem of high visual similarity between classes. It then proposes FruitEnsemble, which runs a weighted ensemble of different neural network backbones to produce a reliable set of top-three candidates. When the ensemble confidence drops below 0.6, the system activates a multimodal large language model that checks the image against external botanical descriptions using step-by-step reasoning. The training also uses a loss that pays extra attention to hard samples. A sympathetic reader would care because accurate fine-grained recognition supports practical tasks such as automated sorting and quality inspection in agriculture.

Core claim

FruitEnsemble is a practical two-stage dynamic inference framework. In the first stage it employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples an expert arbitration mechanism triggers a multimodal large language model when ensemble confidence falls below 0.6 to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought reasoning. The training is optimized with a hard sample-aware joint loss. This leads to 70.49 percent classification accuracy that outperforms existing state-of-the-art models on the 306-category fruit dataset.

What carries the argument

The expert arbitration mechanism that triggers an MLLM for visual verification using botanical descriptions and CoT reasoning only when ensemble confidence falls below 0.6.

If this is right

The framework supplies an efficient deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.
It overcomes generalization limits of static single-model architectures on fine-grained problems.
The hard sample-aware joint loss improves performance on challenging examples during training.
Overall accuracy reaches 70.49 percent and exceeds previous methods on the constructed 306-category dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective use of the MLLM only on uncertain cases could serve as a pattern for balancing accuracy gains against extra computation in other vision tasks.
Adding external textual knowledge may help classification systems in domains where labeled images are scarce but descriptive information exists.
The two-stage design offers a template for hybrid systems that combine traditional ensembles with language-model verification.

Load-bearing premise

The MLLM arbitration step correctly resolves the difficult samples that the ensemble cannot handle by using external botanical descriptions.

What would settle it

A test showing that accuracy on low-confidence samples stays the same or drops after MLLM arbitration is applied, or that overall accuracy fails to exceed prior state-of-the-art results.

Figures

Figures reproduced from arXiv: 2605.20892 by Enhui Yu, Jialu Li, Junhui Li, Ruitong Lu, Youshan Zhang.

**Figure 1.** Figure 1: Accuracy vs. inference efficiency on the Fruit-306 test set. Single CNNs (△) are efficient but limited in accuracy. The LLM-only baseline (⋄) is slow and performs poorly without domain-specific training. Static ensembles (•) improve accuracy but remain sub-optimal. FruitEnsemble (⋆) achieves the best trade-off by invoking the LLM only for uncertain samples. scale, manual sorting is time-consuming, labor-… view at source ↗

**Figure 2.** Figure 2: Comprehensive Statistical Analysis of Fruit-306. (a) Histogram of class sizes showing the frequency distribution of samples per category. (b) Cumulative distribution curve indicating that the top 20% of classes account for a significant majority of the total images. (c) Top 20 most frequent classes (Head), dominated by common commercial varieties. (d) Bottom 20 least frequent classes (Tail), highlighting t… view at source ↗

**Figure 3.** Figure 3: Visual Challenges in Fruit-306. Top Row: High interclass similarity. Examples of visually similar pairs (e.g., Fuji vs. Gala) where discrimination relies on subtle textual attributes (e.g., lenticel density). Bottom Row: High intra-class variance. The same variety shown under different lighting, occlusion, and ripeness conditions. These complexities necessitate dynamic reasoning beyond static classificat… view at source ↗

**Figure 4.** Figure 4: Overview of the FruitEnsemble Framework. Input fruit images are first processed by a curated set of four heterogeneous backbone models (M1-M4: ResNet50, DenseNet201, EfficientNetB7, ViT-B/16) with architecture-adaptive training strategies. Their output probabilities are fused via an Uncertainty-Aware Weighted Aggregation module (A) to generate top-k candidate predictions and a confidence gap metric (∆ = p1… view at source ↗

**Figure 5.** Figure 5: Progressive performance improvement of FruitEnsemble. Starting from individual backbones, heterogeneous ensemble aggregation, test-time augmentation, and LLM arbitration progressively improve accuracy from 65.5% to 70.5% [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: illustrates the effect of LLM trigger thresholds on overall accuracy. The relatively flat curve suggests that the base ensemble model has high confidence in most samples, and LLM intervention primarily benefits a small subset of difficult cases. This analysis helps determine the optimal threshold for deploying LLM arbitration in practice, trading off between accuracy gains and computational overhead. 6. C… view at source ↗

read the original abstract

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a new 306-class fruit dataset and a practical two-stage ensemble-plus-MLLM pipeline, but the experiments do not isolate whether the arbitration step actually improves results on hard samples.

read the letter

This paper introduces a dataset of 306 fruit categories with 116,233 images and a two-stage inference setup. A weighted ensemble of heterogeneous backbones produces a top-3 pool for most cases. When confidence falls below 0.6, an MLLM is called to verify using external botanical descriptions and chain-of-thought reasoning. A hard-sample loss is added during training. The abstract reports 70.49% accuracy and outperformance over existing models.

Referee Report

2 major / 2 minor

Summary. The manuscript constructs a 306-category fruit dataset (116,233 samples) and proposes FruitEnsemble, a two-stage dynamic inference framework. Stage 1 uses a validation-calibrated weighted ensemble of heterogeneous backbones to produce a top-3 candidate pool. When ensemble confidence falls below 0.6, an MLLM performs arbitration via external botanical descriptions and Chain-of-Thought reasoning. Training incorporates a hard-sample-aware joint loss. The paper reports 70.49% classification accuracy and claims outperformance over existing SOTA models on this dataset.

Significance. If the reported gains are shown to stem from the MLLM arbitration rather than the new dataset or ensemble weighting alone, the work could provide a practical, deployment-oriented solution for fine-grained agricultural vision tasks where visual similarity is high. The large-scale fruit dataset addresses a documented scarcity in the domain. Credit is due for the reproducible two-stage design and focus on real-world sorting applications, though the absence of isolating experiments limits the assessed novelty of the arbitration step.

major comments (2)

[Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.
[§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.

minor comments (2)

[§3.1] The hard-sample-aware joint loss is mentioned but its exact formulation (weighting scheme, interaction with the ensemble loss) is not given in sufficient detail for reproduction.
[§2] Dataset construction details (collection protocol, annotation process, train/val/test splits) should be expanded to allow independent verification of the 306-class benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the attribution of results to the MLLM arbitration component.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.

Authors: We acknowledge that the current manuscript lacks explicit ablations isolating the MLLM arbitration. In the revised version we will add a dedicated subsection in §4 that reports (i) top-1 accuracy of the validation-calibrated weighted ensemble alone, (ii) the number and percentage of test samples whose ensemble confidence fell below 0.6 and therefore triggered MLLM arbitration, and (iii) a per-class error analysis highlighting the visual ambiguities resolved by the botanical CoT step. These additions will allow readers to directly attribute performance gains to the arbitration mechanism while preserving the overall 70.49 % result and SOTA comparison. revision: yes
Referee: [§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.

Authors: We agree that quantitative evidence for the MLLM step is necessary. We will augment §3.2 and the experimental results with (a) accuracy on the low-confidence subset before versus after MLLM arbitration and (b) representative failure cases where the MLLM did not resolve the ambiguity. These metrics will be computed on the same held-out test split used for the main results, thereby verifying the contribution of the two-stage design without altering the reported overall accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework reports experimental accuracy without derivations or self-referential reductions

full rationale

The paper describes a two-stage empirical method (weighted heterogeneous ensemble plus conditional MLLM arbitration on a new 306-class dataset) and reports a measured accuracy of 70.49%. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claim rests on direct experimental evaluation rather than any chain that reduces by construction to its own inputs. This is the expected non-finding for an applied ML systems paper whose results are externally falsifiable via replication on the stated dataset.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The framework rests on standard supervised learning assumptions plus one explicit threshold; no new physical entities or unproven mathematical axioms are introduced.

free parameters (1)

ensemble confidence threshold
The value 0.6 is used to decide when to invoke the MLLM arbitrator; its selection is not derived from first principles in the abstract.

pith-pipeline@v0.9.0 · 5745 in / 1192 out tokens · 30997 ms · 2026-05-21T05:57:06.650256+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool... when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered...
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hard Sample-Aware Joint Optimization... Lhard_div = 1/|Bhard| Σ −Σ JS(Pi(x)∥Pj(x))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

[1]

A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

Mounes Astani, Mohammad Hasheminejad, and Mahsa Vaghefi. A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

work page 2022
[2]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

work page 2008
[4]

Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

Anuja Bhargava and Atul Bansal. Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

work page 2021
[5]

Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection

Md Hasan Imam Bijoy et al. Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection. Data in Brief, 61:111752, 2025

work page 2025
[6]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017

work page 2017
[7]

Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

Rafael MO Cruz, Robert Sabourin, George DC Cavalcanti, and Tsang Ing Ren. Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

work page 1925
[8]

Ensemble methods in machine learn- ing

Thomas G Dietterich. Ensemble methods in machine learn- ing. InInternational workshop on multiple classifier systems, pages 1–15. Springer, 2000

work page 2000
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

Mengze Du, Fei Wang, Yu Wang, Kun Li, Wenhui Hou, Lu Liu, Yong He, and Yuwei Wang. Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

work page 2025
[11]

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4438–4446, 2017

work page 2017
[12]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016

work page 2016
[13]

Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

Disha Garg and Mansaf Alam. Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

work page 2023
[14]

Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018

Mihai Gheorghe. Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018. Accessed: 2024- 05-20

work page 2018
[15]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Transfg: A trans- former architecture for fine-grained recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022

work page 2022
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion

Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

work page 2017
[20]

Dualnet: Learn com- plementary features for image recognition

Saihui Hou, Xu Liu, and Zilei Wang. Dualnet: Learn com- plementary features for image recognition. InProceedings of the IEEE international conference on computer vision, pages 502–510, 2017

work page 2017
[21]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[22]

Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

work page 2012
[23]

Diversity regularized ensemble pruning

Nan Li, Yang Yu, and Zhi-Hua Zhou. Diversity regularized ensemble pruning. InJoint European conference on machine learning and knowledge discovery in databases, pages 330–

work page
[24]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[25]

Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

Xi Meng, Yingchun Yuan, Guifa Teng, and Tianzhen Liu. Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

work page 2021
[26]

Recurrent vision transformer for solving visual reasoning problems

Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, and Fabrizio Falchi. Recurrent vision transformer for solving visual reasoning problems. InInternational Con- ference on Image Analysis and Processing, pages 50–61. Springer, 2022

work page 2022
[27]

Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

Mehmet Alican Noyan. Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

work page arXiv 2022
[28]

Fruits 360 dataset on github

Mihai Oltean and Horea Muresan. Fruits 360 dataset on github. 2017

work page 2017
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[30]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

work page 2015
[32]

Rethinking the inception archi- tecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[33]

Inception-v4, inception-resnet and the im- pact of residual connections on learning

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the im- pact of residual connections on learning. InProceedings of the AAAI conference on artificial intelligence, 2017

work page 2017
[34]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[35]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international con- ference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016

work page 2016
[36]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

Yiqun Wang, Fahai Wang, Wenbai Chen, Bowen Lv, Mengchen Liu, Xiangyuan Kong, Chunjiang Zhao, and Zhaocen Pan. A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

work page 2025
[38]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[39]

Cvt: Introduc- ing convolutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc- ing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021

work page 2021
[40]

Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

Miao Xu, Xuan Zhang, ChangJun Zhan, JianYu Ge, and Hua Yang. Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

work page 2025
[41]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly

Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024

work page 2024
[43]

How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

work page 2014
[44]

Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

work page 2019
[45]

Part-based r-cnns for fine-grained category detection

Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Dar- rell. Part-based r-cnns for fine-grained category detection. InEuropean conference on computer vision, pages 834–849. Springer, 2014

work page 2014
[46]

Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models

Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models. In2025 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

work page 2025
[47]

Learning transferable architectures for scalable image recognition

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018

work page 2018

[1] [1]

A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

Mounes Astani, Mohammad Hasheminejad, and Mahsa Vaghefi. A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022

work page 2022

[2] [2]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008

work page 2008

[4] [4]

Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

Anuja Bhargava and Atul Bansal. Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021

work page 2021

[5] [5]

Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection

Md Hasan Imam Bijoy et al. Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection. Data in Brief, 61:111752, 2025

work page 2025

[6] [6]

Xception: Deep learning with depthwise separable convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017

work page 2017

[7] [7]

Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

Rafael MO Cruz, Robert Sabourin, George DC Cavalcanti, and Tsang Ing Ren. Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015

work page 1925

[8] [8]

Ensemble methods in machine learn- ing

Thomas G Dietterich. Ensemble methods in machine learn- ing. InInternational workshop on multiple classifier systems, pages 1–15. Springer, 2000

work page 2000

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

Mengze Du, Fei Wang, Yu Wang, Kun Li, Wenhui Hou, Lu Liu, Yong He, and Yuwei Wang. Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025

work page 2025

[11] [11]

Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4438–4446, 2017

work page 2017

[12] [12]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016

work page 2016

[13] [13]

Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

Disha Garg and Mansaf Alam. Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023

work page 2023

[14] [14]

Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018

Mihai Gheorghe. Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018. Accessed: 2024- 05-20

work page 2018

[15] [15]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Transfg: A trans- former architecture for fine-grained recognition

Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022

work page 2022

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[18] [18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion

Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017

work page 2017

[20] [20]

Dualnet: Learn com- plementary features for image recognition

Saihui Hou, Xu Liu, and Zilei Wang. Dualnet: Learn com- plementary features for image recognition. InProceedings of the IEEE international conference on computer vision, pages 502–510, 2017

work page 2017

[21] [21]

Densely connected convolutional net- works

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017

[22] [22]

Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012

work page 2012

[23] [23]

Diversity regularized ensemble pruning

Nan Li, Yang Yu, and Zhi-Hua Zhou. Diversity regularized ensemble pruning. InJoint European conference on machine learning and knowledge discovery in databases, pages 330–

work page

[24] [24]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017

[25] [25]

Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

Xi Meng, Yingchun Yuan, Guifa Teng, and Tianzhen Liu. Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021

work page 2021

[26] [26]

Recurrent vision transformer for solving visual reasoning problems

Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, and Fabrizio Falchi. Recurrent vision transformer for solving visual reasoning problems. InInternational Con- ference on Image Analysis and Processing, pages 50–61. Springer, 2022

work page 2022

[27] [27]

Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

Mehmet Alican Noyan. Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022

work page arXiv 2022

[28] [28]

Fruits 360 dataset on github

Mihai Oltean and Horea Muresan. Fruits 360 dataset on github. 2017

work page 2017

[29] [29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[30] [30]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

work page 2015

[32] [32]

Rethinking the inception archi- tecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016

[33] [33]

Inception-v4, inception-resnet and the im- pact of residual connections on learning

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the im- pact of residual connections on learning. InProceedings of the AAAI conference on artificial intelligence, 2017

work page 2017

[34] [34]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019

[35] [35]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international con- ference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016

work page 2016

[36] [36]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

Yiqun Wang, Fahai Wang, Wenbai Chen, Bowen Lv, Mengchen Liu, Xiangyuan Kong, Chunjiang Zhao, and Zhaocen Pan. A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025

work page 2025

[38] [38]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[39] [39]

Cvt: Introduc- ing convolutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc- ing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021

work page 2021

[40] [40]

Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

Miao Xu, Xuan Zhang, ChangJun Zhan, JianYu Ge, and Hua Yang. Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025

work page 2025

[41] [41]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly

Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024

work page 2024

[43] [43]

How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014

work page 2014

[44] [44]

Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019

work page 2019

[45] [45]

Part-based r-cnns for fine-grained category detection

Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Dar- rell. Part-based r-cnns for fine-grained category detection. InEuropean conference on computer vision, pages 834–849. Springer, 2014

work page 2014

[46] [46]

Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models

Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models. In2025 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

work page 2025

[47] [47]

Learning transferable architectures for scalable image recognition

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018

work page 2018