FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition
Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3
The pith
A two-stage ensemble with MLLM arbitration reaches 70.49 percent accuracy on 306 fruit categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FruitEnsemble is a practical two-stage dynamic inference framework. In the first stage it employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples an expert arbitration mechanism triggers a multimodal large language model when ensemble confidence falls below 0.6 to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought reasoning. The training is optimized with a hard sample-aware joint loss. This leads to 70.49 percent classification accuracy that outperforms existing state-of-the-art models on the 306-category fruit dataset.
What carries the argument
The expert arbitration mechanism that triggers an MLLM for visual verification using botanical descriptions and CoT reasoning only when ensemble confidence falls below 0.6.
If this is right
- The framework supplies an efficient deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.
- It overcomes generalization limits of static single-model architectures on fine-grained problems.
- The hard sample-aware joint loss improves performance on challenging examples during training.
- Overall accuracy reaches 70.49 percent and exceeds previous methods on the constructed 306-category dataset.
Where Pith is reading between the lines
- The selective use of the MLLM only on uncertain cases could serve as a pattern for balancing accuracy gains against extra computation in other vision tasks.
- Adding external textual knowledge may help classification systems in domains where labeled images are scarce but descriptive information exists.
- The two-stage design offers a template for hybrid systems that combine traditional ensembles with language-model verification.
Load-bearing premise
The MLLM arbitration step correctly resolves the difficult samples that the ensemble cannot handle by using external botanical descriptions.
What would settle it
A test showing that accuracy on low-confidence samples stays the same or drops after MLLM arbitration is applied, or that overall accuracy fails to exceed prior state-of-the-art results.
Figures
read the original abstract
Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs a 306-category fruit dataset (116,233 samples) and proposes FruitEnsemble, a two-stage dynamic inference framework. Stage 1 uses a validation-calibrated weighted ensemble of heterogeneous backbones to produce a top-3 candidate pool. When ensemble confidence falls below 0.6, an MLLM performs arbitration via external botanical descriptions and Chain-of-Thought reasoning. Training incorporates a hard-sample-aware joint loss. The paper reports 70.49% classification accuracy and claims outperformance over existing SOTA models on this dataset.
Significance. If the reported gains are shown to stem from the MLLM arbitration rather than the new dataset or ensemble weighting alone, the work could provide a practical, deployment-oriented solution for fine-grained agricultural vision tasks where visual similarity is high. The large-scale fruit dataset addresses a documented scarcity in the domain. Credit is due for the reproducible two-stage design and focus on real-world sorting applications, though the absence of isolating experiments limits the assessed novelty of the arbitration step.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.
- [§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.
minor comments (2)
- [§3.1] The hard-sample-aware joint loss is mentioned but its exact formulation (weighting scheme, interaction with the ensemble loss) is not given in sufficient detail for reproduction.
- [§2] Dataset construction details (collection protocol, annotation process, train/val/test splits) should be expanded to allow independent verification of the 306-class benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the attribution of results to the MLLM arbitration component.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experimental Results): The central claim of 70.49% accuracy and SOTA outperformance is presented without any ablation that isolates the MLLM arbitration component. No comparison of ensemble-only accuracy versus the full FruitEnsemble, no count of samples triggering the <0.6 threshold, and no error analysis showing which classes or ambiguities the botanical CoT resolves are supplied. This directly affects attribution of the performance gain to the novel arbitration mechanism.
Authors: We acknowledge that the current manuscript lacks explicit ablations isolating the MLLM arbitration. In the revised version we will add a dedicated subsection in §4 that reports (i) top-1 accuracy of the validation-calibrated weighted ensemble alone, (ii) the number and percentage of test samples whose ensemble confidence fell below 0.6 and therefore triggered MLLM arbitration, and (iii) a per-class error analysis highlighting the visual ambiguities resolved by the botanical CoT step. These additions will allow readers to directly attribute performance gains to the arbitration mechanism while preserving the overall 70.49 % result and SOTA comparison. revision: yes
-
Referee: [§3.2] §3.2 (Arbitration Mechanism): The assumption that the MLLM step correctly resolves difficult samples rests on external botanical descriptions, yet no quantitative validation (e.g., accuracy lift on the low-confidence subset or failure cases of the MLLM) is reported. Without this, the two-stage design's load-bearing contribution cannot be verified.
Authors: We agree that quantitative evidence for the MLLM step is necessary. We will augment §3.2 and the experimental results with (a) accuracy on the low-confidence subset before versus after MLLM arbitration and (b) representative failure cases where the MLLM did not resolve the ambiguity. These metrics will be computed on the same held-out test split used for the main results, thereby verifying the contribution of the two-stage design without altering the reported overall accuracy. revision: yes
Circularity Check
No circularity: empirical framework reports experimental accuracy without derivations or self-referential reductions
full rationale
The paper describes a two-stage empirical method (weighted heterogeneous ensemble plus conditional MLLM arbitration on a new 306-class dataset) and reports a measured accuracy of 70.49%. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claim rests on direct experimental evaluation rather than any chain that reduces by construction to its own inputs. This is the expected non-finding for an applied ML systems paper whose results are externally falsifiable via replication on the stated dataset.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble confidence threshold
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool... when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered...
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hard Sample-Aware Joint Optimization... Lhard_div = 1/|Bhard| Σ −Σ JS(Pi(x)∥Pj(x))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mounes Astani, Mohammad Hasheminejad, and Mahsa Vaghefi. A diverse ensemble classifier for tomato disease recognition.Computers and Electronics in Agriculture, 198: 107054, 2022
work page 2022
-
[2]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss.Journal of Machine Learn- ing Research, 9(8), 2008
work page 2008
-
[4]
Anuja Bhargava and Atul Bansal. Classification and grading of multiple varieties of apple fruit.Food Analytical Methods, 14(7):1359–1368, 2021
work page 2021
-
[5]
Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection
Md Hasan Imam Bijoy et al. Fruitvision: A benchmark dataset for fresh, rotten, and formalin-mixed fruit detection. Data in Brief, 61:111752, 2025
work page 2025
-
[6]
Xception: Deep learning with depthwise separable convolutions
Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 1251–1258, 2017
work page 2017
-
[7]
Rafael MO Cruz, Robert Sabourin, George DC Cavalcanti, and Tsang Ing Ren. Meta-des: A dynamic ensemble selec- tion framework using meta-learning.Pattern recognition, 48 (5):1925–1935, 2015
work page 1925
-
[8]
Ensemble methods in machine learn- ing
Thomas G Dietterich. Ensemble methods in machine learn- ing. InInternational workshop on multiple classifier systems, pages 1–15. Springer, 2000
work page 2000
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Mengze Du, Fei Wang, Yu Wang, Kun Li, Wenhui Hou, Lu Liu, Yong He, and Yuwei Wang. Improving long-tailed pest classification using diffusion model-based data augmenta- tion.Computers and Electronics in Agriculture, 234:110244, 2025
work page 2025
-
[11]
Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4438–4446, 2017
work page 2017
-
[12]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016
work page 2016
-
[13]
Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023
Disha Garg and Mansaf Alam. Smart agriculture: A litera- ture review.Journal of Management Analytics, 10(2):359– 415, 2023
work page 2023
-
[14]
Mihai Gheorghe. Fruits 360: A dataset of images contain- ing fruits and vegetables.https://www.kaggle.com/ datasets/moltean/fruits, 2018. Accessed: 2024- 05-20
work page 2018
-
[15]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[16]
Transfg: A trans- former architecture for fine-grained recognition
Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A trans- former architecture for fine-grained recognition. InProceed- ings of the AAAI conference on artificial intelligence, pages 852–860, 2022
work page 2022
-
[17]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[18]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion
Saihui Hou, Yushan Feng, and Zilei Wang. Vegfru: A domain-specific dataset for fine-grained visual categoriza- tion. InProceedings of the IEEE international conference on computer vision, pages 541–549, 2017
work page 2017
-
[20]
Dualnet: Learn com- plementary features for image recognition
Saihui Hou, Xu Liu, and Zilei Wang. Dualnet: Learn com- plementary features for image recognition. InProceedings of the IEEE international conference on computer vision, pages 502–510, 2017
work page 2017
-
[21]
Densely connected convolutional net- works
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger. Densely connected convolutional net- works. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017
work page 2017
-
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works.Advances in neural information processing systems, 25, 2012
work page 2012
-
[23]
Diversity regularized ensemble pruning
Nan Li, Yang Yu, and Zhi-Hua Zhou. Diversity regularized ensemble pruning. InJoint European conference on machine learning and knowledge discovery in databases, pages 330–
-
[24]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017
work page 2017
-
[25]
Xi Meng, Yingchun Yuan, Guifa Teng, and Tianzhen Liu. Deep learning for fine-grained classification of jujube fruit in the natural environment.Journal of Food Measurement and Characterization, 15(5):4150–4165, 2021
work page 2021
-
[26]
Recurrent vision transformer for solving visual reasoning problems
Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, and Fabrizio Falchi. Recurrent vision transformer for solving visual reasoning problems. InInternational Con- ference on Image Analysis and Processing, pages 50–61. Springer, 2022
work page 2022
-
[27]
Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022
Mehmet Alican Noyan. Uncovering bias in the plantvillage dataset.arXiv preprint arXiv:2206.04374, 2022
-
[28]
Mihai Oltean and Horea Muresan. Fruits 360 dataset on github. 2017
work page 2017
-
[29]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[30]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. Very deep convo- lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015
work page 2015
-
[32]
Rethinking the inception archi- tecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception archi- tecture for computer vision. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 2818–2826, 2016
work page 2016
-
[33]
Inception-v4, inception-resnet and the im- pact of residual connections on learning
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4, inception-resnet and the im- pact of residual connections on learning. InProceedings of the AAAI conference on artificial intelligence, 2017
work page 2017
-
[34]
Efficientnet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019
work page 2019
-
[35]
Branchynet: Fast inference via early exiting from deep neural networks
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international con- ference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016
work page 2016
-
[36]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Yiqun Wang, Fahai Wang, Wenbai Chen, Bowen Lv, Mengchen Liu, Xiangyuan Kong, Chunjiang Zhao, and Zhaocen Pan. A large language model for multimodal iden- tification of crop diseases and pests.Scientific Reports, 15 (1):21959, 2025
work page 2025
-
[38]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[39]
Cvt: Introduc- ing convolutions to vision transformers
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introduc- ing convolutions to vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021
work page 2021
-
[40]
Miao Xu, Xuan Zhang, ChangJun Zhan, JianYu Ge, and Hua Yang. Research on citrus grading system based on ma- chine vision.Systems Science & Control Engineering, 13(1): 2460443, 2025
work page 2025
-
[41]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv preprint arXiv:2309.17421, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
A survey on large language model (llm) security and privacy: The good, the bad, and the ugly
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024
work page 2024
-
[43]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?Ad- vances in neural information processing systems, 27, 2014
work page 2014
-
[44]
Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back.Advances in neural information processing systems, 32, 2019
work page 2019
-
[45]
Part-based r-cnns for fine-grained category detection
Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Dar- rell. Part-based r-cnns for fine-grained category detection. InEuropean conference on computer vision, pages 834–849. Springer, 2014
work page 2014
-
[46]
Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, and Xiongkuo Min. Illusionbench: A large- scale and comprehensive benchmark for visual illusion un- derstanding in vision-language models. In2025 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025
work page 2025
-
[47]
Learning transferable architectures for scalable image recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.