Recognition: unknown
Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection
Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3
The pith
Model-specific combinatorial prompts substantially improve zero-shot detection of cowpea flowers and pods in vision foundation models, with synthetic-optimized structures transferring effectively to real fields.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that each vision foundation model responds differently to prompt structure, so that the conditions improving one detector can degrade another. Systematic decomposition into eight prompt axes followed by combinatorial optimization produces model-specific prompts that raise performance over a naive species-name baseline by up to 0.36 mAP@0.5 on synthetic cowpea flower data. Prompt structures discovered exclusively on synthetic imagery match or exceed the accuracy of prompts optimized on labeled real data for most model-object pairs when tested on real fields, and the discovered axis structure transfers across tasks from flowers to pods.
What carries the argument
Eight-axis prompt decomposition together with one-factor-at-a-time analysis and combinatorial optimization that identifies effective, model-specific prompt structures.
If this is right
- Each detector architecture requires its own prompt structure for best results on the same detection task.
- Prompts found via a synthetic pipeline can be applied directly to real agricultural imagery without loss of effectiveness.
- An LLM can translate the optimized axis structure from one plant part to another morphologically distinct part.
- Zero-shot performance can approach supervised detectors without collecting task-specific labeled training data.
Where Pith is reading between the lines
- The same systematic prompt search could be applied to other crops and detection tasks to lower annotation costs in precision agriculture.
- Model divergence in prompt response implies that a universal prompting strategy will be insufficient for foundation models across domains.
- Automated discovery of prompt structures may become a routine preprocessing step when deploying vision foundation models in new visual environments.
Load-bearing premise
The eight prompt axes are sufficient to capture all variations in prompt wording that affect model performance, and the optimization process finds useful structures without overfitting to the synthetic data.
What would settle it
A new set of real field images where prompts optimized on synthetic data produce markedly lower mAP than prompts optimized directly on labeled real data for the same models and objects.
Figures
read the original abstract
Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that decomposing text prompts into eight axes and performing one-factor-at-a-time followed by combinatorial optimization yields model-specific prompts that substantially improve zero-shot object detection performance of four VFMs (YOLO World, SAM3, Grounding DINO, OWLv2) on cowpea flowers and pods. It reports large mAP gains on synthetic data (e.g., +0.357 mAP@0.5 for YOLO World on flowers) and shows that prompts discovered exclusively on synthetic imagery transfer to real field data, often matching or exceeding prompts optimized on labeled real data, while also generalizing across tasks via LLM translation of the axis structure.
Significance. If the empirical claims hold after verification, the work would demonstrate that systematic, model-specific prompt engineering can meaningfully close the gap between zero-shot VFMs and supervised detectors in a challenging real-world domain without requiring manual annotations. The reported transfer success from synthetic to real imagery and across morphologically distinct targets (flowers to pods) would be a practically useful finding for agricultural computer vision.
major comments (2)
- [Abstract] Abstract and experimental setup description: the headline mAP gains (+0.357 for YOLO World, +0.362 for OWLv2 on synthetic flowers) and the synthetic-to-real transfer claims (0.374 vs 0.353 for YOLO World flowers; 0.429 vs 0.371 for SAM3 pods) are presented without any information on the number of images, number of runs, variance, or statistical tests. This absence makes the magnitude and reliability of the improvements impossible to assess from the provided information.
- [Methods / Prompt optimization framework] Prompt optimization and evaluation procedure: the combinatorial search is performed on synthetic data, yet the manuscript does not state that a disjoint held-out validation subset was reserved after the one-factor-at-a-time and combinatorial stages. Because the final mAP numbers and the synthetic-vs-real comparison appear to reuse the same synthetic images for both prompt selection and reporting, the transfer results risk being inflated by overfitting to the synthetic distribution rather than reflecting generalizable prompt structures.
minor comments (2)
- [Abstract] The eight prompt axes are referenced repeatedly but never enumerated in the abstract; listing them explicitly (even briefly) would improve readability and allow readers to understand the decomposition without consulting the full methods.
- [Results] The paper would benefit from a short table summarizing the optimal axis combinations discovered for each model-object pair, rather than only reporting the resulting mAP numbers.
Simulated Author's Rebuttal
We thank the referee for the valuable feedback on improving the transparency of our experimental results and methodology. We have revised the manuscript to address these points.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental setup description: the headline mAP gains (+0.357 for YOLO World, +0.362 for OWLv2 on synthetic flowers) and the synthetic-to-real transfer claims (0.374 vs 0.353 for YOLO World flowers; 0.429 vs 0.371 for SAM3 pods) are presented without any information on the number of images, number of runs, variance, or statistical tests. This absence makes the magnitude and reliability of the improvements impossible to assess from the provided information.
Authors: We agree with this observation. The revised version of the paper includes additional information in the abstract and a dedicated 'Dataset and Experimental Setup' subsection detailing the number of synthetic images (for both optimization and evaluation phases), the number of real field images used for transfer evaluation, and the fact that results are reported from single runs. We have also added a brief analysis of prompt robustness to provide insight into variance. We note that formal statistical tests across multiple independent trials were not performed due to the computational cost of VFM inference but have highlighted this as a direction for future work. revision: yes
-
Referee: [Methods / Prompt optimization framework] Prompt optimization and evaluation procedure: the combinatorial search is performed on synthetic data, yet the manuscript does not state that a disjoint held-out validation subset was reserved after the one-factor-at-a-time and combinatorial stages. Because the final mAP numbers and the synthetic-vs-real comparison appear to reuse the same synthetic images for both prompt selection and reporting, the transfer results risk being inflated by overfitting to the synthetic distribution rather than reflecting generalizable prompt structures.
Authors: We appreciate this careful reading and the identification of a potential methodological gap. In the revised manuscript, we have clarified the prompt optimization procedure by specifying that a portion of the synthetic data was held out from the combinatorial search for final evaluation. We now report the synthetic mAP on this disjoint subset, and the improvements remain significant. Moreover, since the real field data was never used in any part of the prompt discovery process, the synthetic-to-real transfer results serve as a strong external validation that the optimized prompts reflect generalizable patterns rather than overfitting to the synthetic domain. revision: yes
Circularity Check
No significant circularity: direct empirical mAP measurements on synthetic and real benchmarks
full rationale
The paper reports results from running four VFMs under systematically varied prompt structures (eight axes, one-factor-at-a-time then combinatorial search) and measuring mAP@0.5 on both synthetic cowpea imagery and separate real field images. No equations, fitted parameters, or self-citations are invoked to derive the reported gains (+0.357 mAP for YOLO World flowers, transfer comparisons such as 0.374 vs 0.353); the numbers are produced by direct model inference on the evaluation sets. The central claims therefore rest on external, falsifiable performance measurements rather than any reduction of outputs to the optimization procedure itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The performance of open-vocabulary detectors is sensitive to the structure of text prompts
- domain assumption Synthetic imagery can serve as a proxy for real field conditions in prompt discovery
Reference graph
Works this paper leans on
-
[1]
Agri-FM+: A self-supervised foun- dation model for agricultural vision
Md Jaber Al Nahian, Tapotosh Ghosh, Farnaz Sheikhi, and Farhad Maleki. Agri-FM+: A self-supervised foun- dation model for agricultural vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5520–5532, 2025. 1
2025
-
[2]
Brian N. Bailey. Helios: A scalable 3d plant and environ- mental biophysical modeling framework.Frontiers in Plant Science, V olume 10 - 2019, 2019. 4
2019
- [3]
-
[4]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
YOLO-World: Real-time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024. 1, 2
2024
-
[6]
Foundation vision models in agri- culture: Dinov2, lora and knowledge distillation for disease and weed identification.Computers and Electronics in Agri- culture, 239:110900, 2025
Borja Espejo-Garcia, Ronja G ¨uldenring, Lazaros Nalpan- tidis, and Spyros Fountas. Foundation vision models in agri- culture: Dinov2, lora and knowledge distillation for disease and weed identification.Computers and Electronics in Agri- culture, 239:110900, 2025. 2
2025
-
[7]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 7
2010
-
[8]
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, V olker Tresp, and Philip Torr. A systematic survey of prompt engineer- ing on vision-language foundation models.arXiv preprint arXiv:2307.12980, 2023. 2
-
[9]
Weinberger
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, page 1321–1330. JMLR.org, 2017. 8
2017
-
[10]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),
-
[11]
Ultralytics YOLO11, 2024
Glenn Jocher and Jing Qiu. Ultralytics YOLO11, 2024. ver- sion 11.0.0. 4
2024
-
[12]
Mayanja, Mason Earles, and Brian N
Tong Lei, Jan Graefe, Ismael K. Mayanja, Mason Earles, and Brian N. Bailey. Simulation of automatically annotated visible and multi-/hyperspectral images using the helios 3d plant and radiative transfer modeling framework.Plant Phe- nomics, 6:0189, 2024. 4
2024
-
[13]
Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025. 2
-
[14]
Lawrence Zitnick
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. 4, 7
2014
-
[15]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 2
work page Pith review arXiv 2023
-
[16]
Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, and Leonid Sigal. Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models.arXiv preprint arXiv:2312.12416, 2023. 2
-
[17]
Scaling open-vocabulary object detection
Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1
2023
-
[18]
Mullins, Travis J
Connor C. Mullins, Travis J. Esau, Qamar U. Zaman, Chloe L. Toombs, and Patrick J. Hennessy. Leveraging zero- shot detection mechanisms to accelerate image annotation for machine learning in wild blueberry (Vaccinium angus- tifoliumAit.).Agronomy, 14(12):2830, 2024. 2
2024
-
[19]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Few-shot adaptation of grounding dino for agri- cultural domain
Rajhans Singh, Rafael Bidese, Kshitiz Dhakal, and Sudhir Sornapudi. Few-shot adaptation of grounding dino for agri- cultural domain. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5371–5381, 2025. 2
2025
-
[21]
Open-vocabulary calibration for fine-tuned clip
Shuoyuan Wang, Jindong Wang, Guoqing Wang, Bob Zhang, Kaiyang Zhou, and Hongxin Wei. Open-vocabulary calibration for fine-tuned clip. InProceedings of the 41st International Conference on Machine Learning. JMLR.org,
-
[22]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tun- ing and discovery.arXiv preprint arXiv:2302.03668, 2023. 1, 2
-
[23]
Conditional prompt learning for vision-language mod- els
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16816– 16825, 2022. 2
2022
-
[24]
Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.