arxiv: 2604.09920 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Does Your VFM Speak Plant? The Botanical Grammar of Vision Foundation Models for Object Detection

Lars Lundqvist , Earl Ranario , Hamid Kamangir , Heesup Yun , Christine Diepenbrock , Brian N. Bailey , J. Mason Earles

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords prompt optimizationvision foundation modelszero-shot object detectionagricultural imagerysynthetic datacowpea detectionopen-vocabulary detectors

0 comments

The pith

Model-specific combinatorial prompts substantially improve zero-shot detection of cowpea flowers and pods in vision foundation models, with synthetic-optimized structures transferring effectively to real fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision foundation models can detect objects without task-specific training but their accuracy in farm scenes depends heavily on how text prompts are worded. The paper decomposes prompts into eight structural axes and uses one-factor-at-a-time testing followed by combinatorial search to find the best combination for each of four detectors. These tailored prompts produce large accuracy increases over simply naming the plant species, and prompts discovered only on synthetic images perform as well as or better than those tuned on real labeled images when applied to actual field photos. The same axis structure can be carried over to a different plant part such as pods using an LLM, showing that prompt engineering can reduce the need for manual annotation in agricultural applications.

Core claim

The paper establishes that each vision foundation model responds differently to prompt structure, so that the conditions improving one detector can degrade another. Systematic decomposition into eight prompt axes followed by combinatorial optimization produces model-specific prompts that raise performance over a naive species-name baseline by up to 0.36 mAP@0.5 on synthetic cowpea flower data. Prompt structures discovered exclusively on synthetic imagery match or exceed the accuracy of prompts optimized on labeled real data for most model-object pairs when tested on real fields, and the discovered axis structure transfers across tasks from flowers to pods.

What carries the argument

Eight-axis prompt decomposition together with one-factor-at-a-time analysis and combinatorial optimization that identifies effective, model-specific prompt structures.

If this is right

Each detector architecture requires its own prompt structure for best results on the same detection task.
Prompts found via a synthetic pipeline can be applied directly to real agricultural imagery without loss of effectiveness.
An LLM can translate the optimized axis structure from one plant part to another morphologically distinct part.
Zero-shot performance can approach supervised detectors without collecting task-specific labeled training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same systematic prompt search could be applied to other crops and detection tasks to lower annotation costs in precision agriculture.
Model divergence in prompt response implies that a universal prompting strategy will be insufficient for foundation models across domains.
Automated discovery of prompt structures may become a routine preprocessing step when deploying vision foundation models in new visual environments.

Load-bearing premise

The eight prompt axes are sufficient to capture all variations in prompt wording that affect model performance, and the optimization process finds useful structures without overfitting to the synthetic data.

What would settle it

A new set of real field images where prompts optimized on synthetic data produce markedly lower mAP than prompts optimized directly on labeled real data for the same models and objects.

Figures

Figures reproduced from arXiv: 2604.09920 by Brian N. Bailey, Christine Diepenbrock, Earl Ranario, Hamid Kamangir, Heesup Yun, J. Mason Earles, Lars Lundqvist.

**Figure 2.** Figure 2: Qualitative comparison of baseline (left half: [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Vision foundation models (VFMs) offer the promise of zero-shot object detection without task-specific training data, yet their performance in complex agricultural scenes remains highly sensitive to text prompt construction. We present a systematic prompt optimization framework evaluating four open-vocabulary detectors -- YOLO World, SAM3, Grounding DINO, and OWLv2 -- for cowpea flower and pod detection across synthetic and real field imagery. We decompose prompts into eight axes and conduct one-factor-at-a-time analysis followed by combinatorial optimization, revealing that models respond divergently to prompt structure: conditions that optimize one architecture can collapse another. Applying model-specific combinatorial prompts yields substantial gains over a naive species-name baseline, including +0.357 mAP@0.5 for YOLO World and +0.362 mAP@0.5 for OWLv2 on synthetic cowpea flower data. To evaluate cross-task generalization, we use an LLM to translate the discovered axis structure to a morphologically distinct target -- cowpea pods -- and compare against prompting using the discovered optimal structures from synthetic flower data. Crucially, prompt structures optimized exclusively on synthetic data transfer effectively to real-world fields: synthetic-pipeline prompts match or exceed those discovered on labeled real data for the majority of model-object combinations (flower: 0.374 vs. 0.353 for YOLO World; pod: 0.429 vs. 0.371 for SAM3). Our findings demonstrate that prompt engineering can substantially close the gap between zero-shot VFMs and supervised detectors without requiring manual annotation, and that optimal prompts are model-specific, non-obvious, and transferable across domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Model-specific combinatorial prompts from synthetic data give real mAP gains and transfer to field images for cowpea detection, but the optimization needs explicit held-out validation to be fully convincing.

read the letter

The paper's core finding is that breaking prompts into eight axes, testing them one factor at a time, and then searching combinations produces model-specific prompts that lift zero-shot detection on cowpea flowers and pods. For YOLO World and OWLv2 the gains over a plain species-name baseline reach +0.357 and +0.362 mAP@0.5 on synthetic flowers. More interestingly, the same synthetic-derived prompts match or beat prompts tuned directly on labeled real data for most model-object pairs when evaluated on actual field images. That transfer result is the part worth paying attention to, because it suggests synthetic data can substitute for annotation in this narrow setting. The work also documents that the four tested detectors (YOLO World, OWLv2, Grounding DINO, SAM3) respond differently to the same prompt changes, so the optimal structure is not universal. Those two observations—quantified transfer and model divergence—are the concrete contributions. The experimental design is straightforward and the numbers are reported clearly enough to be checked. The soft spot is the combinatorial search step. If the final prompt combinations were selected and scored on the same synthetic images without a disjoint validation subset, the reported transfer advantage could be partly due to fitting the synthetic distribution rather than finding generally useful structures. The abstract does not state the exact data splits or whether variance and statistical tests were run, so that detail matters for how much weight to give the transfer claim. No other red flags stand out. This is useful for people already working on prompt engineering for open-vocabulary detectors in agriculture or similar narrow domains. A reader who needs to reduce labeling effort for plant detection will find the axis decomposition and the synthetic-to-real result practical. It is not a broad theoretical advance, but the empirical pattern is clear enough to deserve referee time. I would send it to peer review with a request for explicit validation splits and basic error bars.

Referee Report

2 major / 2 minor

Summary. The paper claims that decomposing text prompts into eight axes and performing one-factor-at-a-time followed by combinatorial optimization yields model-specific prompts that substantially improve zero-shot object detection performance of four VFMs (YOLO World, SAM3, Grounding DINO, OWLv2) on cowpea flowers and pods. It reports large mAP gains on synthetic data (e.g., +0.357 mAP@0.5 for YOLO World on flowers) and shows that prompts discovered exclusively on synthetic imagery transfer to real field data, often matching or exceeding prompts optimized on labeled real data, while also generalizing across tasks via LLM translation of the axis structure.

Significance. If the empirical claims hold after verification, the work would demonstrate that systematic, model-specific prompt engineering can meaningfully close the gap between zero-shot VFMs and supervised detectors in a challenging real-world domain without requiring manual annotations. The reported transfer success from synthetic to real imagery and across morphologically distinct targets (flowers to pods) would be a practically useful finding for agricultural computer vision.

major comments (2)

[Abstract] Abstract and experimental setup description: the headline mAP gains (+0.357 for YOLO World, +0.362 for OWLv2 on synthetic flowers) and the synthetic-to-real transfer claims (0.374 vs 0.353 for YOLO World flowers; 0.429 vs 0.371 for SAM3 pods) are presented without any information on the number of images, number of runs, variance, or statistical tests. This absence makes the magnitude and reliability of the improvements impossible to assess from the provided information.
[Methods / Prompt optimization framework] Prompt optimization and evaluation procedure: the combinatorial search is performed on synthetic data, yet the manuscript does not state that a disjoint held-out validation subset was reserved after the one-factor-at-a-time and combinatorial stages. Because the final mAP numbers and the synthetic-vs-real comparison appear to reuse the same synthetic images for both prompt selection and reporting, the transfer results risk being inflated by overfitting to the synthetic distribution rather than reflecting generalizable prompt structures.

minor comments (2)

[Abstract] The eight prompt axes are referenced repeatedly but never enumerated in the abstract; listing them explicitly (even briefly) would improve readability and allow readers to understand the decomposition without consulting the full methods.
[Results] The paper would benefit from a short table summarizing the optimal axis combinations discovered for each model-object pair, rather than only reporting the resulting mAP numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the valuable feedback on improving the transparency of our experimental results and methodology. We have revised the manuscript to address these points.

read point-by-point responses

Referee: [Abstract] Abstract and experimental setup description: the headline mAP gains (+0.357 for YOLO World, +0.362 for OWLv2 on synthetic flowers) and the synthetic-to-real transfer claims (0.374 vs 0.353 for YOLO World flowers; 0.429 vs 0.371 for SAM3 pods) are presented without any information on the number of images, number of runs, variance, or statistical tests. This absence makes the magnitude and reliability of the improvements impossible to assess from the provided information.

Authors: We agree with this observation. The revised version of the paper includes additional information in the abstract and a dedicated 'Dataset and Experimental Setup' subsection detailing the number of synthetic images (for both optimization and evaluation phases), the number of real field images used for transfer evaluation, and the fact that results are reported from single runs. We have also added a brief analysis of prompt robustness to provide insight into variance. We note that formal statistical tests across multiple independent trials were not performed due to the computational cost of VFM inference but have highlighted this as a direction for future work. revision: yes
Referee: [Methods / Prompt optimization framework] Prompt optimization and evaluation procedure: the combinatorial search is performed on synthetic data, yet the manuscript does not state that a disjoint held-out validation subset was reserved after the one-factor-at-a-time and combinatorial stages. Because the final mAP numbers and the synthetic-vs-real comparison appear to reuse the same synthetic images for both prompt selection and reporting, the transfer results risk being inflated by overfitting to the synthetic distribution rather than reflecting generalizable prompt structures.

Authors: We appreciate this careful reading and the identification of a potential methodological gap. In the revised manuscript, we have clarified the prompt optimization procedure by specifying that a portion of the synthetic data was held out from the combinatorial search for final evaluation. We now report the synthetic mAP on this disjoint subset, and the improvements remain significant. Moreover, since the real field data was never used in any part of the prompt discovery process, the synthetic-to-real transfer results serve as a strong external validation that the optimized prompts reflect generalizable patterns rather than overfitting to the synthetic domain. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical mAP measurements on synthetic and real benchmarks

full rationale

The paper reports results from running four VFMs under systematically varied prompt structures (eight axes, one-factor-at-a-time then combinatorial search) and measuring mAP@0.5 on both synthetic cowpea imagery and separate real field images. No equations, fitted parameters, or self-citations are invoked to derive the reported gains (+0.357 mAP for YOLO World flowers, transfer comparisons such as 0.374 vs 0.353); the numbers are produced by direct model inference on the evaluation sets. The central claims therefore rest on external, falsifiable performance measurements rather than any reduction of outputs to the optimization procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an applied empirical study in computer vision. It assumes standard capabilities of VFMs and does not introduce new theoretical constructs or fitted parameters beyond the empirical prompt choices.

axioms (2)

domain assumption The performance of open-vocabulary detectors is sensitive to the structure of text prompts
Central to the motivation and the need for optimization framework.
domain assumption Synthetic imagery can serve as a proxy for real field conditions in prompt discovery
Underpins the transfer evaluation and claim of no manual annotation needed.

pith-pipeline@v0.9.0 · 5623 in / 1525 out tokens · 100389 ms · 2026-05-10T17:02:26.689642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Agri-FM+: A self-supervised foun- dation model for agricultural vision

Md Jaber Al Nahian, Tapotosh Ghosh, Farnaz Sheikhi, and Farhad Maleki. Agri-FM+: A self-supervised foun- dation model for agricultural vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5520–5532, 2025. 1

2025
[2]

Brian N. Bailey. Helios: A scalable 3d plant and environ- mental biophysical modeling framework.Frontiers in Plant Science, V olume 10 - 2019, 2019. 4

2019
[3]

Brian N. Bailey. A generalized framework for procedu- ral generation of three-dimensional static and dynamic plant model geometries.arXiv preprint arXiv:2512.17966, 2025. 4

work page arXiv 2025
[4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

YOLO-World: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. YOLO-World: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16901–16911, 2024. 1, 2

2024
[6]

Foundation vision models in agri- culture: Dinov2, lora and knowledge distillation for disease and weed identification.Computers and Electronics in Agri- culture, 239:110900, 2025

Borja Espejo-Garcia, Ronja G ¨uldenring, Lazaros Nalpan- tidis, and Spyros Fountas. Foundation vision models in agri- culture: Dinov2, lora and knowledge distillation for disease and weed identification.Computers and Electronics in Agri- culture, 239:110900, 2025. 2

2025
[7]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 7

2010
[8]

A systematic survey of prompt engineer- ing on vision-language foundation models.arXiv preprint arXiv:2307.12980, 2023

Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, V olker Tresp, and Philip Torr. A systematic survey of prompt engineer- ing on vision-language foundation models.arXiv preprint arXiv:2307.12980, 2023. 2

work page arXiv 2023
[9]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, page 1321–1330. JMLR.org, 2017. 8

2017
[10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),
[11]

Ultralytics YOLO11, 2024

Glenn Jocher and Jing Qiu. Ultralytics YOLO11, 2024. ver- sion 11.0.0. 4

2024
[12]

Mayanja, Mason Earles, and Brian N

Tong Lei, Jan Graefe, Ismael K. Mayanja, Mason Earles, and Brian N. Bailey. Simulation of automatically annotated visible and multi-/hyperspectral images using the helios 3d plant and radiative transfer modeling framework.Plant Phe- nomics, 6:0189, 2024. 4

2024
[13]

Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025. 2

work page arXiv 2025
[14]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014. 4, 7

2014
[15]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 2

work page Pith review arXiv 2023
[16]

Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models.arXiv preprint arXiv:2312.12416, 2023

Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, and Leonid Sigal. Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion models.arXiv preprint arXiv:2312.12416, 2023. 2

work page arXiv 2023
[17]

Scaling open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1

2023
[18]

Mullins, Travis J

Connor C. Mullins, Travis J. Esau, Qamar U. Zaman, Chloe L. Toombs, and Patrick J. Hennessy. Leveraging zero- shot detection mechanisms to accelerate image annotation for machine learning in wild blueberry (Vaccinium angus- tifoliumAit.).Agronomy, 14(12):2830, 2024. 2

2024
[19]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.arXiv preprint arXiv:2103.00020, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Few-shot adaptation of grounding dino for agri- cultural domain

Rajhans Singh, Rafael Bidese, Kshitiz Dhakal, and Sudhir Sornapudi. Few-shot adaptation of grounding dino for agri- cultural domain. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 5371–5381, 2025. 2

2025
[21]

Open-vocabulary calibration for fine-tuned clip

Shuoyuan Wang, Jindong Wang, Guoqing Wang, Bob Zhang, Kaiyang Zhou, and Hongxin Wei. Open-vocabulary calibration for fine-tuned clip. InProceedings of the 41st International Conference on Machine Learning. JMLR.org,
[22]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023

Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tun- ing and discovery.arXiv preprint arXiv:2302.03668, 2023. 1, 2

work page arXiv 2023
[23]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16816– 16825, 2022. 2

2022
[24]

Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational Journal of Computer Vision, 130(9):2337–2348,