pith. sign in

arxiv: 2606.20728 · v1 · pith:YCIZILVCnew · submitted 2026-06-17 · 💻 cs.CV · cs.CL

VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

Pith reviewed 2026-06-26 21:34 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision tool orchestrationco-search solutions and observersvisual programmingdense object countingzero-shot segmentationadaptive pipelinesfoundation model composition
0
0 comments X

The pith

Co-searching executable vision tool pipelines together with observer programs that diagnose their failures improves adaptation to dense objects, occlusion, and domain shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VTOS as a way to orchestrate vision foundation tools by simultaneously searching for solution programs that compose detectors, segmenters, and operators, and observer programs that detect where those solutions go wrong. Feedback from the observers is collected in a shared knowledge base called VisionThoughts and used to steer the next round of search. The approach is evaluated on dense object counting and zero-shot plant-disease segmentation, where it exceeds both fixed pipelines and existing agentic visual-programming methods. A reader would care because current vision tools are powerful individually but become brittle when scenes vary in density, scale, or distribution; this method offers a systematic way to adjust their use without hand-crafted rules for every new condition.

Core claim

VTOS jointly searches over executable solution programs that chain tools such as Grounding DINO, SAM, NMS, and slice-and-detect, together with observer programs that identify failure modes such as incorrect thresholds or missed small objects and generate feedback; this feedback is stored in the VisionThoughts knowledge base to guide subsequent iterations, producing better orchestration than either static pipelines or agent baselines on tasks that require threshold calibration, mask refinement, and handling of domain shift.

What carries the argument

Joint search of solution programs and observer programs whose diagnoses are accumulated in a shared VisionThoughts knowledge base.

If this is right

  • Solution programs can be adapted on the fly for dense object counting by adjusting slicing, NMS, and thresholds.
  • Observer feedback enables better mask refinement and generalization on out-of-domain plant-disease segmentation.
  • The co-search strategy reduces brittleness compared with fixed tool pipelines under occlusion and small targets.
  • Accumulated observations in VisionThoughts allow iterative improvement across multiple search rounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-search pattern could be tested on orchestration of tools outside vision, such as combining language models with retrieval or code execution.
  • If observer quality can itself be improved by search, the method might create self-improving orchestration loops.
  • The knowledge base might support transfer of successful observer patterns from one task to another without retraining from scratch.

Load-bearing premise

Observer programs can reliably diagnose failure modes in candidate solutions and generate feedback that improves the joint search without adding noise or bias.

What would settle it

An experiment in which VTOS is run with non-informative or random observers on LVIS-Count and PlantSeg-OOD and still matches the performance of the full method would show that the observers are not driving the gains.

Figures

Figures reproduced from arXiv: 2606.20728 by Jinchao Ge, Lei Wang, Lingqiao Liu, Shuwen Zhao.

Figure 1
Figure 1. Figure 1: Static tool use fails on hard scenes; VTOS searches an auditable adaptive program. For two representative tasks—dense object counting (Q1) and out-of-distribution leaf-disease segmentation (Q2)—directly applying a VLM yields ungrounded prose answers, while a fixed-threshold detector breaks down under clutter, occlusion, and unseen species (aggregate zero-shot test MAE 53 / Bias +27 for counting; small-lesi… view at source ↗
Figure 2
Figure 2. Figure 2: VTOS Framework. The framework runs for ×N iterations. In each iteration, a Producer proposes K candidate solution programs from a constrained tool library; programs are executed and scored by a sandboxed subprocess evaluator. An Analyzer then proposes observer programs targeting the top-ranked solutions Top(St) (amber coupling line), generating qualitative failure observations via a budgeted toolbox of pur… view at source ↗
Figure 3
Figure 3. Figure 3: Two-task evaluation setup. VTOS is evaluated on dense object counting (top, LVIS-Count: 180 images sampled from LVIS, 30 categories, two density tiers — Normal 10 ≤ N ≤ 50 and Extreme 51 ≤ N ≤ 100) and on grounded segmentation (bottom, PlantSeg-OOD: 100-image OOD test sampled from PlantSeg, three-way species￾disjoint splits, two lesion-area tiers — Small < 10% and Moderate 10–30%). Both tasks expose distin… view at source ↗
Figure 4
Figure 4. Figure 4: VTOS detection examples on LVIS-Count (zero-shot test set). Top row: Normal tier (10 ≤ N ≤ 50); bottom row: Extreme tier (51 ≤ N ≤ 100). Cyan boxes mark ground-truth instances; each panel reports the VTOS program’s predicted count versus ground truth. These representative cases show the searched counting program recovering the correct count across object categories and density regimes. tion, so an interrup… view at source ↗
Figure 5
Figure 5. Figure 5: mIoU Trajectory over Search Iterations on the LVIS-Count training split. Per-iteration best￾of-K (K=3) mIoU stratified by density tier (Normal: 10≤N≤50; Extreme: 51≤N≤100). The trajectory is non-monotonic — both tiers show occasional regres￾sions (e.g., Extreme dips at iter 3 from 0.30 to 0.22) as the Producer explores new code directions in response to Analyzer feedback. Best Normal mIoU (0.364) is reache… view at source ↗
read the original abstract

Vision foundation tools such as open-vocabulary detectors, segmentation models, and post-processing operators are powerful building blocks for computer vision, but their effectiveness depends heavily on how they are orchestrated: which tools are used, in what order, with what parameters, and under what visual conditions. Existing visual-programming agents typically generate a fixed solution pipeline, making them brittle under dense objects, occlusion, small targets, and domain shift. We introduce VTOS (Vision Tools Orchestration Search), a framework for adaptive visual tool orchestration through joint solution--observer search. VTOS co-searches executable solution programs that compose vision tools such as Grounding DINO, SAM, NMS, and slice-and-detect, together with observer programs that diagnose candidate solutions, identify failure modes, and generate actionable feedback. These observations are accumulated in a shared VisionThoughts knowledge base to guide subsequent search. We evaluate VTOS through two case studies: dense object counting on LVIS-Count and zero-shot plant-disease segmentation on PlantSeg-OOD, which stress different orchestration challenges including threshold calibration, NMS, slicing, mask refinement, and domain generalization. Across both tasks, VTOS outperforms static tool pipelines and agentic visual-programming baselines, showing that co-searching solutions and observers is an effective strategy for adapting vision tools to challenging computer vision tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces VTOS, a framework for adaptive orchestration of vision foundation tools (e.g., Grounding DINO, SAM) via joint search over executable solution programs and diagnostic observer programs. Observers identify failure modes and accumulate feedback in a VisionThoughts knowledge base to guide further search. The approach is evaluated on two case studies—dense object counting on LVIS-Count and zero-shot plant-disease segmentation on PlantSeg-OOD—claiming outperformance over static tool pipelines and agentic visual-programming baselines.

Significance. If the experimental claims are substantiated with controlled quantitative results, the co-search of solutions and observers could represent a meaningful advance in making vision tool pipelines robust to occlusion, domain shift, and dense scenes. The VisionThoughts accumulation mechanism offers a concrete way to close the loop between diagnosis and search, which is a promising direction for agentic CV systems beyond fixed pipelines.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'VTOS outperforms static tool pipelines and agentic visual-programming baselines' is stated without any quantitative results, error bars, ablation tables, or implementation details (e.g., number of LLM calls, search budget, or success metrics), so the claim cannot be evaluated from the provided text.
  2. [Evaluation] Evaluation sections (case studies): no controlled ablation is described that holds total search iterations or LLM evaluations fixed while toggling the presence of observer generation and VisionThoughts feedback. Without this isolation, gains cannot be attributed specifically to co-searching observers rather than simply exploring more candidate solutions.
  3. [Method] Method description: the assumption that observer programs reliably produce actionable, low-noise feedback is load-bearing for the joint-search claim, yet no analysis of observer accuracy, failure-mode coverage, or cases where feedback introduces bias is provided.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit definitions of the two tasks' metrics (e.g., counting MAE, segmentation IoU) and the exact baselines used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract, evaluation protocol, and observer analysis require strengthening for clarity and rigor. We address each major comment below and will incorporate revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'VTOS outperforms static tool pipelines and agentic visual-programming baselines' is stated without any quantitative results, error bars, ablation tables, or implementation details (e.g., number of LLM calls, search budget, or success metrics), so the claim cannot be evaluated from the provided text.

    Authors: We agree the abstract should include concrete quantitative support. In the revision we will add key performance metrics (with error bars), search budget, and LLM call counts drawn from the experimental results to make the central claim evaluable directly from the abstract. revision: yes

  2. Referee: [Evaluation] Evaluation sections (case studies): no controlled ablation is described that holds total search iterations or LLM evaluations fixed while toggling the presence of observer generation and VisionThoughts feedback. Without this isolation, gains cannot be attributed specifically to co-searching observers rather than simply exploring more candidate solutions.

    Authors: This is a valid concern for attribution. We will add a controlled ablation that fixes the total search iterations and LLM evaluations while toggling observer generation and VisionThoughts feedback, allowing direct isolation of the co-search contribution. revision: yes

  3. Referee: [Method] Method description: the assumption that observer programs reliably produce actionable, low-noise feedback is load-bearing for the joint-search claim, yet no analysis of observer accuracy, failure-mode coverage, or cases where feedback introduces bias is provided.

    Authors: We acknowledge the need for explicit validation of the observer component. The revision will include quantitative and qualitative analysis of observer accuracy (e.g., agreement with human failure-mode labels on sampled cases), coverage of failure modes, and discussion of potential bias introduced by noisy feedback. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical framework

full rationale

The paper describes an empirical search framework evaluated on two computer vision tasks with comparisons to baselines. No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the abstract or described content. Claims rest on experimental outcomes rather than any reduction to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is the expected non-finding for a non-mathematical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities described.

pith-pipeline@v0.9.1-grok · 5772 in / 1065 out tokens · 18707 ms · 2026-06-26T21:34:40.477570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 25 canonical work pages · 9 internal anchors

  1. [1]

    Visual instruction tuning , volume =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , editor =. Visual instruction tuning , volume =. Advances in neural information processing systems , publisher =. 2023 , keywords =

  2. [2]

    Toward ultra-long-horizon agentic science: Cognitive accumulation for machine learning engineering

    Zhu, Xinyu and Cai, Yuzhu and Liu, Zexi and Zheng, Bingyang and Wang, Cheng and Ye, Rui and Chen, Jiaao and Wang, Hanrui and Wang, Wei-Chen and Zhang, Yuzhi and Zhang, Linfeng and E, Weinan and Jin, Di and Chen, Siheng , month = jan, year =. Toward ultra-long-horizon agentic science:. doi:10.48550/arXiv.2601.10402 , abstract =

  3. [3]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Wang, Guanzhi and Xie, Yuqi and Jiang, Yunfan and Mandlekar, Ajay and Xiao, Chaowei and Zhu, Yuke and Fan, Linxi and Anandkumar, Anima , month = oct, year =. Voyager:. doi:10.48550/arXiv.2305.16291 , abstract =

  4. [4]

    arXiv preprint arXiv:2511.15593 , author =

    What does it take to be a good. arXiv preprint arXiv:2511.15593 , author =. 2025 , note =

  5. [5]

    Gupta, Tanmay and Kembhavi, Aniruddha , month = jun, year =. Visual. 2023. doi:10.1109/CVPR52729.2023.01436 , urldate =

  6. [6]

    Surís, Dídac and Menon, Sachit and Vondrick, Carl , month = oct, year =. 2023. doi:10.1109/ICCV51070.2023.01092 , urldate =

  7. [7]

    SAM 2: Segment Anything in Images and Videos

    Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and Rädle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Vasudev Alwala, Kalyan and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Dollár, Piotr and Feichtenhofer, Christoph , month = ...

  8. [8]

    Slicing aided hyper inference and fine-tuning for small object detection , doi =

    Akyon, Fatih Cagatay and Onur Altinuc, Sinan and Temizel, Alptekin , year =. Slicing aided hyper inference and fine-tuning for small object detection , doi =. 2022

  9. [9]

    Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and Ge, Wenbin and Guo, Zhifang and Huang, Qidong and Huang, Jie and Huang, Fei and Hui, Binyuan and Jiang, Shutong and Li, Zhaohai and Li, Mingsheng and Li, Mei and Li, Kaixin and Lin, Zicheng a...

  10. [10]

    International conference on machine learning,

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , editor =. International conference on machine learning,. 2023 , note =

  11. [11]

    System prompt optimization with meta-learning , url =

    Choi, Yumin and Baek, Jinheon and Hwang, Sung Ju , month = oct, year =. System prompt optimization with meta-learning , url =. doi:10.48550/arXiv.2505.09666 , abstract =

  12. [12]

    International conference on learning representations , author =

    Tent:. International conference on learning representations , author =. 2021 , keywords =

  13. [13]

    Multi-agent architecture search via agentic supernet , shorttitle =

    Zhang, Guibin and Niu, Luyang and Fang, Junfeng and Wang, Kun and Bai, Lei and Wang, Xiang , month = jun, year =. Multi-agent architecture search via agentic supernet , shorttitle =. doi:10.48550/arXiv.2502.04180 , abstract =

  14. [14]

    Multi-agent design:

    Zhou, Han and Wan, Xingchen and Sun, Ruoxi and Palangi, Hamid and Iqbal, Shariq and Vulić, Ivan and Korhonen, Anna and Arık, Sercan Ö , month = feb, year =. Multi-agent design:. doi:10.48550/arXiv.2502.02533 , abstract =

  15. [15]

    doi:10.48550/arXiv.2304.01904 , abstract =

    Paul, Debjit and Ismayilzada, Mete and Peyrard, Maxime and Borges, Beatriz and Bosselut, Antoine and West, Robert and Faltings, Boi , month = feb, year =. doi:10.48550/arXiv.2304.01904 , abstract =

  16. [16]

    TextGrad: Automatic "Differentiation" via Text

    Yuksekgonul, Mert and Bianchi, Federico and Boen, Joseph and Liu, Sheng and Huang, Zhi and Guestrin, Carlos and Zou, James , month = jun, year =. doi:10.48550/arXiv.2406.07496 , abstract =

  17. [17]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Azarnasab, Ehsan and Ahmed, Faisal and Liu, Zicheng and Liu, Ce and Zeng, Michael and Wang, Lijuan , year =. doi:10.48550/ARXIV.2303.11381 , abstract =

  18. [18]

    Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , booktitle =

  19. [19]

    and Cao, Yuan , year =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik R. and Cao, Yuan , year =. The eleventh international conference on learning representations,

  20. [20]

    Proceedings of the 37th international conference on neural information processing systems , publisher =

    Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , year =. Proceedings of the 37th internation...

  21. [21]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, Tianhe and Liu, Shilong and Zeng, Ailing and Lin, Jing and Li, Kunchang and Cao, He and Chen, Jiayu and Huang, Xinyu and Chen, Yukang and Yan, Feng and Zeng, Zhaoyang and Zhang, Hao and Li, Feng and Yang, Jie and Li, Hongyang and Jiang, Qing and Zhang, Lei , month = jan, year =. Grounded. doi:10.48550/arXiv.2401.14159 , journal =

  22. [22]

    arXiv e-prints , author =

    Learning transferable visual models from natural language supervision , doi =. arXiv e-prints , author =. 2021 , note =

  23. [23]

    and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , month = oct, year =

    Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , month = oct, year =. Segment anything , booktitle =

  24. [24]

    Executable code actions elicit better

    Wang, Xingyao and Chen, Yangyi and Yuan, Lifan and Zhang, Yizhe and Li, Yunzhu and Peng, Hao and Ji, Heng , year =. Executable code actions elicit better. Proceedings of the 41st international conference on machine learning , publisher =

  25. [25]

    2024 , pages =

    Amini-Naieni, Niki and Han, Tengda and Zisserman, Andrew , editor =. 2024 , pages =. doi:10.52202/079017-1547 , booktitle =

  26. [26]

    CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

    Qu, Ao and Zheng, Han and Zhou, Zijian and Yan, Yihao and Tang, Yihong and Ong, Shao Yong and Hong, Fenglu and Zhou, Kaichen and Jiang, Chonghe and Kong, Minwei and Zhu, Jiacheng and Jiang, Xuan and Li, Sirui and Wu, Cathy and Low, Bryan Kian Hsiang and Zhao, Jinhua and Liang, Paul Pu , month = apr, year =. doi:10.48550/arXiv.2604.01658 , journal =

  27. [27]

    arXiv preprint arXiv:2604.04872 , author =

    Synthetic sandbox for training machine learning engineering agents , url =. arXiv preprint arXiv:2604.04872 , author =. 2026 , note =

  28. [28]

    Learning to ideate for machine learning engineering agents , url =

    Zhang, Yunxiang and Zhou, Kang and Xu, Zhichao and Ramnath, Kiran and Zhou, Yun and Woo, Sangmin and Ding, Haibo and Cheong, Lin Lee , editor =. Learning to ideate for machine learning engineering agents , url =. 2026 , pages =. doi:10.18653/v1/2026.eacl-short.32 , booktitle =

  29. [29]

    2025 , note =

    arXiv preprint arXiv:2505.07782 , author =. 2025 , note =

  30. [30]

    Advances in neural information processing systems , author =

  31. [31]

    arXiv preprint arXiv:2505.14738 , author =

    R&. arXiv preprint arXiv:2505.14738 , author =. 2025 , note =

  32. [32]

    International conference on learning representations , author =

  33. [33]

    Reflexion: language agents with verbal reinforcement learning , abstract =

    Shinn, Noah and Cassano, Federico and Gopinath, Ashwin and Narasimhan, Karthik and Yao, Shunyu , year =. Reflexion: language agents with verbal reinforcement learning , abstract =. Proceedings of the 37th international conference on neural information processing systems , publisher =

  34. [34]

    GPT-4o System Card

    CoRR , author =. 2024 , note =. doi:10.48550/ARXIV.2410.21276 , abstract =

  35. [35]

    and Brown, Matthew S

    Kim, Jin and Wahi-Anwa, Muhammad and Park, Sangyun and Shin, Shawn and Hoffman, John M. and Brown, Matthew S. , month = jun, year =. Autonomous computer vision development with agentic. doi:10.48550/arXiv.2506.11140 , abstract =

  36. [36]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Hudson, Drew A. and Manning, Christopher D. , month = may, year =. doi:10.48550/arXiv.1902.09506 , abstract =

  37. [37]

    doi:10.48550/arXiv.1908.03195 , abstract =

    Gupta, Agrim and Dollár, Piotr and Girshick, Ross , month = sep, year =. doi:10.48550/arXiv.1908.03195 , abstract =

  38. [38]

    doi:10.48550/arXiv.2209.11486 , abstract =

    Hou, Yutai and Dong, Hongyuan and Wang, Xinghao and Li, Bohan and Che, Wanxiang , month = feb, year =. doi:10.48550/arXiv.2209.11486 , abstract =

  39. [39]

    arXiv preprint arXiv:2303.05499 , author =

    Grounding. arXiv preprint arXiv:2303.05499 , author =. 2023 , keywords =

  40. [40]

    doi:10.48550/arXiv.2410.02958 , abstract =

    Trirat, Patara and Jeong, Wonyong and Hwang, Sung Ju , month = jun, year =. doi:10.48550/arXiv.2410.02958 , abstract =

  41. [41]

    doi:10.48550/arXiv.2402.15351 , abstract =

    Yang, Zekang and Zeng, Wang and Jin, Sheng and Qian, Chen and Luo, Ping and Liu, Wentao , month = dec, year =. doi:10.48550/arXiv.2402.15351 , abstract =

  42. [42]

    Learning to count everything , url =

    Ranjan, Viresh and Sharma, Udbhav and Nguyen, Thu and Hoai, Minh , month = apr, year =. Learning to count everything , url =. doi:10.48550/arXiv.2104.08391 , abstract =

  43. [43]

    2025 , note =

    arXiv preprint arXiv:2507.02554 , author =. 2025 , note =

  44. [44]

    2025 , note =

    arXiv preprint arXiv:2502.13138 , author =. 2025 , note =

  45. [45]

    doi:10.48550/arXiv.2510.02669 , abstract =

    Ma, Bo and Li, Hang and Hu, ZeHua and Gui, XiaoFan and Liu, LuYao and Liu, Simon , month = oct, year =. doi:10.48550/arXiv.2510.02669 , abstract =