pith. machine review for the scientific record. sign in

arxiv: 2511.21064 · v2 · submitted 2025-11-26 · 💻 cs.AI · cs.CV

OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Pith reviewed 2026-05-17 05:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords detectionovodovod-agentbandittextualvisualacrossagent
0
0 comments X

The pith

OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the observation that current open-vocabulary detectors are trained on rich vision-language data yet still perform a simple matching step at test time using fixed text labels. OVOD-Agent instead lets an agent actively explore an image by choosing actions that change its internal state, memory, and focus. These transitions are treated as a weakly Markovian process across eight defined states. A bandit module supplies exploration signals that point the agent toward uncertain image regions. The same trajectories are then used to train a reward model in a self-supervised loop, so the agent can refine its policy without extra human labels. Experiments claim that plugging this agent into existing OVOD backbones yields steady gains, especially on rare object classes. The approach borrows the chain-of-thought idea but applies it to visual decisions rather than text. Because the method stays lightweight, it avoids heavy LLM calls during inference. The core promise is that turning passive matching into guided, self-improving search can close the gap between training data richness and inference flexibility.

Core claim

Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

Load-bearing premise

Visual context transitions can be adequately modeled as a Weakly Markovian Decision Process over eight state spaces that naturally represent the agent's state, memory, and interaction dynamics.

Figures

Figures reproduced from arXiv: 2511.21064 by Chu He, Chujie Wang, Jianyu Lu, Xi Chen, Zhiyuan Luo.

Figure 1
Figure 1. Figure 1: We illustrate the state-transition behavior of OVOD [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OVOD-Agent operates through a self-evolving visual reasoning pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure cases of OVOD-Agent. Representative examples where the agent fails to correctly identify rare or occluded objects [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Step-by-step Case Study of OVOD-Agent, showing how visual actions (color, texture, container, background, spatial cues) progressively refine the caption and stabilize detector grounding. 3 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation protocol for GPT-5 trajectory scoring, including the instruction prompt defining the evaluator’s role and the input [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes OVOD-Agent to address limitations in Open-Vocabulary Object Detection (OVOD) by transforming passive category matching into proactive visual reasoning. It models visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces representing agent state, memory, and interaction dynamics, incorporates a Bandit module for exploration under limited supervision, and integrates Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization in a closed loop. The framework extends Chain-of-Thought to an interpretable Visual-CoT and claims consistent improvements across OVOD backbones on COCO and LVIS, especially for rare categories.

Significance. If the central claims hold after addressing the modeling assumptions, this work offers a novel agent-based approach to bridge multimodal pretraining and unimodal inference in OVOD via self-evolving detection and bandit-driven exploration. The closed-loop self-supervised RM optimization and extension of CoT to visual reasoning represent a potentially impactful direction for proactive detection systems, particularly if it yields reproducible gains on rare categories without introducing excessive parameters.

major comments (3)
  1. [Method section (w-MDP definition)] The central modeling choice of representing visual context transitions as a w-MDP over exactly eight state spaces (described as naturally capturing agent state, memory, and interaction dynamics) lacks any derivation, justification, or validation that the weak Markov property holds or that this discretization suffices for long-range dependencies in visual reasoning. This is load-bearing for the framework, as arbitrary state spaces would undermine the transition matrices, bandit trajectories, and the reliability of the closed self-supervised loop.
  2. [Method section (Bandit-RM integration)] The self-supervised RM optimization forms a closed loop by training the reward model on trajectories generated by the Bandit module, which itself depends on the current detection policy. The manuscript does not demonstrate independence from fitted internal signals or rule out degeneracy, which directly affects whether reported gains on rare categories can be attributed to the proposed framework rather than circular reinforcement of existing biases.
  3. [Experiments section] The abstract and experimental claims assert consistent improvements across OVOD backbones on COCO and LVIS (particularly rare categories) but supply no quantitative numbers, specific baselines, error bars, ablation studies, or statistical significance tests. Without these, the data-to-claim link for the effectiveness of the w-MDP and bandit components cannot be verified.
minor comments (2)
  1. [Method section] The notation for the w-MDP transition matrices and the eight state spaces could be clarified with explicit definitions or a diagram to improve readability for readers unfamiliar with the specific discretization.
  2. [Abstract] The abstract would benefit from a brief mention of the scale of improvements or key baselines to better convey the empirical contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important areas where additional rigor and clarity will strengthen the manuscript. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Method section (w-MDP definition)] The central modeling choice of representing visual context transitions as a w-MDP over exactly eight state spaces (described as naturally capturing agent state, memory, and interaction dynamics) lacks any derivation, justification, or validation that the weak Markov property holds or that this discretization suffices for long-range dependencies in visual reasoning. This is load-bearing for the framework, as arbitrary state spaces would undermine the transition matrices, bandit trajectories, and the reliability of the closed self-supervised loop.

    Authors: We agree that the choice of eight state spaces requires explicit justification. These states were selected to represent the minimal sufficient statistics for the agent's visual reasoning loop: current visual features, short-term detection memory, uncertainty map, action history, global context embedding, category prior memory, bandit interaction state, and policy parameters. This discretization approximates the weak Markov property by collapsing long-range visual dependencies into these aggregated features. In the revised manuscript we will add a new subsection deriving the state space from the requirements of proactive OVOD, including a brief proof sketch that the transition depends only on these states under the limited-supervision regime, together with a sensitivity study on the number of states. revision: yes

  2. Referee: [Method section (Bandit-RM integration)] The self-supervised RM optimization forms a closed loop by training the reward model on trajectories generated by the Bandit module, which itself depends on the current detection policy. The manuscript does not demonstrate independence from fitted internal signals or rule out degeneracy, which directly affects whether reported gains on rare categories can be attributed to the proposed framework rather than circular reinforcement of existing biases.

    Authors: This concern about potential circularity is well-taken. The Bandit module generates trajectories using only the current detector's uncertainty estimates, and the RM is trained on detection-quality improvements observed along those trajectories. To break direct dependence, the RM update uses a one-step delay relative to the policy. Nevertheless, we acknowledge that explicit checks for degeneracy are missing. In the revision we will add an ablation that replaces the learned RM with a static reward function and shows that the closed-loop version yields further gains on rare categories, thereby demonstrating that the reported improvements are not solely due to self-reinforcement of existing biases. revision: yes

  3. Referee: [Experiments section] The abstract and experimental claims assert consistent improvements across OVOD backbones on COCO and LVIS (particularly rare categories) but supply no quantitative numbers, specific baselines, error bars, ablation studies, or statistical significance tests. Without these, the data-to-claim link for the effectiveness of the w-MDP and bandit components cannot be verified.

    Authors: We apologize if the experimental presentation was insufficiently prominent. The manuscript already reports results on COCO and LVIS against standard OVOD baselines (ViLD, RegionCLIP, and others), with particular gains on rare categories, together with ablations isolating the w-MDP and Bandit modules. Results are averaged over multiple runs. To make the evidence fully verifiable we will expand the experimental section with a consolidated table that includes all numerical values, standard deviations, and p-values for the key comparisons, and we will add a dedicated paragraph linking each component ablation directly to the rare-category improvements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; modeling choice and self-supervised loop are externally validated

full rationale

The paper's derivation consists of a modeling decision to represent visual context transitions via a w-MDP over eight states plus a bandit-driven self-supervised RM loop. These are presented as design choices rather than derived predictions. The load-bearing evidence is empirical performance gains on independent COCO and LVIS benchmarks (including rare categories), which lie outside the internal definitions. No equation or step reduces a claimed result to an input by construction, and no self-citation chain is invoked to establish uniqueness or forbid alternatives. The explicit mention of a 'closed loop' describes an intended training dynamic, not a tautology that forces the reported outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on several modeling choices and new constructs introduced without external benchmarks or independent evidence visible in the abstract.

free parameters (1)
  • Eight state spaces
    Chosen to represent agent's state, memory, and interaction dynamics in the w-MDP.
axioms (1)
  • domain assumption Visual context transitions are Weakly Markovian
    Invoked to justify modeling the agent's sequential decisions over visual regions.
invented entities (2)
  • Visual-CoT no independent evidence
    purpose: Provide interpretable explicit actions for visual reasoning
    New extension of textual chain-of-thought to the visual domain.
  • OVOD-Agent no independent evidence
    purpose: Proactive visual reasoning and self-evolving detection framework
    Composite system that integrates w-MDP and bandit components.

pith-pipeline@v0.9.0 · 5563 in / 1447 out tokens · 42112 ms · 2026-05-17T05:28:48.752659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Finite- time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite- time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002. 3

  2. [2]

    The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

    Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22520–22529, 2024. 2

  3. [3]

    Decision transformer: Rein- forcement learning via sequence modeling, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling, 2021. 3

  4. [4]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 1, 7

  5. [5]

    Cot-pl: Visual chain-of-thought reasoning meets pseudo-labeling for open-vocabulary object detection, 2025

    Hojun Choi, Youngsun Lim, Jaeyo Shin, and Hyunjung Shim. Cot-pl: Visual chain-of-thought reasoning meets pseudo-labeling for open-vocabulary object detection, 2025. 2

  6. [6]

    Deep reinforcement learn- ing from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural information processing systems, 30, 2017. 3

  7. [7]

    CRC Press, 2017

    Emanuele Crisostomi, Robert Shorten, Sonja St ¨udli, and Fabian Wirth.Electric and plug-in hybrid vehicle networks: optimization and control. CRC Press, 2017. 2

  8. [8]

    Learning to prompt for open-vocabulary ob- ject detection with vision-language model, 2022

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model, 2022. 1

  9. [9]

    Agent ai: Surveying the hori- zons of multimodal interaction, 2024

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi V o, Li Fei-Fei, and Jianfeng Gao. Agent ai: Surveying the hori- zons of multimodal interaction, 2024. 2

  10. [10]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024. 2

  11. [11]

    Improving zero-shot gen- eralization and robustness of multi-modal models

    Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lak- shminarayanan, and Jiaping Zhao. Improving zero-shot gen- eralization and robustness of multi-modal models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11093–11101, 2023. 1

  12. [12]

    Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

  13. [13]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6

  14. [14]

    Mastering atari with discrete world models, 2022

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022. 3

  15. [15]

    Denoising diffu- sion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 2

  16. [16]

    Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024

    Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024. 2

  17. [17]

    Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J. Kim. Retrieval-augmented open-vocabulary object detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 17427– 17436, 2024. 2

  18. [18]

    An iterative feedback mechanism for improving natural language class descriptions in open-vocabulary ob- ject detection

    Louis Y Kim, Michelle Karker, Victoria Valledor, Seiy- oung C Lee, Karl F Brzoska, Margaret Duff, and Anthony Palladino. An iterative feedback mechanism for improving natural language class descriptions in open-vocabulary ob- ject detection. InAutomatic Target Recognition XXXV, pages 57–69. SPIE, 2025. 2

  19. [19]

    The epoch-greedy algo- rithm for multi-armed bandits with side information.Ad- vances in neural information processing systems, 20, 2007

    John Langford and Tong Zhang. The epoch-greedy algo- rithm for multi-armed bandits with side information.Ad- vances in neural information processing systems, 20, 2007. 3

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 2

  21. [21]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10965–10975,

  22. [22]

    V oCoT: Unleashing vi- sually grounded multi-step reasoning in large multi-modal models

    Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan- jing Huang, and Zhongyu Wei. V oCoT: Unleashing vi- sually grounded multi-step reasoning in large multi-modal models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Vol- ume 1: Long Papers), p...

  23. [23]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

  24. [24]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

  25. [25]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  26. [26]

    Jones, Anoop Cherian, and Yasin Yilmaz

    Furkan Mumcu, Michael J. Jones, Anoop Cherian, and Yasin Yilmaz. Llm-guided agentic object detection for open-world understanding, 2025. 2 9

  27. [27]

    Openai gpt-5 model release.https://openai

    OpenAI. Openai gpt-5 model release.https://openai. com/index/introducing-gpt-5/, 2025. Accessed: 2025-08-07. 7

  28. [28]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

  29. [29]

    Learn- ing to name classes for vision and language models

    Sarah Parisot, Yongxin Yang, and Steven McDonagh. Learn- ing to name classes for vision and language models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23477–23486, 2023. 1

  30. [30]

    Kosmos-2: Grounding multimodal large language models to the world, 2023

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. 2

  31. [31]

    What does a platypus look like? generating customized prompts for zero-shot image classification

    Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15691–15701, 2023. 1

  32. [32]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1

  33. [33]

    Denseclip: Language-guided dense prediction with context- aware prompting, 2022

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting, 2022. 2

  34. [34]

    Toolformer: Lan- guage models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539– 68551. Curran Associates, Inc., 2023. 2

  35. [35]

    Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 2

  37. [37]

    Markov chain of thought for efficient mathematical reasoning, 2025

    Wen Yang, Minpeng Liao, and Kai Fan. Markov chain of thought for efficient mathematical reasoning, 2025. 2

  38. [38]

    Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection, 2022

    Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection, 2022. 1

  39. [39]

    Detclipv2: Scal- able open-vocabulary object detection pre-training via word- region alignment, 2023

    Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scal- able open-vocabulary object detection pre-training via word- region alignment, 2023

  40. [40]

    Detclipv3: To- wards versatile generative open-vocabulary object detection

    Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: To- wards versatile generative open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 27391–27401, 2024. 1, 7

  41. [41]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the Eleventh International Conference on Learning Rep- resentations (ICLR 2023), Kigali, Rwanda, 2023. ICLR. 2

  42. [42]

    Regionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16793–16803, 2022. 2

  43. [43]

    Detecting twenty-thousand classes using image-level supervision

    Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. InEuropean confer- ence on computer vision, pages 350–368. Springer, 2022. 1, 6

  44. [44]

    Detecting twenty-thousand classes using image-level supervision

    Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. InEuropean confer- ence on computer vision, pages 350–368. Springer, 2022. 1

  45. [45]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 2 10 OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection Supplementary Material

  46. [46]

    a object

    Appendix This appendix provides additional technical details that complement the main paper. We first present the full set of visual-action operators used byOVOD-Agent. Next, we provide an expanded case study that demonstrates how the agent incrementally refines its textual hypotheses using both low-level and high-level visual cues. We further include a c...