arxiv: 2511.21064 · v2 · submitted 2025-11-26 · 💻 cs.AI · cs.CV

OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Chujie Wang , Jianyu Lu , Zhiyuan Luo , Xi Chen , Chu He This is my paper

Pith reviewed 2026-05-17 05:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords detectionovodovod-agentbandittextualvisualacrossagent

0 comments

The pith

OVOD-Agent models visual reasoning as a weakly Markovian decision process with bandit-driven exploration to create a self-evolving open-vocabulary detector that improves on rare categories in COCO and LVIS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper starts from the observation that current open-vocabulary detectors are trained on rich vision-language data yet still perform a simple matching step at test time using fixed text labels. OVOD-Agent instead lets an agent actively explore an image by choosing actions that change its internal state, memory, and focus. These transitions are treated as a weakly Markovian process across eight defined states. A bandit module supplies exploration signals that point the agent toward uncertain image regions. The same trajectories are then used to train a reward model in a self-supervised loop, so the agent can refine its policy without extra human labels. Experiments claim that plugging this agent into existing OVOD backbones yields steady gains, especially on rare object classes. The approach borrows the chain-of-thought idea but applies it to visual decisions rather than text. Because the method stays lightweight, it avoids heavy LLM calls during inference. The core promise is that turning passive matching into guided, self-improving search can close the gap between training data richness and inference flexibility.

Core claim

Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

Load-bearing premise

Visual context transitions can be adequately modeled as a Weakly Markovian Decision Process over eight state spaces that naturally represent the agent's state, memory, and interaction dynamics.

Figures

Figures reproduced from arXiv: 2511.21064 by Chu He, Chujie Wang, Jianyu Lu, Xi Chen, Zhiyuan Luo.

**Figure 2.** Figure 2: OVOD-Agent operates through a self-evolving visual reasoning pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Failure cases of OVOD-Agent. Representative examples where the agent fails to correctly identify rare or occluded objects [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Step-by-step Case Study of OVOD-Agent, showing how visual actions (color, texture, container, background, spatial cues) progressively refine the caption and stabilize detector grounding. 3 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation protocol for GPT-5 trajectory scoring, including the instruction prompt defining the evaluator’s role and the input [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD's lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes OVOD-Agent to address limitations in Open-Vocabulary Object Detection (OVOD) by transforming passive category matching into proactive visual reasoning. It models visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces representing agent state, memory, and interaction dynamics, incorporates a Bandit module for exploration under limited supervision, and integrates Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization in a closed loop. The framework extends Chain-of-Thought to an interpretable Visual-CoT and claims consistent improvements across OVOD backbones on COCO and LVIS, especially for rare categories.

Significance. If the central claims hold after addressing the modeling assumptions, this work offers a novel agent-based approach to bridge multimodal pretraining and unimodal inference in OVOD via self-evolving detection and bandit-driven exploration. The closed-loop self-supervised RM optimization and extension of CoT to visual reasoning represent a potentially impactful direction for proactive detection systems, particularly if it yields reproducible gains on rare categories without introducing excessive parameters.

major comments (3)

[Method section (w-MDP definition)] The central modeling choice of representing visual context transitions as a w-MDP over exactly eight state spaces (described as naturally capturing agent state, memory, and interaction dynamics) lacks any derivation, justification, or validation that the weak Markov property holds or that this discretization suffices for long-range dependencies in visual reasoning. This is load-bearing for the framework, as arbitrary state spaces would undermine the transition matrices, bandit trajectories, and the reliability of the closed self-supervised loop.
[Method section (Bandit-RM integration)] The self-supervised RM optimization forms a closed loop by training the reward model on trajectories generated by the Bandit module, which itself depends on the current detection policy. The manuscript does not demonstrate independence from fitted internal signals or rule out degeneracy, which directly affects whether reported gains on rare categories can be attributed to the proposed framework rather than circular reinforcement of existing biases.
[Experiments section] The abstract and experimental claims assert consistent improvements across OVOD backbones on COCO and LVIS (particularly rare categories) but supply no quantitative numbers, specific baselines, error bars, ablation studies, or statistical significance tests. Without these, the data-to-claim link for the effectiveness of the w-MDP and bandit components cannot be verified.

minor comments (2)

[Method section] The notation for the w-MDP transition matrices and the eight state spaces could be clarified with explicit definitions or a diagram to improve readability for readers unfamiliar with the specific discretization.
[Abstract] The abstract would benefit from a brief mention of the scale of improvements or key baselines to better convey the empirical contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important areas where additional rigor and clarity will strengthen the manuscript. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Method section (w-MDP definition)] The central modeling choice of representing visual context transitions as a w-MDP over exactly eight state spaces (described as naturally capturing agent state, memory, and interaction dynamics) lacks any derivation, justification, or validation that the weak Markov property holds or that this discretization suffices for long-range dependencies in visual reasoning. This is load-bearing for the framework, as arbitrary state spaces would undermine the transition matrices, bandit trajectories, and the reliability of the closed self-supervised loop.

Authors: We agree that the choice of eight state spaces requires explicit justification. These states were selected to represent the minimal sufficient statistics for the agent's visual reasoning loop: current visual features, short-term detection memory, uncertainty map, action history, global context embedding, category prior memory, bandit interaction state, and policy parameters. This discretization approximates the weak Markov property by collapsing long-range visual dependencies into these aggregated features. In the revised manuscript we will add a new subsection deriving the state space from the requirements of proactive OVOD, including a brief proof sketch that the transition depends only on these states under the limited-supervision regime, together with a sensitivity study on the number of states. revision: yes
Referee: [Method section (Bandit-RM integration)] The self-supervised RM optimization forms a closed loop by training the reward model on trajectories generated by the Bandit module, which itself depends on the current detection policy. The manuscript does not demonstrate independence from fitted internal signals or rule out degeneracy, which directly affects whether reported gains on rare categories can be attributed to the proposed framework rather than circular reinforcement of existing biases.

Authors: This concern about potential circularity is well-taken. The Bandit module generates trajectories using only the current detector's uncertainty estimates, and the RM is trained on detection-quality improvements observed along those trajectories. To break direct dependence, the RM update uses a one-step delay relative to the policy. Nevertheless, we acknowledge that explicit checks for degeneracy are missing. In the revision we will add an ablation that replaces the learned RM with a static reward function and shows that the closed-loop version yields further gains on rare categories, thereby demonstrating that the reported improvements are not solely due to self-reinforcement of existing biases. revision: yes
Referee: [Experiments section] The abstract and experimental claims assert consistent improvements across OVOD backbones on COCO and LVIS (particularly rare categories) but supply no quantitative numbers, specific baselines, error bars, ablation studies, or statistical significance tests. Without these, the data-to-claim link for the effectiveness of the w-MDP and bandit components cannot be verified.

Authors: We apologize if the experimental presentation was insufficiently prominent. The manuscript already reports results on COCO and LVIS against standard OVOD baselines (ViLD, RegionCLIP, and others), with particular gains on rare categories, together with ablations isolating the w-MDP and Bandit modules. Results are averaged over multiple runs. To make the evidence fully verifiable we will expand the experimental section with a consolidated table that includes all numerical values, standard deviations, and p-values for the key comparisons, and we will add a dedicated paragraph linking each component ablation directly to the rare-category improvements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; modeling choice and self-supervised loop are externally validated

full rationale

The paper's derivation consists of a modeling decision to represent visual context transitions via a w-MDP over eight states plus a bandit-driven self-supervised RM loop. These are presented as design choices rather than derived predictions. The load-bearing evidence is empirical performance gains on independent COCO and LVIS benchmarks (including rare categories), which lie outside the internal definitions. No equation or step reduces a claimed result to an input by construction, and no self-citation chain is invoked to establish uniqueness or forbid alternatives. The explicit mention of a 'closed loop' describes an intended training dynamic, not a tautology that forces the reported outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on several modeling choices and new constructs introduced without external benchmarks or independent evidence visible in the abstract.

free parameters (1)

Eight state spaces
Chosen to represent agent's state, memory, and interaction dynamics in the w-MDP.

axioms (1)

domain assumption Visual context transitions are Weakly Markovian
Invoked to justify modeling the agent's sequential decisions over visual regions.

invented entities (2)

Visual-CoT no independent evidence
purpose: Provide interpretable explicit actions for visual reasoning
New extension of textual chain-of-thought to the visual domain.
OVOD-Agent no independent evidence
purpose: Proactive visual reasoning and self-evolving detection framework
Composite system that integrates w-MDP and bandit components.

pith-pipeline@v0.9.0 · 5563 in / 1447 out tokens · 42112 ms · 2026-05-17T05:28:48.752659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation (reality_from_one_distinction and Tick orbit) reality_from_one_distinction (8-tick period emergence) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent's state, memory, and interaction dynamics
IndisputableMonolith/Foundation (time-as-orbit certificate) Tick ≃ LogicNat (8-period clock) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Stopping Criteria... Step limit: t ≥ H_max = 7

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Finite- time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite- time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002. 3

work page 2002
[2]

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22520–22529, 2024. 2

work page 2024
[3]

Decision transformer: Rein- forcement learning via sequence modeling, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling, 2021. 3

work page 2021
[4]

Yolo-world: Real-time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 1, 7

work page 2024
[5]

Cot-pl: Visual chain-of-thought reasoning meets pseudo-labeling for open-vocabulary object detection, 2025

Hojun Choi, Youngsun Lim, Jaeyo Shin, and Hyunjung Shim. Cot-pl: Visual chain-of-thought reasoning meets pseudo-labeling for open-vocabulary object detection, 2025. 2

work page 2025
[6]

Deep reinforcement learn- ing from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[7]

CRC Press, 2017

Emanuele Crisostomi, Robert Shorten, Sonja St ¨udli, and Fabian Wirth.Electric and plug-in hybrid vehicle networks: optimization and control. CRC Press, 2017. 2

work page 2017
[8]

Learning to prompt for open-vocabulary ob- ject detection with vision-language model, 2022

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-vocabulary ob- ject detection with vision-language model, 2022. 1

work page 2022
[9]

Agent ai: Surveying the hori- zons of multimodal interaction, 2024

Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, Katsushi Ikeuchi, Hoi V o, Li Fei-Fei, and Jianfeng Gao. Agent ai: Surveying the hori- zons of multimodal interaction, 2024. 2

work page 2024
[10]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024. 2

work page 2024
[11]

Improving zero-shot gen- eralization and robustness of multi-modal models

Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lak- shminarayanan, and Jiaping Zhao. Improving zero-shot gen- eralization and robustness of multi-modal models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11093–11101, 2023. 1

work page 2023
[12]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

work page internal anchor Pith review arXiv
[13]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6

work page 2019
[14]

Mastering atari with discrete world models, 2022

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022. 3

work page 2022
[15]

Denoising diffu- sion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models, 2020. 2

work page 2020
[16]

Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024

Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu. Llms meet vlms: Boost open vocabulary object detection with fine-grained descriptors, 2024. 2

work page 2024
[17]

Jooyeon Kim, Eulrang Cho, Sehyung Kim, and Hyunwoo J. Kim. Retrieval-augmented open-vocabulary object detec- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 17427– 17436, 2024. 2

work page 2024
[18]

An iterative feedback mechanism for improving natural language class descriptions in open-vocabulary ob- ject detection

Louis Y Kim, Michelle Karker, Victoria Valledor, Seiy- oung C Lee, Karl F Brzoska, Margaret Duff, and Anthony Palladino. An iterative feedback mechanism for improving natural language class descriptions in open-vocabulary ob- ject detection. InAutomatic Target Recognition XXXV, pages 57–69. SPIE, 2025. 2

work page 2025
[19]

The epoch-greedy algo- rithm for multi-armed bandits with side information.Ad- vances in neural information processing systems, 20, 2007

John Langford and Tong Zhang. The epoch-greedy algo- rithm for multi-armed bandits with side information.Ad- vances in neural information processing systems, 20, 2007. 3

work page 2007
[20]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 2

work page 2023
[21]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10965–10975,

work page
[22]

V oCoT: Unleashing vi- sually grounded multi-step reasoning in large multi-modal models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan- jing Huang, and Zhongyu Wei. V oCoT: Unleashing vi- sually grounded multi-step reasoning in large multi-modal models. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Vol- ume 1: Long Papers), p...

work page 2025
[23]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 6

work page 2014
[24]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2

work page 2023
[25]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[26]

Jones, Anoop Cherian, and Yasin Yilmaz

Furkan Mumcu, Michael J. Jones, Anoop Cherian, and Yasin Yilmaz. Llm-guided agentic object detection for open-world understanding, 2025. 2 9

work page 2025
[27]

Openai gpt-5 model release.https://openai

OpenAI. Openai gpt-5 model release.https://openai. com/index/introducing-gpt-5/, 2025. Accessed: 2025-08-07. 7

work page 2025
[28]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 3

work page 2022
[29]

Learn- ing to name classes for vision and language models

Sarah Parisot, Yongxin Yang, and Steven McDonagh. Learn- ing to name classes for vision and language models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23477–23486, 2023. 1

work page 2023
[30]

Kosmos-2: Grounding multimodal large language models to the world, 2023

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023. 2

work page 2023
[31]

What does a platypus look like? generating customized prompts for zero-shot image classification

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What does a platypus look like? generating customized prompts for zero-shot image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15691–15701, 2023. 1

work page 2023
[32]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 1

work page 2021
[33]

Denseclip: Language-guided dense prediction with context- aware prompting, 2022

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting, 2022. 2

work page 2022
[34]

Toolformer: Lan- guage models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, pages 68539– 68551. Curran Associates, Inc., 2023. 2

work page 2023
[35]

Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning,

work page
[36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Pro- cessing Systems, pages 24824–24837. Curran Associates, Inc., 2022. 2

work page 2022
[37]

Markov chain of thought for efficient mathematical reasoning, 2025

Wen Yang, Minpeng Liao, and Kai Fan. Markov chain of thought for efficient mathematical reasoning, 2025. 2

work page 2025
[38]

Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection, 2022

Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre- training for open-world detection, 2022. 1

work page 2022
[39]

Detclipv2: Scal- able open-vocabulary object detection pre-training via word- region alignment, 2023

Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scal- able open-vocabulary object detection pre-training via word- region alignment, 2023

work page 2023
[40]

Detclipv3: To- wards versatile generative open-vocabulary object detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. Detclipv3: To- wards versatile generative open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 27391–27401, 2024. 1, 7

work page 2024
[41]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the Eleventh International Conference on Learning Rep- resentations (ICLR 2023), Kigali, Rwanda, 2023. ICLR. 2

work page 2023
[42]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16793–16803, 2022. 2

work page 2022
[43]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. InEuropean confer- ence on computer vision, pages 350–368. Springer, 2022. 1, 6

work page 2022
[44]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. InEuropean confer- ence on computer vision, pages 350–368. Springer, 2022. 1

work page 2022
[45]

Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 2 10 OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection Supplementary Material

work page 2023
[46]

a object

Appendix This appendix provides additional technical details that complement the main paper. We first present the full set of visual-action operators used byOVOD-Agent. Next, we provide an expanded case study that demonstrates how the agent incrementally refines its textual hypotheses using both low-level and high-level visual cues. We further include a c...

work page