Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

Andy J. Ma; Dazhao Du; Jia Wan; Tao Han; Yaowu Fan

arxiv: 2605.29894 · v1 · pith:UKTY3FLOnew · submitted 2026-05-28 · 💻 cs.CV

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

Yaowu Fan , Tao Han , Dazhao Du , Andy J. Ma , Jia Wan This is my paper

Pith reviewed 2026-06-29 08:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual agentexpert harnessingmulti-turn reasoningvisual reasoning segmentationreferring expressionobject detectionreinforcement learningdynamic memory

0 comments

The pith

VisHarness trains a lightweight agent to select and sequence calls to fixed heterogeneous visual experts across multi-turn interactions rather than training any single expert for the full task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VisHarness as a trainable visual agent that separates high-level decision making from low-level execution by learning a policy to call and combine existing specialized models. This policy is trained with only lightweight reinforcement learning while the experts remain frozen. The approach is tested on benchmarks involving reasoning segmentation, referring segmentation, small-object detection, and counting, where it exceeds general models and matches or beats task-specific ones. A dynamic memory mechanism is added to keep token costs manageable during multi-turn expert interactions. The central claim is that a general harnessing policy can deliver both broad applicability and expert-level precision without per-task retraining of the underlying models.

Core claim

VisHarness learns a generalizable policy that, through multi-turn interactions, chooses which heterogeneous visual experts to invoke and in what order, solving complex visual tasks while preserving the experts' specialized precision and avoiding the need to fine-tune them for each new condition.

What carries the argument

VisHarness, the trainable agent whose policy decides when and which experts to call, supported by dynamic visual memory archiving to control token growth in live multi-turn loops.

If this is right

The same agent policy can be applied to new visual tasks by adding or swapping experts without retraining the policy from scratch.
Multi-turn expert interaction becomes feasible at scale once memory archiving keeps token counts bounded.
General-purpose models can be improved by wrapping them with a learned harness rather than scaling the base model further.
Task-specific models retain their accuracy edge while gaining the flexibility of a shared decision layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy generalizes across expert sets, the same training loop could be reused for entirely different modalities such as audio or 3-D data.
The memory archiving trick may also apply to other agent systems that accumulate large context from tool calls.
Performance gains would shrink if the experts themselves become outdated faster than the policy can be retrained.

Load-bearing premise

A single lightweight-trained policy can reliably choose and order calls to a fixed set of experts for many different complex visual conditions without any further expert retraining.

What would settle it

On a new benchmark mixing the four task types, the agent would need to produce lower accuracy than both general models and the best task-specific model in at least two categories when experts are held completely fixed.

Figures

Figures reproduced from arXiv: 2605.29894 by Andy J. Ma, Dazhao Du, Jia Wan, Tao Han, Yaowu Fan.

**Figure 1.** Figure 1: From expert training to expert harnessing. (a): Traditional computer vision methods train a separate specialist for each visual sub-task. (b): VisHarness learns one harnessing policy over a set of heterogeneous experts and thus can solve complex visual tasks through multi-turn interaction. to solve the entire problem (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of VisHarness. VisHarness solves complex fundamental vision tasks through multi-turn interactions. At each turn, it selects an action based on the current memory. When a visual expert is invoked, the environment parses the expert name and arguments, and a controller dispatches the request to the least-loaded worker among multiple expert instances for parallel execution. After receiving the visuali… view at source ↗

**Figure 3.** Figure 3: The Heterogeneous Visual Expert Suite consists of six visual experts, including three [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of visual expert calling by different model variants across different datasets. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-turn interaction visualization on two representative image-text pairs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisHarness trains a lightweight policy to call and sequence fixed visual experts over multiple turns, using dynamic memory archiving to keep RL training feasible.

read the letter

The main thing here is the shift from training one model to solve a visual task to training an agent that decides which expert to invoke and when, across several rounds. The dynamic visual memory archiving is the practical piece that lets them run on-policy RL without the visual token count exploding.

That framing is clean and the architecture holds together: the experts stay frozen, the agent learns the harnessing policy, and the memory mechanism directly tackles the multi-turn overhead. The four benchmarks (reasoning segmentation, generalized referring segmentation, dense small-object detection, referring counting) are reasonable choices for testing whether the approach works on both language-heavy and perception-heavy cases.

The abstract claims VisHarness beats general models and matches or beats task-specific ones, but supplies no numbers, error bars, or baseline descriptions, so the performance edge is hard to assess from the given text. If the full paper has proper ablations on expert selection and shows the policy generalizes without per-task expert tuning, that would be the key evidence.

This is worth a referee for groups building modular or agentic vision systems. The idea is coherent and the memory trick addresses a concrete engineering constraint. I would not cite it yet without seeing the detailed results and comparisons.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution by learning a generalizable policy to harness a set of fixed heterogeneous visual experts via multi-turn interactions. It introduces dynamic visual memory archiving to support efficient on-policy reinforcement learning by mitigating visual-token overhead. The central empirical claim is that, with only lightweight training, VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance to task-specific models on four benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting.

Significance. If the performance claims hold under rigorous validation, the paradigm of training a lightweight general policy for expert selection and sequencing (rather than fine-tuning the experts themselves) offers a promising route toward general-purpose visual intelligence that combines flexible reasoning with the precision of specialized models. The approach is internally consistent with the described architecture and addresses a genuine limitation of task-specific optimization.

major comments (1)

[Abstract and Experiments] Abstract and Experiments section: The benchmark results are stated without details on the specific baselines compared, error bars or statistical significance, data splits, or exact training procedures (including reward formulation and on-policy RL hyperparameters), which is load-bearing for assessing whether VisHarness truly outperforms general-purpose models or matches task-specific ones.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater experimental transparency. We agree that the current presentation of results in the abstract and experiments section lacks sufficient detail on baselines, statistical measures, data handling, and training specifics, which is essential for validating the performance claims. We will revise the manuscript to address this.

read point-by-point responses

Referee: The benchmark results are stated without details on the specific baselines compared, error bars or statistical significance, data splits, or exact training procedures (including reward formulation and on-policy RL hyperparameters), which is load-bearing for assessing whether VisHarness truly outperforms general-purpose models or matches task-specific ones.

Authors: We fully agree with this assessment. The revised manuscript will expand the Experiments section (and update the abstract if space permits) to explicitly list all compared baselines with citations and categories (general-purpose vs. task-specific), report error bars from multiple random seeds along with statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values), detail the exact train/validation/test splits used for each of the four benchmarks, and provide complete training details including the reward function formulation, on-policy RL algorithm hyperparameters (learning rate, discount factor, batch size, rollout length, number of epochs), and any other procedural specifics. These additions will enable rigorous independent verification of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes VisHarness as an agent policy trained via lightweight on-policy RL to select and sequence calls to fixed heterogeneous experts, with dynamic memory to handle multi-turn interactions. No derivation chain reduces a claimed result to its inputs by construction: the central claim is an empirical performance advantage on four external benchmarks (reasoning segmentation, generalized referring segmentation, dense small-object detection, referring counting), which are standard and independent of the training objective or fitted parameters. No self-definitional equations, fitted-input predictions, or load-bearing self-citations are present in the provided text; the architecture is described as decoupled and the evaluation uses external task-specific models for comparison without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that fixed expert models can be effectively orchestrated by a learned policy and on the new technique of dynamic visual memory archiving; no explicit free parameters or invented physical entities are stated.

axioms (1)

domain assumption Heterogeneous visual experts remain effective when called in sequence by an external policy without modification.
Invoked to justify decoupling agent training from expert optimization.

invented entities (2)

VisHarness no independent evidence
purpose: Trainable agent for expert harnessing
New system name and architecture introduced to implement the policy.
dynamic visual memory archiving no independent evidence
purpose: Reduce visual token overhead during multi-turn RL
New mechanism proposed to enable efficient on-policy training.

pith-pipeline@v0.9.1-grok · 5782 in / 1251 out tokens · 33718 ms · 2026-06-29T08:32:53.443452+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

work page arXiv 2023
[2]

Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

Niki Amini-Naieni, Tengda Han, and Andrew Zisserman. Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

2024
[3]

Qwen3-VL Technical Report

Shuai Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[5]

Referring expression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring expression counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16985–16995. IEEE, 2024

2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Vision-language transformer and query generation for referring segmentation

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 16321–16330, 2021

2021
[8]

Chan, and Andy J

Yaowu Fan, Jia Wan, Tao Han, Antoni B. Chan, and Andy J. Ma. Video individual counting for moving drones. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12284–12293, 2025

2025
[9]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[10]

Locate then segment: A strong pipeline for referring image segmentation

Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. Locate then segment: A strong pipeline for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9858–9867, 2021

2021
[11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

2023
[13]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

2024
[14]

Text4seg++: Advancing image segmentation via generative language modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026

Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmentation via generative language modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026

2026
[15]

Gres: Generalized referring expression segmenta- tion

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmenta- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601. IEEE, 2023

2023
[16]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024. 10

2024
[17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

2024
[18]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg- zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Cohd: A counting-aware hierarchical decoding framework for generalized referring expression segmentation

Zhuoyan Luo, Yinghao Wu, Tianheng Cheng, Yong Liu, Yicheng Xiao, Hongfa Wang, Xiao- Ping Zhang, and Yujiu Yang. Cohd: A counting-aware hierarchical decoding framework for generalized referring expression segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22685–22694, 2025

2025
[20]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026
[22]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

arXiv preprint arXiv:2509.25164 (2025)

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Yolo26: key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

work page arXiv 2025
[24]

Training-free object counting with prompts

Zenglin Shi, Ying Sun, and Mengmi Zhang. Training-free object counting with prompts. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 323–331, 2024

2024
[25]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

work page arXiv 2025
[27]

A generalized loss function for crowd counting and localization

Jia Wan, Ziquan Liu, and Antoni B Chan. A generalized loss function for crowd counting and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1974–1983, 2021

1974
[28]

Git: Towards generalist vision transformer through universal language interface

Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InComputer Vision – ECCV 2024, pages 55–73. Springer Nature Switzerland, 2025

2024
[29]

Acting less is reasoning more! teaching model to act efficiently, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

work page arXiv 2025
[30]

X. Wang, S. Zhang, S. Li, K. Li, K. Kallidromitis, Y . Kato, K. Kozuka, and T. Darrell. Segllm: Multi-round reasoning segmentation with large language models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

2025
[31]

Refdetector: A simple yet effective matching-based method for referring expression comprehension

Yabing Wang, Zhuotao Tian, Zheng Qin, Sanping Zhou, and Le Wang. Refdetector: A simple yet effective matching-based method for referring expression comprehension. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8033–8041, 2025

2025
[32]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022. 11

2022
[33]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

2025
[34]

Dettoolchain: A new prompting paradigm to unleash detection ability of mllm

Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, and Jian Wu. Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. In European Conference on Computer Vision, pages 164–182. Springer, 2024

2024
[35]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

2024
[36]

Zero-shot object counting

Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, and Dimitris Samaras. Zero-shot object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15548–15557, June 2023

2023
[37]

Zero-shot object counting with language-vision models.arXiv preprint arXiv:2309.13097, 2023

Jingyi Xu, Hieu Le, and Dimitris Samaras. Zero-shot object counting with language-vision models.arXiv preprint arXiv:2309.13097, 2023

work page arXiv 2023
[38]

Empowering segmentation ability to multi-modal large language models.arXiv preprint arXiv:2403.14141, 2024

Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao, Jinwei Chen, and Bo Li. Empowering segmentation ability to multi-modal large language models.arXiv preprint arXiv:2403.14141, 2024

work page arXiv 2024
[39]

Language-aware vision transformer for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5238–5255, 2024

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5238–5255, 2024

2024
[40]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 12

work page internal anchor Pith review Pith/arXiv arXiv 2010

[1] [1]

Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

Niki Amini-Naieni, Kiana Amini-Naieni, Tengda Han, and Andrew Zisserman. Open-world text-specified object counting.arXiv preprint arXiv:2306.01851, 2023

work page arXiv 2023

[2] [2]

Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

Niki Amini-Naieni, Tengda Han, and Andrew Zisserman. Countgd: Multi-modal open-world counting.Advances in Neural Information Processing Systems, 37:48810–48837, 2024

2024

[3] [3]

Qwen3-VL Technical Report

Shuai Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[5] [5]

Referring expression counting

Siyang Dai, Jun Liu, and Ngai-Man Cheung. Referring expression counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16985–16995. IEEE, 2024

2024

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Vision-language transformer and query generation for referring segmentation

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 16321–16330, 2021

2021

[8] [8]

Chan, and Andy J

Yaowu Fan, Jia Wan, Tao Han, Antoni B. Chan, and Andy J. Ma. Video individual counting for moving drones. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12284–12293, 2025

2025

[9] [9]

Detect anything via next point prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, and Lei Zhang. Detect anything via next point prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026

[10] [10]

Locate then segment: A strong pipeline for referring image segmentation

Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. Locate then segment: A strong pipeline for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9858–9867, 2021

2021

[11] [11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team. Kimi K2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

2023

[13] [13]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

2024

[14] [14]

Text4seg++: Advancing image segmentation via generative language modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026

Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmentation via generative language modeling.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–16, 2026

2026

[15] [15]

Gres: Generalized referring expression segmenta- tion

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmenta- tion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592–23601. IEEE, 2023

2023

[16] [16]

Llava-plus: Learning to use tools for creating multimodal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multimodal agents. InEuropean conference on computer vision, pages 126–142. Springer, 2024. 10

2024

[17] [17]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

2024

[18] [18]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg- zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Cohd: A counting-aware hierarchical decoding framework for generalized referring expression segmentation

Zhuoyan Luo, Yinghao Wu, Tianheng Cheng, Yong Liu, Yicheng Xiao, Hongfa Wang, Xiao- Ping Zhang, and Yujiu Yang. Cohd: A counting-aware hierarchical decoding framework for generalized referring expression segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22685–22694, 2025

2025

[20] [20]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

2026

[22] [22]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

arXiv preprint arXiv:2509.25164 (2025)

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, and Manoj Karkee. Yolo26: key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint arXiv:2509.25164, 2025

work page arXiv 2025

[24] [24]

Training-free object counting with prompts

Zenglin Shi, Ying Sun, and Mengmi Zhang. Training-free object counting with prompts. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 323–331, 2024

2024

[25] [25]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Ufo: A unified approach to fine-grained visual perception via open-ended language interface.arXiv preprint arXiv:2503.01342, 2025

work page arXiv 2025

[27] [27]

A generalized loss function for crowd counting and localization

Jia Wan, Ziquan Liu, and Antoni B Chan. A generalized loss function for crowd counting and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1974–1983, 2021

1974

[28] [28]

Git: Towards generalist vision transformer through universal language interface

Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. InComputer Vision – ECCV 2024, pages 55–73. Springer Nature Switzerland, 2025

2024

[29] [29]

Acting less is reasoning more! teaching model to act efficiently, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

work page arXiv 2025

[30] [30]

X. Wang, S. Zhang, S. Li, K. Li, K. Kallidromitis, Y . Kato, K. Kozuka, and T. Darrell. Segllm: Multi-round reasoning segmentation with large language models. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

2025

[31] [31]

Refdetector: A simple yet effective matching-based method for referring expression comprehension

Yabing Wang, Zhuotao Tian, Zheng Qin, Sanping Zhou, and Le Wang. Refdetector: A simple yet effective matching-based method for referring expression comprehension. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8033–8041, 2025

2025

[32] [32]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022. 11

2022

[33] [33]

Instructseg: Unifying instructed visual segmentation with multi-modal large language models

Cong Wei, Yujie Zhong, Haoxian Tan, Yingsen Zeng, Yong Liu, Hongfa Wang, and Yujiu Yang. Instructseg: Unifying instructed visual segmentation with multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20193–20203, 2025

2025

[34] [34]

Dettoolchain: A new prompting paradigm to unleash detection ability of mllm

Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, and Jian Wu. Dettoolchain: A new prompting paradigm to unleash detection ability of mllm. In European Conference on Computer Vision, pages 164–182. Springer, 2024

2024

[35] [35]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

2024

[36] [36]

Zero-shot object counting

Jingyi Xu, Hieu Le, Vu Nguyen, Viresh Ranjan, and Dimitris Samaras. Zero-shot object counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15548–15557, June 2023

2023

[37] [37]

Zero-shot object counting with language-vision models.arXiv preprint arXiv:2309.13097, 2023

Jingyi Xu, Hieu Le, and Dimitris Samaras. Zero-shot object counting with language-vision models.arXiv preprint arXiv:2309.13097, 2023

work page arXiv 2023

[38] [38]

Empowering segmentation ability to multi-modal large language models.arXiv preprint arXiv:2403.14141, 2024

Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao, Jinwei Chen, and Bo Li. Empowering segmentation ability to multi-modal large language models.arXiv preprint arXiv:2403.14141, 2024

work page arXiv 2024

[39] [39]

Language-aware vision transformer for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5238–5255, 2024

Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(7):5238–5255, 2024

2024

[40] [40]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 12

work page internal anchor Pith review Pith/arXiv arXiv 2010