pith. machine review for the scientific record. sign in

arxiv: 2503.06520 · v2 · submitted 2025-03-09 · 💻 cs.CV · cs.MM

Recognition: no theorem link

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:27 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords reasoning segmentationreinforcement learningchain-of-thought reasoningzero-shot generalizationimage segmentationGRPO
0
0 comments X

The pith

Reinforcement learning with format and accuracy rewards enables explicit reasoning chains to guide image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that image segmentation models can learn to produce their own step-by-step reasoning about user intentions through reinforcement learning, without needing supervised examples of reasoning. Traditional methods depend on fine-tuning with fixed labels and short descriptions, which limits how well they handle new kinds of requests. Seg-Zero separates the task into a reasoning component that creates chains of thought and positional hints, and a segmentation component that turns those hints into masks. By rewarding correct output formats and accurate masks during reinforcement learning, the system develops reasoning abilities that help it generalize to unseen domains. A sympathetic reader would care because this points to a way of building more adaptable vision systems that can explain their choices.

Core claim

Seg-Zero introduces a decoupled architecture where a reasoning model interprets user intentions, generates explicit reasoning chains, and creates positional prompts for a separate segmentation model to produce pixel-level masks. The system is trained exclusively with reinforcement learning using the GRPO algorithm and a reward mechanism that combines format rewards for proper output structure with accuracy rewards for correct segmentation results, without any explicit reasoning supervision data.

What carries the argument

The decoupled reasoning model and segmentation model pair, trained via GRPO reinforcement learning driven by format and accuracy rewards.

If this is right

  • The model produces visible chain-of-thought reasoning during inference.
  • Zero-shot performance on the ReasonSeg benchmark reaches 57.5, exceeding previous approaches by a large margin.
  • Generalization to out-of-domain segmentation tasks improves without additional supervised training.
  • Reasoning capabilities emerge at test time from the reinforcement training process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reward-based training could be applied to develop reasoning in other computer vision tasks such as object detection or image captioning.
  • The separation of reasoning and execution modules might allow easier debugging of where errors occur in complex queries.
  • Further scaling the reinforcement process could lead to more sophisticated reasoning chains for highly ambiguous segmentation requests.

Load-bearing premise

The combination of format and accuracy rewards through reinforcement learning will consistently generate useful and generalizable reasoning rather than superficial patterns tuned to the training examples.

What would settle it

Run the model on a new set of segmentation queries involving reasoning steps absent from the training distribution, and measure whether the explicit reasoning chains correlate with higher segmentation accuracy or if performance drops to baseline levels.

read the original abstract

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Seg-Zero, a decoupled architecture with a reasoning model that generates explicit chain-of-thought and positional prompts, followed by a segmentation model that produces pixel-level masks. The system is trained exclusively via GRPO reinforcement learning using only format and accuracy rewards, with no explicit reasoning supervision or labeled reasoning data. It claims robust zero-shot generalization and reports that Seg-Zero-7B achieves 57.5 on the ReasonSeg benchmark, an 18% improvement over prior LISA-7B.

Significance. If the central performance claim and the emergence of useful reasoning chains are substantiated, the work would demonstrate that simple reward signals in a decoupled RL setup can induce generalizable chain-of-thought for reasoning segmentation without supervised reasoning data. This would be a notable data-efficient advance for zero-shot multimodal tasks. The public code release strengthens reproducibility.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.
  2. [Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.
  3. [Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.
minor comments (1)
  1. [Abstract] Abstract: 'precious pixel-level masks' is likely a typo for 'precise pixel-level masks'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.

    Authors: We thank the referee for highlighting this. To address the concern, we will expand the Experiments section to include a comprehensive description of the training data mixture, details on how baselines were implemented (including hyperparameters and code references), results from multiple runs with standard deviations, and appropriate statistical tests. This will allow for better verification of the central claims. revision: yes

  2. Referee: [Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.

    Authors: We appreciate this point and acknowledge the value of such an ablation. We will include an ablation study in the revised version where we replace the reasoning model's output with randomized or fixed positional prompts (keeping the segmentation model unchanged) to demonstrate that the learned reasoning chains contribute to the performance gains beyond superficial patterns. revision: yes

  3. Referee: [Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.

    Authors: We will add qualitative visualizations of the reasoning chains generated by Seg-Zero in the Experiments section to provide direct evidence of the emergent reasoning. Furthermore, we will include a comparison experiment replacing the reasoning model with a non-reasoning baseline (e.g., direct instruction without chain-of-thought) and report the resulting mask quality metrics to support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result independent of training inputs

full rationale

The paper reports a measured zero-shot score of 57.5 on the external ReasonSeg benchmark after GRPO training with format+accuracy rewards. This performance number is obtained by direct evaluation on held-out data and does not reduce to any fitted parameter, self-defined quantity, or self-citation chain. No equations, uniqueness theorems, or ansatzes are presented that would make the central claim equivalent to its inputs by construction. The emergence of reasoning chains is an empirical observation rather than a deductive step that loops back on itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions about reward shaping and policy optimization; no new physical or mathematical entities are introduced beyond the named GRPO algorithm and the reward design.

free parameters (1)
  • reward weights for format and accuracy
    The abstract describes a 'sophisticated reward mechanism' that combines format and accuracy terms; the relative weighting is not specified and must be chosen or tuned.
axioms (1)
  • domain assumption Reinforcement learning with format and accuracy rewards can produce explicit, generalizable chain-of-thought reasoning without any supervised reasoning examples.
    Invoked when the paper states that training occurs 'exclusively via reinforcement learning with GRPO and without explicit reasoning data'.

pith-pipeline@v0.9.0 · 5521 in / 1354 out tokens · 51293 ms · 2026-05-16T12:27:04.945088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  2. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  3. PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.

  4. Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

  5. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.

  6. A Unified and Controllable Framework for Layered Image Generation with Visual Effects

    cs.CV 2026-01 unverdicted novelty 7.0

    LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

  7. IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

    cs.CV 2026-01 conditional novelty 7.0

    IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.

  8. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  9. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  10. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  11. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 conditional novelty 6.0

    Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

  12. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  13. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  14. ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring

    cs.CL 2026-05 unverdicted novelty 5.0

    ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.

  15. SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units

    cs.CV 2026-04 unverdicted novelty 5.0

    SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.

  16. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

  17. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  18. RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

  19. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 4.0

    Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...

  20. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  21. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 19 Pith papers · 10 internal anchors

  1. [1]

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation

    Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern anal- ysis and machine intelligence, 39(12):2481–2495, 2017. 3

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 3, 4, 5

  3. [3]

    One token to seg them all: Language instructed reasoning seg- mentation in videos

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning seg- mentation in videos. Advances in Neural Information Pro- cessing Systems, 37:6833–6859, 2025. 2, 3

  4. [4]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017. 3

  5. [5]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation. arXiv preprint arXiv:1706.05587 ,

  6. [6]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3

  7. [7]

    Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. In European Conference on Computer Vision , pages 323–340. Springer, 2024. 2, 3, 7, 9

  8. [8]

    Per- pixel classification is not all you need for semantic segmen- tation

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in neural information processing systems , 34:17864–17875, 2021. 3

  9. [9]

    Masked-attention mask 11 transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask 11 transformer for universal image segmentation. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3

  10. [10]

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang

    Xidong Feng, Ziyu Wan, Muning Wen, Stephen Mar- cus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179 ,

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3

  12. [12]

    Partimagenet: A large, high- quality dataset of parts

    Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xi- aoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, Qi- hang Yu, and Alan Yuille. Partimagenet: A large, high- quality dataset of parts. In European Conference on Com- puter Vision, pages 128–145. Springer, 2022. 2

  13. [13]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 2, 3, 5

  14. [14]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF international con- ference on computer vision , pages 4015–4026, 2023. 2, 3, 4

  15. [15]

    Training language models to self-correct via reinforcement learning

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024. 2

  16. [16]

    Open R1 Multimodal

    EvolvingLMMs Lab. Open R1 Multimodal. https : //github.com/EvolvingLMMs- Lab/open- r1- multimodal, 2025. 3

  17. [17]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 2, 3, 5, 7, 8, 9

  18. [18]

    Mini-gemini: Mining the potential of multi-modality vision language models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

  19. [19]

    Open-vocabulary semantic segmentation with mask-adapted clip

    Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7061–7070, 2023. 7

  20. [20]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Repre- sentations, 2023. 2

  21. [21]

    Refinenet: Multi-path refinement networks for high- resolution semantic segmentation

    Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1925–1934, 2017. 3

  22. [22]

    GRES: Gen- eralized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Gen- eralized referring expression segmentation. In CVPR, 2023. 7

  23. [23]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36:34892–34916, 2023. 2

  24. [24]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR,

  25. [25]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015. 3

  26. [26]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 5

  27. [27]

    OpenAI o1

    OpenAI. OpenAI o1. https://openai.com/o1/ ,

  28. [28]

    Perceptiongpt: Effectively fusing visual perception into llm

    Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual perception into llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 27124– 27133, 2024. 7

  29. [29]

    Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD international con- ference on knowledge discovery & data mining, pages 3505– 3506, 2020. 5

  30. [30]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 3, 4, 5

  31. [31]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 ,

  32. [32]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 2, 3, 7

  33. [33]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 3

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, 12 Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5

  35. [35]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5

  36. [36]

    R1-V Team. R1-V. https://github.com/Deep- Agent/R1-V?tab=readme-ov-file , 2025. 3

  37. [37]

    Solving olympiad geometry without human demon- strations

    Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demon- strations. Nature, 625(7995):476–482, 2024. 2

  38. [38]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geof- frey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022. 2

  39. [39]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human anno- tations. arXiv preprint arXiv:2312.08935, 2023. 2

  40. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 3

  41. [41]

    Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

    Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023. 3

  42. [42]

    Lavt: Language-aware vision transformer for referring image segmentation

    Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022. 7

  43. [43]

    Modeling context in referring expres- sions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016. 2, 3, 5

  44. [44]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 3

  45. [45]

    Lyra: An efficient and speech-centric framework for omni-cognition

    Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, et al. Lyra: An efficient and speech-centric framework for omni-cognition. arXiv preprint arXiv:2412.09501, 2024. 3 13