pith. sign in

arxiv: 2605.15951 · v1 · pith:YXCNXVAInew · submitted 2026-05-15 · 💻 cs.CV

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

Pith reviewed 2026-05-20 18:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords group revisionreinforcement learningvision-language modelsobject groundingreward shapingGRPOreferring expression comprehensionsegmentation
0
0 comments X

The pith

Group revision turns initial failures into shaped feedback signals that improve reinforcement learning for object grounding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard GRPO-based reinforcement learning for vision-language models provides too little signal when every candidate response fails on difficult grounding tasks. It introduces a group-revision approach that starts with one initial response, generates multiple revised candidates, and then measures how much each revision improves on the first attempt. A consolidation step converts those measured improvements into shaping signals that adjust both the reward and the advantage estimates. The result is stronger learning on hard cases, shown by gains on referring segmentation, reasoning segmentation, referring expression comprehension, and counting benchmarks.

Core claim

The group-revision optimisation paradigm enhances learning on hard cases by generating revised candidates, quantifying each candidate's improvement over the initial attempt through a consolidation process, and using the resulting signals to refine the reward and modulate the advantage, thereby amplifying the influence of high-quality revisions.

What carries the argument

The consolidation process that quantifies improvement over the initial response and converts it into reward-shaping signals for refining rewards and modulating advantages.

If this is right

  • Consistent gains on referring and reasoning segmentation tasks compared with prior GRPO models.
  • Improved results on referring expression comprehension and counting benchmarks.
  • Stronger learning signals specifically for scenarios where all initial responses fail.
  • More effective modulation of advantage estimates in reinforcement learning for grounding problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The revision-and-measurement loop could extend to other sparse-reward reinforcement learning settings outside vision-language grounding.
  • Self-generated revisions might reduce reliance on external feedback in broader model alignment tasks.
  • Applying the same consolidation idea to multi-turn visual reasoning could address harder compositional cases.

Load-bearing premise

The process of measuring improvement over the first response produces reliable shaping signals that strengthen good revisions without adding bias or noise to the advantage estimates.

What would settle it

Running the method on the same hard-case benchmarks and finding no improvement or worse performance than standard GRPO would show the shaping signals do not deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.15951 by Anjie Le, Can Peng, Fengbei Liu, Jiajun Deng, Jiayuan Zhu, Jiazhen Pan, Junde Wu, Yiping Ji, Yuyuan Liu.

Figure 1
Figure 1. Figure 1: Group Answer (GRPO): A group of responses (n=8) from Qwen2.5VL-7B [3] yields unsatisfactory grounding, with the best box IoU only 14.72%. Group Revision: A group of re￾sponses that revises the mistaken answer improves the best IoU to 74.57%. Quantitative study: The red curve shows the best IoU within each GRPO group on the hard cases through the training set, while the group revision effectively recover th… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our approach. We first sample and collect the output from the initial response (Eq. (1)). A revision prompt is appended as a follow-up question to guide the sampling of a group of revised responses (Eq. (2)). We then collect the shaping sets from both results and compute the shaping signal ∆ϕi (Eq. (6)) between them. This signal is derived from the alignment cost (Eq. (4)), based on object-… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of Post Scaling. We evaluate our method with the effectiveness of advantage post-scaling from Eq. (8) in the seg￾mentation [35, 90] and counting tasks [18] in single-object setting. marks. Our method (multi) shows consistent gains over [3], achieving +0.55% on SeedBenchV2+ [37] and +1.25% (av￾erage) on MMU-pro [93]. These results suggest that object￾level grounding improves instance-level localisa… view at source ↗
Figure 5
Figure 5. Figure 5: Reward curves over training steps for both single- and multi-object setups. Each line shows the mean reward of the revi￾sion group in condition of with and without the ∆ϕ term. be effectively up-weighted, while most cases remain stable. Crucially, the distribution indicates that initial responses are generally of competitive quality and are not intentionally degraded to inflate ∆ϕ, reducing the risk of rew… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons on the CountingBench [58] and ReasoningSeg [35] datasets. For each example, we present the outputs from Qwen2.5VL-7B [3] (top row) and our method (bottom), including both CoT reasoning and spatial grounding. the speed and direction of the boat’, our model produces more coherent reasoning (e.g., correctly identify￾ing the oars) and delivers more accurate spatial grounding. 5. Conclus… view at source ↗
read the original abstract

Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a group-revision optimisation paradigm for RL-based finetuning of large vision-language models on object-level grounding. It samples an initial response, generates revised candidates, applies a consolidation process to quantify improvement over the initial attempt, and converts these into shaping signals that refine the reward and modulate the advantage within a GRPO-style framework. The goal is to provide denser learning signals on hard cases where all candidates fail. Empirical results claim consistent gains over prior GRPO-based models on referring/reasoning segmentation, REC, and counting benchmarks, with code released.

Significance. If the consolidation process produces reliable, unbiased shaping signals that correlate with grounding quality, the approach could meaningfully extend reward-shaping ideas to group-based revision in vision-language RL, particularly for sparse-reward hard cases. The public code release supports reproducibility and is a positive contribution. Significance is currently limited by the absence of detailed validation for the key shaping mechanism and full experimental controls.

major comments (2)
  1. [Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.
  2. [Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.
minor comments (2)
  1. [Method] Notation for the shaping signal and advantage modulation should be formalized with an equation rather than prose description to improve clarity.
  2. [Experiments] Figure captions and axis labels in the results figures would benefit from explicit mention of the exact metrics (e.g., IoU thresholds) used for the reported scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions made to strengthen the presentation of the consolidation process and experimental evidence.

read point-by-point responses
  1. Referee: [Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.

    Authors: We agree that the original description of the consolidation process was high-level and lacked sufficient detail on the quantification metric, aggregation rule, normalization, and debiasing. In the revised manuscript we have added an expanded subsection in the Method section that explicitly defines the improvement quantification (using task-specific grounding metrics such as IoU for segmentation and accuracy for REC/counting), the aggregation function that converts per-candidate deltas into shaping signals, the normalization procedure applied to the deltas, and a debiasing step that down-weights recoveries from the initial failure mode relative to smaller but higher-quality gains. Mathematical formulations for the shaped reward and modulated advantage are now included to clarify how bias and noise are controlled within the GRPO framework. revision: yes

  2. Referee: [Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.

    Authors: We acknowledge that the original Experiments section omitted explicit data-split details, targeted ablations on consolidation scaling factors, and full per-task comparison tables. In the revision we have added: (i) a clear description of the training and evaluation splits for each benchmark, (ii) new ablation tables that isolate the contribution of the consolidation scaling factors, and (iii) expanded result tables that report per-task metrics against the strongest GRPO baselines. These additions provide the granular evidence needed to substantiate the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical method with independent consolidation signals

full rationale

The paper introduces a group-revision paradigm for RL finetuning of vision-language models on hard grounding cases. The consolidation process quantifies improvement over an initial response using external metrics (e.g., IoU or accuracy deltas) to generate shaping signals for reward and advantage modulation. This is a designed heuristic, not a self-definitional loop or fitted parameter renamed as prediction. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the described derivation. The central claim rests on empirical gains across benchmarks rather than reducing to its own inputs by construction. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on standard RL assumptions and the existence of a base GRPO implementation. No new physical entities or ungrounded mathematical axioms are introduced beyond typical hyperparameter choices in RL training.

free parameters (2)
  • revision generation parameters
    Number of revised candidates and how they are sampled or prompted; these control the exploration of improvements and are chosen to make the shaping signals effective.
  • consolidation scaling factors
    Weights or functions that convert measured improvement into reward and advantage adjustments; these are tuned to amplify high-quality revisions.
axioms (1)
  • domain assumption Improvement over an initial response can be reliably quantified and used as a shaping signal without introducing systematic bias.
    Invoked when the consolidation process is described as converting improvement into informative signals for reward refinement.

pith-pipeline@v0.9.0 · 5728 in / 1205 out tokens · 28554 ms · 2026-05-20T18:35:12.255591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost Jcost uniqueness / washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals... alignment cost... Φ(s_shape,i) := 1/|A| Σ e_m,n with e = 1/3[(1-IoU) + f_L1(box) + f_L1(point)]

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 27 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022. 2

  2. [2]

    Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017. 3

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 5, 6, 7, 8

  4. [4]

    Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

    Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

  5. [5]

    One token to seg them all: Language-instructed reasoning segmentation in videos

    Zechen Bai, Tong He, Haiyang Mei, et al. One token to seg them all: Language-instructed reasoning segmentation in videos. InNeurIPS, 2024. 3

  6. [6]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 3

  7. [7]

    Perception tokens enhance visual reasoning in multimodal language models

    Mohammad Bigverdi, Amanpreet Singh, et al. Perception tokens enhance visual reasoning in multimodal language models. InCVPR, 2025. 3

  8. [8]

    Let there be a clock on the beach: Reducing object hal- lucination in image captioning

    Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. Let there be a clock on the beach: Reducing object hal- lucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022. 2

  9. [9]

    Reward machines for vision-based robotic manipulation

    Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. Reward machines for vision-based robotic manipulation. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 14284–14290. IEEE, 2021. 3

  10. [10]

    Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025. 3

  11. [11]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  12. [12]

    Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions

    Yifei Chen, Lambert Schomaker, and Francisco Cruz. Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions. InAustralasian Joint Conference on Artificial Intelligence, pages 52–64. Springer, 2024. 3

  13. [13]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

  14. [14]

    Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017. 2, 3

  15. [15]

    Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 3

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 1

  17. [17]

    Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

    Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wen- lei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

  18. [18]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

  19. [19]

    Potential-based difference rewards for multiagent reinforcement learning

    Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. InProceedings of the 2014 inter- national conference on Autonomous agents and multi-agent systems, pages 165–172, 2014. 3

  20. [20]

    Dynamic potential-based reward shaping

    Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In11th International Con- ference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pages 433–440. IFAAMAS, 2012. 3

  21. [21]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 2

  22. [22]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 3

  23. [23]

    Mme: A comprehensive evaluation bench- mark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 5, 6

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Dong Guo, Zichen Liu, Weize Zhang, Yuxuan Zhou, Xi- aozhi Wang, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Uses GRPO for reasoning RL. 3

  25. [25]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 5

  26. [26]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 2

  27. [27]

    Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023. 1

  28. [28]

    Roboground: Robotic manipulation with grounded vision-language priors

    Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, and Zhou Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22540– 22550, 2025. 1, 3

  29. [29]

    Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025

    Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 1, 3

  30. [30]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 1, 2

  31. [31]

    Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025

    Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, and Lei Zhang. Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025. 3

  32. [32]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3

  33. [33]

    Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

    Christoph F Kurz, Tatiana Merzhevich, Bjoern M Eskofier, Jakob Nikolas Kather, and Benjamin Gmeiner. Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

  34. [34]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 5

  35. [35]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 1, 2, 3, 5, 6, 7, 8

  36. [36]

    Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 1, 2

  37. [37]

    Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024. 5, 6, 7

  38. [38]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  39. [39]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational conference on machine learning, pages 12888– 12900. PMLR, 2022. 1

  40. [40]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

  41. [41]

    Codeprm: Execution feedback-enhanced process reward model for code generation

    Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 8169–8182, 2025. 2, 3

  42. [42]

    Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

    Wendi Li and Yixuan Li. Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024. 3

  43. [43]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 6

  44. [44]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schul- man, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. 2

  45. [45]

    Gres: Gener- alized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gener- alized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 5

  46. [46]

    Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 1, 2

  47. [47]

    A Survey on Hallucination in Large Vision-Language Models

    Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024. 1, 2

  48. [48]

    Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 6

  49. [49]

    AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

    Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, and Gus- tavo Carneiro. Auralsam2: Enabling sam2 hear through pyramid audio-visual feature prompting.arXiv preprint arXiv:2506.01015, 2025. 3

  50. [50]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 5, 6, 8

  51. [51]

    Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 3, 5, 6, 8

  52. [52]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 2, 3

  53. [53]

    Groma: Localized visual tokenization for grounding multimodal large language models

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision, pages 417–435. Springer,

  54. [54]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 6

  55. [55]

    Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025

    Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025. 2

  56. [56]

    Policy invariance under reward transformations: Theory and appli- cation to reward shaping

    Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and appli- cation to reward shaping. InIcml, pages 278–287. Citeseer,

  57. [57]

    Training language models to follow instructions with human feed- back

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022. 3

  58. [58]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023. 1, 2, 5, 6, 8

  59. [59]

    Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025

    Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025. 3

  60. [60]

    Perceptiongpt: Effectively fusing visual percep- tion into llm

    Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual percep- tion into llm. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27124– 27133, 2024. 6

  61. [61]

    Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model

    Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chel- lappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14076–14088, 2024. 6

  62. [62]

    Reasoning to attend: Try to understand how< seg> token works

    Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 6

  63. [63]

    Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

    Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025. 1

  64. [64]

    Glamm: Pixel grounding large multimodal model

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 3

  65. [65]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:240...

  66. [66]

    Pixellm: Pixel reasoning with large multimodal model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 6

  67. [67]

    Object Hallucination in Image Captioning

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156, 2018. 1, 2

  68. [68]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Zihan Zhang, Xinyu Huang, Yushi Yang, Minghui Qiu, and Wayne Xin Zhao. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Intro- duces Group-Relative Policy Optimization (GRPO). 1, 3, 5

  69. [69]

    R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

    Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

  70. [70]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision- language model.arXiv preprint arXiv:2504.07615, 2025. 1

  71. [71]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 5

  72. [72]

    Chong Wang, Yuanhong Chen, Fengbei Liu, Yuyuan Liu, Davis James McCarthy, Helen Frazer, and Gustavo Carneiro. Mixture of gaussian-distributed prototypes with generative modelling for interpretable and trustworthy im- age recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

  73. [73]

    Elysium: Exploring object-level perception in videos via mllm

    Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 6

  74. [74]

    Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities

    Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiang- miao Pang. Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025. 1, 3

  75. [75]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 6

  76. [76]

    Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

    Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shen- glong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025. 2

  77. [77]

    Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024

    XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024. 6

  78. [78]

    Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003

    Eric Wiewiora. Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003. 2, 3

  79. [79]

    Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025

    Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025. 1, 3

  80. [80]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

Showing first 80 references.