From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

Anjie Le; Can Peng; Fengbei Liu; Jiajun Deng; Jiayuan Zhu; Jiazhen Pan; Junde Wu; Yiping Ji; Yuyuan Liu

arxiv: 2605.15951 · v1 · pith:YXCNXVAInew · submitted 2026-05-15 · 💻 cs.CV

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

Yuyuan Liu , Yiping Ji , Anjie Le , Jiayuan Zhu , Jiazhen Pan , Can Peng , Jiajun Deng , Fengbei Liu

show 1 more author

Junde Wu

This is my paper

Pith reviewed 2026-05-20 18:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords group revisionreinforcement learningvision-language modelsobject groundingreward shapingGRPOreferring expression comprehensionsegmentation

0 comments

The pith

Group revision turns initial failures into shaped feedback signals that improve reinforcement learning for object grounding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard GRPO-based reinforcement learning for vision-language models provides too little signal when every candidate response fails on difficult grounding tasks. It introduces a group-revision approach that starts with one initial response, generates multiple revised candidates, and then measures how much each revision improves on the first attempt. A consolidation step converts those measured improvements into shaping signals that adjust both the reward and the advantage estimates. The result is stronger learning on hard cases, shown by gains on referring segmentation, reasoning segmentation, referring expression comprehension, and counting benchmarks.

Core claim

The group-revision optimisation paradigm enhances learning on hard cases by generating revised candidates, quantifying each candidate's improvement over the initial attempt through a consolidation process, and using the resulting signals to refine the reward and modulate the advantage, thereby amplifying the influence of high-quality revisions.

What carries the argument

The consolidation process that quantifies improvement over the initial response and converts it into reward-shaping signals for refining rewards and modulating advantages.

If this is right

Consistent gains on referring and reasoning segmentation tasks compared with prior GRPO models.
Improved results on referring expression comprehension and counting benchmarks.
Stronger learning signals specifically for scenarios where all initial responses fail.
More effective modulation of advantage estimates in reinforcement learning for grounding problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The revision-and-measurement loop could extend to other sparse-reward reinforcement learning settings outside vision-language grounding.
Self-generated revisions might reduce reliance on external feedback in broader model alignment tasks.
Applying the same consolidation idea to multi-turn visual reasoning could address harder compositional cases.

Load-bearing premise

The process of measuring improvement over the first response produces reliable shaping signals that strengthen good revisions without adding bias or noise to the advantage estimates.

What would settle it

Running the method on the same hard-case benchmarks and finding no improvement or worse performance than standard GRPO would show the shaping signals do not deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2605.15951 by Anjie Le, Can Peng, Fengbei Liu, Jiajun Deng, Jiayuan Zhu, Jiazhen Pan, Junde Wu, Yiping Ji, Yuyuan Liu.

**Figure 1.** Figure 1: Group Answer (GRPO): A group of responses (n=8) from Qwen2.5VL-7B [3] yields unsatisfactory grounding, with the best box IoU only 14.72%. Group Revision: A group of responses that revises the mistaken answer improves the best IoU to 74.57%. Quantitative study: The red curve shows the best IoU within each GRPO group on the hard cases through the training set, while the group revision effectively recover th… view at source ↗

**Figure 2.** Figure 2: Illustration of our approach. We first sample and collect the output from the initial response (Eq. (1)). A revision prompt is appended as a follow-up question to guide the sampling of a group of revised responses (Eq. (2)). We then collect the shaping sets from both results and compute the shaping signal ∆ϕi (Eq. (6)) between them. This signal is derived from the alignment cost (Eq. (4)), based on object-… view at source ↗

**Figure 3.** Figure 3: Ablation of Post Scaling. We evaluate our method with the effectiveness of advantage post-scaling from Eq. (8) in the segmentation [35, 90] and counting tasks [18] in single-object setting. marks. Our method (multi) shows consistent gains over [3], achieving +0.55% on SeedBenchV2+ [37] and +1.25% (average) on MMU-pro [93]. These results suggest that objectlevel grounding improves instance-level localisa… view at source ↗

**Figure 5.** Figure 5: Reward curves over training steps for both single- and multi-object setups. Each line shows the mean reward of the revision group in condition of with and without the ∆ϕ term. be effectively up-weighted, while most cases remain stable. Crucially, the distribution indicates that initial responses are generally of competitive quality and are not intentionally degraded to inflate ∆ϕ, reducing the risk of rew… view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons on the CountingBench [58] and ReasoningSeg [35] datasets. For each example, we present the outputs from Qwen2.5VL-7B [3] (top row) and our method (bottom), including both CoT reasoning and spatial grounding. the speed and direction of the boat’, our model produces more coherent reasoning (e.g., correctly identifying the oars) and delivers more accurate spatial grounding. 5. Conclus… view at source ↗

read the original abstract

Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a group-revision step to GRPO that turns improvement over an initial failed response into reward-shaping signals for hard grounding cases in VLMs.

read the letter

The main takeaway is that this work targets the sparse-reward issue in GRPO when every sampled response fails on tough object-grounding examples. It samples an initial response, generates revised candidates, measures improvement over the first attempt, and folds those deltas into both the reward and the advantage estimate. That is a direct, practical extension of existing reward-shaping ideas rather than an entirely new framework. The approach is applied to referring and reasoning segmentation, REC, and counting, with reported gains over prior GRPO baselines and code released on GitHub. Releasing code is helpful for anyone who wants to test the idea quickly. The consolidation process that turns raw improvement into shaping signals is the part that needs the most attention. The abstract leaves the exact quantification metric and aggregation rule unspecified, so it is possible that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but genuinely better revisions. Without seeing the full ablations or checks for bias in the advantage estimates, it is hard to judge how clean the learning signal actually is. The gains look consistent on the listed benchmarks, but the lack of reported statistical tests or sensitivity analysis around the revision parameters makes the robustness claim harder to evaluate from the summary alone. This paper is mainly for researchers already working on RL fine-tuning of vision-language models for precise visual grounding. A reader in that sub-area would get a concrete method to try and could build on the released code. It is solid enough on the problem statement and empirical direction to deserve a serious referee, even if the experiments will need tighter validation on the consolidation mechanics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a group-revision optimisation paradigm for RL-based finetuning of large vision-language models on object-level grounding. It samples an initial response, generates revised candidates, applies a consolidation process to quantify improvement over the initial attempt, and converts these into shaping signals that refine the reward and modulate the advantage within a GRPO-style framework. The goal is to provide denser learning signals on hard cases where all candidates fail. Empirical results claim consistent gains over prior GRPO-based models on referring/reasoning segmentation, REC, and counting benchmarks, with code released.

Significance. If the consolidation process produces reliable, unbiased shaping signals that correlate with grounding quality, the approach could meaningfully extend reward-shaping ideas to group-based revision in vision-language RL, particularly for sparse-reward hard cases. The public code release supports reproducibility and is a positive contribution. Significance is currently limited by the absence of detailed validation for the key shaping mechanism and full experimental controls.

major comments (2)

[Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.
[Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.

minor comments (2)

[Method] Notation for the shaping signal and advantage modulation should be formalized with an equation rather than prose description to improve clarity.
[Experiments] Figure captions and axis labels in the results figures would benefit from explicit mention of the exact metrics (e.g., IoU thresholds) used for the reported scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions made to strengthen the presentation of the consolidation process and experimental evidence.

read point-by-point responses

Referee: [Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.

Authors: We agree that the original description of the consolidation process was high-level and lacked sufficient detail on the quantification metric, aggregation rule, normalization, and debiasing. In the revised manuscript we have added an expanded subsection in the Method section that explicitly defines the improvement quantification (using task-specific grounding metrics such as IoU for segmentation and accuracy for REC/counting), the aggregation function that converts per-candidate deltas into shaping signals, the normalization procedure applied to the deltas, and a debiasing step that down-weights recoveries from the initial failure mode relative to smaller but higher-quality gains. Mathematical formulations for the shaped reward and modulated advantage are now included to clarify how bias and noise are controlled within the GRPO framework. revision: yes
Referee: [Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.

Authors: We acknowledge that the original Experiments section omitted explicit data-split details, targeted ablations on consolidation scaling factors, and full per-task comparison tables. In the revision we have added: (i) a clear description of the training and evaluation splits for each benchmark, (ii) new ablation tables that isolate the contribution of the consolidation scaling factors, and (iii) expanded result tables that report per-task metrics against the strongest GRPO baselines. These additions provide the granular evidence needed to substantiate the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical method with independent consolidation signals

full rationale

The paper introduces a group-revision paradigm for RL finetuning of vision-language models on hard grounding cases. The consolidation process quantifies improvement over an initial response using external metrics (e.g., IoU or accuracy deltas) to generate shaping signals for reward and advantage modulation. This is a designed heuristic, not a self-definitional loop or fitted parameter renamed as prediction. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the described derivation. The central claim rests on empirical gains across benchmarks rather than reducing to its own inputs by construction. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on standard RL assumptions and the existence of a base GRPO implementation. No new physical entities or ungrounded mathematical axioms are introduced beyond typical hyperparameter choices in RL training.

free parameters (2)

revision generation parameters
Number of revised candidates and how they are sampled or prompted; these control the exploration of improvements and are chosen to make the shaping signals effective.
consolidation scaling factors
Weights or functions that convert measured improvement into reward and advantage adjustments; these are tuned to amplify high-quality revisions.

axioms (1)

domain assumption Improvement over an initial response can be reliably quantified and used as a shaping signal without introducing systematic bias.
Invoked when the consolidation process is described as converting improvement into informative signals for reward refinement.

pith-pipeline@v0.9.0 · 5728 in / 1205 out tokens · 28554 ms · 2026-05-20T18:35:12.255591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost Jcost uniqueness / washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals... alignment cost... Φ(s_shape,i) := 1/|A| Σ e_m,n with e = 1/3[(1-IoU) + f_L1(box) + f_L1(point)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 27 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022. 2

work page 2022
[2]

Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017. 3

work page 2017
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

work page arXiv
[5]

One token to seg them all: Language-instructed reasoning segmentation in videos

Zechen Bai, Tong He, Haiyang Mei, et al. One token to seg them all: Language-instructed reasoning segmentation in videos. InNeurIPS, 2024. 3

work page 2024
[6]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Perception tokens enhance visual reasoning in multimodal language models

Mohammad Bigverdi, Amanpreet Singh, et al. Perception tokens enhance visual reasoning in multimodal language models. InCVPR, 2025. 3

work page 2025
[8]

Let there be a clock on the beach: Reducing object hal- lucination in image captioning

Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. Let there be a clock on the beach: Reducing object hal- lucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022. 2

work page 2022
[9]

Reward machines for vision-based robotic manipulation

Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. Reward machines for vision-based robotic manipulation. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 14284–14290. IEEE, 2021. 3

work page 2021
[10]

Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025. 3

work page arXiv 2025
[11]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions

Yifei Chen, Lambert Schomaker, and Francisco Cruz. Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions. InAustralasian Joint Conference on Artificial Intelligence, pages 52–64. Springer, 2024. 3

work page 2024
[13]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017. 2, 3

work page 2017
[15]

Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 3

work page 2024
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wen- lei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

work page arXiv
[18]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

work page
[19]

Potential-based difference rewards for multiagent reinforcement learning

Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. InProceedings of the 2014 inter- national conference on Autonomous agents and multi-agent systems, pages 165–172, 2014. 3

work page 2014
[20]

Dynamic potential-based reward shaping

Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In11th International Con- ference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pages 433–440. IFAAMAS, 2012. 3

work page 2012
[21]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mme: A comprehensive evaluation bench- mark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 5, 6

work page 2025
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Dong Guo, Zichen Liu, Weize Zhang, Yuxuan Zhou, Xi- aozhi Wang, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Uses GRPO for reasoning RL. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 5

work page 2019
[26]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 2

work page 2024
[27]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023. 1

work page 2023
[28]

Roboground: Robotic manipulation with grounded vision-language priors

Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, and Zhou Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22540– 22550, 2025. 1, 3

work page 2025
[29]

Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 1, 3

work page arXiv 2025
[30]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 1, 2

work page 2023
[31]

Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025

Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, and Lei Zhang. Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025. 3

work page arXiv 2025
[32]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3

work page 2023
[33]

Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

Christoph F Kurz, Tatiana Merzhevich, Bjoern M Eskofier, Jakob Nikolas Kather, and Benjamin Gmeiner. Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

work page
[34]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 5

work page 2023
[35]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 1, 2, 3, 5, 6, 7, 8

work page 2024
[36]

Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 1, 2

work page 2024
[37]

Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024. 5, 6, 7

work page arXiv 2024
[38]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational conference on machine learning, pages 12888– 12900. PMLR, 2022. 1

work page 2022
[40]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

work page 2023
[41]

Codeprm: Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 8169–8182, 2025. 2, 3

work page 2025
[42]

Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

Wendi Li and Yixuan Li. Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024. 3

work page arXiv 2024
[43]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schul- man, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Gres: Gener- alized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gener- alized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 5

work page 2023
[46]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 1, 2

work page 2023
[47]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 6

work page 2024
[49]

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, and Gus- tavo Carneiro. Auralsam2: Enabling sam2 hear through pyramid audio-visual feature prompting.arXiv preprint arXiv:2506.01015, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 3, 5, 6, 8

work page arXiv 2025
[52]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision, pages 417–435. Springer,

work page
[54]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 6

work page 2022
[55]

Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025

Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025. 2

work page arXiv 2025
[56]

Policy invariance under reward transformations: Theory and appli- cation to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and appli- cation to reward shaping. InIcml, pages 278–287. Citeseer,

work page
[57]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022. 3

work page 2022
[58]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023. 1, 2, 5, 6, 8

work page 2023
[59]

Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025. 3

work page arXiv 2025
[60]

Perceptiongpt: Effectively fusing visual percep- tion into llm

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual percep- tion into llm. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27124– 27133, 2024. 6

work page 2024
[61]

Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model

Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chel- lappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14076–14088, 2024. 6

work page 2024
[62]

Reasoning to attend: Try to understand how< seg> token works

Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 6

work page 2025
[63]

Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025. 1

work page arXiv 2025
[64]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 3

work page 2024
[65]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:240...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 6

work page 2024
[67]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156, 2018. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Zihan Zhang, Xinyu Huang, Yushi Yang, Minghui Qiu, and Wayne Xin Zhao. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Intro- duces Group-Relative Policy Optimization (GRPO). 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

work page arXiv
[70]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision- language model.arXiv preprint arXiv:2504.07615, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Chong Wang, Yuanhong Chen, Fengbei Liu, Yuyuan Liu, Davis James McCarthy, Helen Frazer, and Gustavo Carneiro. Mixture of gaussian-distributed prototypes with generative modelling for interpretable and trustworthy im- age recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

work page 2025
[73]

Elysium: Exploring object-level perception in videos via mllm

Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 6

work page 2024
[74]

Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiang- miao Pang. Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025. 1, 3

work page 2025
[75]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shen- glong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025. 2

work page arXiv 2025
[77]

Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024. 6

work page arXiv 2024
[78]

Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003

Eric Wiewiora. Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003. 2, 3

work page 2003
[79]

Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025

Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025. 1, 3

work page arXiv 2025
[80]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025

Showing first 80 references.

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022. 2

work page 2022

[2] [2]

Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017. 3

work page 2017

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

work page arXiv

[5] [5]

One token to seg them all: Language-instructed reasoning segmentation in videos

Zechen Bai, Tong He, Haiyang Mei, et al. One token to seg them all: Language-instructed reasoning segmentation in videos. InNeurIPS, 2024. 3

work page 2024

[6] [6]

Hallucination of Multimodal Large Language Models: A Survey

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Perception tokens enhance visual reasoning in multimodal language models

Mohammad Bigverdi, Amanpreet Singh, et al. Perception tokens enhance visual reasoning in multimodal language models. InCVPR, 2025. 3

work page 2025

[8] [8]

Let there be a clock on the beach: Reducing object hal- lucination in image captioning

Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. Let there be a clock on the beach: Reducing object hal- lucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022. 2

work page 2022

[9] [9]

Reward machines for vision-based robotic manipulation

Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. Reward machines for vision-based robotic manipulation. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 14284–14290. IEEE, 2021. 3

work page 2021

[10] [10]

Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025. 3

work page arXiv 2025

[11] [11]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions

Yifei Chen, Lambert Schomaker, and Francisco Cruz. Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions. InAustralasian Joint Conference on Artificial Intelligence, pages 52–64. Springer, 2024. 3

work page 2024

[13] [13]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017. 2, 3

work page 2017

[15] [15]

Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 3

work page 2024

[16] [16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wen- lei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,

work page arXiv

[18] [18]

Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,

work page

[19] [19]

Potential-based difference rewards for multiagent reinforcement learning

Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. InProceedings of the 2014 inter- national conference on Autonomous agents and multi-agent systems, pages 165–172, 2014. 3

work page 2014

[20] [20]

Dynamic potential-based reward shaping

Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In11th International Con- ference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pages 433–440. IFAAMAS, 2012. 3

work page 2012

[21] [21]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Mme: A comprehensive evaluation bench- mark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 5, 6

work page 2025

[24] [24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Dong Guo, Zichen Liu, Weize Zhang, Yuxuan Zhou, Xi- aozhi Wang, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Uses GRPO for reasoning RL. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 5

work page 2019

[26] [26]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 2

work page 2024

[27] [27]

Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023. 1

work page 2023

[28] [28]

Roboground: Robotic manipulation with grounded vision-language priors

Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, and Zhou Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22540– 22550, 2025. 1, 3

work page 2025

[29] [29]

Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 1, 3

work page arXiv 2025

[30] [30]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 1, 2

work page 2023

[31] [31]

Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025

Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, and Lei Zhang. Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025. 3

work page arXiv 2025

[32] [32]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3

work page 2023

[33] [33]

Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

Christoph F Kurz, Tatiana Merzhevich, Bjoern M Eskofier, Jakob Nikolas Kather, and Benjamin Gmeiner. Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,

work page

[34] [34]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 5

work page 2023

[35] [35]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 1, 2, 3, 5, 6, 7, 8

work page 2024

[36] [36]

Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 1, 2

work page 2024

[37] [37]

Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024. 5, 6, 7

work page arXiv 2024

[38] [38]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational conference on machine learning, pages 12888– 12900. PMLR, 2022. 1

work page 2022

[40] [40]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2

work page 2023

[41] [41]

Codeprm: Execution feedback-enhanced process reward model for code generation

Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 8169–8182, 2025. 2, 3

work page 2025

[42] [42]

Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024

Wendi Li and Yixuan Li. Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024. 3

work page arXiv 2024

[43] [43]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schul- man, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Gres: Gener- alized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gener- alized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 5

work page 2023

[46] [46]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 1, 2

work page 2023

[47] [47]

A Survey on Hallucination in Large Vision-Language Models

Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 6

work page 2024

[49] [49]

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, and Gus- tavo Carneiro. Auralsam2: Enabling sam2 hear through pyramid audio-visual feature prompting.arXiv preprint arXiv:2506.01015, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 5, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 3, 5, 6, 8

work page arXiv 2025

[52] [52]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision, pages 417–435. Springer,

work page

[54] [54]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 6

work page 2022

[55] [55]

Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025

Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025. 2

work page arXiv 2025

[56] [56]

Policy invariance under reward transformations: Theory and appli- cation to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and appli- cation to reward shaping. InIcml, pages 278–287. Citeseer,

work page

[57] [57]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022. 3

work page 2022

[58] [58]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023. 1, 2, 5, 6, 8

work page 2023

[59] [59]

Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025. 3

work page arXiv 2025

[60] [60]

Perceptiongpt: Effectively fusing visual percep- tion into llm

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual percep- tion into llm. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27124– 27133, 2024. 6

work page 2024

[61] [61]

Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model

Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chel- lappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14076–14088, 2024. 6

work page 2024

[62] [62]

Reasoning to attend: Try to understand how< seg> token works

Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 6

work page 2025

[63] [63]

Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025

Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025. 1

work page arXiv 2025

[64] [64]

Glamm: Pixel grounding large multimodal model

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 3

work page 2024

[65] [65]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:240...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

Pixellm: Pixel reasoning with large multimodal model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 6

work page 2024

[67] [67]

Object Hallucination in Image Captioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156, 2018. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [68]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Zihan Zhang, Xinyu Huang, Yushi Yang, Minghui Qiu, and Wayne Xin Zhao. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Intro- duces Group-Relative Policy Optimization (GRPO). 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,

work page arXiv

[70] [70]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision- language model.arXiv preprint arXiv:2504.07615, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

Chong Wang, Yuanhong Chen, Fengbei Liu, Yuyuan Liu, Davis James McCarthy, Helen Frazer, and Gustavo Carneiro. Mixture of gaussian-distributed prototypes with generative modelling for interpretable and trustworthy im- age recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1

work page 2025

[73] [73]

Elysium: Exploring object-level perception in videos via mllm

Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 6

work page 2024

[74] [74]

Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiang- miao Pang. Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025. 1, 3

work page 2025

[75] [75]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shen- glong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025. 2

work page arXiv 2025

[77] [77]

Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024. 6

work page arXiv 2024

[78] [78]

Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003

Eric Wiewiora. Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003. 2, 3

work page 2003

[79] [79]

Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025

Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025. 1, 3

work page arXiv 2025

[80] [80]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025