From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
Pith reviewed 2026-05-20 18:35 UTC · model grok-4.3
The pith
Group revision turns initial failures into shaped feedback signals that improve reinforcement learning for object grounding in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The group-revision optimisation paradigm enhances learning on hard cases by generating revised candidates, quantifying each candidate's improvement over the initial attempt through a consolidation process, and using the resulting signals to refine the reward and modulate the advantage, thereby amplifying the influence of high-quality revisions.
What carries the argument
The consolidation process that quantifies improvement over the initial response and converts it into reward-shaping signals for refining rewards and modulating advantages.
If this is right
- Consistent gains on referring and reasoning segmentation tasks compared with prior GRPO models.
- Improved results on referring expression comprehension and counting benchmarks.
- Stronger learning signals specifically for scenarios where all initial responses fail.
- More effective modulation of advantage estimates in reinforcement learning for grounding problems.
Where Pith is reading between the lines
- The revision-and-measurement loop could extend to other sparse-reward reinforcement learning settings outside vision-language grounding.
- Self-generated revisions might reduce reliance on external feedback in broader model alignment tasks.
- Applying the same consolidation idea to multi-turn visual reasoning could address harder compositional cases.
Load-bearing premise
The process of measuring improvement over the first response produces reliable shaping signals that strengthen good revisions without adding bias or noise to the advantage estimates.
What would settle it
Running the method on the same hard-case benchmarks and finding no improvement or worse performance than standard GRPO would show the shaping signals do not deliver the claimed benefit.
Figures
read the original abstract
Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a group-revision optimisation paradigm for RL-based finetuning of large vision-language models on object-level grounding. It samples an initial response, generates revised candidates, applies a consolidation process to quantify improvement over the initial attempt, and converts these into shaping signals that refine the reward and modulate the advantage within a GRPO-style framework. The goal is to provide denser learning signals on hard cases where all candidates fail. Empirical results claim consistent gains over prior GRPO-based models on referring/reasoning segmentation, REC, and counting benchmarks, with code released.
Significance. If the consolidation process produces reliable, unbiased shaping signals that correlate with grounding quality, the approach could meaningfully extend reward-shaping ideas to group-based revision in vision-language RL, particularly for sparse-reward hard cases. The public code release supports reproducibility and is a positive contribution. Significance is currently limited by the absence of detailed validation for the key shaping mechanism and full experimental controls.
major comments (2)
- [Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.
- [Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.
minor comments (2)
- [Method] Notation for the shaping signal and advantage modulation should be formalized with an equation rather than prose description to improve clarity.
- [Experiments] Figure captions and axis labels in the results figures would benefit from explicit mention of the exact metrics (e.g., IoU thresholds) used for the reported scores.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions made to strengthen the presentation of the consolidation process and experimental evidence.
read point-by-point responses
-
Referee: [Method] Method section (consolidation process): The central claim requires that the (unspecified) quantification metric and aggregation rule convert improvement deltas into shaping signals that amplify high-quality revisions rather than introduce bias or noise into advantage estimates. No explicit description, normalization, or debiasing procedure is provided, leaving open the risk that simple deltas overweight recoveries from the initial failure mode while under-weighting smaller but superior gains.
Authors: We agree that the original description of the consolidation process was high-level and lacked sufficient detail on the quantification metric, aggregation rule, normalization, and debiasing. In the revised manuscript we have added an expanded subsection in the Method section that explicitly defines the improvement quantification (using task-specific grounding metrics such as IoU for segmentation and accuracy for REC/counting), the aggregation function that converts per-candidate deltas into shaping signals, the normalization procedure applied to the deltas, and a debiasing step that down-weights recoveries from the initial failure mode relative to smaller but higher-quality gains. Mathematical formulations for the shaped reward and modulated advantage are now included to clarify how bias and noise are controlled within the GRPO framework. revision: yes
-
Referee: [Experiments] Experiments section: The reported benchmark gains cannot be fully assessed without data splits, ablation studies isolating the consolidation scaling factors, and complete tables showing per-task metrics against strong GRPO baselines. The abstract-level claims of 'consistent gains' therefore rest on incomplete evidence.
Authors: We acknowledge that the original Experiments section omitted explicit data-split details, targeted ablations on consolidation scaling factors, and full per-task comparison tables. In the revision we have added: (i) a clear description of the training and evaluation splits for each benchmark, (ii) new ablation tables that isolate the contribution of the consolidation scaling factors, and (iii) expanded result tables that report per-task metrics against the strongest GRPO baselines. These additions provide the granular evidence needed to substantiate the reported gains. revision: yes
Circularity Check
No significant circularity: empirical method with independent consolidation signals
full rationale
The paper introduces a group-revision paradigm for RL finetuning of vision-language models on hard grounding cases. The consolidation process quantifies improvement over an initial response using external metrics (e.g., IoU or accuracy deltas) to generate shaping signals for reward and advantage modulation. This is a designed heuristic, not a self-definitional loop or fitted parameter renamed as prediction. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling is present in the described derivation. The central claim rests on empirical gains across benchmarks rather than reducing to its own inputs by construction. The approach is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (2)
- revision generation parameters
- consolidation scaling factors
axioms (1)
- domain assumption Improvement over an initial response can be reliably quantified and used as a shaping signal without introducing systematic bias.
Lean theorems connected to this paper
-
IndisputableMonolith/CostJcost uniqueness / washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals... alignment cost... Φ(s_shape,i) := 1/|A| Σ e_m,n with e = 1/3[(1-IoU) + f_L1(box) + f_L1(point)]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716–23736, 2022. 2
work page 2022
-
[2]
Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural informa- tion processing systems, 30, 2017. 3
work page 2017
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 5, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,
-
[5]
One token to seg them all: Language-instructed reasoning segmentation in videos
Zechen Bai, Tong He, Haiyang Mei, et al. One token to seg them all: Language-instructed reasoning segmentation in videos. InNeurIPS, 2024. 3
work page 2024
-
[6]
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Perception tokens enhance visual reasoning in multimodal language models
Mohammad Bigverdi, Amanpreet Singh, et al. Perception tokens enhance visual reasoning in multimodal language models. InCVPR, 2025. 3
work page 2025
-
[8]
Let there be a clock on the beach: Reducing object hal- lucination in image captioning
Ali Furkan Biten, Lluís Gómez, and Dimosthenis Karatzas. Let there be a clock on the beach: Reducing object hal- lucination in image captioning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1381–1390, 2022. 2
work page 2022
-
[9]
Reward machines for vision-based robotic manipulation
Alberto Camacho, Jacob Varley, Andy Zeng, Deepali Jain, Atil Iscen, and Dmitry Kalashnikov. Reward machines for vision-based robotic manipulation. In2021 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 14284–14290. IEEE, 2021. 3
work page 2021
-
[10]
Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning
Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentiviz- ing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025. 3
-
[11]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Yifei Chen, Lambert Schomaker, and Francisco Cruz. Boosting reinforcement learning algorithms in continuous robotic reaching tasks using adaptive potential functions. InAustralasian Joint Conference on Artificial Intelligence, pages 52–64. Springer, 2024. 3
work page 2024
-
[13]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learn- ing from human preferences.Advances in neural informa- tion processing systems, 30, 2017. 2, 3
work page 2017
-
[15]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 3
work page 2024
-
[16]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wen- lei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimiza- tion for code generation.arXiv preprint arXiv:2410.17621,
-
[18]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104,
-
[19]
Potential-based difference rewards for multiagent reinforcement learning
Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potential-based difference rewards for multiagent reinforcement learning. InProceedings of the 2014 inter- national conference on Autonomous agents and multi-agent systems, pages 165–172, 2014. 3
work page 2014
-
[20]
Dynamic potential-based reward shaping
Sam Michael Devlin and Daniel Kudenko. Dynamic potential-based reward shaping. In11th International Con- ference on Autonomous Agents and Multiagent Systems (AAMAS 2012), pages 433–440. IFAAMAS, 2012. 3
work page 2012
-
[21]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Mme: A comprehensive evaluation bench- mark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 5, 6
work page 2025
-
[24]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Dong Guo, Zichen Liu, Weize Zhang, Yuxuan Zhou, Xi- aozhi Wang, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Uses GRPO for reasoning RL. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 5
work page 2019
-
[26]
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504–13514, 2024. 2
work page 2024
-
[27]
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixing hackable benchmarks for vision-language compositional- ity.Advances in neural information processing systems, 36: 31096–31116, 2023. 1
work page 2023
-
[28]
Roboground: Robotic manipulation with grounded vision-language priors
Haifeng Huang, Xinyi Chen, Yilun Chen, Hao Li, Xiaoshen Han, Zehan Wang, Tai Wang, Jiangmiao Pang, and Zhou Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22540– 22550, 2025. 1, 3
work page 2025
-
[29]
Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multi- modal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 1, 3
-
[30]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 1, 2
work page 2023
-
[31]
Qing Jiang, Xingyu Chen, Zhaoyang Zeng, Junzhi Yu, and Lei Zhang. Rex-thinker: Grounded object re- ferring via chain-of-thought reasoning.arXiv preprint arXiv:2506.04034, 2025. 3
-
[32]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3
work page 2023
-
[33]
Christoph F Kurz, Tatiana Merzhevich, Bjoern M Eskofier, Jakob Nikolas Kather, and Benjamin Gmeiner. Benchmark- ing vision-language models for diagnostics in emergency and critical care settings.npj Digital Medicine, 8(1):423,
-
[34]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 5
work page 2023
-
[35]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 1, 2, 3, 5, 6, 7, 8
work page 2024
-
[36]
Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shi- jian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through vi- sual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024. 1, 2
work page 2024
-
[37]
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multi- modal large language models with text-rich visual compre- hension.arXiv preprint arXiv:2404.16790, 2024. 5, 6, 7
-
[38]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational conference on machine learning, pages 12888– 12900. PMLR, 2022. 1
work page 2022
-
[40]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2
work page 2023
-
[41]
Codeprm: Execution feedback-enhanced process reward model for code generation
Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Com- putational Linguistics: ACL 2025, pages 8169–8182, 2025. 2, 3
work page 2025
-
[42]
Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024
Wendi Li and Yixuan Li. Process reward model with q- value rankings.arXiv preprint arXiv:2410.11287, 2024. 3
-
[43]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Ed- wards, Bowen Baker, Teddy Lee, Jan Leike, John Schul- man, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Gres: Gener- alized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gener- alized referring expression segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23592–23601, 2023. 5
work page 2023
-
[46]
Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 1, 2
work page 2023
-
[47]
A Survey on Hallucination in Large Vision-Language Models
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 6
work page 2024
-
[49]
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, and Gus- tavo Carneiro. Auralsam2: Enabling sam2 hear through pyramid audio-visual feature prompting.arXiv preprint arXiv:2506.01015, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fan- bin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 1, 2, 3, 5, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 1, 3, 5, 6, 8
-
[52]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Groma: Localized visual tokenization for grounding multimodal large language models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. InEuropean Conference on Computer Vision, pages 417–435. Springer,
-
[54]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 6
work page 2022
-
[55]
Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Greenewald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group rel- ative policy optimization: Insights into on-policy and off- policy training.arXiv preprint arXiv:2505.22257, 2025. 2
-
[56]
Policy invariance under reward transformations: Theory and appli- cation to reward shaping
Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and appli- cation to reward shaping. InIcml, pages 278–287. Citeseer,
-
[57]
Training language models to follow instructions with human feed- back
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2022. 3
work page 2022
-
[58]
Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023. 1, 2, 5, 6, 8
work page 2023
-
[59]
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning ca- pability of vision-language models via reinforcement learn- ing.arXiv preprint arXiv:2502.19634, 2025. 3
-
[60]
Perceptiongpt: Effectively fusing visual percep- tion into llm
Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, and Tong Zhang. Perceptiongpt: Effectively fusing visual percep- tion into llm. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27124– 27133, 2024. 6
work page 2024
-
[61]
Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model
Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chel- lappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general-purpose coarse-to-fine vision- language model. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14076–14088, 2024. 6
work page 2024
-
[62]
Reasoning to attend: Try to understand how< seg> token works
Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2025. 6
work page 2025
-
[63]
Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, and Qi Wu. Navbench: Probing multimodal large language models for embodied navigation.arXiv preprint arXiv:2506.01031, 2025. 1
-
[64]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13009–13018, 2024. 3
work page 2024
-
[65]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Fe- ichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:240...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 6
work page 2024
-
[67]
Object Hallucination in Image Captioning
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning.arXiv preprint arXiv:1809.02156, 2018. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[68]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Zihan Zhang, Xinyu Huang, Yushi Yang, Minghui Qiu, and Wayne Xin Zhao. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Intro- duces Group-Relative Policy Optimization (GRPO). 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,
Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-prm: Reasoning-driven pro- cess reward modeling.arXiv preprint arXiv:2503.21295,
-
[70]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision- language model.arXiv preprint arXiv:2504.07615, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Chong Wang, Yuanhong Chen, Fengbei Liu, Yuyuan Liu, Davis James McCarthy, Helen Frazer, and Gustavo Carneiro. Mixture of gaussian-distributed prototypes with generative modelling for interpretable and trustworthy im- age recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
work page 2025
-
[73]
Elysium: Exploring object-level perception in videos via mllm
Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. Elysium: Exploring object-level perception in videos via mllm. InEuropean Conference on Computer Vision, pages 166–185. Springer, 2024. 6
work page 2024
-
[74]
Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, and Jiang- miao Pang. Rethinking the embodied gap in vision-and- language navigation: A holistic study of physical and visual disparities. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9455–9465, 2025. 1, 3
work page 2025
-
[75]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shen- glong Ye, Xizhou Zhu, et al. Visualprm: An effective process reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025. 2
-
[77]
Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024
XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. Segllm: Multi-round reasoning segmenta- tion.arXiv preprint arXiv:2410.18923, 2024. 6
-
[78]
Eric Wiewiora. Potential-based shaping and q-value ini- tialization are equivalent.Journal of Artificial Intelligence Research, 19:205–208, 2003. 2, 3
work page 2003
-
[79]
Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embod- ied navigation of vision-language models.arXiv preprint arXiv:2510.26909, 2025. 1, 3
-
[80]
Qwen-image technical report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.