Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Pith reviewed 2026-05-10 16:47 UTC · model grok-4.3
The pith
Adding consistency and grounding constraints to GRPO reduces CoT inconsistency from 24.5% to 1.7% and improves accuracy in multimodal spatial reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FGRPO modifies standard GRPO by incorporating batch-level constraints on logical consistency, where the CoT must entail the final answer, and visual grounding, where each step accurately reflects image content, using Lagrangian dual ascent to adaptively balance these constraints during optimization. On Qwen2.5-VL models across seven benchmarks, this reduces the inconsistency rate from 24.5% to 1.7%, raises visual grounding scores by 13%, and improves accuracy over unconstrained GRPO.
What carries the argument
Batch-level consistency and grounding constraints enforced via Lagrangian dual ascent within the advantage computation of Group Relative Policy Optimization (GRPO).
Load-bearing premise
The proposed constraints on consistency and grounding can be enforced stably through Lagrangian dual ascent without causing optimization issues or needing extensive unreported tuning.
What would settle it
A replication on the same models and datasets where FGRPO fails to reduce inconsistency below 10% or does not improve accuracy would falsify the central claim.
Figures
read the original abstract
Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard GRPO training of multimodal reasoning models improves benchmark accuracy but degrades CoT quality, as measured by logical inconsistency with the final answer and poor visual grounding in spatial reasoning tasks. It introduces Faithful GRPO (FGRPO), which augments GRPO with batch-level consistency and grounding constraints enforced through Lagrangian dual ascent, adaptively weighting the constraints. On Qwen2.5-VL-7B and 3B models across seven spatial datasets, FGRPO is reported to reduce inconsistency from 24.5% to 1.7%, raise visual grounding scores by 13%, and improve final-answer accuracy relative to vanilla GRPO.
Significance. If the empirical results are robust, the work provides evidence that enforcing faithfulness constraints during RLVR can simultaneously improve both reasoning quality and task performance, addressing a documented failure mode in current multimodal reasoning models. The constrained-optimization framing is a natural extension of GRPO and could influence training practices for reliable visual reasoning systems.
major comments (3)
- [Abstract] Abstract: the inconsistency rate (24.5% to 1.7%) and visual-grounding improvement (+13%) are presented without any description of the underlying metrics, whether they are computed automatically or via human annotation, or any error bars, run-to-run variance, or statistical tests.
- [Methods] Methods section: no ablation isolates the Lagrangian dual-ascent component from other FGRPO changes (e.g., adaptive constraint weights or batch-level grouping), so it is impossible to attribute the reported gains specifically to the constrained optimization.
- [Experiments] Experiments: the manuscript provides no evidence that the consistency and grounding constraints actually bind during training, such as dual-variable trajectories, constraint-violation rates per batch, or sensitivity to the dual learning rate.
minor comments (1)
- [Methods] The description of how the advantage is modified by the dual variables would benefit from an explicit equation showing the constrained advantage formula.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the inconsistency rate (24.5% to 1.7%) and visual-grounding improvement (+13%) are presented without any description of the underlying metrics, whether they are computed automatically or via human annotation, or any error bars, run-to-run variance, or statistical tests.
Authors: We agree that the abstract lacks sufficient detail on the metrics. The inconsistency rate is measured automatically via an entailment verifier that determines whether the CoT logically supports the final answer, while the visual grounding score is computed automatically through object-attribute and spatial-relation matching against image features. These definitions appear in Sections 3.2 and 4.1. We will revise the abstract to briefly describe the automatic nature of the metrics and will add references to error bars, run-to-run variance, and statistical tests for the reported figures in the experiments section. revision: yes
-
Referee: [Methods] Methods section: no ablation isolates the Lagrangian dual-ascent component from other FGRPO changes (e.g., adaptive constraint weights or batch-level grouping), so it is impossible to attribute the reported gains specifically to the constrained optimization.
Authors: The referee correctly notes the absence of an ablation that isolates Lagrangian dual ascent from the adaptive weighting and batch-level grouping. Although the methods section describes the integrated FGRPO formulation, a targeted ablation would clarify the contribution of the dual-ascent mechanism. We will add this ablation study in the revised manuscript. revision: yes
-
Referee: [Experiments] Experiments: the manuscript provides no evidence that the consistency and grounding constraints actually bind during training, such as dual-variable trajectories, constraint-violation rates per batch, or sensitivity to the dual learning rate.
Authors: We acknowledge that the experiments section does not present direct evidence of constraint binding, such as dual-variable trajectories or per-batch violation rates. We will incorporate these analyses, including plots of dual variables, constraint-violation statistics, and sensitivity to the dual learning rate, to demonstrate that the constraints are active during optimization. revision: yes
Circularity Check
No circularity: empirical method with external benchmark validation
full rationale
The paper proposes FGRPO as a constrained variant of GRPO using Lagrangian dual ascent to enforce batch-level consistency and grounding constraints during training. All central claims are empirical performance deltas (inconsistency rate drop from 24.5% to 1.7%, +13% grounding, accuracy gains) measured on held-out spatial reasoning benchmarks across Qwen2.5-VL backbones. No equations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-defined quantities, or self-citations. The optimization procedure is a standard application of constrained policy optimization; success is falsifiable via external test sets rather than tautological. Minor self-citation of the authors' prior GRPO work is present but not load-bearing for the core claims.
Axiom & Free-Parameter Ledger
free parameters (1)
- adaptive constraint weights
axioms (1)
- domain assumption Lagrangian dual ascent can enforce logical consistency and visual grounding constraints while preserving GRPO's group-relative advantage structure
Reference graph
Works this paper leans on
-
[1]
Ilya Loshchilov and Frank Hutter
URLhttps://api.semanticscholar.org/CorpusID:259837088. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/ CorpusID:53592270. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Mich...
work page 2017
-
[2]
R e FT : Reasoning with reinforced fine-tuning
Accessed: 2025-11-14. Arijit Ray, Jiafei Duan, Ellis L Brown II, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. SAT: Dynamic spatial aptitude training for multimodal language models. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/foru...
-
[3]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
URLhttps://api.semanticscholar.org/CorpusID:208158250. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.ArXiv, abs/2504.10479,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URLhttps://api.semanticscholar.org/CorpusID:277780955. 14 Preprint. Under review. Appendix: Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization A Training and Data Curation Details In this appendix we provide comprehensive details on the training pipeline, data curation, evaluation setup, and...
work page 2025
-
[5]
2.Ignoreall visual, spatial, numeric, or coordinate-based information
Evaluateonly the internal textual logicbetween the reasoning and the answer. 2.Ignoreall visual, spatial, numeric, or coordinate-based information. Treat references to image positions or coordinates as ordinary text, not evidence
-
[6]
Donotcheck factual accuracy with respect to the question or the real world
-
[7]
If the reasoning explicitly argues toward a conclusion and the final answer matches that conclusion, mark it asconsistenteven if the reasoning itself might be incorrect or uncertain
-
[8]
If the reasoning ends ambiguously, contradicts itself, or draws a different conclusion than the final answer, mark it asinconsistent
-
[9]
If the reasoning is too vague or incomplete to tell whether the answer follows, mark it asuncertain
-
[10]
If the reasoning shows best-effort deliberation (e.g., comparing options and making a justified choice), count that as consistent as long as the final answer matches the reasoning’s chosen option. Output strictly "YES" or "NO" only: -- "YES" if the final answer is logically consistent with the reasoning trace following the rules above. -- "NO" if the fina...
-
[11]
ENTITY GROUNDING: Named objects/people/entities are present and visible
-
[12]
ATTRIBUTE VERIFICATION: Claimed colors, sizes, counts, text content match the image(s)
-
[13]
match actual positions of referenced objects
SPATIAL RELATIONSHIP CHECK: Claimed left/right, above/below, inside, between, etc. match actual positions of referenced objects
-
[14]
BOUNDING BOX VERIFICATION: If coordinates like [x1,y1,x2,y2] are referenced, the region contains the described object and reasonably bounds it
-
[15]
IMPLICIT VISUAL CLAIMS: Conclusions depending on visual facts (counts, groupings, relative sizes) --- verify the underlying visual facts
-
[16]
MULTI-IMAGE REFERENCES: If the sentence refers to ‘image 1’, ‘image 2’, ‘the first image’, ‘the second image’, etc., verify the claim against the correct image. INCORRECT--- The sentence makes a visual claim that is factually inaccurate when checked against the image(s). Only mark INCORRECT if the core visual claim is wrong --- e.g., wrong object identity...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.