arxiv: 2604.08476 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti , Aditya Kanade , Rohit Sinha , Vineeth N Balasubramanian , Tanuja Ganu This is my paper

Pith reviewed 2026-05-10 16:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords faithful GRPOvisual spatial reasoningmultimodal modelsconstrained optimizationchain of thoughtlogical consistencyvisual groundingpolicy optimization

0 comments

The pith

Adding consistency and grounding constraints to GRPO reduces CoT inconsistency from 24.5% to 1.7% and improves accuracy in multimodal spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard training with verifiable rewards boosts accuracy on visual spatial reasoning but produces Chain-of-Thought outputs that often fail to logically support the answer or match the image details. The authors introduce Faithful GRPO to add constraints for logical consistency and visual grounding, enforced adaptively with Lagrangian dual ascent during optimization. Experiments on seven real-world benchmarks with Qwen2.5-VL models show large gains in reasoning faithfulness and also higher final accuracy. Readers should care because this shows how to get both correct answers and trustworthy explanations from multimodal systems without sacrificing one for the other.

Core claim

FGRPO modifies standard GRPO by incorporating batch-level constraints on logical consistency, where the CoT must entail the final answer, and visual grounding, where each step accurately reflects image content, using Lagrangian dual ascent to adaptively balance these constraints during optimization. On Qwen2.5-VL models across seven benchmarks, this reduces the inconsistency rate from 24.5% to 1.7%, raises visual grounding scores by 13%, and improves accuracy over unconstrained GRPO.

What carries the argument

Batch-level consistency and grounding constraints enforced via Lagrangian dual ascent within the advantage computation of Group Relative Policy Optimization (GRPO).

Load-bearing premise

The proposed constraints on consistency and grounding can be enforced stably through Lagrangian dual ascent without causing optimization issues or needing extensive unreported tuning.

What would settle it

A replication on the same models and datasets where FGRPO fails to reduce inconsistency below 10% or does not improve accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.08476 by Aditya Kanade, Rohit Sinha, Sai Srinivas Kancheti, Tanuja Ganu, Vineeth N Balasubramanian.

**Figure 1.** Figure 1: Unfaithful reasoning masked by correct answers. Both models answer correctly, but only FGRPO reasons faithfully. Left: The GRPO-Task model incorrectly claims there are no visible paths, contradicting its own answer of “1.0” (50% faithfulness, inconsistent). Right: The GRPO-Task model reasons toward “lamp” but answers “box” (33% faithfulness, inconsistent). FGRPO produces visually grounded reasoning in both… view at source ↗

**Figure 2.** Figure 2: Overview of the FGRPO training pipeline. We show advantage computation for a training batch with 3 samples, each with 2 rollouts. For each prompt-image pair si , the policy samples G = 2 rollouts and we compute the task reward RT, the consistency reward RC, and grounding rewards RS and RG (only RG is shown for clarity). The consistency and semantic-grounding rewards are provided by an online VLM judge. Eac… view at source ↗

**Figure 3.** Figure 3: Per-dataset reasoning quality breakdown. (a) Semantic grounding (S): FGRPO achieves uniformly higher semantic grounding than GRPO-T across all seven benchmarks (86.0% vs. 72.7%), with the largest gains on MindCube (+22.8 pp) and OmniSpatial (+21.1 pp). TreeVGR also outperforms GRPO-T (81.9%). (b) Inconsistency rate: FGRPO reduces inconsistency to 1.7% on average, compared to 26.1% (GRPO-T), 26.0% (TreeVGR… view at source ↗

**Figure 4.** Figure 4: FGRPO (squares) vs GRPO-T (circles); larger shape size indicates higher inconsistency. rifices accuracy (−0.2). Multiplicative gating (R = 0.5 · Racc · RC + 0.5 · Rfmt · RG) fares worse: accuracy drops by 1.7 points while inconsistency remains at 19.6%. The bottom block uses FGRPO’s decoupled advantage formulation (Eq. 6), where each signal is independently normalized. Even with only the consistency const… view at source ↗

**Figure 5.** Figure 5: An overview of the two-stage training pipeline. We curate CoT training data using [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for the consistency constraint evaluation. The judge receives only the question, reasoning trace, and final answer—no image is provided. It outputs YES (consistent) or NO (inconsistent). C.3 Semantic Grounding Judge Prompt The semantic grounding (faithfulness) constraint evaluates whether each reasoning sentence makes accurate visual claims when checked against the input image(s). The full prom… view at source ↗

**Figure 7.** Figure 7: Prompt used for the semantic grounding (faithfulness) constraint evaluation. The judge receives the image, question, accumulated reasoning context, and the latest sentence to evaluate. It outputs CORRECT, INCORRECT, or SKIP. The per-sample semantic grounding score S is the fraction of visual sentences (non-SKIP) that are CORRECT. E Qualitative Examples We present additional contrastive examples comparing G… view at source ↗

**Figure 8.** Figure 8: Training dynamics (7B FGRPO). Top: Lagrange multiplier trajectories. Bottom: constraint satisfaction over training steps. are color-coded: blue for grounded, orange for ungrounded, red for inconsistent reasoning and green for consistent reasoning. Question: Is the photo taken looking down from above or looking up from below? Options: A. Looking down from above B. Looking up from below GRPO-Task FGRPO Faith… view at source ↗

**Figure 9.** Figure 9: Perspective estimation. GRPO-Task claims the photo was taken from above despite the upward-looking perspective of the wine glass, contradicting its own correct answer. FGRPO correctly identifies the low vantage point. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Navigation and signage. GRPO-Task misreads the “Entering” sign as an “Exiting” sign and contradicts its own answer. FGRPO correctly interprets the signage and produces consistent reasoning. Question: Which object is closer? Options: A. The stop sign. B. The school zone sign. GRPO-Task FGRPO Faithfulness 44% · Inconsistent Faithfulness 100% · Consistent The image contains both a stop sign and a school zone… view at source ↗

**Figure 11.** Figure 11: Relative distance. GRPO-T incorrectly concludes the stop sign is closer, contradicting its answer of “school zone sign.” FGRPO correctly identifies relative depth from the scene layout. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Navigation and signage. GRPO-Task claims that the “Entering” sign does not lead to terminal B, yet selects option A (yes) as the final answer contradicting its reasoning. Whereas, FGRPO correctly identifies the entry for Terminal B and answers yes [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Object counting. GRPO-Task claims two giraffes are present despite the image showing three, contradicting its correct answer. FGRPO accurately counts three giraffes with consistent reasoning. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Depth with bounding boxes. GRPO-Task claims the lamp (red box) is closer, contradicting its answer of “pillow” (blue box). FGRPO correctly reasons about relative depth using the bounding-box annotations [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Directional reasoning. GRPO-Task concludes the left lane is correct for reaching Porte de Vaise, contradicting its answer of “Yes” (right lane). FGRPO correctly reads the road sign and produces consistent reasoning. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Egocentric spatial reasoning. GRPO-Task reasons that turning right would move parallel to the bus rather than toward the door, contradicting its answer. FGRPO correctly reasons about the egocentric perspective change. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: FGRPO responses on the eval set. We observe that FGRPO responds faithfully and exhibits both spatial groundedness and consistency. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

read the original abstract

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FGRPO adds Lagrangian constraints on logical consistency and visual grounding to GRPO for multimodal spatial reasoning, with reported sharp drops in inconsistency and accuracy gains, but the dual ascent mechanics lack reported validation.

read the letter

The paper introduces Faithful GRPO, which enforces batch-level constraints for CoT logical consistency and visual grounding inside the GRPO advantage computation using Lagrangian dual ascent with adaptive weighting. This is the main new piece relative to standard GRPO and prior RLVR work on models like ViGoRL-Spatial and TreeVGR. They first document that accuracy improvements in these models often come with unfaithful reasoning, then show FGRPO reduces inconsistency from 24.5% to 1.7% and lifts grounding scores by 13% on seven spatial benchmarks while also improving final answer accuracy on Qwen2.5-VL-7B and 3B backbones. The two-axis breakdown of reasoning quality is a clean way to measure the problem, and the empirical pattern that better faithfulness correlates with better answers is useful to see in practice. The adaptive constraint handling avoids the obvious risk of over-penalizing the policy. The soft spots sit in the optimization details. The abstract and available description give no error bars, no explicit formulas or code for the inconsistency and grounding metrics, no ablations isolating the dual ascent from other GRPO changes, and no traces or statistics on whether the dual variables actually converge or bind during training. The stress-test point about stability and hyperparameter sensitivity is fair; without those checks it is hard to know if the gains are robust or reproducible. This is for groups working on RL fine-tuning of vision-language models for spatial or visual reasoning tasks. Readers who care about constrained policy optimization or CoT faithfulness will get the most from it. I would send it to peer review because the problem is real, the approach is straightforward, and the numbers are large enough to warrant checking the missing controls.

Referee Report

3 major / 1 minor

Summary. The paper claims that standard GRPO training of multimodal reasoning models improves benchmark accuracy but degrades CoT quality, as measured by logical inconsistency with the final answer and poor visual grounding in spatial reasoning tasks. It introduces Faithful GRPO (FGRPO), which augments GRPO with batch-level consistency and grounding constraints enforced through Lagrangian dual ascent, adaptively weighting the constraints. On Qwen2.5-VL-7B and 3B models across seven spatial datasets, FGRPO is reported to reduce inconsistency from 24.5% to 1.7%, raise visual grounding scores by 13%, and improve final-answer accuracy relative to vanilla GRPO.

Significance. If the empirical results are robust, the work provides evidence that enforcing faithfulness constraints during RLVR can simultaneously improve both reasoning quality and task performance, addressing a documented failure mode in current multimodal reasoning models. The constrained-optimization framing is a natural extension of GRPO and could influence training practices for reliable visual reasoning systems.

major comments (3)

[Abstract] Abstract: the inconsistency rate (24.5% to 1.7%) and visual-grounding improvement (+13%) are presented without any description of the underlying metrics, whether they are computed automatically or via human annotation, or any error bars, run-to-run variance, or statistical tests.
[Methods] Methods section: no ablation isolates the Lagrangian dual-ascent component from other FGRPO changes (e.g., adaptive constraint weights or batch-level grouping), so it is impossible to attribute the reported gains specifically to the constrained optimization.
[Experiments] Experiments: the manuscript provides no evidence that the consistency and grounding constraints actually bind during training, such as dual-variable trajectories, constraint-violation rates per batch, or sensitivity to the dual learning rate.

minor comments (1)

[Methods] The description of how the advantage is modified by the dual variables would benefit from an explicit equation showing the constrained advantage formula.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the inconsistency rate (24.5% to 1.7%) and visual-grounding improvement (+13%) are presented without any description of the underlying metrics, whether they are computed automatically or via human annotation, or any error bars, run-to-run variance, or statistical tests.

Authors: We agree that the abstract lacks sufficient detail on the metrics. The inconsistency rate is measured automatically via an entailment verifier that determines whether the CoT logically supports the final answer, while the visual grounding score is computed automatically through object-attribute and spatial-relation matching against image features. These definitions appear in Sections 3.2 and 4.1. We will revise the abstract to briefly describe the automatic nature of the metrics and will add references to error bars, run-to-run variance, and statistical tests for the reported figures in the experiments section. revision: yes
Referee: [Methods] Methods section: no ablation isolates the Lagrangian dual-ascent component from other FGRPO changes (e.g., adaptive constraint weights or batch-level grouping), so it is impossible to attribute the reported gains specifically to the constrained optimization.

Authors: The referee correctly notes the absence of an ablation that isolates Lagrangian dual ascent from the adaptive weighting and batch-level grouping. Although the methods section describes the integrated FGRPO formulation, a targeted ablation would clarify the contribution of the dual-ascent mechanism. We will add this ablation study in the revised manuscript. revision: yes
Referee: [Experiments] Experiments: the manuscript provides no evidence that the consistency and grounding constraints actually bind during training, such as dual-variable trajectories, constraint-violation rates per batch, or sensitivity to the dual learning rate.

Authors: We acknowledge that the experiments section does not present direct evidence of constraint binding, such as dual-variable trajectories or per-batch violation rates. We will incorporate these analyses, including plots of dual variables, constraint-violation statistics, and sensitivity to the dual learning rate, to demonstrate that the constraints are active during optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmark validation

full rationale

The paper proposes FGRPO as a constrained variant of GRPO using Lagrangian dual ascent to enforce batch-level consistency and grounding constraints during training. All central claims are empirical performance deltas (inconsistency rate drop from 24.5% to 1.7%, +13% grounding, accuracy gains) measured on held-out spatial reasoning benchmarks across Qwen2.5-VL backbones. No equations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-defined quantities, or self-citations. The optimization procedure is a standard application of constrained policy optimization; success is falsifiable via external test sets rather than tautological. Minor self-citation of the authors' prior GRPO work is present but not load-bearing for the core claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard constrained RL assumptions plus the claim that batch-level constraints can be incorporated into GRPO advantage computation without destabilizing training.

free parameters (1)

adaptive constraint weights
The paper states that relative importance of consistency and grounding constraints is adjusted during optimization; no specific values or schedules are given in the abstract.

axioms (1)

domain assumption Lagrangian dual ascent can enforce logical consistency and visual grounding constraints while preserving GRPO's group-relative advantage structure
Invoked when the authors describe incorporating constraints into advantage computation.

pith-pipeline@v0.9.0 · 5603 in / 1339 out tokens · 45712 ms · 2026-05-10T16:47:22.411514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Ilya Loshchilov and Frank Hutter

URLhttps://api.semanticscholar.org/CorpusID:259837088. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/ CorpusID:53592270. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Mich...

work page 2017
[2]

R e FT : Reasoning with reinforced fine-tuning

Accessed: 2025-11-14. Arijit Ray, Jiafei Duan, Ellis L Brown II, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. SAT: Dynamic spatial aptitude training for multimodal language models. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/foru...

work page doi:10.18653/v1/2024.acl-long.410 2025
[3]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

URLhttps://api.semanticscholar.org/CorpusID:208158250. Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.ArXiv, abs/2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Let me

URLhttps://api.semanticscholar.org/CorpusID:277780955. 14 Preprint. Under review. Appendix: Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization A Training and Data Curation Details In this appendix we provide comprehensive details on the training pipeline, data curation, evaluation setup, and...

work page 2025
[5]

2.Ignoreall visual, spatial, numeric, or coordinate-based information

Evaluateonly the internal textual logicbetween the reasoning and the answer. 2.Ignoreall visual, spatial, numeric, or coordinate-based information. Treat references to image positions or coordinates as ordinary text, not evidence

work page
[6]

Donotcheck factual accuracy with respect to the question or the real world

work page
[7]

If the reasoning explicitly argues toward a conclusion and the final answer matches that conclusion, mark it asconsistenteven if the reasoning itself might be incorrect or uncertain

work page
[8]

If the reasoning ends ambiguously, contradicts itself, or draws a different conclusion than the final answer, mark it asinconsistent

work page
[9]

If the reasoning is too vague or incomplete to tell whether the answer follows, mark it asuncertain

work page
[10]

YES" or

If the reasoning shows best-effort deliberation (e.g., comparing options and making a justified choice), count that as consistent as long as the final answer matches the reasoning’s chosen option. Output strictly "YES" or "NO" only: -- "YES" if the final answer is logically consistent with the reasoning trace following the rules above. -- "NO" if the fina...

work page
[11]

ENTITY GROUNDING: Named objects/people/entities are present and visible

work page
[12]

ATTRIBUTE VERIFICATION: Claimed colors, sizes, counts, text content match the image(s)

work page
[13]

match actual positions of referenced objects

SPATIAL RELATIONSHIP CHECK: Claimed left/right, above/below, inside, between, etc. match actual positions of referenced objects

work page
[14]

BOUNDING BOX VERIFICATION: If coordinates like [x1,y1,x2,y2] are referenced, the region contains the described object and reasonably bounds it

work page
[15]

IMPLICIT VISUAL CLAIMS: Conclusions depending on visual facts (counts, groupings, relative sizes) --- verify the underlying visual facts

work page
[16]

Entering

MULTI-IMAGE REFERENCES: If the sentence refers to ‘image 1’, ‘image 2’, ‘the first image’, ‘the second image’, etc., verify the claim against the correct image. INCORRECT--- The sentence makes a visual claim that is factually inaccurate when checked against the image(s). Only mark INCORRECT if the core visual claim is wrong --- e.g., wrong object identity...

work page