Reward Design for Physical Reasoning in Vision-Language Models
Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3
The pith
Accuracy-based rewards in GRPO training outperform supervised fine-tuning for vision-language models on most physical reasoning domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRPO with accuracy-based rewards outperforms SFT on most domains of PhyX. Reward design induces domain-specific reasoning behaviors rather than uniform improvements. The internal attention-weight reward improves spatial relation accuracy from 0.27 to 0.50 without requiring spatial annotations.
What carries the argument
The four reward signals of increasing semantic richness used inside GRPO training, with special focus on the novel internal attention-weight reward derived from the model's attention over input image regions.
If this is right
- Accuracy-based rewards provide the strongest overall performance gains across both multiple-choice and open-ended formats.
- Rubric rewards improve structured reasoning quality such as principle identification and unit consistency without consistent accuracy gains.
- Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains.
- The internal attention-weight reward enables supervision of visual grounding using only model internals, removing the need for spatial annotations.
Where Pith is reading between the lines
- If attention-derived rewards generalize beyond physics, they could reduce dependence on expensive annotated datasets for other visual reasoning applications.
- Adaptive reward selection that switches between accuracy and attention signals based on detected reasoning type might further improve multimodal performance.
- The domain-specific patterns suggest that reward design choices will need to be validated separately for each target application rather than assumed to transfer.
Load-bearing premise
The observed differences in domain performance are caused by the reward signals rather than by interactions with the specific 2B model, training hyperparameters, or the PhyX benchmark construction.
What would settle it
Repeating the full ablation study on a different VLM size or architecture and finding that the relative ordering of the four reward types reverses or disappears would show that reward design is not the primary driver of the reported domain differences.
Figures
read the original abstract
Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical ablation study of four reward signals (format compliance, answer accuracy, composite rubric, and internal attention-weight) for GRPO post-training of the IBM Granite Vision 3.3 (2B) VLM on the PhyX benchmark (3,000 problems across six physics domains and six reasoning types). It claims that accuracy-based GRPO outperforms SFT on most domains, that rubric rewards improve structured reasoning quality without consistent accuracy gains, and that the novel attention-weight reward (requiring no spatial annotations) raises spatial-relation accuracy from 0.27 to 0.50 while degrading symbolic-domain performance, demonstrating that reward design induces domain-specific behaviors rather than uniform improvement.
Significance. If the central empirical patterns prove robust, the work offers a useful contribution to reward engineering for visually grounded reasoning in VLMs by showing that richer semantic rewards do not automatically translate to better accuracy and by introducing an annotation-free internal attention reward that targets spatial grounding. The domain-specific effects and the concrete accuracy lift on spatial relations are the most actionable findings for practitioners.
major comments (2)
- [Abstract] Abstract and results sections: the reported lift in spatial-relation accuracy from 0.27 to 0.50 with the attention-weight reward is presented without error bars, number of runs, or statistical significance tests, leaving open whether the difference is reliable or could be explained by training stochasticity.
- [Experimental setup] Experimental setup (throughout results): all runs use a single fixed 2B backbone, fixed training schedule, and the identical PhyX train/test split with no cross-model controls, hyperparameter sweeps, or benchmark variants. This makes it impossible to isolate the causal effect of the reward signals from interactions with this particular model-benchmark pair, directly undermining the claim that accuracy-based rewards and attention rewards produce generalizable domain-specific behaviors.
minor comments (2)
- The abstract states that gains 'vary substantially by reward type and domain' but does not include a summary table of per-domain accuracies for all four rewards versus SFT; adding such a table would make the domain-specific patterns immediately verifiable.
- No training curves, validation loss trajectories, or convergence diagnostics are referenced, which would help readers assess whether the reported differences reflect stable policy optimization or early stopping artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our empirical study of reward signals for GRPO training of VLMs. We address each major comment below and specify the revisions we will make to improve the robustness and transparency of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and results sections: the reported lift in spatial-relation accuracy from 0.27 to 0.50 with the attention-weight reward is presented without error bars, number of runs, or statistical significance tests, leaving open whether the difference is reliable or could be explained by training stochasticity.
Authors: We agree that the reported accuracy improvement lacks supporting statistical details, which limits confidence in its reliability. In the revised manuscript we will perform at least three independent training runs with different random seeds for the attention-weight reward condition and the relevant baselines. We will report mean accuracies together with standard deviations and conduct a paired statistical significance test (e.g., t-test) between conditions. These results and the associated error bars will be added to the relevant tables and figures in the results section; the abstract will be updated to reflect the more robust presentation of the 0.27-to-0.50 lift. revision: yes
-
Referee: [Experimental setup] Experimental setup (throughout results): all runs use a single fixed 2B backbone, fixed training schedule, and the identical PhyX train/test split with no cross-model controls, hyperparameter sweeps, or benchmark variants. This makes it impossible to isolate the causal effect of the reward signals from interactions with this particular model-benchmark pair, directly undermining the claim that accuracy-based rewards and attention rewards produce generalizable domain-specific behaviors.
Authors: We acknowledge that restricting the study to a single 2B model and fixed schedule prevents strong claims of generalizability across architectures or training regimes. The experimental design deliberately holds the backbone and schedule constant precisely to isolate the effect of each reward signal; varying multiple factors simultaneously would confound attribution of the observed domain-specific behaviors. We will add a dedicated Limitations paragraph that explicitly states the scope of the findings and notes that interactions with other models or benchmarks remain to be explored. We do not plan to add cross-model or hyperparameter-sweep experiments in this revision because of computational cost, but the added discussion will prevent overstatement of generalizability. revision: partial
Circularity Check
No circularity in empirical ablation study
full rationale
The paper reports an empirical ablation comparing four reward signals (format, accuracy, rubric, attention-weight) for GRPO training of a fixed 2B VLM on the PhyX benchmark. All central claims rest on held-out performance numbers across domains and formats; no equations, derivations, or first-principles results are presented. The attention-weight reward is introduced and evaluated directly via measured accuracy lifts (0.27 to 0.50) without any self-referential fitting or renaming of inputs. No load-bearing step reduces to the paper's own definitions or prior self-citations by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GRPO training with the listed reward signals will produce measurable differences in VLM behavior on held-out physics problems
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.