Reward Design for Physical Reasoning in Vision-Language Models

Derek Lilienthal; Manisha Mukherjee; Sameera Horawalavithana

arxiv: 2604.13993 · v1 · submitted 2026-04-15 · 💻 cs.AI · cs.CL· cs.CV

Reward Design for Physical Reasoning in Vision-Language Models

Derek Lilienthal , Manisha Mukherjee , Sameera Horawalavithana This is my paper

Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords physical reasoningvision-language modelsreward designGRPOattention weightsspatial reasoningphysics benchmarks

0 comments

The pith

Accuracy-based rewards in GRPO training outperform supervised fine-tuning for vision-language models on most physical reasoning domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how reward signals shape the behavior of vision-language models trained with Group Relative Policy Optimization on physical reasoning tasks that combine image perception with physics knowledge and step-by-step inference. It runs a controlled comparison of four rewards of increasing richness—format compliance, answer accuracy, a composite rubric covering correctness and principles, and a new internal signal based on the model's attention weights over image regions—evaluated on the PhyX benchmark spanning six domains and two answer formats. Accuracy rewards deliver the largest overall gains relative to supervised fine-tuning, while the attention reward specifically raises spatial relation accuracy from 0.27 to 0.50 without any external spatial labels. The results establish that reward design does not produce uniform gains but instead induces distinct reasoning patterns that depend on the physics domain and question type.

Core claim

GRPO with accuracy-based rewards outperforms SFT on most domains of PhyX. Reward design induces domain-specific reasoning behaviors rather than uniform improvements. The internal attention-weight reward improves spatial relation accuracy from 0.27 to 0.50 without requiring spatial annotations.

What carries the argument

The four reward signals of increasing semantic richness used inside GRPO training, with special focus on the novel internal attention-weight reward derived from the model's attention over input image regions.

If this is right

Accuracy-based rewards provide the strongest overall performance gains across both multiple-choice and open-ended formats.
Rubric rewards improve structured reasoning quality such as principle identification and unit consistency without consistent accuracy gains.
Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains.
The internal attention-weight reward enables supervision of visual grounding using only model internals, removing the need for spatial annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If attention-derived rewards generalize beyond physics, they could reduce dependence on expensive annotated datasets for other visual reasoning applications.
Adaptive reward selection that switches between accuracy and attention signals based on detected reasoning type might further improve multimodal performance.
The domain-specific patterns suggest that reward design choices will need to be validated separately for each target application rather than assumed to transfer.

Load-bearing premise

The observed differences in domain performance are caused by the reward signals rather than by interactions with the specific 2B model, training hyperparameters, or the PhyX benchmark construction.

What would settle it

Repeating the full ablation study on a different VLM size or architecture and finding that the relative ordering of the four reward types reverses or disappears would show that reward design is not the primary driver of the reported domain differences.

Figures

Figures reproduced from arXiv: 2604.13993 by Derek Lilienthal, Manisha Mukherjee, Sameera Horawalavithana.

**Figure 1.** Figure 1: Reward design spectrum for GRPO-based VLM training on physical reasoning. Given a PhyX image-question pair, the model generates structured outputs and is trained with GRPO. We compare four rewards: (R1) format, (R2) accuracy, (R3) rubric (correctness, principle, unit), and (R4) attention-based visual grounding. language models. Group Relative Policy Optimization (GRPO) (Guo et al., 2025) has driven large g… view at source ↗

**Figure 2.** Figure 2: (a) Comparison of reasoning accuracy across GRPO configurations. (b) Attention [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of reward configurations and training dynamics. (a) Pairwise heatmap [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Attention Score Masking Whitespace Filling to Increase Mask Area. Distribution [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Attention Forground Masks Images on Phyx Training Dataset [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Mean Attention Map Layered over images on Phyx Training Dataset [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Mean Attention Map Layered over images on Phyx Training Dataset [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The attention-derived reward is a practical new angle for spatial gains without labels, but the whole study sits on one model and one benchmark so the domain patterns are hard to generalize.

read the letter

The punchline is that pulling a reward straight from the model's attention weights over image regions lifts spatial relation accuracy from 0.27 to 0.50 on PhyX with no extra annotations. That part is new and worth trying if you work on grounded VLMs. The rest of the paper is a straightforward ablation of four reward types under GRPO versus SFT on the same 2B Granite Vision backbone across six physics domains and two answer formats. Accuracy rewards come out strongest overall, the rubric version helps structured outputs, and attention helps spatial while hurting symbolic domains. They report the numbers cleanly enough to see the pattern. What the paper does well is run the comparison systematically on a multi-domain benchmark and introduce an internal reward that avoids manual spatial labels. That combination gives readers something concrete to replicate or extend. The soft spots are exactly what the stress-test flags. All runs use the identical 2B model, fixed schedule, and single PhyX split. No error bars, no statistical tests, no hyperparameter sweeps, and no cross-model checks. That means the observed differences could easily come from interactions between the reward, this particular model, and how PhyX was built rather than from general properties of the reward signals. The explanation for why attention degrades symbolic domains is reasonable but stays post-hoc. This paper is for people already doing VLM fine-tuning or physical reasoning benchmarks who want a new reward variant to test. It is not transformative, but the empirical setup is honest and the attention trick is a fresh enough idea that a serious referee should look at it. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical ablation study of four reward signals (format compliance, answer accuracy, composite rubric, and internal attention-weight) for GRPO post-training of the IBM Granite Vision 3.3 (2B) VLM on the PhyX benchmark (3,000 problems across six physics domains and six reasoning types). It claims that accuracy-based GRPO outperforms SFT on most domains, that rubric rewards improve structured reasoning quality without consistent accuracy gains, and that the novel attention-weight reward (requiring no spatial annotations) raises spatial-relation accuracy from 0.27 to 0.50 while degrading symbolic-domain performance, demonstrating that reward design induces domain-specific behaviors rather than uniform improvement.

Significance. If the central empirical patterns prove robust, the work offers a useful contribution to reward engineering for visually grounded reasoning in VLMs by showing that richer semantic rewards do not automatically translate to better accuracy and by introducing an annotation-free internal attention reward that targets spatial grounding. The domain-specific effects and the concrete accuracy lift on spatial relations are the most actionable findings for practitioners.

major comments (2)

[Abstract] Abstract and results sections: the reported lift in spatial-relation accuracy from 0.27 to 0.50 with the attention-weight reward is presented without error bars, number of runs, or statistical significance tests, leaving open whether the difference is reliable or could be explained by training stochasticity.
[Experimental setup] Experimental setup (throughout results): all runs use a single fixed 2B backbone, fixed training schedule, and the identical PhyX train/test split with no cross-model controls, hyperparameter sweeps, or benchmark variants. This makes it impossible to isolate the causal effect of the reward signals from interactions with this particular model-benchmark pair, directly undermining the claim that accuracy-based rewards and attention rewards produce generalizable domain-specific behaviors.

minor comments (2)

The abstract states that gains 'vary substantially by reward type and domain' but does not include a summary table of per-domain accuracies for all four rewards versus SFT; adding such a table would make the domain-specific patterns immediately verifiable.
No training curves, validation loss trajectories, or convergence diagnostics are referenced, which would help readers assess whether the reported differences reflect stable policy optimization or early stopping artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study of reward signals for GRPO training of VLMs. We address each major comment below and specify the revisions we will make to improve the robustness and transparency of our claims.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the reported lift in spatial-relation accuracy from 0.27 to 0.50 with the attention-weight reward is presented without error bars, number of runs, or statistical significance tests, leaving open whether the difference is reliable or could be explained by training stochasticity.

Authors: We agree that the reported accuracy improvement lacks supporting statistical details, which limits confidence in its reliability. In the revised manuscript we will perform at least three independent training runs with different random seeds for the attention-weight reward condition and the relevant baselines. We will report mean accuracies together with standard deviations and conduct a paired statistical significance test (e.g., t-test) between conditions. These results and the associated error bars will be added to the relevant tables and figures in the results section; the abstract will be updated to reflect the more robust presentation of the 0.27-to-0.50 lift. revision: yes
Referee: [Experimental setup] Experimental setup (throughout results): all runs use a single fixed 2B backbone, fixed training schedule, and the identical PhyX train/test split with no cross-model controls, hyperparameter sweeps, or benchmark variants. This makes it impossible to isolate the causal effect of the reward signals from interactions with this particular model-benchmark pair, directly undermining the claim that accuracy-based rewards and attention rewards produce generalizable domain-specific behaviors.

Authors: We acknowledge that restricting the study to a single 2B model and fixed schedule prevents strong claims of generalizability across architectures or training regimes. The experimental design deliberately holds the backbone and schedule constant precisely to isolate the effect of each reward signal; varying multiple factors simultaneously would confound attribution of the observed domain-specific behaviors. We will add a dedicated Limitations paragraph that explicitly states the scope of the findings and notes that interactions with other models or benchmarks remain to be explored. We do not plan to add cross-model or hyperparameter-sweep experiments in this revision because of computational cost, but the added discussion will prevent overstatement of generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical ablation study

full rationale

The paper reports an empirical ablation comparing four reward signals (format, accuracy, rubric, attention-weight) for GRPO training of a fixed 2B VLM on the PhyX benchmark. All central claims rest on held-out performance numbers across domains and formats; no equations, derivations, or first-principles results are presented. The attention-weight reward is introduced and evaluated directly via measured accuracy lifts (0.27 to 0.50) without any self-referential fitting or renaming of inputs. No load-bearing step reduces to the paper's own definitions or prior self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions of RL fine-tuning (reward signals guide policy improvement) and the validity of the PhyX benchmark as a proxy for physical reasoning. No new physical laws or entities are postulated.

axioms (1)

domain assumption GRPO training with the listed reward signals will produce measurable differences in VLM behavior on held-out physics problems
Invoked throughout the ablation design and result interpretation

pith-pipeline@v0.9.0 · 5596 in / 1255 out tokens · 31843 ms · 2026-05-10T12:46:15.698743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

0362 #1 ^H 2

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

0362 #1 ^H 2

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv