RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

· 2026 · cs.AI · arXiv 2604.11626

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

representative citing papers

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

cs.CV · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

cs.CV · 2026-05-12 · conditional · novelty 6.0

G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.

citing papers explorer

Showing 4 of 4 citing papers.

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World cs.CV · 2026-05-13 · unverdicted · none · ref 42 · 2 links · internal anchor
PanoWorld adds spherical spatial cross-attention and pano-native training data to MLLMs for improved spatial reasoning on ERP panoramas, outperforming baselines on new and existing benchmarks.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models cs.CV · 2026-05-12 · conditional · none · ref 31 · internal anchor
G²TR reduces visual tokens and prefill compute by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency, balanced selection, and merging, while preserving reasoning accuracy and editing quality.
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth cs.CV · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

fields

years

verdicts

representative citing papers

citing papers explorer