Recognition: unknown
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3
The pith
Teaching reward models to generate multi-dimensional critiques before scoring turns them into active tools that improve visual generators during training and at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reward models that first produce multi-dimensional rationales before assigning scores function as active optimization tools rather than passive evaluators. Their rationales serve as structured rewards for RL training of text-to-image and image-editing generators and also drive a test-time Generate-Critique-Refine procedure that revises prompts to improve outputs without parameter updates. This behavior is enabled by Preference-Anchored Rationalization (PARROT), which recovers rationales from readily available preference data via anchored generation, consistency filtering, and distillation. The trained RationalRewards (8B) model achieves leading preference-prediction accuracy among open-souce
What carries the argument
Preference-Anchored Rationalization (PARROT), a framework that recovers high-quality rationales from preference pairs through anchored generation, consistency filtering, and distillation, enabling reward models to output multi-dimensional critiques before scoring.
Load-bearing premise
The rationales recovered by PARROT through anchored generation, consistency filtering, and distillation are sufficiently high-quality and unbiased to serve as effective training signals for multi-dimensional critiques that genuinely improve downstream optimization.
What would settle it
If the Generate-Critique-Refine loop produces no measurable improvement on benchmarks where RL fine-tuning with the same reward model does improve performance, the test-time mechanism would be shown not to capture useful optimization signals.
Figures
read the original abstract
Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that training reward models to output explicit multi-dimensional critiques before scoring, via a new Preference-Anchored Rationalization (PARROT) framework that extracts rationales from preference pairs through anchored generation, consistency filtering, and distillation, yields an 8B RationalRewards model. This model achieves SOTA open-source preference prediction (competitive with Gemini-2.5-Pro on less data), serves as a stronger RL reward for text-to-image and editing generators than scalar baselines, and enables a test-time Generate-Critique-Refine loop that matches or exceeds RL fine-tuning on benchmarks.
Significance. If the empirical claims hold after verification, the work would be significant for shifting reward models from opaque scalars to interpretable, actionable reasoning tools usable at both training (RL) and test time (prompt revision without updates). The reported data efficiency and dual-use of critiques represent a potentially useful direction for visual generation optimization, though the absence of supporting details limits current assessment.
major comments (3)
- [Abstract and §3] Abstract and §3 (PARROT): The central claim that PARROT recovers high-quality, unbiased rationales via anchored generation, consistency filtering, and distillation is load-bearing for all downstream results, yet no human agreement metrics, rationale fidelity ablations, or comparisons against gold rationales are reported to confirm this assumption.
- [§4] §4 (Experiments): The abstract asserts SOTA preference prediction and RL/test-time gains, but reports no experimental details, full baseline comparisons, statistical significance tests, or ablation studies, making it impossible to assess whether the data support the claims.
- [§4.1] §4.1 (Preference prediction results): Rationales are recovered from the identical preference data used to train the model, creating a circularity risk where reported gains may reflect distribution fitting rather than independent generalization; this is not addressed with held-out rationale validation or bias analysis.
minor comments (2)
- [§3] Clarify how the multi-dimensional critique dimensions are defined, scored, and aggregated into the final reward signal.
- [Related Work] Add missing references to prior work on rationale-augmented rewards or critique-based refinement in vision-language models.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve transparency and rigor.
read point-by-point responses
-
Referee: [Abstract and §3] The central claim that PARROT recovers high-quality, unbiased rationales via anchored generation, consistency filtering, and distillation is load-bearing for all downstream results, yet no human agreement metrics, rationale fidelity ablations, or comparisons against gold rationales are reported to confirm this assumption.
Authors: We acknowledge that direct human validation metrics would provide stronger support for the quality of recovered rationales. The current manuscript relies on consistency filtering, distillation, and downstream empirical gains as proxies for rationale quality. We will add human agreement metrics on a sampled subset of rationales, ablations isolating each PARROT stage to measure impact on rationale fidelity, and comparisons against any available gold rationales from the source preference datasets. These additions will appear in an expanded §3. revision: yes
-
Referee: [§4] The abstract asserts SOTA preference prediction and RL/test-time gains, but reports no experimental details, full baseline comparisons, statistical significance tests, or ablation studies, making it impossible to assess whether the data support the claims.
Authors: We agree that the main text should contain more complete experimental information. The full manuscript includes setup details, but we will expand §4 to report full baseline comparisons (including all relevant open- and closed-source models), statistical significance tests across multiple seeds, and additional ablations on critique dimensions, RL reward usage, and the Generate-Critique-Refine loop. Extended tables and implementation specifics will be moved to the appendix. revision: yes
-
Referee: [§4.1] Rationales are recovered from the identical preference data used to train the model, creating a circularity risk where reported gains may reflect distribution fitting rather than independent generalization; this is not addressed with held-out rationale validation or bias analysis.
Authors: This is a legitimate concern. Preference prediction is already evaluated on held-out test splits disjoint from training data. To further address circularity, we will add held-out rationale generation experiments on unseen preference pairs and include a bias analysis examining correlations across critique dimensions. These results will be reported in §4.1 with updated splits and analysis. revision: partial
Circularity Check
No significant circularity detected; claims rest on independent empirical benchmarks.
full rationale
The paper describes an empirical pipeline: PARROT generates rationales from existing preference pairs via anchored generation, filtering, and distillation; RationalRewards is then trained on the resulting (preference, rationale, score) tuples. Performance is measured on separate test sets for preference prediction accuracy, RL-based generator improvement, and test-time Generate-Critique-Refine gains. No equations, uniqueness theorems, or first-principles derivations are presented that reduce by construction to the training inputs. No self-citations are invoked as load-bearing justification for the core method or results. The reported gains are therefore not forced by the construction of the inputs themselves but are subject to standard empirical verification on held-out data.
Axiom & Free-Parameter Ledger
invented entities (2)
-
PARROT framework
no independent evidence
-
RationalRewards model
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
-
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2410.12832. OpenAI. Gpt-image-1. https://platform.openai.com/docs/guides/image-generation? image-generation-model=gpt-image-1 , 2025. OpenAI’s image generation model. Ac- cessed September 2025. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. SDXL: Improving lat...
-
[2]
The preference label y is provided as a hint, focusing generation on rationales consistent with the observed preference
Phase 1 (Rationale Generation)constructs the variational posterior qϕ(z|x , y) by prompting a teacher VLM with preference-anchored instructions. The preference label y is provided as a hint, focusing generation on rationales consistent with the observed preference
-
[3]
Phase 2 (Consistency Filtering)maximizes Term 1, Eqϕ [logP θ(y|x , z)], by retaining only rationales z for which the preference y can be recovered from (x, z) alone (Eq. 2). This restricts qϕ’s effective support to the high-likelihood region, ensuring predictive sufficiency
-
[4]
A person planting a tree with cat. HD. Realistic style
Phase 3 (Foresight Distillation)minimizes Term 2, DKL(qϕ(z|x , y)∥Pθ(z|x)) , by training the student model Pθ(z|x) to generate rationales without access to y. Since qϕ 19 Preprint. Under review. RL (LoRA) Training Steps0 300 Figure 12: Text-to-Image RL using scalar reward model demonstrates reward hacking – while the reward increases, the visual quality o...
-
[5]
show its work
Physical and Visual Quality: ## Justification: … However, there are **notable physical and anatomical flaws**: -**Hand Structure Deformity**: The person's right hand (touching the soil) has an unnatural, elongated thumb … -**Abnormal Element Overlap / Implausible Interaction**: The cat's front paw is placed against the person's extended hand, but the physic...
2020
-
[7]
-3 (Minor mismatch):Most relevant elements are preserved, but a few aspects are missing or incorrectly handled
Image Faithfulness(How well are the non-edited parts and key input elements preserved?) -4 (Uses input fully):All relevant elements from the input are accurately preserved or transformed as instructed. -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects are missing or incorrectly handled. -2 (Partial mismatch):Some elements are car...
-
[8]
No visible artifacts
Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible. No visible artifacts. -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive. -2 (Some flaws):Clear physical or visual errors that disrupt the image. -1 (Severe flaws):Major physical/visu...
-
[9]
-3 (Mostly match):Minor misspellings or inconsistent capitalization
Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...
-
[10]
Image Faithfulness
Text Faithfulness: ## Justification:[Detailed comparison] ## Score A:[float]## Score B:[float]## Winner:[A/B/Tie] 2--4. (same structure for remaining aspects) # Summary:[Overall comparison summary] Text-to-Image Variant.For text-to-image generation, the prompt is modified as follows: (1) only two images are provided (Generated Image A and Generated Image ...
-
[11]
The Source Image (First image)
-
[12]
To do this, you must first assess the image on four critical aspects, provide justifications and absolute scores in 1--4 scale
The Edited Image (Second image) Your task is to evaluate the Edited Image against the Source Image and the User Instruction. To do this, you must first assess the image on four critical aspects, provide justifications and absolute scores in 1--4 scale. About the scores: you should try to givefloat scores. For example, float values are important to reflect...
-
[13]
No hallucinations or unrequested changes
Text Faithfulness(How accurately does the output follow the instruction?) -4 (Full match):All key elements (objects, colors, actions) are represented exactly as described. No hallucinations or unrequested changes. 23 Preprint. Under review. -3 (Minor mismatch):Most key elements are present, but minor details are missing, incorrect, or slightly inaccurate....
-
[15]
No visible artifacts (seams, blurring, noise)
Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible (correct lighting, shadows, geometry, anatomy). No visible artifacts (seams, blurring, noise). -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive (e.g., slight lighting mismatch, minor...
-
[16]
-3 (Mostly match):Minor misspellings or inconsistent capitalization
Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...
-
[19]
Under review
Physical and Visual Quality: ## Score:[ float score ] ## Justification:[Detailed explanation of the score] 24 Preprint. Under review
-
[20]
First, thecritique promptevaluates a single generated image across four dimensions with natural language justification
Text Rendering: ## Score:[ float score or N/A ] ## Justification:[Detailed explanation of the score] # Summary:[Summary of the evaluation] C.4 Generate–Critique–Refine (GCR) Loop Prompts The GCR loop at test time (Section 2.2, Figure 6) uses the trained RationalRewards model in two stages. First, thecritique promptevaluates a single generated image across...
-
[21]
No hallucinations or unrequested changes
Text Faithfulness(How accurately does the output follow the instruction?) -4 (Full match):All key elements (objects, colors, actions) are represented exactly as described. No hallucinations or unrequested changes. -3 (Minor mismatch):Most key elements are present, but minor details are missing, incorrect, or slightly inaccurate. -2 (Some mismatch):Some ke...
-
[22]
-3 (Minor mismatch):Most relevant elements are preserved, but a few aspects (e.g., background details, lighting consistency) are missing or incorrectly handled
Image Faithfulness(How well are the non-edited parts and key input elements preserved?) -4 (Uses input fully):All relevant elements from the input (background, style, lighting, identity) are accurately preserved or transformed as instructed. -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects (e.g., background details, lighting con...
-
[23]
No visible artifacts (seams, blurring, noise)
Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible (correct lighting, shadows, geometry, anatomy). No visible artifacts (seams, blurring, noise). -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive (e.g., slight lighting mismatch, minor...
-
[24]
-3 (Mostly match):Minor misspellings or inconsistent capitalization
Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...
-
[25]
Text Faithfulness: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]
-
[26]
Image Faithfulness: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]
-
[27]
Physical and Visual Quality: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]
-
[28]
floating
Text Rendering: ## Score:[ float score or N/A ] ## Justification:[Detailed explanation of the score] # Summary:[Summary of the evaluation] # User Request Refinement: ## Refinement Comments:[Explanation of why the original instruction needs refinement and what constraints should be added] ## Refined Request:[Improved, more specific instruction that address...
2025
-
[29]
Image A contains a clear sunset in the background
Visual hallucination: The teacher generates a rationale describing visual content not present in the images (e.g., “Image A contains a clear sunset in the background” when no sunset is visible), leading to an incorrect preference prediction when the label hint is removed
-
[30]
Label-ignoring rationales: Despite the preference anchor, the teacher occasionally gen- erates a rationale that favors the non-preferred image, particularly when the quality difference between images is subtle
-
[31]
Both images are of reasonable quality
Vague, non-predictive reasoning: The rationale provides generic praise or criticism (e.g., “Both images are of reasonable quality”) without sufficient discriminative detail to distinguish between the two options. F.3 Evaluation Benchmark Summary Table 13 summarizes all evaluation benchmarks used in this work. Benchmark Task # Samples Evaluation Protocol M...
2025
-
[32]
Teacher Model Dependence.The quality of RationalRewards is upper-bounded by the teacher model (Qwen3-VL-32B-Instruct) used to generate training rationales. In domains where the teacher exhibits systematic blind spots—such as fine-grained physics simulation, culturally specific aesthetics, or specialized technical content—the student model inherits these l...
-
[33]
The teacher VLM in- troduces additional biases from its own pretraining data
Bias Inheritance.Preference datasets (EditReward, HPDv3, RapidData) encode the aesthetic preferences and cultural assumptions of their annotators. The teacher VLM in- troduces additional biases from its own pretraining data. RationalRewards may therefore systematically favor certain visual styles, demographics, or content types. We have not conducted a co...
-
[34]
Latent Capability Hypothesis.Our finding that test-time prompt tuning matches or exceeds RL-based fine-tuning (Section 3.2) supports the hypothesis that generators har- bor latent capabilities under-elicited by suboptimal prompts. However, this remains a working hypothesis: we have not validated it at the representation level (e.g., by probing internal ac...
-
[35]
While this corresponds to a natural boundary in our scoring rubric (Appendix D), we have not conducted a comprehensive sensitivity analysis across all benchmarks and generators
Threshold Sensitivity.The GCR loop uses a fixed threshold of 3.0 to trigger refinement. While this corresponds to a natural boundary in our scoring rubric (Appendix D), we have not conducted a comprehensive sensitivity analysis across all benchmarks and generators. The optimal threshold may vary by generator capability and task difficulty
-
[36]
Language and Domain Scope.All evaluation in this work is conducted on English- language benchmarks. The transferability of RationalRewards’ structured critiques to other languages, as well as to non-photorealistic domains (e.g., 3D rendering, video generation, scientific visualization), remains untested. G.2 Broader Impact RationalRewards and the PARROT f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.