pith. machine review for the scientific record. sign in

arxiv: 2604.11626 · v2 · submitted 2026-04-13 · 💻 cs.AI · cs.LG

Recognition: unknown

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords reward modelvisual generationtext-to-imagereinforcement learningtest-time refinementpreference rationalizationimage editingmulti-dimensional critique
0
0 comments X

The pith

Teaching reward models to generate multi-dimensional critiques before scoring turns them into active tools that improve visual generators during training and at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most reward models for visual generation collapse human preferences into a single unexplained score. This paper shows that first requiring the model to output explicit multi-dimensional critiques produces richer signals that can guide generator improvement in two distinct regimes. At training time the critiques supply fine-grained, interpretable rewards for reinforcement learning. At test time they power a Generate-Critique-Refine loop that revises the input prompt to produce better outputs without any weight updates. To obtain the necessary training data without expensive human rationale labels, the authors introduce Preference-Anchored Rationalization (PARROT), which recovers high-quality rationales from ordinary preference pairs through anchored generation, consistency filtering, and distillation. The resulting 8B RationalRewards model reaches state-of-the-art preference prediction among open-source reward models while using 10-20 times less data than prior baselines, and its test-time refinement loop matches or exceeds the gains obtained from full RL fine-tuning on several benchmarks.

Core claim

Reward models that first produce multi-dimensional rationales before assigning scores function as active optimization tools rather than passive evaluators. Their rationales serve as structured rewards for RL training of text-to-image and image-editing generators and also drive a test-time Generate-Critique-Refine procedure that revises prompts to improve outputs without parameter updates. This behavior is enabled by Preference-Anchored Rationalization (PARROT), which recovers rationales from readily available preference data via anchored generation, consistency filtering, and distillation. The trained RationalRewards (8B) model achieves leading preference-prediction accuracy among open-souce

What carries the argument

Preference-Anchored Rationalization (PARROT), a framework that recovers high-quality rationales from preference pairs through anchored generation, consistency filtering, and distillation, enabling reward models to output multi-dimensional critiques before scoring.

Load-bearing premise

The rationales recovered by PARROT through anchored generation, consistency filtering, and distillation are sufficiently high-quality and unbiased to serve as effective training signals for multi-dimensional critiques that genuinely improve downstream optimization.

What would settle it

If the Generate-Critique-Refine loop produces no measurable improvement on benchmarks where RL fine-tuning with the same reward model does improve performance, the test-time mechanism would be shown not to capture useful optimization signals.

Figures

Figures reproduced from arXiv: 2604.11626 by Cong Wei, Fangzhen Lin, Haozhe Wang, Jiaming Liu, Weiming Ren, Wenhu Chen.

Figure 1
Figure 1. Figure 1: Train-Time RL and Test-Time PromptTuning (PT) with RationalRewards on text and image-to-image generation benchmarks. (Left) Comparison on image editing benchmarks. RL with RationalRewards outperforms prior open-source generators. Crucially, we find that test-time PT with RationalRewards alone can surpass expensive RL. (Right) Breakdown results on text-to-image benchmark UniGenBench++. 1 introduction As vis… view at source ↗
Figure 2
Figure 2. Figure 2: RationalRewards is a reasoning-based reward model that produces structured rationales before assigning scores, enabling dual-space optimization for image generation. (a) As a reward model, it improves RL-based fine-tuning of generators over scalar baselines; (b) as a test-time optimizer, its Generate–Critique–Refine loop matches or surpasses RL-based optimization on multiple benchmarks without parameter up… view at source ↗
Figure 3
Figure 3. Figure 3: RL (LoRA) training on Qwen-Image using scalar rewards encounter reward hacking (bottom row): as training reward continues to grow, generation quality starts to degenerate, because black box rewards mislead visual generators with biases. In contrast, RationalRewards (top row) sustains generation quality with stable reward growth. See [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We implement Preference-Anchored Rationalization as a practical three-phase pipeline. Text Faithfulness: 4.0 Image Faithfulness: 4.0 Phys/Visual Quality: 3.6 Text Rendering: 4.0 Input Image Flux-Kontext-Dev OmniGen2 Instruction: Change the background to a deserted road. Ovis-u1 Bagel-Think Text Faithfulness: 1.0 Image Faithfulness: 2.5 Phys/Visual Quality: 2.0 Text Rendering: 1.0 Text Faithfulness: 1.0 Ima… view at source ↗
Figure 5
Figure 5. Figure 5: Example pointwise scores rated by RationalRewards for image/text-to-image generations (rationales omitted). RationalRewards evaluates each result across multiple dimensions. 2.1 Variational Framework: The Hindsight-Foresight Decomposition Let x = (IA, IB, c) denote a comparison tuple comprising two generated images and a conditioning user request c (which includes text instructions and, for editing tasks, … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on image/text-to-image tasks optimized with reinforcement learning (RL) and prompt tuning (PT) using RationalRewards. observed label, yielding higher-quality posterior samples than unconditioned generation— confirmed empirically in [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Test-Time Prompt Refinement via “Generate-Critique-Refine” loop with RationalRewards. compute for quality without parameter updates (Snell et al., 2024). We note that this post￾hoc prompt refinement dataset also enables distillation for pre-hoc prompt enhancement models. This dual-space formulation connects to test-time compute scaling (Snell et al., 2024): prompt-space optimization offers an axis for impr… view at source ↗
Figure 8
Figure 8. Figure 8: RationalRewards (a) enables explainable quality control for data curation; (b) serves as a multi-dimensional reward model driven by transparent rationales; (c) serves as a preference-calibrated test-time prompt tuner that trades compute for better generation quality; (d) fuels regional flaw grounding and dense visual rewards. ate–Critique–Refine loop – a purely test-time intervention requiring no parameter… view at source ↗
Figure 9
Figure 9. Figure 9: RL with RationalRewards on Qwen-Image (text-to-image generator) and Flux￾Kontext [dev] (image-to-image editing). The reward standard-deviation gradually decays as training proceeds. Crucially, the evaluation reward curve on held-out eval-set align well with the score curve on target test benchmarks. Critique Visualization. We provide additional example use case of RationalRewards, which visualizes problema… view at source ↗
Figure 10
Figure 10. Figure 10: The evolution of generation quality of RL using [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training curves comparison between [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Text-to-Image RL using scalar reward model demonstrates reward hacking – [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of Critique Visualization.RationalRewards first analyzes the image and pro￾vides critique rationales, then summarizes them and generates referring expressions for Ground￾ingDINO and SAM to produce segmentation masks for problematic regions. is fixed, this reduces to maximizing Eqϕ [log Pθ (z | x)], which is precisely the standard supervised fine-tuning (SFT) objective on the filtered posterio… view at source ↗
read the original abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that training reward models to output explicit multi-dimensional critiques before scoring, via a new Preference-Anchored Rationalization (PARROT) framework that extracts rationales from preference pairs through anchored generation, consistency filtering, and distillation, yields an 8B RationalRewards model. This model achieves SOTA open-source preference prediction (competitive with Gemini-2.5-Pro on less data), serves as a stronger RL reward for text-to-image and editing generators than scalar baselines, and enables a test-time Generate-Critique-Refine loop that matches or exceeds RL fine-tuning on benchmarks.

Significance. If the empirical claims hold after verification, the work would be significant for shifting reward models from opaque scalars to interpretable, actionable reasoning tools usable at both training (RL) and test time (prompt revision without updates). The reported data efficiency and dual-use of critiques represent a potentially useful direction for visual generation optimization, though the absence of supporting details limits current assessment.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (PARROT): The central claim that PARROT recovers high-quality, unbiased rationales via anchored generation, consistency filtering, and distillation is load-bearing for all downstream results, yet no human agreement metrics, rationale fidelity ablations, or comparisons against gold rationales are reported to confirm this assumption.
  2. [§4] §4 (Experiments): The abstract asserts SOTA preference prediction and RL/test-time gains, but reports no experimental details, full baseline comparisons, statistical significance tests, or ablation studies, making it impossible to assess whether the data support the claims.
  3. [§4.1] §4.1 (Preference prediction results): Rationales are recovered from the identical preference data used to train the model, creating a circularity risk where reported gains may reflect distribution fitting rather than independent generalization; this is not addressed with held-out rationale validation or bias analysis.
minor comments (2)
  1. [§3] Clarify how the multi-dimensional critique dimensions are defined, scored, and aggregated into the final reward signal.
  2. [Related Work] Add missing references to prior work on rationale-augmented rewards or critique-based refinement in vision-language models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve transparency and rigor.

read point-by-point responses
  1. Referee: [Abstract and §3] The central claim that PARROT recovers high-quality, unbiased rationales via anchored generation, consistency filtering, and distillation is load-bearing for all downstream results, yet no human agreement metrics, rationale fidelity ablations, or comparisons against gold rationales are reported to confirm this assumption.

    Authors: We acknowledge that direct human validation metrics would provide stronger support for the quality of recovered rationales. The current manuscript relies on consistency filtering, distillation, and downstream empirical gains as proxies for rationale quality. We will add human agreement metrics on a sampled subset of rationales, ablations isolating each PARROT stage to measure impact on rationale fidelity, and comparisons against any available gold rationales from the source preference datasets. These additions will appear in an expanded §3. revision: yes

  2. Referee: [§4] The abstract asserts SOTA preference prediction and RL/test-time gains, but reports no experimental details, full baseline comparisons, statistical significance tests, or ablation studies, making it impossible to assess whether the data support the claims.

    Authors: We agree that the main text should contain more complete experimental information. The full manuscript includes setup details, but we will expand §4 to report full baseline comparisons (including all relevant open- and closed-source models), statistical significance tests across multiple seeds, and additional ablations on critique dimensions, RL reward usage, and the Generate-Critique-Refine loop. Extended tables and implementation specifics will be moved to the appendix. revision: yes

  3. Referee: [§4.1] Rationales are recovered from the identical preference data used to train the model, creating a circularity risk where reported gains may reflect distribution fitting rather than independent generalization; this is not addressed with held-out rationale validation or bias analysis.

    Authors: This is a legitimate concern. Preference prediction is already evaluated on held-out test splits disjoint from training data. To further address circularity, we will add held-out rationale generation experiments on unseen preference pairs and include a bias analysis examining correlations across critique dimensions. These results will be reported in §4.1 with updated splits and analysis. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on independent empirical benchmarks.

full rationale

The paper describes an empirical pipeline: PARROT generates rationales from existing preference pairs via anchored generation, filtering, and distillation; RationalRewards is then trained on the resulting (preference, rationale, score) tuples. Performance is measured on separate test sets for preference prediction accuracy, RL-based generator improvement, and test-time Generate-Critique-Refine gains. No equations, uniqueness theorems, or first-principles derivations are presented that reduce by construction to the training inputs. No self-citations are invoked as load-bearing justification for the core method or results. The reported gains are therefore not forced by the construction of the inputs themselves but are subject to standard empirical verification on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claims depend on the unverified quality of PARROT-generated rationales and the assumption that multi-dimensional critiques provide superior optimization signals compared to scalar scores, with no external validation or formal guarantees provided.

invented entities (2)
  • PARROT framework no independent evidence
    purpose: Recover high-quality rationales from preference data without direct annotations
    Core method introduced to enable training of the critique-producing reward model.
  • RationalRewards model no independent evidence
    purpose: 8B parameter reward model that outputs critiques before scores
    The trained model whose performance underpins all reported improvements.

pith-pipeline@v0.9.0 · 5539 in / 1403 out tokens · 35904 ms · 2026-05-10T15:32:17.099273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.

  2. PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

    cs.CV 2026-05 unverdicted novelty 6.0

    PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages · cited by 2 Pith papers

  1. [1]

    URLhttps://arxiv.org/abs/2410.12832. OpenAI. Gpt-image-1. https://platform.openai.com/docs/guides/image-generation? image-generation-model=gpt-image-1 , 2025. OpenAI’s image generation model. Ac- cessed September 2025. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. SDXL: Improving lat...

  2. [2]

    The preference label y is provided as a hint, focusing generation on rationales consistent with the observed preference

    Phase 1 (Rationale Generation)constructs the variational posterior qϕ(z|x , y) by prompting a teacher VLM with preference-anchored instructions. The preference label y is provided as a hint, focusing generation on rationales consistent with the observed preference

  3. [3]

    Phase 2 (Consistency Filtering)maximizes Term 1, Eqϕ [logP θ(y|x , z)], by retaining only rationales z for which the preference y can be recovered from (x, z) alone (Eq. 2). This restricts qϕ’s effective support to the high-likelihood region, ensuring predictive sufficiency

  4. [4]

    A person planting a tree with cat. HD. Realistic style

    Phase 3 (Foresight Distillation)minimizes Term 2, DKL(qϕ(z|x , y)∥Pθ(z|x)) , by training the student model Pθ(z|x) to generate rationales without access to y. Since qϕ 19 Preprint. Under review. RL (LoRA) Training Steps0 300 Figure 12: Text-to-Image RL using scalar reward model demonstrates reward hacking – while the reward increases, the visual quality o...

  5. [5]

    show its work

    Physical and Visual Quality: ## Justification: … However, there are **notable physical and anatomical flaws**: -**Hand Structure Deformity**: The person's right hand (touching the soil) has an unnatural, elongated thumb … -**Abnormal Element Overlap / Implausible Interaction**: The cat's front paw is placed against the person's extended hand, but the physic...

  6. [7]

    -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects are missing or incorrectly handled

    Image Faithfulness(How well are the non-edited parts and key input elements preserved?) -4 (Uses input fully):All relevant elements from the input are accurately preserved or transformed as instructed. -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects are missing or incorrectly handled. -2 (Partial mismatch):Some elements are car...

  7. [8]

    No visible artifacts

    Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible. No visible artifacts. -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive. -2 (Some flaws):Clear physical or visual errors that disrupt the image. -1 (Severe flaws):Major physical/visu...

  8. [9]

    -3 (Mostly match):Minor misspellings or inconsistent capitalization

    Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...

  9. [10]

    Image Faithfulness

    Text Faithfulness: ## Justification:[Detailed comparison] ## Score A:[float]## Score B:[float]## Winner:[A/B/Tie] 2--4. (same structure for remaining aspects) # Summary:[Overall comparison summary] Text-to-Image Variant.For text-to-image generation, the prompt is modified as follows: (1) only two images are provided (Generated Image A and Generated Image ...

  10. [11]

    The Source Image (First image)

  11. [12]

    To do this, you must first assess the image on four critical aspects, provide justifications and absolute scores in 1--4 scale

    The Edited Image (Second image) Your task is to evaluate the Edited Image against the Source Image and the User Instruction. To do this, you must first assess the image on four critical aspects, provide justifications and absolute scores in 1--4 scale. About the scores: you should try to givefloat scores. For example, float values are important to reflect...

  12. [13]

    No hallucinations or unrequested changes

    Text Faithfulness(How accurately does the output follow the instruction?) -4 (Full match):All key elements (objects, colors, actions) are represented exactly as described. No hallucinations or unrequested changes. 23 Preprint. Under review. -3 (Minor mismatch):Most key elements are present, but minor details are missing, incorrect, or slightly inaccurate....

  13. [15]

    No visible artifacts (seams, blurring, noise)

    Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible (correct lighting, shadows, geometry, anatomy). No visible artifacts (seams, blurring, noise). -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive (e.g., slight lighting mismatch, minor...

  14. [16]

    -3 (Mostly match):Minor misspellings or inconsistent capitalization

    Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...

  15. [19]

    Under review

    Physical and Visual Quality: ## Score:[ float score ] ## Justification:[Detailed explanation of the score] 24 Preprint. Under review

  16. [20]

    First, thecritique promptevaluates a single generated image across four dimensions with natural language justification

    Text Rendering: ## Score:[ float score or N/A ] ## Justification:[Detailed explanation of the score] # Summary:[Summary of the evaluation] C.4 Generate–Critique–Refine (GCR) Loop Prompts The GCR loop at test time (Section 2.2, Figure 6) uses the trained RationalRewards model in two stages. First, thecritique promptevaluates a single generated image across...

  17. [21]

    No hallucinations or unrequested changes

    Text Faithfulness(How accurately does the output follow the instruction?) -4 (Full match):All key elements (objects, colors, actions) are represented exactly as described. No hallucinations or unrequested changes. -3 (Minor mismatch):Most key elements are present, but minor details are missing, incorrect, or slightly inaccurate. -2 (Some mismatch):Some ke...

  18. [22]

    -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects (e.g., background details, lighting consistency) are missing or incorrectly handled

    Image Faithfulness(How well are the non-edited parts and key input elements preserved?) -4 (Uses input fully):All relevant elements from the input (background, style, lighting, identity) are accurately preserved or transformed as instructed. -3 (Minor mismatch):Most relevant elements are preserved, but a few aspects (e.g., background details, lighting con...

  19. [23]

    No visible artifacts (seams, blurring, noise)

    Physical and Visual Quality(Technical errors, composition, realism, and physics) -4 (No noticeable flaws):The image is physically plausible (correct lighting, shadows, geometry, anatomy). No visible artifacts (seams, blurring, noise). -3 (Minor flaws):Small inaccuracies that are noticeable but not strongly disruptive (e.g., slight lighting mismatch, minor...

  20. [24]

    -3 (Mostly match):Minor misspellings or inconsistent capitalization

    Text Rendering(Only if the instruction involves generating text) -4 (Full match):Text is correct, legible, and integrated well. -3 (Mostly match):Minor misspellings or inconsistent capitalization. -2 (Partial match):Major misspellings or distorted text. -1 (Major deviations):Text is unreadable, severely distorted, or missing. (Use N/A if no text generatio...

  21. [25]

    Text Faithfulness: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]

  22. [26]

    Image Faithfulness: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]

  23. [27]

    Physical and Visual Quality: ## Score:[ float score ] ## Justification:[Detailed explanation of the score]

  24. [28]

    floating

    Text Rendering: ## Score:[ float score or N/A ] ## Justification:[Detailed explanation of the score] # Summary:[Summary of the evaluation] # User Request Refinement: ## Refinement Comments:[Explanation of why the original instruction needs refinement and what constraints should be added] ## Refined Request:[Improved, more specific instruction that address...

  25. [29]

    Image A contains a clear sunset in the background

    Visual hallucination: The teacher generates a rationale describing visual content not present in the images (e.g., “Image A contains a clear sunset in the background” when no sunset is visible), leading to an incorrect preference prediction when the label hint is removed

  26. [30]

    Label-ignoring rationales: Despite the preference anchor, the teacher occasionally gen- erates a rationale that favors the non-preferred image, particularly when the quality difference between images is subtle

  27. [31]

    Both images are of reasonable quality

    Vague, non-predictive reasoning: The rationale provides generic praise or criticism (e.g., “Both images are of reasonable quality”) without sufficient discriminative detail to distinguish between the two options. F.3 Evaluation Benchmark Summary Table 13 summarizes all evaluation benchmarks used in this work. Benchmark Task # Samples Evaluation Protocol M...

  28. [32]

    Teacher Model Dependence.The quality of RationalRewards is upper-bounded by the teacher model (Qwen3-VL-32B-Instruct) used to generate training rationales. In domains where the teacher exhibits systematic blind spots—such as fine-grained physics simulation, culturally specific aesthetics, or specialized technical content—the student model inherits these l...

  29. [33]

    The teacher VLM in- troduces additional biases from its own pretraining data

    Bias Inheritance.Preference datasets (EditReward, HPDv3, RapidData) encode the aesthetic preferences and cultural assumptions of their annotators. The teacher VLM in- troduces additional biases from its own pretraining data. RationalRewards may therefore systematically favor certain visual styles, demographics, or content types. We have not conducted a co...

  30. [34]

    Latent Capability Hypothesis.Our finding that test-time prompt tuning matches or exceeds RL-based fine-tuning (Section 3.2) supports the hypothesis that generators har- bor latent capabilities under-elicited by suboptimal prompts. However, this remains a working hypothesis: we have not validated it at the representation level (e.g., by probing internal ac...

  31. [35]

    While this corresponds to a natural boundary in our scoring rubric (Appendix D), we have not conducted a comprehensive sensitivity analysis across all benchmarks and generators

    Threshold Sensitivity.The GCR loop uses a fixed threshold of 3.0 to trigger refinement. While this corresponds to a natural boundary in our scoring rubric (Appendix D), we have not conducted a comprehensive sensitivity analysis across all benchmarks and generators. The optimal threshold may vary by generator capability and task difficulty

  32. [36]

    Language and Domain Scope.All evaluation in this work is conducted on English- language benchmarks. The transferability of RationalRewards’ structured critiques to other languages, as well as to non-photorealistic domains (e.g., 3D rendering, video generation, scientific visualization), remains untested. G.2 Broader Impact RationalRewards and the PARROT f...