Generation Navigator: A State-Aware Agentic Framework for Image Generation

Jinming Liu; Ruoyu Feng; Wenjun Zeng; Xin Jin; Yuqi Wang

arxiv: 2605.17969 · v1 · pith:LOQV2DOInew · submitted 2026-05-18 · 💻 cs.CV

Generation Navigator: A State-Aware Agentic Framework for Image Generation

Jinming Liu , Ruoyu Feng , Yuqi Wang , Wenjun Zeng , Xin Jin This is my paper

Pith reviewed 2026-05-20 12:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationmulti-turn agentsreinforcement learningcredit assignmentagentic frameworkpolicy optimizationimage quality metrics

0 comments

The pith

A state-aware agent learns to steer multi-turn image generation with a trajectory-level objective that rewards quality peaks, retention of gains, and turn efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats text-to-image generation as a sequence of decisions where the agent sees the current image state and picks the next action to better match user intent. Prior systems rely on prompt rewriting or fixed rules that do not adapt to how quality changes across turns. The core fix is PRE-GRPO, which decomposes the reward so the agent is credited for reaching high quality, for keeping that quality instead of letting it drop, and for stopping once further turns add nothing. Experiments report clear gains over baselines, including a WISE score of 0.90 and 79 percent reasoning accuracy on a dedicated benchmark. A reader would care because the method points toward reliable automation of the trial-and-error process that still dominates practical image creation.

Core claim

Generation Navigator reformulates image generation as a state-conditioned action-making problem in which an agent learns to output the next steering action based on the evolving image. Standard reinforcement learning fails here because a single end-of-trajectory reward gives equal credit to every prior action and cannot tell improving moves from degrading or wasteful ones. PRE-GRPO resolves the issue by explicitly scoring trajectories on three axes: discovery of a high-quality image (Peak), preservation of that quality in subsequent turns (Retention), and avoidance of unnecessary steps (Efficiency). The resulting agent produces higher-quality and more faithful outputs than rule-driven or non

What carries the argument

PRE-GRPO, a trajectory-level reinforcement learning objective that rewards discovery of a high-quality image (Peak), preservation of quality across later turns (Retention), and reduction of superfluous turns (Efficiency).

If this is right

The agent can adapt its next action to the actual quality trajectory instead of following static rewriting rules.
Early high-quality outputs are protected from later degradation rather than being overwritten.
Unproductive turns are reduced while final image quality stays high or improves.
Benchmark scores rise on both general image quality measures and specialized reasoning tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward decomposition could be tested in other sequential generative domains such as video or 3D asset creation.
Better state representations might further improve the agent's ability to detect when quality has peaked.
Trajectory-aware objectives of this kind may lower the amount of human feedback needed to train reliable generative agents.

Load-bearing premise

The three explicit components of Peak, Retention, and Efficiency supply the correct credit signals for individual actions inside multi-turn rollouts without missing other dynamics or adding new biases.

What would settle it

A controlled ablation that removes one or more of the Peak, Retention, or Efficiency terms from the objective and still matches or exceeds the reported WISE and reasoning scores on the same benchmarks would show the decomposition is not required.

Figures

Figures reproduced from arXiv: 2605.17969 by Jinming Liu, Ruoyu Feng, Wenjun Zeng, Xin Jin, Yuqi Wang.

**Figure 1.** Figure 1: The necessity of dynamic state-conditioned actions and trajectory-level rewards. (a) Action choice is state-dependent: a preliminary study on T2I-ReasonBench [Sun et al., 2025] shows that different actions have respective advantages, and dynamic action making is superior to fixed workflows. (b) RL training relying only on a best-image-score reward struggles to distinguish regressive trajectories or unneces… view at source ↗

**Figure 2.** Figure 2: Overview of Generation Navigator. (a) At each turn, the navigator observes the prompt and interaction history, then outputs a structured action with one of three action choices—STOP, REFINE, or REGENERATE—along with a revised prompt. Until the turn budget is exhausted, or STOP is selected, the final output is selected as the highest-scored candidate across the entire trajectory. (b) PRE-GRPO decomposes the… view at source ↗

**Figure 3.** Figure 3: Training-stage ablation reveals progressive gains on complex tasks and learned restraint on simple tasks. For complex reasoning and knowledge-intensive tasks (T2I-ReasonBench and WISE), multi-turn action making and policy optimization yield steady improvements. For simple tasks (GenEval), TF Agent and SFT degrade below the one-shot baseline due to the narrow action distribution caused by hand-crafted decis… view at source ↗

**Figure 4.** Figure 4: PRE-GRPO action analysis and quality–latency trade-off. (a) Reward components: PRE-GRPO outperforms final-score-only and best-score-only reward variants, and both the retention and efficiency terms contribute to the final performance. Here, “-peak”, “-ret.” and “-eff.” denote reward variants w/o the peak, retention, and efficiency terms, respectively. (b) Action behavior: PRE-GRPO shifts the action distrib… view at source ↗

**Figure 5.** Figure 5: Robustness and transfer across system components. The learned state-conditioned action policy shows consistent transfer across navigator, generator, and reviewer choices. (a) Navigator backbone: training improves state-conditioned action-making performance across 4B, 7B and 8B navigators, indicating the robustness of our training method across different navigators. (b) Generator transfer: The navigator tra… view at source ↗

**Figure 6.** Figure 6: Prompt-pool distribution used for data construction. (a) Per-dimension score distribution [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Training data contamination analysis. From left to right: embedding cosine similarity [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Hyperparameter studies. We sweep the terminal retention weight α, turn-cost weight β, and maximum turn budget Tmax. Dashed vertical lines mark the default setting used in the main experiments. The dotted horizontal baseline in the first two panels denotes the corresponding SFT baseline. Both α and β consistently outperform the baseline over a wide range of values, demonstrating that our method is robust to… view at source ↗

**Figure 10.** Figure 10: Across both reward variants, best-score selection outperforms final-output selection. This indicates that, in a multi-turn generation trajectory, the last image is not always the strongest candidate according to the evaluation score, so preserving the best-scored candidate is a simple and more reliable output rule. The combination of best-step reward and best-score selection yields the strongest result (7… view at source ↗

**Figure 9.** Figure 9: Controlled sampling-budget comparison on T2I-ReasonBench. We compare one-shot generation, best-of-3 controls with benchmark selection, prompt-enhanced controls, fixed-workflow agents, training-free state-conditioned action making, and trained action-policy variants. The small gains from best-of-3 selection isolate the effect of additional sampling, while the larger gains from state-conditioned and trained … view at source ↗

**Figure 10.** Figure 10: Best-score vs. final-score selection on T2I-ReasonBench. We ablate the training reward (final-step only vs. best-step) and the inference selector (final output vs. best-scored trajectory candidate). Best-score selection consistently improves over final-output selection under both reward choices. lowers the average to 1.67 while achieving the best GenEval score (0.8843), suggesting that its trajectory-leve… view at source ↗

**Figure 11.** Figure 11: Quality–latency trade-off as the turn budget increases from 1 to 4. Users can adjust the [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Quality–latency trade-off on WISE. Generation Navigator achieves substantially higher WISE scores than IRG while using much lower latency, especially in the no-CoT setting. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames multi-turn image generation as a learned state-conditioned RL task and introduces PRE-GRPO to fix credit assignment, but the whole claim rests on an unspecified quality scorer whose reliability is not shown.

read the letter

The main thing to know is that the authors treat iterative text-to-image refinement as a trajectory where the agent picks actions based on the current image state, and they replace the usual single final reward with PRE-GRPO. This objective separately scores the discovery of a high-quality image, whether quality holds up in later turns, and whether extra turns are avoided. That split directly attacks the credit-assignment problem they describe, where a naive end-of-trajectory reward would treat every step the same even if some steps degrade the result or add nothing useful. The formulation looks like a clean way to make the policy learn from the actual dynamics of the generation process rather than from hand-coded rules or simple prompt rewriting. The reported numbers on WISE and T2I-ReasonBench are presented as substantial gains, which suggests the method can produce measurable differences on existing benchmarks. The soft spot is exactly the one the stress-test note flags: the paper never defines how quality is scored at each turn. Without knowing whether this is a frozen VLM, a CLIP variant, a learned reward model, or something else, and without any check on how well that scorer matches human intent on the failure modes the agent is supposed to fix, it is hard to know whether the three reward terms are actually giving credit for the right things or just amplifying whatever biases the scorer already has. The abstract also gives no detail on baselines, ablations, or statistical tests, so the size of the improvement is difficult to interpret. This work is aimed at researchers who build agentic systems for controllable generation and who already work with RL on generative models. A reader in that niche can extract the reward decomposition and the state-conditioning idea even if the experiments need more scrutiny. The paper has a concrete technical proposal plus benchmark results, so it deserves to go to peer review rather than a desk reject; the referees will need to press on the quality function and the experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper claims to address challenges in faithfully realizing user intent in text-to-image generation by proposing Generation Navigator, a multi-turn agent that learns to dynamically steer the generation process. It introduces PRE-GRPO, a trajectory-level RL objective that rewards Peak (discovering high-quality image), Retention (avoiding quality degradation), and Efficiency (minimizing unnecessary turns) to solve credit assignment issues in multi-turn rollouts. The approach is evaluated on benchmarks, achieving a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

Significance. The proposed framework and PRE-GRPO objective have the potential to significantly improve automated multi-turn image generation by learning adaptive behaviors rather than using fixed rules. If the quality metric used in the reward terms reliably tracks progress toward user intent, this could lead to more effective agentic systems. The reported performance gains suggest practical benefits, but the significance hinges on providing more details on the implementation and validation of the core components.

major comments (2)

Method section, PRE-GRPO formulation: The PRE-GRPO objective relies on a scalar quality score computed at each turn to define the Peak, Retention, and Efficiency terms, but the manuscript provides no explicit definition of this quality function (e.g., specific VLM, CLIP variant, or learned model, and whether it is frozen or trained). This is load-bearing for the central claim that PRE-GRPO solves credit assignment without introducing new biases, as poor correlation with user intent would undermine the decomposition.
Experiments section: The reported gains (WISE score of 0.90 and 79.06% on T2I-ReasonBench) are presented without details on baselines, statistical tests, or ablations isolating the contribution of each PRE-GRPO component. This makes it difficult to confirm that improvements stem directly from the proposed objective rather than implementation choices.

minor comments (2)

Abstract: The claim of 'substantial improvements across benchmarks' would be strengthened by briefly naming the base T2I model and primary comparison methods.
Notation and figures: The state-conditioned action space and trajectory dynamics could be clarified with a diagram or pseudocode to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: Method section, PRE-GRPO formulation: The PRE-GRPO objective relies on a scalar quality score computed at each turn to define the Peak, Retention, and Efficiency terms, but the manuscript provides no explicit definition of this quality function (e.g., specific VLM, CLIP variant, or learned model, and whether it is frozen or trained). This is load-bearing for the central claim that PRE-GRPO solves credit assignment without introducing new biases, as poor correlation with user intent would undermine the decomposition.

Authors: We agree that the manuscript lacks an explicit definition of the quality function, which is necessary to evaluate the central claims. We will revise the method section to provide a full specification of this function, including the model architecture, training status (frozen or otherwise), and any supporting validation of its alignment with user intent. revision: yes
Referee: Experiments section: The reported gains (WISE score of 0.90 and 79.06% on T2I-ReasonBench) are presented without details on baselines, statistical tests, or ablations isolating the contribution of each PRE-GRPO component. This makes it difficult to confirm that improvements stem directly from the proposed objective rather than implementation choices.

Authors: We agree that the experiments section would be strengthened by additional details. We will revise to include explicit baseline comparisons, results from statistical significance tests across multiple runs, and ablations that isolate the individual contributions of the Peak, Retention, and Efficiency terms. revision: yes

Circularity Check

0 steps flagged

PRE-GRPO is a proposed RL objective addressing an identified credit-assignment issue; no reduction to inputs by construction

full rationale

The paper describes a credit assignment problem in naive trajectory-level RL for multi-turn T2I agents (equal credit to all actions, ignoring quality dynamics across turns) and introduces PRE-GRPO as an explicit decomposition into Peak, Retention, and Efficiency terms. This is a design choice for the reward structure rather than a derivation that reduces to prior equations, fitted parameters, or self-citations. The central claim is supported by empirical results on WISE and T2I-ReasonBench rather than by algebraic equivalence or load-bearing self-reference. No self-definitional, fitted-input, or uniqueness-imported steps appear in the provided abstract or description.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The claim rests on modeling image generation as an MDP-like process and on the new reward design in PRE-GRPO; no machine-checked proofs or shipped code are mentioned.

free parameters (1)

Balancing coefficients among Peak, Retention, and Efficiency terms
The trajectory-level objective combines three reward aspects whose relative importance is not derived from first principles and would require tuning.

axioms (1)

domain assumption Text-to-image generation can be usefully reformulated as a state-conditioned sequential decision process.
The central reformulation in the abstract depends on this modeling choice being appropriate for the task.

invented entities (2)

Generation Navigator agent no independent evidence
purpose: To output the next action that steers the generation trajectory
New agent architecture introduced to implement the state-aware framework.
PRE-GRPO objective no independent evidence
purpose: To provide trajectory-level credit assignment via Peak, Retention, and Efficiency
Custom reinforcement learning objective proposed to solve the identified credit assignment problem.

pith-pipeline@v0.9.0 · 5761 in / 1535 out tokens · 72095 ms · 2026-05-20T12:37:36.783425+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

PRE-GRPO ... explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). ... R(τi) = Pi + α·Ri − β·Ei + γ·Fi
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trajectory-level reinforcement learning objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Pilot study details.Appendix B expands the setup behind the introductory pilot study, including the three-turn workflow and the fixed-action versus preference-reference compari- son

work page
[2]

This section supports the training-data claims in Section 4.1 and clarifies how action-trajectory supervision differs from ordinary prompt-image data

Data pipeline and distribution.Appendix C describes the construction of the 103K struc- tured trajectories, including prompt-pool scoring, targeted prompt augmentation, branch- and-select exploration, and trajectory filtering. This section supports the training-data claims in Section 4.1 and clarifies how action-trajectory supervision differs from ordinar...

work page
[3]

Agent prompts.Appendix D gives the complete reviewer and navigator prompt templates

work page
[4]

6, the turn budget Tmax, and representative reward-scale examples

Hyperparameter studies.Appendix E analyzes the reward weights α and β in Eq. 6, the turn budget Tmax, and representative reward-scale examples. These studies explain why the default PRE-GRPO configuration balances candidate discovery, terminal retention, and turn efficiency

work page
[5]

This isolates the net contribution of state- conditioned action making from gains caused by extra sampling budget

Controlled sampling-budget comparison.Appendix F compares one-shot generation, best-of-3 selection, prompt enhancement, fixed-workflow agents, and trained action-policy variants under a single T2I-ReasonBench view. This isolates the net contribution of state- conditioned action making from gains caused by extra sampling budget

work page
[6]

Best-score versus final-score selection.Appendix G ablates the interaction between trajectory-level reward choice and inference-time output selection, explaining why Genera- tion Navigator returns the highest-scored candidate along the trajectory

work page
[7]

Average generation turns.Appendix H reports the average number of generation turns on GenEval as an indirect signal of action calibration on simple compositional prompts

work page
[8]

Quality–latency trade-off.Appendix I studies how the maximum turn budget affects accuracy and latency for the no-CoT Navigator

work page
[9]

WISE latency analysis.Appendix J compares the quality–latency trade-off on WISE against generator-only and agent baselines

work page
[10]

The examples expose the actual prompts, reviewer feedback, actions, and selected images behind the aggregate results

Qualitative visualizations.Appendix K presents representative multi-turn cases across textual image design, counting, spatial relations, scientific reasoning, and two-object gener- ation. The examples expose the actual prompts, reviewer feedback, actions, and selected images behind the aggregate results

work page
[11]

Human evaluation.Appendix L reports a pairwise human study comparing human prefer- ences with reviewer-induced preferences, providing an empirical check that the automatic reviewer is a useful signal for PRE-GRPO trajectory optimization

work page
[12]

B Pilot Study Details Pilot study details (Section 1).We construct a three-turn workflow on T2I-ReasonBench to compare action strategies

Limitations and future directions.Appendix M discusses computational trade-offs, and the use of reviewers as environment signals. B Pilot Study Details Pilot study details (Section 1).We construct a three-turn workflow on T2I-ReasonBench to compare action strategies. In the first turn, each prompt is rewritten by Doubao-Seed1.5 and then fed to FLUX.2-Klei...

work page
[13]

This captures paraphrase-level semantic overlap

Embedding cosine similarity.We encode each benchmark prompt and each training prompt with the sentence-transformers/all-MiniLM-L6-v2 encoder [Reimers and Gurevych, 2019, Wang et al., 2020] and report the maximum cosine similarity between each benchmark prompt and its nearest neighbor in the training pool. This captures paraphrase-level semantic overlap. 2...

work page 2019
[14]

5-gram containment.Measures the fraction of a benchmark prompt’s 5-grams that appear in its nearest training prompt, capturing asymmetric substring inclusion

work page
[15]

8-gram containment.Following the contamination protocol used in PaLM [Chowdhery et al., 2023], we flag a benchmark sample as potentially contaminated if ≥70% of its 8-grams are contained in any training prompt

work page 2023
[16]

13-gram collision.Following GPT-3 [Brown et al., 2020], we check for exact 13-gram matches between benchmark and training prompts. 0.0 0.2 0.4 0.6 0.8 1.0 max similarity to prompt pool 0% 20% 40% 60% 80% 100%cumulative percent Embedding cosine (ECDF) benchmark GenEval T2I-ReasonBench WISE 0.0 0.2 0.4 0.6 0.8 1.0 max similarity to prompt pool 0 20 40 60 80...

work page 2020
[17]

Keep the thinking process concise (within 512 tokens)

Analyze the **Current Image** strictly against the **User Request**. Keep the thinking process concise (within 512 tokens)

work page
[18]

holding a red cup

Provide a detailed diagnosis of flaws. **Inputs**: - **User Request**: {user_request} - **Current Image**: (Visual Input) **Evaluation Criteria**: **1. Aesthetic Quality**: **Aesthetic & Technical Quality Scoring Rules (0.0-5.0)**: Evaluate the overall aesthetic appeal of the image and provide a score: Assess the image for technical flaws (artifacts, anat...

work page
[19]

**Rephrase**: Describe the subject using different adjectives or synonyms

work page
[20]

**Reorder**: Move the missing or distorted elements mentioned in the ‘diagnosis‘ to the **very beginning** of the prompt

work page
[21]

garbled/messy

**Simplify vs. Enrich**: - If diagnosis says "garbled/messy" -> **Simplify** details, focus on main subject. - If diagnosis says "boring/wrong style" -> **Enrich** with style modifiers (e.g., "cinematic lighting", "concept art"). 17 0.02 0.2 0.5 1.0 α 76.0 76.5 77.0 77.5 78.0 78.5Reasoning SFT Terminal-retention weight 0.01 0.02 0.03 0.04 β 75.5 76.0 76.5...

work page
[22]

Add [obj]

**Format**: Use "Add [obj]", "Remove [obj]", "Change [obj] to [obj]", or "Make [obj] [action]"

work page
[23]

it/him/her

**Specificity**: Explicitly state the target. AVOID "it/him/her". Use "the panda", "the red car"

work page
[24]

Do not fix everything at once

**Single Focus**: Target **ONLY ONE** specific area mentioned in the ‘diagnosis‘. Do not fix everything at once

work page
[25]

Use REGENERATE for that

**Avoid Anatomy**: Do NOT try to fix eyes/gaze via I2I. Use REGENERATE for that

work page
[26]

Make the tall warrior wear a red cape

**Example**: "Make the tall warrior wear a red cape." (NOT "Change clothes"). **Task 3: Output** Only provide a valid JSON object. ‘‘‘json { "decision": "STOP" | "REGENERATE" | "REFINE", "reasoning": "Explain why you chose this action based on the diagnosis.", "revised_prompt": "String OR null. If STOP, null. If REGENERATE, the full new T2I prompt. If REF...

work page 2000

[1] [1]

Pilot study details.Appendix B expands the setup behind the introductory pilot study, including the three-turn workflow and the fixed-action versus preference-reference compari- son

work page

[2] [2]

This section supports the training-data claims in Section 4.1 and clarifies how action-trajectory supervision differs from ordinary prompt-image data

Data pipeline and distribution.Appendix C describes the construction of the 103K struc- tured trajectories, including prompt-pool scoring, targeted prompt augmentation, branch- and-select exploration, and trajectory filtering. This section supports the training-data claims in Section 4.1 and clarifies how action-trajectory supervision differs from ordinar...

work page

[3] [3]

Agent prompts.Appendix D gives the complete reviewer and navigator prompt templates

work page

[4] [4]

6, the turn budget Tmax, and representative reward-scale examples

Hyperparameter studies.Appendix E analyzes the reward weights α and β in Eq. 6, the turn budget Tmax, and representative reward-scale examples. These studies explain why the default PRE-GRPO configuration balances candidate discovery, terminal retention, and turn efficiency

work page

[5] [5]

This isolates the net contribution of state- conditioned action making from gains caused by extra sampling budget

Controlled sampling-budget comparison.Appendix F compares one-shot generation, best-of-3 selection, prompt enhancement, fixed-workflow agents, and trained action-policy variants under a single T2I-ReasonBench view. This isolates the net contribution of state- conditioned action making from gains caused by extra sampling budget

work page

[6] [6]

Best-score versus final-score selection.Appendix G ablates the interaction between trajectory-level reward choice and inference-time output selection, explaining why Genera- tion Navigator returns the highest-scored candidate along the trajectory

work page

[7] [7]

Average generation turns.Appendix H reports the average number of generation turns on GenEval as an indirect signal of action calibration on simple compositional prompts

work page

[8] [8]

Quality–latency trade-off.Appendix I studies how the maximum turn budget affects accuracy and latency for the no-CoT Navigator

work page

[9] [9]

WISE latency analysis.Appendix J compares the quality–latency trade-off on WISE against generator-only and agent baselines

work page

[10] [10]

The examples expose the actual prompts, reviewer feedback, actions, and selected images behind the aggregate results

Qualitative visualizations.Appendix K presents representative multi-turn cases across textual image design, counting, spatial relations, scientific reasoning, and two-object gener- ation. The examples expose the actual prompts, reviewer feedback, actions, and selected images behind the aggregate results

work page

[11] [11]

Human evaluation.Appendix L reports a pairwise human study comparing human prefer- ences with reviewer-induced preferences, providing an empirical check that the automatic reviewer is a useful signal for PRE-GRPO trajectory optimization

work page

[12] [12]

B Pilot Study Details Pilot study details (Section 1).We construct a three-turn workflow on T2I-ReasonBench to compare action strategies

Limitations and future directions.Appendix M discusses computational trade-offs, and the use of reviewers as environment signals. B Pilot Study Details Pilot study details (Section 1).We construct a three-turn workflow on T2I-ReasonBench to compare action strategies. In the first turn, each prompt is rewritten by Doubao-Seed1.5 and then fed to FLUX.2-Klei...

work page

[13] [13]

This captures paraphrase-level semantic overlap

Embedding cosine similarity.We encode each benchmark prompt and each training prompt with the sentence-transformers/all-MiniLM-L6-v2 encoder [Reimers and Gurevych, 2019, Wang et al., 2020] and report the maximum cosine similarity between each benchmark prompt and its nearest neighbor in the training pool. This captures paraphrase-level semantic overlap. 2...

work page 2019

[14] [14]

5-gram containment.Measures the fraction of a benchmark prompt’s 5-grams that appear in its nearest training prompt, capturing asymmetric substring inclusion

work page

[15] [15]

8-gram containment.Following the contamination protocol used in PaLM [Chowdhery et al., 2023], we flag a benchmark sample as potentially contaminated if ≥70% of its 8-grams are contained in any training prompt

work page 2023

[16] [16]

13-gram collision.Following GPT-3 [Brown et al., 2020], we check for exact 13-gram matches between benchmark and training prompts. 0.0 0.2 0.4 0.6 0.8 1.0 max similarity to prompt pool 0% 20% 40% 60% 80% 100%cumulative percent Embedding cosine (ECDF) benchmark GenEval T2I-ReasonBench WISE 0.0 0.2 0.4 0.6 0.8 1.0 max similarity to prompt pool 0 20 40 60 80...

work page 2020

[17] [17]

Keep the thinking process concise (within 512 tokens)

Analyze the **Current Image** strictly against the **User Request**. Keep the thinking process concise (within 512 tokens)

work page

[18] [18]

holding a red cup

Provide a detailed diagnosis of flaws. **Inputs**: - **User Request**: {user_request} - **Current Image**: (Visual Input) **Evaluation Criteria**: **1. Aesthetic Quality**: **Aesthetic & Technical Quality Scoring Rules (0.0-5.0)**: Evaluate the overall aesthetic appeal of the image and provide a score: Assess the image for technical flaws (artifacts, anat...

work page

[19] [19]

**Rephrase**: Describe the subject using different adjectives or synonyms

work page

[20] [20]

**Reorder**: Move the missing or distorted elements mentioned in the ‘diagnosis‘ to the **very beginning** of the prompt

work page

[21] [21]

garbled/messy

**Simplify vs. Enrich**: - If diagnosis says "garbled/messy" -> **Simplify** details, focus on main subject. - If diagnosis says "boring/wrong style" -> **Enrich** with style modifiers (e.g., "cinematic lighting", "concept art"). 17 0.02 0.2 0.5 1.0 α 76.0 76.5 77.0 77.5 78.0 78.5Reasoning SFT Terminal-retention weight 0.01 0.02 0.03 0.04 β 75.5 76.0 76.5...

work page

[22] [22]

Add [obj]

**Format**: Use "Add [obj]", "Remove [obj]", "Change [obj] to [obj]", or "Make [obj] [action]"

work page

[23] [23]

it/him/her

**Specificity**: Explicitly state the target. AVOID "it/him/her". Use "the panda", "the red car"

work page

[24] [24]

Do not fix everything at once

**Single Focus**: Target **ONLY ONE** specific area mentioned in the ‘diagnosis‘. Do not fix everything at once

work page

[25] [25]

Use REGENERATE for that

**Avoid Anatomy**: Do NOT try to fix eyes/gaze via I2I. Use REGENERATE for that

work page

[26] [26]

Make the tall warrior wear a red cape

**Example**: "Make the tall warrior wear a red cape." (NOT "Change clothes"). **Task 3: Output** Only provide a valid JSON object. ‘‘‘json { "decision": "STOP" | "REGENERATE" | "REFINE", "reasoning": "Explain why you chose this action based on the diagnosis.", "revised_prompt": "String OR null. If STOP, null. If REGENERATE, the full new T2I prompt. If REF...

work page 2000