Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Jaemin Kim; Jinho Chang; Jong Chul Ye

arxiv: 2509.25845 · v3 · submitted 2025-09-30 · 💻 cs.CV · cs.AI

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Jinho Chang , Jaemin Kim , Jong Chul Ye This is my paper

Pith reviewed 2026-05-18 13:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords reward-guided editingdiffusion modelstraining-freeoptimal controlimage editingadjoint statestrajectory optimization

0 comments

The pith

Treating the reverse diffusion process as a controllable trajectory enables training-free reward-guided image editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that image editing with diffusion models can be improved by formulating it as a trajectory optimal control problem. This involves treating the reverse diffusion process as a controllable path starting from the source image and using adjoint state updates to guide it toward maximizing a reward. A sympathetic reader would care if this leads to better editing results that keep the original image's content intact while aligning with the reward objective. The experiments indicate it outperforms previous training-free methods without falling into reward hacking. If the claim holds, it offers a practical way to apply rewards in editing without extra training steps.

Core claim

We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

What carries the argument

Trajectory optimal control on the reverse diffusion process, using iterative adjoint state updates to steer the editing trajectory from the source image.

If this is right

The method outperforms inversion-based training-free guidance baselines on reward maximization.
It maintains higher fidelity to the source image during edits.
It avoids reward hacking in the guided editing process.
It works across distinct editing tasks without requiring model retraining or inversion tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This control-based view of diffusion could be adapted to video editing by extending trajectory control over time.
Similar optimal control techniques might improve reward guidance in other iterative generation processes beyond images.
Combining multiple rewards through multi-objective optimal control could support more complex editing goals.

Load-bearing premise

The reverse diffusion process can be modeled as a controllable dynamical system with computable adjoint states that can be updated iteratively without instability.

What would settle it

A direct comparison on editing benchmarks where the method shows no improvement in the reward-fidelity balance or exhibits unstable outputs would disprove the effectiveness of the trajectory control approach.

Figures

Figures reproduced from arXiv: 2509.25845 by Jaemin Kim, Jinho Chang, Jong Chul Ye.

**Figure 1.** Figure 1: Reward-guided image editing samples with unconditional diffusion and flowmatching models. Reward-guided edited samples across various tasks, such as (a) Human preference, (b) Style transfer, (c) Counterfactual generation, and (d) Text-guided image editing. ABSTRACT Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A promi… view at source ↗

**Figure 2.** Figure 2: Methodology overview. Given a source image x1, our method first generates its corresponding initial trajectory. We then progressively refine this trajectory by solving a reward-guided optimal control problem. This process steers the path into an optimized trajectory, whose endpoint is the final edited image x u ∗ 1 . Meanwhile, reward-guided sampling methods have been proposed as a promising, training-fre… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on (a) Human preference, (b) Style transfer, (c) Counterfactual generation, and (d) Text-guided image editing. Each image’s target reward is written in yellow. Target reward Validation metrics Source preservation Method ImageReward[↑] HPSv2[↑] CLIPScore[↑] Aesthetic[↑] LPIPS[↓] CLIP-Isrc[↑] None 0.1542 0.2385 0.2887 6.0516 0.0000 1.0000 Gradient Ascent 1.9088 0.2247 0.2877 5.5775 0.1… view at source ↗

**Figure 4.** Figure 4: (a) Reward-fidelity trade-off for different methods. (b) Selection of different initial trajec [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative ablation study on different choices of hyperparameters for the depth [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: An example question from our user study survey. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames reward-guided editing as trajectory optimal control with adjoints and reports stronger experimental tradeoffs than inversion baselines, but the derivations and stability details stay thin.

read the letter

The main thing to know is that this paper treats the diffusion reverse process as a controllable trajectory starting from a source image, then uses adjoint states to iteratively steer it toward a target reward while trying to keep fidelity. The authors position the whole thing as training-free and claim it beats existing inversion-based guidance methods on the reward-fidelity balance without reward hacking.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free reward-guided image editing framework that formulates the diffusion reverse process as a trajectory optimal control problem. The source image serves as the initial state of a controllable dynamical system, with adjoint states iteratively updated to steer the trajectory toward maximizing a target reward while preserving semantic fidelity. Experiments across multiple editing tasks claim significant outperformance over inversion-based training-free baselines, achieving a superior reward-fidelity balance without reward hacking.

Significance. If the adjoint propagation is shown to be numerically stable without hidden per-task tuning or regularization, the work would offer a principled optimal-control derivation for reward-guided editing in diffusion models. This could strengthen training-free guidance methods by replacing heuristic approaches with trajectory optimization, and the reported experimental gains suggest practical value for tasks requiring controlled semantic edits. The absence of free parameters in the core formulation would be a notable strength if verified.

major comments (2)

[§3.2] §3.2 (Adjoint Update Rules): The iterative adjoint state computation for steering the reverse diffusion trajectory is presented as the core mechanism, yet no explicit stability analysis, damping term, or step-size adaptation is provided to address accumulation of discretization or gradient errors over the full reverse chain. This is load-bearing for the central claim, as the skeptic concern about instability in high-dimensional settings directly challenges whether the method can reliably avoid deviation from the source or require post-hoc adjustments.
[§4.1] §4.1 (Reward Formulation): The precise mathematical definition of the reward function and its embedding into the optimal control objective (e.g., how the terminal cost or running cost is defined) is not detailed enough to confirm that the reported balance between reward maximization and source fidelity emerges directly from the derivation rather than from implicit scaling or task-specific choices.

minor comments (2)

[Figure 2] Figure 2 caption: The visualization of adjoint trajectories would benefit from explicit annotation of the time steps at which updates occur to clarify the iterative process.
[Related Work] Related Work section: The discussion of prior inversion-based methods could include a direct comparison table of computational overhead to better contextualize the training-free advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions that will be incorporated to clarify the technical details and strengthen the central claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Adjoint Update Rules): The iterative adjoint state computation for steering the reverse diffusion trajectory is presented as the core mechanism, yet no explicit stability analysis, damping term, or step-size adaptation is provided to address accumulation of discretization or gradient errors over the full reverse chain. This is load-bearing for the central claim, as the skeptic concern about instability in high-dimensional settings directly challenges whether the method can reliably avoid deviation from the source or require post-hoc adjustments.

Authors: We agree that an explicit stability analysis is important for addressing potential concerns about error accumulation in high-dimensional adjoint propagation. Our current formulation derives the adjoint updates directly from the Pontryagin maximum principle applied to the diffusion dynamics, which constrains deviations through the control cost and initial-state anchoring; empirical results across editing tasks show stable trajectories without per-task tuning or added regularization. In the revision we will add a new subsection to §3.2 that supplies both a first-order error bound on discretization and gradient propagation and additional numerical diagnostics (error norms versus number of steps) confirming that no hidden adjustments are required to preserve source fidelity. revision: yes
Referee: [§4.1] §4.1 (Reward Formulation): The precise mathematical definition of the reward function and its embedding into the optimal control objective (e.g., how the terminal cost or running cost is defined) is not detailed enough to confirm that the reported balance between reward maximization and source fidelity emerges directly from the derivation rather than from implicit scaling or task-specific choices.

Authors: We thank the referee for highlighting this presentational gap. The reward enters the optimal-control objective strictly as a terminal cost R(x_T), while source fidelity is enforced by the fixed initial condition x_0 (the source image) together with the quadratic control cost that penalizes large deviations from the uncontrolled diffusion trajectory. In the revised manuscript we will expand §4.1 to state the full objective functional explicitly, including the precise definitions of the terminal reward term and the running control cost, thereby showing that the observed reward-fidelity trade-off follows directly from the derivation without auxiliary scaling factors. revision: yes

Circularity Check

0 steps flagged

No circularity: optimal control derivation is independent of fitted inputs

full rationale

The paper presents the editing framework as a direct application of trajectory optimal control to the diffusion reverse process, treating it as a controllable dynamical system and iteratively updating adjoint states. No quoted equations or sections reduce the central result to a self-definition, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The performance claims rest on external experimental comparisons to baselines rather than quantities forced by the method's own construction. The derivation chain is therefore self-contained against standard optimal control and diffusion literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the diffusion reverse process being treatable as a controllable trajectory and on the existence of stable adjoint updates; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption The reverse process of a pre-trained diffusion model can be modeled as a deterministic or stochastic dynamical system amenable to adjoint-based optimal control.
Invoked when the editing process is cast as a trajectory optimal control problem originating from the source image.

pith-pipeline@v0.9.0 · 5682 in / 1221 out tokens · 34271 ms · 2026-05-18T13:10:19.121273+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RewardFlow: Generate Images by Optimizing What You Reward
cs.CV 2026-04 unverdicted novelty 7.0

RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Motion guidance: Diffusion-based image editing with differen- tiable motion estimators.arXiv preprint arXiv:2401.18085,

Daniel Geng and Andrew Owens. Motion guidance: Diffusion-based image editing with differen- tiable motion estimators.arXiv preprint arXiv:2401.18085,

work page arXiv
[3]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Z., Salakhut- dinov, R., et al

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei- Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserving guided diffusion.arXiv preprint arXiv:2311.16424,

work page arXiv
[5]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for im- proved quality, stability, and variation.arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629,

work page arXiv
[8]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rb-modulation: Training-free personalization of diffu- sion models using stochastic optimal control

11 Preprint. Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Rb-modulation: Training-free personalization of diffusion models using stochastic optimal control.arXiv preprint arXiv:2405.17401,

work page arXiv
[12]

Jiaming Song, Chenlin Meng, and Stefano Ermon

Accessed: 2023-11-10. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInterna- tional Conference on Learning Representations, 2021a. Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation....

work page 2023
[13]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to- image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Unidb: A uni- fied diffusion bridge framework via stochastic optimal control.arXiv preprint arXiv:2502.05749,

Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, and Ye Shi. Unidb: A uni- fied diffusion bridge framework via stochastic optimal control.arXiv preprint arXiv:2502.05749,

work page arXiv
[15]

Empirically, we find that even a single optimiza- tion step per iteration is sufficient to achieve stable optimization while maintaining alignment with the PMP conditions

Note that while we describe the most naive gradient ascent in Algorithm 2 and Algorithm 3, more advanced optimizers can also be utilized for more stable optimization. Empirically, we find that even a single optimiza- tion step per iteration is sufficient to achieve stable optimization while maintaining alignment with the PMP conditions. A.2 HYPERPARAMETER...

work page 2024
[16]

B.2 CONNECTION BETWEEN OPTIMAL CONTROL TERM AND GUIDED SAMPLING In this section, we discuss how the suggested method can be related to the guided sampling methods

Bold: best, underline: second best. B.2 CONNECTION BETWEEN OPTIMAL CONTROL TERM AND GUIDED SAMPLING In this section, we discuss how the suggested method can be related to the guided sampling methods. In the diffusion model sampling process with the noisy sample ˆxt, DPS and many of the suggested guided sampling variations (Chung et al., 2023; Yu et al., 2...

work page 2023

[1] [1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Motion guidance: Diffusion-based image editing with differen- tiable motion estimators.arXiv preprint arXiv:2401.18085,

Daniel Geng and Andrew Owens. Motion guidance: Diffusion-based image editing with differen- tiable motion estimators.arXiv preprint arXiv:2401.18085,

work page arXiv

[3] [3]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Z., Salakhut- dinov, R., et al

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei- Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserving guided diffusion.arXiv preprint arXiv:2311.16424,

work page arXiv

[5] [5]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for im- proved quality, stability, and variation.arXiv preprint arXiv:1710.10196,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629,

work page arXiv

[8] [8]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Rb-modulation: Training-free personalization of diffu- sion models using stochastic optimal control

11 Preprint. Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Rb-modulation: Training-free personalization of diffusion models using stochastic optimal control.arXiv preprint arXiv:2405.17401,

work page arXiv

[12] [12]

Jiaming Song, Chenlin Meng, and Stefano Ermon

Accessed: 2023-11-10. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInterna- tional Conference on Learning Representations, 2021a. Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation....

work page 2023

[13] [13]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to- image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Unidb: A uni- fied diffusion bridge framework via stochastic optimal control.arXiv preprint arXiv:2502.05749,

Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, and Ye Shi. Unidb: A uni- fied diffusion bridge framework via stochastic optimal control.arXiv preprint arXiv:2502.05749,

work page arXiv

[15] [15]

Empirically, we find that even a single optimiza- tion step per iteration is sufficient to achieve stable optimization while maintaining alignment with the PMP conditions

Note that while we describe the most naive gradient ascent in Algorithm 2 and Algorithm 3, more advanced optimizers can also be utilized for more stable optimization. Empirically, we find that even a single optimiza- tion step per iteration is sufficient to achieve stable optimization while maintaining alignment with the PMP conditions. A.2 HYPERPARAMETER...

work page 2024

[16] [16]

B.2 CONNECTION BETWEEN OPTIMAL CONTROL TERM AND GUIDED SAMPLING In this section, we discuss how the suggested method can be related to the guided sampling methods

Bold: best, underline: second best. B.2 CONNECTION BETWEEN OPTIMAL CONTROL TERM AND GUIDED SAMPLING In this section, we discuss how the suggested method can be related to the guided sampling methods. In the diffusion model sampling process with the noisy sample ˆxt, DPS and many of the suggested guided sampling variations (Chung et al., 2023; Yu et al., 2...

work page 2023