Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Angjoo Kanazawa; David McAllister; Janne Hellsten; Miika Aittala; Samuli Laine; Tero Karras; Timo Aila

arxiv: 2603.12893 · v2 · pith:ZEOQ7X6Bnew · submitted 2026-03-13 · 💻 cs.CV · cs.AI· cs.LG· cs.NE· stat.ML

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister , Miika Aittala , Tero Karras , Janne Hellsten , Angjoo Kanazawa , Timo Aila , Samuli Laine This is my paper

classification 💻 cs.CV cs.AIcs.LGcs.NEstat.ML

keywords imagemodelsqualitysamplingactionalignmentflowlearning

0 comments

read the original abstract

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning
cs.LG 2026-06 unverdicted novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without ...
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
cs.CV 2026-05 unverdicted novelty 6.0

RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.