Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Guang Lin; Haoyang Zheng; Junyi Wu; Ruizhe Zhang; Weijian Luo

arxiv: 2605.24001 · v2 · pith:ZTXKHUTEnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.LG

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Junyi Wu , Weijian Luo , Haoyang Zheng , Ruizhe Zhang , Guang Lin This is my paper

Pith reviewed 2026-06-30 18:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords one-step text-to-image generationreinforcement learningRLHFdiffusion modelsdistribution alignmentreward optimizationtrajectory-level alignment

0 comments

The pith

DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels to align one-step generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Diff-Instruct with Diffused Reward (DIDR) to fix the mismatch between terminal reward optimization and generative dynamics in one-step text-to-image models. It derives a data-free framework from integral KL minimization that spreads the reward-tilted distribution from clean images through the entire diffusion trajectory. This objective shares the same minimizer as direct clean-image RLHF and produces the Diffused Reward Score as a correction to the reference score. A practical Diffused Reward Proxy estimates the score via short-step denoising. Experiments show the resulting one-step models Pareto-dominate prior baselines and can exceed a 50-step teacher on preference alignment.

Core claim

DIDR is a trajectory-level alignment method derived from integral KL minimization. It propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. The objective admits the same minimizer as clean-image RLHF and naturally induces the Diffused Reward Score, which serves as a reward-driven correction to the reference score function. The Diffused Reward Proxy supplies an efficient estimator of this score through differentiable short-step denoising.

What carries the argument

Integral KL minimization objective that propagates the reward-tilted distribution across the diffusion trajectory and induces the Diffused Reward Score (DRS) as a correction term.

If this is right

One-step generators achieve the identical optimal distribution as clean-image RLHF.
Optimization no longer trades image fidelity for higher reward by exploiting stochastic degrees of freedom.
The framework transfers to large DiT backbones and yields single-step models that surpass their multi-step teachers in preference alignment.
Consistent Pareto dominance holds over existing one-step SDXL baselines under the same reward signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integral-KL propagation principle could be applied to other stochastic generative processes such as flow-matching models.
If the diffused correction works at every noise level, the approach may generalize to reward alignment for video or 3D generation without retraining the full trajectory.
The method suggests that explicit trajectory-level objectives can replace separate distribution-matching and reward stages in future one-step RL pipelines.

Load-bearing premise

The integral KL minimization objective correctly propagates the reward-tilted distribution across the full diffusion trajectory without bias from the noise schedule or the short-step DRP estimator approximation.

What would settle it

Train a model under DIDR and directly compare its zero-noise marginal distribution against the distribution obtained by optimizing the same reward on clean images only; systematic mismatch would falsify the propagation claim.

read the original abstract

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIDR's integral KL construction for one-step alignment is a reasonable attempt at fixing reward-dynamics mismatch, but the same-minimizer claim rests on unshown steps that the stress-test flags correctly.

read the letter

The new piece is the Diffused Reward Score derived from integral KL minimization, meant to push the reward-tilted clean distribution through the entire diffusion trajectory for one-step generators. This is positioned as different from prior one-step RL that mixes image reward with noisy distribution matching. The paper reports that the objective shares the minimizer with standard clean-image RLHF and that the practical DRP estimator lets them train without data.

Experiments claim Pareto dominance over SDXL one-step baselines and, on a 6B DiT, better preference scores than the 50-step teacher in one step. Those results are the concrete part worth checking.

The soft spot is exactly the one the stress-test raises: the abstract gives no derivation showing that the integral KL is free of bias from the noise schedule or from the short-step DRP approximation. If either introduces error, the same-minimizer property does not hold and DRS becomes an approximate correction rather than an exact one. Without the full derivation or controls for schedule dependence, the central claim stays unverified.

This is for people working on real-time diffusion and RLHF alignment. The thinking engages the right problem and cites the relevant prior one-step methods. It deserves a serious referee to examine the math and the experimental controls, even if revisions are likely needed on the approximation details.

Referee Report

2 major / 2 minor

Summary. The paper proposes Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment method for one-step text-to-image generators derived from Integral KL minimization. It claims that DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels, admits the same minimizer as clean-image RLHF, induces a Diffused Reward Score (DRS) as a reward-driven correction to the reference score, and employs a practical Diffused Reward Proxy (DRP) estimator via short-step differentiable denoising. Experiments reportedly show consistent Pareto dominance over one-step SDXL baselines and, on a 6B DiT backbone, surpassing a 50-step teacher model in preference alignment with a single generation step.

Significance. If the same-minimizer property and unbiased propagation of the reward-tilted distribution are rigorously established, the work would offer a principled alternative to existing reward-optimization approaches for one-step diffusion models, potentially resolving the mismatch between terminal rewards and generative dynamics while remaining data-free. The introduction of DRS as a correction term and the DRP estimator could influence trajectory-level RL methods in diffusion models more broadly.

major comments (2)

[Abstract] Abstract: The central claim that the integral KL objective 'admits the same minimizer as clean-image RLHF' and propagates the reward-tilted distribution without bias is load-bearing for the 'principled' framing, yet the abstract (and available text) provides no derivation steps, schedule-independence proof, or error analysis for the DRP short-step approximation; this must be supplied explicitly, e.g., via an expanded §3 or appendix with the relevant integral and minimizer equivalence.
[Abstract] Abstract: The claim that DIDR 'naturally inducing the Diffused Reward Score (DRS)' as an exact reward-driven correction rests on the integral KL construction, but no analysis is given showing that the practical DRP estimator (or any implicit fitting) preserves independence from the noise schedule; any schedule dependence would undermine the same-minimizer property and render DRS a biased correction.

minor comments (2)

[Abstract] The abstract introduces new terms (Diffused Reward Score, Diffused Reward Proxy) without immediate notational definitions or forward references to their formal definitions in the main text.
Experimental claims of Pareto dominance and surpassing the teacher model would benefit from explicit controls or ablations addressing the DRP approximation error in the main results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for explicit theoretical support of our central claims. We will revise the manuscript to strengthen the presentation of the derivations while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the integral KL objective 'admits the same minimizer as clean-image RLHF' and propagates the reward-tilted distribution without bias is load-bearing for the 'principled' framing, yet the abstract (and available text) provides no derivation steps, schedule-independence proof, or error analysis for the DRP short-step approximation; this must be supplied explicitly, e.g., via an expanded §3 or appendix with the relevant integral and minimizer equivalence.

Authors: We agree that the abstract, being a concise summary, does not include full derivation steps. The integral KL minimization, same-minimizer equivalence to clean-image RLHF, and unbiased propagation of the reward-tilted distribution are derived in Section 3 of the manuscript. To address the comment directly, the revision will expand §3 with explicit step-by-step derivations of the integral objective and minimizer equivalence, add a schedule-independence proof, and include an appendix with error bounds and analysis for the DRP short-step approximation. revision: yes
Referee: [Abstract] Abstract: The claim that DIDR 'naturally inducing the Diffused Reward Score (DRS)' as an exact reward-driven correction rests on the integral KL construction, but no analysis is given showing that the practical DRP estimator (or any implicit fitting) preserves independence from the noise schedule; any schedule dependence would undermine the same-minimizer property and render DRS a biased correction.

Authors: The natural induction of DRS as a reward-driven correction follows from the integral KL construction detailed in Section 3. We acknowledge that explicit analysis of the DRP estimator's schedule independence is not fully elaborated. In the revision we will add a dedicated subsection analyzing the short-step differentiable denoising approximation, showing that it preserves the key independence properties (and thus the same-minimizer guarantee) under the conditions and schedules used in the experiments, while discussing any practical biases and mitigation strategies. revision: yes

Circularity Check

1 steps flagged

Same-minimizer property follows by construction from Integral KL objective

specific steps

self definitional [Abstract]
"a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS)"

The Integral KL objective is defined specifically to propagate the reward-tilted clean-image distribution across noise levels; therefore the property that it admits the same minimizer as clean-image RLHF holds by the construction of the objective itself rather than as an independent result.

full rationale

The paper's central theoretical claim—that the DIDR objective admits the same minimizer as clean-image RLHF—rests on defining the objective via Integral KL minimization explicitly constructed to propagate the reward-tilted distribution across the diffusion trajectory. This equivalence is therefore a direct consequence of the objective's definitional setup rather than an independent derivation. The provided text offers no separate verification, external benchmark, or schedule-independent proof beyond the construction itself. The DRP estimator is introduced separately as a practical approximation, but the load-bearing same-minimizer assertion reduces to the input definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility; ledger reflects only elements explicitly named in the provided text.

axioms (1)

domain assumption Integral KL minimization propagates the RLHF-optimal reward-tilted distribution across noise levels while preserving the clean-image minimizer.
Invoked as the derivation basis for DIDR admitting the same minimizer as clean-image RLHF.

invented entities (2)

Diffused Reward Score (DRS) no independent evidence
purpose: Reward-driven correction to the reference score function at each noise level.
New score function induced by the trajectory-level objective.
Diffused Reward Proxy (DRP) no independent evidence
purpose: Efficient estimator of DRS via differentiable short-step denoising.
Practical implementation component required to apply the framework.

pith-pipeline@v0.9.1-grok · 5776 in / 1350 out tokens · 37591 ms · 2026-06-30T18:13:01.228335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

𝑠ref istrainedfor10,000stepsviaDSMonsamplesfrom 𝑞0 (Adam,lr=3 ×10−4, batch 2,048)

Referencemodel. 𝑠ref istrainedfor10,000stepsviaDSMonsamplesfrom 𝑞0 (Adam,lr=3 ×10−4, batch 2,048). Weights are frozen afterwards
[2]

𝑔𝜃 is initialised by regressing against 30-step DDIM samples from𝑠ref using matched noise (3,000 steps, Adam, lr=10−3, batch 2,048)

Generator distillation. 𝑔𝜃 is initialised by regressing against 30-step DDIM samples from𝑠ref using matched noise (3,000 steps, Adam, lr=10−3, batch 2,048). 3.Teaching Assistant initialisation.𝑠 𝜓 is copied from𝑠ref and fine-tuned during alignment
[3]

The two methods differ only in𝑠target: Method Reward location DI++ endpoint𝑥 0 only Didrevery noise level via DRS Evaluation

Alignment.6,000 outer steps; each step runs 5 TA DSM updates (Adam, lr=3× 10−4) followed by one generator update (Adam, lr=10−4, batch 2,048). The two methods differ only in𝑠target: Method Reward location DI++ endpoint𝑥 0 only Didrevery noise level via DRS Evaluation. Final generators are evaluated on𝑛= 10,000samples using the hard decision boundary𝑥 > 0....

2023
[4]

Text-alignment metrics

uses the LAION aesthetic predictor, which estimates visual appeal from CLIP image embeddings and does not directly measure prompt faithfulness. Text-alignment metrics. CLIPScore (Hessel et al., 2021; Radford et al., 2021) measures image–text compatibility using CLIP embedding similarity. DPG-Bench (Hu et al., 2024) evaluates semantic adherence on dense pr...

work page arXiv 2021
[5]

A tranquil dormant mountain dusted with fresh snow under a clear pale blue daytime sky, calm still air, soft diffuse light, muted cool whites and blues, serene and peaceful alpine landscape
[6]

Close-up macro of a human eye with a vivid blue iris, glittering sparkles and cosmic galaxy-like reflections, extreme detail, macro photography
[7]

Portrait of a young African woman wearing a blue and yellow headwrap and a pearl earring, classical painting style, warm studio lighting, elegant and dignified
[8]

A majestic tall sailing ship navigating through a glowing cosmic nebula, teal and cyan hues, dramatic fantasy scene, epic scale
[9]

A vivid explosion of colorful paint splashing in mid-air, rainbow hues of orange, yellow, magenta, blue and purple, high-speed photography, dark background
[10]

A tabby cat leaping through the air in a frozen action shot, photorealistic, natural daylight background
[11]

An assorted sushi platter with rolls and nigiri, glowing neon blue ring lighting, futuristic and vibrant food photography
[12]

A lush bouquet of deep red roses in a round green ceramic vase, soft natural window light, elegant floral still life
[13]

A woman’s portrait blending photorealism with blue watercolor ink splashes, artistic double- exposure style, ethereal and dreamy Figure 5 — Qualitative Comparison SDXL comparison:
[14]

Dramatic close-up portrait of a rugged elderly man with a weathered face, white beard, dark flat cap, overcast sea in the background
[15]

A baby elephant in rain boots jumping in puddles under a rainbow, cheerful children’s book illustration
[16]

Z-Image comparison:

An oil painting of a polar bear mother and cub drifting on an ice floe at Arctic dusk, warm amber sky. Z-Image comparison:
[17]

A beautiful young woman in a white linen dress, Mediterranean coastal town background, warm golden hour light, photorealistic portrait
[18]

A majestic white stallion galloping through crashing ocean waves at sunrise, cinematic photog- raphy
[19]

Figure 7 — Temperature Ablation

A young wizard girl casting a glowing spell in a floating sky library, anime key visual style. Figure 7 — Temperature Ablation
[20]

Aurora borealis over a calm reflective lake surrounded by snow-covered pine trees, winter night
[21]

Volcanic eruption with lava flows meeting snow and ice, dramatic fire and steam, epic landscape
[22]

White arctic fox standing in snow, winter landscape, photorealistic wildlife photography
[23]

Desert campsite at night with glowing orange tents and campfire, Milky Way visible overhead Diff-Instruct with Diffused Reward 33 Figure 8 — Multi-step Comparison
[24]

Japanese ukiyo-e woodblock print of Mount Fuji with cherry blossom trees and a lone figure on a wooden bridge
[25]

Cute anthropomorphic fox in a cozy autumn café, illustrated anime art style, fallen leaves, warm tones
[26]

Aerial drone view of a tropical volcanic island with a turquoise crater lake, surrounded by ocean
[27]

Portrait of a Rajasthani woman with traditional Indian jewelry and a colorful headscarf, photo- realistic Figure 9 — Failure Cases
[28]

A woman placing a tray of food into a kitchen oven
[29]

A large crowd of people standing in a queue outdoors
[30]

A round wooden dining table in a bright room
[31]

A person cutting raw meat on a kitchen counter surrounded by fresh vegetables

[1] [1]

𝑠ref istrainedfor10,000stepsviaDSMonsamplesfrom 𝑞0 (Adam,lr=3 ×10−4, batch 2,048)

Referencemodel. 𝑠ref istrainedfor10,000stepsviaDSMonsamplesfrom 𝑞0 (Adam,lr=3 ×10−4, batch 2,048). Weights are frozen afterwards

[2] [2]

𝑔𝜃 is initialised by regressing against 30-step DDIM samples from𝑠ref using matched noise (3,000 steps, Adam, lr=10−3, batch 2,048)

Generator distillation. 𝑔𝜃 is initialised by regressing against 30-step DDIM samples from𝑠ref using matched noise (3,000 steps, Adam, lr=10−3, batch 2,048). 3.Teaching Assistant initialisation.𝑠 𝜓 is copied from𝑠ref and fine-tuned during alignment

[3] [3]

The two methods differ only in𝑠target: Method Reward location DI++ endpoint𝑥 0 only Didrevery noise level via DRS Evaluation

Alignment.6,000 outer steps; each step runs 5 TA DSM updates (Adam, lr=3× 10−4) followed by one generator update (Adam, lr=10−4, batch 2,048). The two methods differ only in𝑠target: Method Reward location DI++ endpoint𝑥 0 only Didrevery noise level via DRS Evaluation. Final generators are evaluated on𝑛= 10,000samples using the hard decision boundary𝑥 > 0....

2023

[4] [4]

Text-alignment metrics

uses the LAION aesthetic predictor, which estimates visual appeal from CLIP image embeddings and does not directly measure prompt faithfulness. Text-alignment metrics. CLIPScore (Hessel et al., 2021; Radford et al., 2021) measures image–text compatibility using CLIP embedding similarity. DPG-Bench (Hu et al., 2024) evaluates semantic adherence on dense pr...

work page arXiv 2021

[5] [5]

A tranquil dormant mountain dusted with fresh snow under a clear pale blue daytime sky, calm still air, soft diffuse light, muted cool whites and blues, serene and peaceful alpine landscape

[6] [6]

Close-up macro of a human eye with a vivid blue iris, glittering sparkles and cosmic galaxy-like reflections, extreme detail, macro photography

[7] [7]

Portrait of a young African woman wearing a blue and yellow headwrap and a pearl earring, classical painting style, warm studio lighting, elegant and dignified

[8] [8]

A majestic tall sailing ship navigating through a glowing cosmic nebula, teal and cyan hues, dramatic fantasy scene, epic scale

[9] [9]

A vivid explosion of colorful paint splashing in mid-air, rainbow hues of orange, yellow, magenta, blue and purple, high-speed photography, dark background

[10] [10]

A tabby cat leaping through the air in a frozen action shot, photorealistic, natural daylight background

[11] [11]

An assorted sushi platter with rolls and nigiri, glowing neon blue ring lighting, futuristic and vibrant food photography

[12] [12]

A lush bouquet of deep red roses in a round green ceramic vase, soft natural window light, elegant floral still life

[13] [13]

A woman’s portrait blending photorealism with blue watercolor ink splashes, artistic double- exposure style, ethereal and dreamy Figure 5 — Qualitative Comparison SDXL comparison:

[14] [14]

Dramatic close-up portrait of a rugged elderly man with a weathered face, white beard, dark flat cap, overcast sea in the background

[15] [15]

A baby elephant in rain boots jumping in puddles under a rainbow, cheerful children’s book illustration

[16] [16]

Z-Image comparison:

An oil painting of a polar bear mother and cub drifting on an ice floe at Arctic dusk, warm amber sky. Z-Image comparison:

[17] [17]

A beautiful young woman in a white linen dress, Mediterranean coastal town background, warm golden hour light, photorealistic portrait

[18] [18]

A majestic white stallion galloping through crashing ocean waves at sunrise, cinematic photog- raphy

[19] [19]

Figure 7 — Temperature Ablation

A young wizard girl casting a glowing spell in a floating sky library, anime key visual style. Figure 7 — Temperature Ablation

[20] [20]

Aurora borealis over a calm reflective lake surrounded by snow-covered pine trees, winter night

[21] [21]

Volcanic eruption with lava flows meeting snow and ice, dramatic fire and steam, epic landscape

[22] [22]

White arctic fox standing in snow, winter landscape, photorealistic wildlife photography

[23] [23]

Desert campsite at night with glowing orange tents and campfire, Milky Way visible overhead Diff-Instruct with Diffused Reward 33 Figure 8 — Multi-step Comparison

[24] [24]

Japanese ukiyo-e woodblock print of Mount Fuji with cherry blossom trees and a lone figure on a wooden bridge

[25] [25]

Cute anthropomorphic fox in a cozy autumn café, illustrated anime art style, fallen leaves, warm tones

[26] [26]

Aerial drone view of a tropical volcanic island with a turquoise crater lake, surrounded by ocean

[27] [27]

Portrait of a Rajasthani woman with traditional Indian jewelry and a colorful headscarf, photo- realistic Figure 9 — Failure Cases

[28] [28]

A woman placing a tray of food into a kitchen oven

[29] [29]

A large crowd of people standing in a queue outdoors

[30] [30]

A round wooden dining table in a bright room

[31] [31]

A person cutting raw meat on a kitchen counter surrounded by fresh vegetables