AdvDMD: Adversarial Reward Meets DMD For High-Quality Few-Step Generation
Pith reviewed 2026-05-07 12:15 UTC · model grok-4.3
The pith
AdvDMD integrates an adversarial discriminator from DMD2 as an online-updated reward into a unified SDE-based distillation process to achieve superior few-step generation quality over teacher models on benchmarks like DPG-Bench and GenEval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow.
Load-bearing premise
The adversarially trained discriminator from DMD2 can serve as a reliable reward model when trained on both intermediate and final denoising states and updated online, providing holistic supervision without reward hacking or instability.
read the original abstract
Diffusion models offer superior generation quality at the expense of extensive sampling steps. Distillation methods, with Distribution Matching Distillation (DMD) as a popular example, can mitigate this issue, but performance degradation remains pronounced when sampling steps are limited. Reinforcement learning (RL) has been leveraged to improve the few-step generation quality during distillation, with the potential to even surpass the performance of the teacher model. However, existing approaches are combinatorial in nature, merely integrating an RL process with the distillation process, which introduces unnecessary complexities. To address this gap, we propose AdvDMD, a method that seamlessly unifies DMD distillation and RL. Specifically, AdvDMD employs the adversarially trained discriminator from DMD2 as the reward model, which assigns low scores to generated images and high scores to real ones. It is trained on both intermediate and final states of the denoising process and updated online with the distilled model, enabling a holistic supervision of the sampling trajectories and mitigating reward hacking. We adopt a unified SDE backward simulation and a different training schedule for DMD and RL to enable a more stable and efficient training. Experimental results demonstrate that the 4-step AdvDMD outperforms the original 40-step model for SD3.5 on DPG-Bench, while achieving significant performance gains for SD3 on the GenEval. On Qwen-Image, our 2-step AdvDMD achieves superior performance over TwinFlow.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdvDMD, which unifies Distribution Matching Distillation (DMD) with reinforcement learning by repurposing the adversarially trained discriminator from DMD2 as an online reward model. The discriminator is trained on both intermediate and final denoising states and updated jointly with the distilled model to provide holistic trajectory supervision. A unified SDE backward simulation and adjusted DMD/RL training schedule are introduced for stability. The central empirical claims are that 4-step AdvDMD outperforms the original 40-step SD3.5 teacher on DPG-Bench, yields significant gains for SD3 on GenEval, and that 2-step AdvDMD surpasses TwinFlow on Qwen-Image.
Significance. If the reported gains are robustly validated, the work would be significant for efficient diffusion sampling. It provides a non-combinatorial integration of adversarial rewards into distillation, potentially allowing few-step models to exceed multi-step teachers while addressing reward hacking through online holistic supervision. This extends DMD2 in a natural direction and could influence reward-model design in generative distillation pipelines.
major comments (2)
- Experimental Results section: the central claim that 4-step AdvDMD outperforms the 40-step SD3.5 model on DPG-Bench is load-bearing for the paper's contribution, yet the manuscript supplies no error bars, run counts, statistical tests, or ablation studies isolating the effect of online discriminator updates and intermediate-state training; without these, it is impossible to rule out post-hoc selection or confounding factors in the reported superiority.
- Method section (description of discriminator as reward): the assertion that training on intermediate and final states plus online updates 'mitigates reward hacking' is central to the unification but lacks any quantitative analysis, failure-case examination, or comparison to a fixed discriminator baseline; this assumption directly supports the claim of stable holistic supervision.
minor comments (3)
- The abstract states 'significant performance gains' for SD3 on GenEval without quoting the actual metric values or baselines, reducing clarity for readers.
- Notation for the unified SDE backward simulation could be formalized with an explicit equation contrasting it to standard DMD SDE to improve reproducibility.
- Figure captions and table headers should explicitly list all compared methods and metrics to avoid ambiguity when presenting the 2-step vs. TwinFlow results.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The feedback highlights important aspects of statistical rigor and empirical validation that will strengthen the manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Experimental Results section: the central claim that 4-step AdvDMD outperforms the 40-step SD3.5 model on DPG-Bench is load-bearing for the paper's contribution, yet the manuscript supplies no error bars, run counts, statistical tests, or ablation studies isolating the effect of online discriminator updates and intermediate-state training; without these, it is impossible to rule out post-hoc selection or confounding factors in the reported superiority.
Authors: We agree that the absence of error bars, run counts, statistical tests, and targeted ablations weakens the robustness of the central claim. In the revised manuscript, we will report all main results with error bars computed across at least three independent random seeds, explicitly state the number of runs performed, and include statistical significance tests (e.g., paired t-tests) comparing AdvDMD against the teacher model. We will also add ablation studies that separately disable online discriminator updates and intermediate-state training to isolate their individual contributions and demonstrate that the reported gains are not due to post-hoc selection or confounding factors. revision: yes
-
Referee: Method section (description of discriminator as reward): the assertion that training on intermediate and final states plus online updates 'mitigates reward hacking' is central to the unification but lacks any quantitative analysis, failure-case examination, or comparison to a fixed discriminator baseline; this assumption directly supports the claim of stable holistic supervision.
Authors: We acknowledge that the current manuscript provides only a qualitative description of how intermediate-state training and online updates mitigate reward hacking, without supporting quantitative evidence. In the revision, we will add a direct comparison to a fixed (non-updated) discriminator baseline, report quantitative metrics such as average reward scores and trajectory consistency over training, and include an analysis of potential failure cases (e.g., mode collapse or reward exploitation) with evidence that the holistic supervision prevents them. These additions will be placed in both the Method and Experiments sections. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents AdvDMD as a unification of DMD distillation with an adversarially trained discriminator (from DMD2) serving as an online-updated reward model on both intermediate and final denoising states, combined with a unified SDE backward simulation and adjusted training schedule. No equations, derivations, or self-definitional reductions are visible in the abstract or described method that would make any claimed prediction equivalent to its inputs by construction. The central claims rest on experimental benchmarks (e.g., 4-step AdvDMD outperforming 40-step SD3.5) rather than a closed loop of fitted parameters or load-bearing self-citations that reduce to unverified premises. The online adaptation of the discriminator introduces external dependency but does not create internal circularity.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.