Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Jade Zou; Jiangfeng Xiong; Jianwei Zhang; Junzhe Li; Liefeng Bo; Qi Tian; Tao Huang; Weijie Kong; Yue Wu; Zhao Zhong

arxiv: 2605.23522 · v1 · pith:LZVDKUWVnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.CV

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Jade Zou , Tao Huang , Weijie Kong , Junzhe Li , Yue Wu , Qi Tian , Jiangfeng Xiong , Jianwei Zhang

show 2 more authors

Liefeng Bo Zhao Zhong

This is my paper

Pith reviewed 2026-05-25 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords flow matchingstochastic samplingreinforcement learningSDE discretizationpost-traininggenerative modelsdiffusion models

0 comments

The pith

Precise maintains SDE consistency in stochastic sampling for flow-matching models by freezing the clean-latent posterior mean, enabling faster RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that converting the deterministic ODE trajectory of flow-matching models into an SDE for RL requires careful choice of noise schedule and faithful discretization to avoid excess noise or heuristic failures. Standard samplers deviate from the intended process at low step counts, which harms reward optimization stability. Precise introduces an approximation that freezes the clean-latent posterior mean to enforce SDE consistency throughout the trajectory. This produces more stable exploration and leads to faster convergence in RL post-training while reaching higher alignment metrics.

Core claim

Precise is a stochastic sampler that balances effective exploration with stability through an SDE schedule and keeps the denoising trajectory consistent with the underlying flow-matching SDE via a novel approximation that freezes the clean-latent posterior mean, which resolves the excess discretization noise present in existing samplers.

What carries the argument

The SDE-consistent stochastic sampler using frozen clean-latent posterior mean approximation, which prevents deviation from the flow-matching process during discretization at small step counts.

If this is right

RL post-training converges faster and with greater stability when the sampler maintains SDE consistency.
Alignment metrics such as PickScore and HPSv2.1 reach state-of-the-art levels under the new sampler.
Wall-clock training time drops by 13.1 to 53.2 percent while matching or exceeding prior best in-domain performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same freezing approximation might reduce discretization artifacts in non-RL sampling tasks for flow-matching or diffusion models.
The approach could be tested on video or 3D generation pipelines where step count is similarly constrained.
If the posterior-mean freeze generalizes, it may simplify SDE schedule design across different generative architectures.

Load-bearing premise

The discretization deviation observed in the toy example is the dominant reason for instability and slow convergence when standard samplers are used on full-scale flow-matching models in RL.

What would settle it

Measure the actual discretization error magnitude of standard samplers versus Precise at the exact step counts, noise levels, and latent resolutions used in the paper's RL experiments to see whether the error gap accounts for the reported stability and speed differences.

Figures

Figures reproduced from arXiv: 2605.23522 by Jade Zou, Jiangfeng Xiong, Jianwei Zhang, Junzhe Li, Liefeng Bo, Qi Tian, Tao Huang, Weijie Kong, Yue Wu, Zhao Zhong.

**Figure 2.** Figure 2: SD3.5-M clean-latent stability under forward-style renoising on adjacent logSNR-grid pairs; the band [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Training trajectories under the two main protocols. Higher is better. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: FLUX.2 Klein training trajectories under the 20-NFE protocol. Higher is better. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablations on NFE and exploration strength. Left: NFE ablation on PickScore. Right: [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples under the 10-NFE protocol. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples under the 30-NFE protocol. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: CPS outer-ring mass on the equal-mass double-ring distribution. The target value is [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Precise adds a freezing approximation for the clean-latent posterior mean to enforce SDE consistency in RL sampling for flow-matching models, but the toy-example diagnosis is the only concrete support shown and its relevance at scale remains untested.

read the letter

The core claim is that freezing the clean-latent posterior mean during discretization keeps the sampler on the intended SDE trajectory and removes excess noise that standard methods introduce. This is presented as the fix that lets RL reward optimization run faster and more stably on flow-matching generators. The paper also derives an SDE schedule by balancing exploration against stability in the denoising process. Both pieces are practical and address a real operational tension when turning deterministic flow-matching trajectories into stochastic policies for online RL. The toy example that diagnoses discretization drift in prior samplers is a clear way to illustrate the problem before scaling up. That kind of diagnostic step is useful even if the numbers are small. The experiments are said to show 13-53% less wall-clock time to reach prior best alignment scores on PickScore and HPSv2.1, which would matter for anyone running these loops at production scale. The soft spot is the leap from the toy case to full-resolution models. The abstract gives no indication that the observed discretization error is the dominant bottleneck once step counts and image sizes reach the values used in the RL runs; other sources of variance such as reward-model noise or policy-gradient variance could easily matter more. Without ablations that isolate the freezing step or direct checks that the sampled trajectories actually stay SDE-consistent at scale, the performance gains are hard to attribute. The derivation of the SDE schedule is also described at a high level only. This work is aimed at groups already doing RL post-training on diffusion or flow models and hitting sampler-related instability. A reader who needs a drop-in stochastic policy for that setting could extract the schedule and the freezing rule and test them directly. The paper deserves a serious referee because the engineering problem is concrete and the proposed approximation is specific enough to be checked against existing SDE discretizations. Referees can ask for the missing verification steps and ablations without the work being dismissed outright.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing stochastic samplers for flow-matching models in RL post-training introduce excess discretization noise or rely on non-convergent heuristics. It derives an SDE schedule balancing exploration and stability, diagnoses the issue via a toy example, and introduces Precise, which maintains SDE-consistency via a novel approximation that freezes the clean-latent posterior mean. This is asserted to yield significantly faster and more stable reward optimization, SOTA alignment scores (PickScore, HPSv2.1), and 13.1-53.2% less wall-clock training time.

Significance. If the central empirical claims hold and the toy-example diagnosis generalizes, the work would supply a principled, SDE-consistent stochastic policy for RL fine-tuning of flow-matching generators. The explicit separation of schedule design from discretization fidelity, together with the posterior-mean freezing approximation, could become a reusable component for stable online RL in this model class.

major comments (3)

[Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.
[Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.
[Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.

minor comments (2)

Notation for the derived SDE schedule should be introduced with an explicit equation number rather than described only in prose.
The abstract would be clearer if it cited the specific section containing the toy-example figures and the RL-experiment tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.

Authors: The toy example in §3 is designed to isolate and diagnose the discretization deviation mechanism under conditions where the exact SDE solution is known. We agree that it does not quantify the relative magnitude of this error against other sources in the full RL setting. The performance differences reported in §5 provide supporting evidence that addressing SDE consistency improves outcomes, but we will add an explicit discussion in the revised §3 addressing the potential interplay with reward-model variance and policy-gradient noise at the step counts used in our experiments. revision: yes
Referee: [Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.

Authors: Section §5 contains the quantitative results (PickScore, HPSv2.1, and the reported 13.1-53.2% wall-clock reductions) along with comparisons to prior samplers. We will revise the manuscript to include error bars on the main metrics, add ablation tables isolating the contribution of the posterior-mean freezing step, and insert a numerical verification that the sampled trajectories remain consistent with the target SDE at the operating step counts. revision: yes
Referee: [Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.

Authors: Section §4 presents the conceptual description of the approximation. In the revision we will add the explicit update equation and a numerical check measuring trajectory deviation from the target SDE at the step counts employed in the RL experiments. A formal convergence proof lies outside the scope of the present work, which builds on existing flow-matching theory; the added numerical check will serve as empirical support. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper's core steps consist of an analysis of exploration-stability tension to derive an SDE schedule, followed by a toy example demonstrating discretization deviations in prior samplers, and the introduction of a novel approximation (freezing the clean-latent posterior mean) to enforce SDE-consistency. None of these reduce by construction to fitted inputs, self-definitions, or self-citation chains; the claims rest on independent analytical reasoning and empirical results rather than tautological equivalences. No load-bearing equations or parameters are shown to be renamed predictions or imported uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that replacing the ODE with an SDE produces a valid policy for RL, plus the modeling choice that freezing the posterior mean is a faithful approximation rather than an ad-hoc fix. No free parameters or invented entities are explicitly named.

axioms (1)

domain assumption The reverse-time ODE of flow matching can be replaced by an SDE to create a stochastic policy without changing the marginal data distribution at convergence.
Invoked when the abstract states that turning the deterministic trajectory into a stochastic policy is a critical step for online RL.

pith-pipeline@v0.9.0 · 5869 in / 1391 out tokens · 22082 ms · 2026-05-25T05:03:29.872242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.3 (Exact transition under frozen posterior mean). ... zt′ = (1−t′)ˆz0(t) + t′/t e−A(t′,t)/2 (zt − (1−t)ˆz0(t)) + t′ √(1−e−A(t′,t)) w
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 4.1 (First-order logSNR decomposition) ... Δλvel = 2Δt / t(1−t) + o(Δt)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 15 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

HunyuanImage 3.0 Technical Report

Release article. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv
[4]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

work page 2021
[6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025a. Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured br...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

11 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Prafulla ...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

work page arXiv
[16]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understa...

work page arXiv
[17]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

work page arXiv
[20]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80

and CPS (Wang & Yu, 2025), andη=1.5 for PRECISE, matching the main experimental protocol. As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80. Figure 8 isolates the large-N regime by tracking the CPS outer-ring mass as the NFE increases to N=1280 . The target outer-ring mass is 0.5, but the curve does not approach that value ...

work page 2025
[22]

Stability AI Community License FLUX.2 Klein 4B Base Black Forest Labs FLUX.2 Klein 4B Base (Black Forest Labs, 2025

work page 2025
[23]

MIT CLIPScore / CLIP CLIPScore with OpenAI CLIP (clip-vit-large-patch14) (Hessel et al., 2021; Rad- ford et al.,

work page 2021

[1] [1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

HunyuanImage 3.0 Technical Report

Release article. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

work page arXiv

[4] [4]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

work page 2021

[6] [6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025a. Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured br...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

11 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Prafulla ...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

work page arXiv

[16] [16]

Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understa...

work page arXiv

[17] [17]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

work page arXiv

[20] [20]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80

and CPS (Wang & Yu, 2025), andη=1.5 for PRECISE, matching the main experimental protocol. As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80. Figure 8 isolates the large-N regime by tracking the CPS outer-ring mass as the NFE increases to N=1280 . The target outer-ring mass is 0.5, but the curve does not approach that value ...

work page 2025

[22] [22]

Stability AI Community License FLUX.2 Klein 4B Base Black Forest Labs FLUX.2 Klein 4B Base (Black Forest Labs, 2025

work page 2025

[23] [23]

MIT CLIPScore / CLIP CLIPScore with OpenAI CLIP (clip-vit-large-patch14) (Hessel et al., 2021; Rad- ford et al.,

work page 2021