pith. sign in

arxiv: 2605.23522 · v1 · pith:LZVDKUWVnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI· cs.CV

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Pith reviewed 2026-05-25 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords flow matchingstochastic samplingreinforcement learningSDE discretizationpost-traininggenerative modelsdiffusion models
0
0 comments X

The pith

Precise maintains SDE consistency in stochastic sampling for flow-matching models by freezing the clean-latent posterior mean, enabling faster RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that converting the deterministic ODE trajectory of flow-matching models into an SDE for RL requires careful choice of noise schedule and faithful discretization to avoid excess noise or heuristic failures. Standard samplers deviate from the intended process at low step counts, which harms reward optimization stability. Precise introduces an approximation that freezes the clean-latent posterior mean to enforce SDE consistency throughout the trajectory. This produces more stable exploration and leads to faster convergence in RL post-training while reaching higher alignment metrics.

Core claim

Precise is a stochastic sampler that balances effective exploration with stability through an SDE schedule and keeps the denoising trajectory consistent with the underlying flow-matching SDE via a novel approximation that freezes the clean-latent posterior mean, which resolves the excess discretization noise present in existing samplers.

What carries the argument

The SDE-consistent stochastic sampler using frozen clean-latent posterior mean approximation, which prevents deviation from the flow-matching process during discretization at small step counts.

If this is right

  • RL post-training converges faster and with greater stability when the sampler maintains SDE consistency.
  • Alignment metrics such as PickScore and HPSv2.1 reach state-of-the-art levels under the new sampler.
  • Wall-clock training time drops by 13.1 to 53.2 percent while matching or exceeding prior best in-domain performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same freezing approximation might reduce discretization artifacts in non-RL sampling tasks for flow-matching or diffusion models.
  • The approach could be tested on video or 3D generation pipelines where step count is similarly constrained.
  • If the posterior-mean freeze generalizes, it may simplify SDE schedule design across different generative architectures.

Load-bearing premise

The discretization deviation observed in the toy example is the dominant reason for instability and slow convergence when standard samplers are used on full-scale flow-matching models in RL.

What would settle it

Measure the actual discretization error magnitude of standard samplers versus Precise at the exact step counts, noise levels, and latent resolutions used in the paper's RL experiments to see whether the error gap accounts for the reported stability and speed differences.

Figures

Figures reproduced from arXiv: 2605.23522 by Jade Zou, Jiangfeng Xiong, Jianwei Zhang, Junzhe Li, Liefeng Bo, Qi Tian, Tao Huang, Weijie Kong, Yue Wu, Zhao Zhong.

Figure 1
Figure 1. Figure 1: Left: Sampler design has two coupled axes: the exploration-stability balance and SDE consistency. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SD3.5-M clean-latent stability under forward-style renoising on adjacent logSNR-grid pairs; the band [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training trajectories under the two main protocols. Higher is better. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FLUX.2 Klein training trajectories under the 20-NFE protocol. Higher is better. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablations on NFE and exploration strength. Left: NFE ablation on PickScore. Right: [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples under the 10-NFE protocol. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples under the 30-NFE protocol. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CPS outer-ring mass on the equal-mass double-ring distribution. The target value is [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing stochastic samplers for flow-matching models in RL post-training introduce excess discretization noise or rely on non-convergent heuristics. It derives an SDE schedule balancing exploration and stability, diagnoses the issue via a toy example, and introduces Precise, which maintains SDE-consistency via a novel approximation that freezes the clean-latent posterior mean. This is asserted to yield significantly faster and more stable reward optimization, SOTA alignment scores (PickScore, HPSv2.1), and 13.1-53.2% less wall-clock training time.

Significance. If the central empirical claims hold and the toy-example diagnosis generalizes, the work would supply a principled, SDE-consistent stochastic policy for RL fine-tuning of flow-matching generators. The explicit separation of schedule design from discretization fidelity, together with the posterior-mean freezing approximation, could become a reusable component for stable online RL in this model class.

major comments (3)
  1. [Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.
  2. [Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.
  3. [Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.
minor comments (2)
  1. Notation for the derived SDE schedule should be introduced with an explicit equation number rather than described only in prose.
  2. The abstract would be clearer if it cited the specific section containing the toy-example figures and the RL-experiment tables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Toy example / §3] Toy-example diagnosis (abstract and §3): the claim that discretization deviation is the primary bottleneck is supported only by the toy case; no evidence is given that this error dominates over SDE-schedule choice, reward-model variance, or policy-gradient noise once step counts and resolutions reach those used in the reported RL experiments.

    Authors: The toy example in §3 is designed to isolate and diagnose the discretization deviation mechanism under conditions where the exact SDE solution is known. We agree that it does not quantify the relative magnitude of this error against other sources in the full RL setting. The performance differences reported in §5 provide supporting evidence that addressing SDE consistency improves outcomes, but we will add an explicit discussion in the revised §3 addressing the potential interplay with reward-model variance and policy-gradient noise at the step counts used in our experiments. revision: yes

  2. Referee: [Experiments / §5] Experimental validation (abstract and §5): the abstract asserts faster, more stable optimization and SOTA scores but supplies no quantitative results, error bars, ablation tables, or direct verification that the claimed SDE consistency is achieved or that the approximation converges to the data distribution at the operating point.

    Authors: Section §5 contains the quantitative results (PickScore, HPSv2.1, and the reported 13.1-53.2% wall-clock reductions) along with comparisons to prior samplers. We will revise the manuscript to include error bars on the main metrics, add ablation tables isolating the contribution of the posterior-mean freezing step, and insert a numerical verification that the sampled trajectories remain consistent with the target SDE at the operating step counts. revision: yes

  3. Referee: [Method / §4] SDE-consistency claim (abstract and §4): the novel approximation is described as restoring consistency by freezing the clean-latent posterior mean, yet no equation, convergence proof, or numerical check is referenced showing that the resulting trajectory satisfies the target SDE at the small step counts used in RL.

    Authors: Section §4 presents the conceptual description of the approximation. In the revision we will add the explicit update equation and a numerical check measuring trajectory deviation from the target SDE at the step counts employed in the RL experiments. A formal convergence proof lies outside the scope of the present work, which builds on existing flow-matching theory; the added numerical check will serve as empirical support. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper's core steps consist of an analysis of exploration-stability tension to derive an SDE schedule, followed by a toy example demonstrating discretization deviations in prior samplers, and the introduction of a novel approximation (freezing the clean-latent posterior mean) to enforce SDE-consistency. None of these reduce by construction to fitted inputs, self-definitions, or self-citation chains; the claims rest on independent analytical reasoning and empirical results rather than tautological equivalences. No load-bearing equations or parameters are shown to be renamed predictions or imported uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that replacing the ODE with an SDE produces a valid policy for RL, plus the modeling choice that freezing the posterior mean is a faithful approximation rather than an ad-hoc fix. No free parameters or invented entities are explicitly named.

axioms (1)
  • domain assumption The reverse-time ODE of flow matching can be replaced by an SDE to create a stochastic policy without changing the marginal data distribution at convergence.
    Invoked when the abstract states that turning the deterministic trajectory into a stochastic policy is a critical step for online RL.

pith-pipeline@v0.9.0 · 5869 in / 1391 out tokens · 22082 ms · 2026-05-25T05:03:29.872242+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 15 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

  2. [2]

    HunyuanImage 3.0 Technical Report

    Release article. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,

  3. [3]

    Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

    Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153,

  4. [4]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113,

  5. [5]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

  6. [6]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  7. [7]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025a. Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured br...

  8. [8]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

  9. [9]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

  10. [10]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    11 Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378,

  11. [11]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  13. [13]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Prafulla ...

  14. [14]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  15. [15]

    Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

    Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

  16. [16]

    Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv:2510.22319, 2025

    Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understa...

  17. [17]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  18. [18]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  19. [19]

    Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator.arXiv preprint arXiv:2204.13902,

  20. [20]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117,

  21. [21]

    As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80

    and CPS (Wang & Yu, 2025), andη=1.5 for PRECISE, matching the main experimental protocol. As shown in Figure 1, CPS samples remain biased toward the inner ring even atN=80. Figure 8 isolates the large-N regime by tracking the CPS outer-ring mass as the NFE increases to N=1280 . The target outer-ring mass is 0.5, but the curve does not approach that value ...

  22. [22]

    Stability AI Community License FLUX.2 Klein 4B Base Black Forest Labs FLUX.2 Klein 4B Base (Black Forest Labs, 2025

  23. [23]

    MIT CLIPScore / CLIP CLIPScore with OpenAI CLIP (clip-vit-large-patch14) (Hessel et al., 2021; Rad- ford et al.,