pith. sign in

arxiv: 2606.21086 · v1 · pith:TUNFEOYDnew · submitted 2026-06-19 · 💻 cs.RO

ReFPO: Reflow Regularization for Flow Matching Policy Gradients

Pith reviewed 2026-06-26 14:37 UTC · model grok-4.3

classification 💻 cs.RO
keywords flow matchingpolicy gradientsreinforcement learningreflow regularizationgenerative policiesone-step inferencerobotic controlgeometric regularization
0
0 comments X

The pith

Flow matching policy gradients implicitly perform advantage-weighted Reflow, so an explicit geometric regularizer added in one line stabilizes training and supports accurate one-step inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the gradient updates in Flow Matching Policy Gradients function as an implicit advantage-weighted Reflow process. This geometric view motivates adding an explicit Reflow regularizer to the method. The resulting ReFPO approach requires only a single line of code change and no extra computation or distillation stages. It reduces proxy-ratio spikes during training and supports high-fidelity one-step generation that matches or exceeds multi-step results. Experiments on control tasks from simple grids to complex humanoid robots confirm improved performance and robustness to discretization.

Core claim

The gradient updates in Flow Matching Policy Gradients can be interpreted as an implicit advantage-weighted Reflow process. Building on this, ReFPO introduces an explicit geometric regularizer implemented with a single line of code change. This regularization reduces CFM proxy-ratio spikes, stabilizes training, and enables high-fidelity one-step inference often matching or exceeding multi-step performance across GridWorld, MuJoCo Playground, and Humanoid Control tasks.

What carries the argument

The advantage-weighted Reflow process interpretation of FPO gradients, which motivates the explicit Reflow geometric regularizer.

If this is right

  • Reduces CFM proxy-ratio spikes during training
  • Stabilizes PPO-style training without auxiliary stages
  • Enables high-fidelity one-step inference that matches or exceeds multi-step
  • Improves average performance and discretization robustness in control tasks

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This geometric regularization might apply to other generative policy optimization methods beyond flow matching.
  • Explicit path rectification could further improve sample efficiency in high-dimensional control.
  • Stable one-step inference opens possibilities for real-time deployment of generative policies in robotics.

Load-bearing premise

The gradient updates in Flow Matching Policy Gradients can be interpreted as an implicit advantage-weighted Reflow process.

What would settle it

If experiments show that the explicit Reflow regularizer does not reduce CFM proxy-ratio spikes or improve one-step inference performance compared to standard FPO, the value of the regularization would be falsified.

Figures

Figures reproduced from arXiv: 2606.21086 by Chengsi Yao, Fan Feng, Ge Wang, Honghao Cai, Jiahao Yang, Jinke Ren, Shenhao Yan, Shuguang Cui, Xi Li, Yatong Han, Yibo Peng, Yiming Zhao, Zhen Li.

Figure 1
Figure 1. Figure 1: Overview of ReFPO. Generative models, particularly flow matching and diffusion, have become pow￾erful policy representations in reinforce￾ment learning (RL) by enabling the cap￾ture of complex, multimodal action dis￾tributions. Unlike traditional Gaussian policies, flow-based policies can represent non-convex behaviors essential for high￾dimensional tasks. However, their prac￾tical deployment is limited by… view at source ↗
Figure 2
Figure 2. Figure 2: Visualizations on the Multimodal Grid World task. The top and bottom rows correspond to FPO and ReFPO, respectively. Left columns: learned vector fields representing the policy’s action distribution. Right columns: rollout trajectories starting from fixed initial states to goal zones. Both visualizations are shown for 10-step and 1-step generation settings. Multimodality and Path Geometry [PITH_FULL_IMAGE… view at source ↗
Figure 3
Figure 3. Figure 3: MuJoCo Playground evaluation. FPO and ReFPO rewards on 10 DM Control Suite tasks over 100M steps. Shaded regions show the 10-step/1-step gap; narrower gaps indicate better one-step consistency [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Policy-ratio diagnostics. On PointMass, FingerSpin, and BallInCup, reward drops in FPO coincide with large proxy-ratio spikes, while ReFPO keeps the ratio smaller and the reward curves more stable. Beyond cumulative reward, we report two flow diagnostics. Straightness Error is the MSE between the learned velocity field vθ and the straight target direction (a1 − a0); lower values indicate a more linear samp… view at source ↗
Figure 5
Figure 5. Figure 5: Performance and efficiency analysis. (a) Comparison of FPO and ReFPO on Humanoid Control. (b) Action generation latency, where ReFPO-1step achieves significantly lower inference time. supplementary materials. Implementation Details and Metrics. The agent has 24 actuated joints (72 DoF) and is trained in Isaac Gym [23] using Puffer-PHC [20]. Policies receive proprioception and goal-conditioned targets under… view at source ↗
read the original abstract

We present Reflow-regularized Flow Matching Policy Gradients (ReFPO), a simple online RL method that adds explicit Reflow regularization to FPO for efficient flow-based control. We uncover a key structural property: the gradient updates in Flow Matching Policy Gradients (FPO) can be interpreted as an implicit advantage-weighted Reflow process, providing a new geometric perspective on flow-based policy gradients. Building on this insight, ReFPO introduces an explicit geometric regularizer that can be implemented with a single line of code change without incurring additional computational overhead or auxiliary distillation stages. By synergizing advantage-guided updates with path rectification, our method reduces CFM proxy-ratio spikes, stabilizes PPO-style training, and enables high-fidelity one-step inference that often matches or exceeds multi-step performance. We experimentally demonstrate that ReFPO improves average performance and discretization robustness across GridWorld, MuJoCo Playground, and high-dimensional Humanoid Control tasks, providing a scalable and stable approach for generative policies in complex physical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReFPO, which augments Flow Matching Policy Gradients (FPO) with an explicit Reflow regularization term. It claims that FPO gradient updates constitute an implicit advantage-weighted Reflow process, motivating a single-line geometric regularizer that reduces CFM proxy-ratio spikes, stabilizes training, and supports high-fidelity one-step inference. Experiments on GridWorld, MuJoCo Playground, and Humanoid control tasks report improved average performance and discretization robustness.

Significance. If the structural property is rigorously established and the regularizer proves effective without hidden costs, the work supplies a lightweight stabilization technique for flow-based generative policies in RL. The geometric framing could inform future policy-gradient designs that combine advantage weighting with path rectification, and the single-line implementation claim is a practical strength if verified.

major comments (2)
  1. [Section 3 (Structural Property and Motivation)] The central structural claim—that FPO gradients are an implicit advantage-weighted Reflow process—is load-bearing for the motivation of the explicit regularizer, yet no derivation is supplied that aligns the CFM velocity field, advantage weighting, and path-rectification terms. Without this step-by-step equivalence (or an ablation isolating the implicit Reflow component), the geometric justification remains an unverified modeling choice rather than a necessary consequence of the FPO objective.
  2. [§4 (Method)] §4 (Method) and Algorithm 1: the claim that the regularizer incurs “no additional computational overhead” and requires only “a single line of code change” must be supported by explicit complexity analysis and a side-by-side code diff; the current description does not quantify the extra gradient term’s cost relative to the base FPO update.
minor comments (2)
  1. [Abstract and §5] Abstract and §5 (Experiments): performance claims are stated without error bars, number of seeds, or statistical tests; tables or figures should report mean ± std across runs to substantiate “improves average performance.”
  2. [§2 (Preliminaries)] Notation: the distinction between the CFM proxy ratio and the advantage-weighted Reflow objective is introduced without a clear equation reference; a dedicated notation table or inline definitions would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and commit to revisions that strengthen the manuscript where the points are valid.

read point-by-point responses
  1. Referee: [Section 3 (Structural Property and Motivation)] The central structural claim—that FPO gradients are an implicit advantage-weighted Reflow process—is load-bearing for the motivation of the explicit regularizer, yet no derivation is supplied that aligns the CFM velocity field, advantage weighting, and path-rectification terms. Without this step-by-step equivalence (or an ablation isolating the implicit Reflow component), the geometric justification remains an unverified modeling choice rather than a necessary consequence of the FPO objective.

    Authors: We agree that the structural claim is central and that the manuscript lacks an explicit derivation. In the revision we will add a step-by-step derivation in Section 3 that aligns the CFM velocity field, advantage weighting, and path-rectification terms. We will also include an ablation isolating the implicit Reflow component. revision: yes

  2. Referee: [§4 (Method)] §4 (Method) and Algorithm 1: the claim that the regularizer incurs “no additional computational overhead” and requires only “a single line of code change” must be supported by explicit complexity analysis and a side-by-side code diff; the current description does not quantify the extra gradient term’s cost relative to the base FPO update.

    Authors: We acknowledge that the overhead and single-line claims require explicit support. In the revision we will add a complexity analysis of the extra gradient term relative to base FPO and include a side-by-side code diff (in the main text or appendix) to demonstrate the change and quantify cost. revision: yes

Circularity Check

0 steps flagged

No circularity: structural interpretation presented as modeling insight without reduction to fitted inputs or self-citation chains

full rationale

The paper states an interpretation of FPO gradients as an implicit advantage-weighted Reflow process and uses it to motivate an explicit regularizer, but the provided text contains no equations, fitting procedures, or self-citations that reduce the claimed property or regularizer to the inputs by construction. The regularizer is introduced as an additive change rather than a tautological renaming or statistical forcing of a fitted quantity. No load-bearing step equates the output to the input via definition or prior self-work, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical details, loss functions, or modeling assumptions are visible in the abstract, so the ledger is empty.

pith-pipeline@v0.9.1-grok · 5741 in / 1099 out tokens · 19320 ms · 2026-06-26T14:37:20.052243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 11 linked inside Pith

  1. [1]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=YCWjhGrJFD

  2. [2]

    Openai gym.arXiv preprint arXiv:1606.01540, 2016

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

  3. [3]

    One-step flow policy mirror descent

    Tianyi Chen, Haitong Ma, Na Li, Kai Wang, and Bo Dai. One-step flow policy mirror descent. arXiv preprint arXiv:2507.23675, 2025

  4. [4]

    Twinflow: Realizing one-step generation on large models with self-adversarial flows.arXiv preprint arXiv:2512.05150, 2025

    Zhenglin Cheng, Peng Sun, Jianguo Li, and Tao Lin. Twinflow: Realizing one-step generation on large models with self-adversarial flows.arXiv preprint arXiv:2512.05150, 2025

  5. [5]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  6. [6]

    Online reward-weighted fine-tuning of flow matching with wasserstein regularization

    Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

  7. [7]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=OlzB6LnXcS

  8. [8]

    Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

  9. [9]

    Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

  10. [10]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InInternational Conference on Machine Learning, pages 9902–9915. PMLR, 2022

  11. [11]

    Understanding diffusion objectives as the ELBO with simple data augmentation

    Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=NnMEadcdyD

  12. [12]

    Optimal flow matching: Learning straight trajectories in just one step.Advances in Neural Information Processing Systems, 37:104180–104204, 2024

    Nikita Kornilov, Petr Mokrov, Alexander Gasnikov, and Aleksandr Korotin. Optimal flow matching: Learning straight trajectories in just one step.Advances in Neural Information Processing Systems, 37:104180–104204, 2024

  13. [13]

    Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

    JunzheLi, YutaoCui, TaoHuang, YinpingMa, ChunFan, MilesYang, andZhaoZhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

  14. [14]

    Reinforcement learning with action chunking

    Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=XUks1Y96NR. 12

  15. [15]

    Adversarial flow models

    Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, and Haoqi Fan. Adversarial flow models. arXiv preprint arXiv:2511.22475, 2025

  16. [16]

    Flashaudio: Rectified flow for fast and high-fidelity text-to-audio generation

    Huadai Liu, Jialei Wang, Rongjie Huang, Yang Liu, Heng Lu, Zhou Zhao, and Wei Xue. Flashaudio: Rectified flow for fast and high-fidelity text-to-audio generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13694–13710, 2025

  17. [17]

    Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  18. [18]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=XVjTT1nw5z

  19. [19]

    Soflow: Solution flow models for one-step generative modeling.arXiv preprint arXiv:2512.15657, 2025

    Tianze Luo, Haotian Yuan, and Zhuang Liu. Soflow: Solution flow models for one-step generative modeling.arXiv preprint arXiv:2512.15657, 2025

  20. [20]

    Perpetual humanoid control for real-time simulated avatars

    Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

  21. [21]

    Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

    Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

  22. [22]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019

  23. [23]

    Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  24. [24]

    Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

  25. [25]

    Revisiting diffusion q-learning: From iterative denoising to one-step action generation.arXiv preprint arXiv:2508.13904, 2025

    Thanh Nguyen and Chang D Yoo. Revisiting diffusion q-learning: From iterative denoising to one-step action generation.arXiv preprint arXiv:2508.13904, 2025

  26. [26]

    Flow q-learning.arXiv preprint arXiv:2502.02538, 2025

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning.arXiv preprint arXiv:2502.02538, 2025

  27. [27]

    Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

  28. [28]

    Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz

    Allen Z. Ren, Justin Lidard, Lars Lien Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=mEpqHvbD2h. 13

  29. [29]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control.arXiv preprint arXiv:2505.22642, 2025

  31. [31]

    LEARNING STRAIGHT FLOWS BY LEARNING CURVED INTERPOLANTS

    Shiv Shankar and Tomas Geffner. LEARNING STRAIGHT FLOWS BY LEARNING CURVED INTERPOLANTS. InICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2025. URLhttps://openreview.net/forum?id=9bJ2PJFNX4

  32. [32]

    Consistency models, 2023

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. URL https://arxiv.org/abs/2303.01469

  33. [33]

    Deepmind control suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  34. [34]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  35. [35]

    Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, pages 1–34, 2024

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector- Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research, pages 1–34, 2024

  36. [36]

    Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

  37. [37]

    dm_control: Software and tasks for continuous control.Software Impacts, 6:100022, 2020

    Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control.Software Impacts, 6:100022, 2020

  38. [38]

    One-step gen- erative policies with q-learning: A reformulation of meanflow.arXiv preprint arXiv:2511.13035, 2025

    Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, and Yanwei Fu. One-step gen- erative policies with q-learning: A reformulation of meanflow.arXiv preprint arXiv:2511.13035, 2025

  39. [39]

    Diffusion policies as an expressive policy class for offline reinforcement learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

  40. [40]

    Consistency flow matching: Defining straight flows with velocity consistency.CoRR, 2024

    Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Ermon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency.CoRR, 2024

  41. [41]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024. 14

  42. [42]

    Mujoco playground

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground. arXiv preprint arXiv:2502.08844, 2025

  43. [43]

    Energy-weighted flow matching for offline reinforcement learning

    Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning. InThe Thirteenth International Conference on Learning Representations,

  44. [44]

    URLhttps://openreview.net/forum?id=HA0oLUvuGI

  45. [45]

    Reinflow: Fine-tuning flow matching policy with online reinforcement learning

    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  46. [46]

    Flow straighter and faster: Efficient one-step generative modeling via meanflow on rectified trajectories.arXiv preprint arXiv:2511.23342, 2025

    Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Alen Mrdovic, and Dimitris Metaxas. Flow straighter and faster: Efficient one-step generative modeling via meanflow on rectified trajectories.arXiv preprint arXiv:2511.23342, 2025

  47. [47]

    SCot: Unifying consistency models and rectified flows via straight-consistent trajectories

    zhangkai wu, Xuhui Fan, Hongyu Wu, and Longbing Cao. SCot: Unifying consistency models and rectified flows via straight-consistent trajectories. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= GV82iAD70j

  48. [48]

    Terminal velocity matching

    Linqi Zhou, Mathias Parger, Ayaan Haque, and Jiaming Song. Terminal velocity matching. arXiv preprint arXiv:2511.19797, 2025

  49. [49]

    Analyzing and mitigating model collapse in rectified flow models.arXiv preprint arXiv:2412.08175, 2024

    Huminhao Zhu, Fangyikang Wang, Tianyu Ding, Qing Qu, and Zhihui Zhu. Analyzing and mitigating model collapse in rectified flow models.arXiv preprint arXiv:2412.08175, 2024

  50. [50]

    Slimflow: Training smaller one-step diffusion models with rectified flow

    Yuanzhi Zhu, Xingchao Liu, and Qiang Liu. Slimflow: Training smaller one-step diffusion models with rectified flow. InEuropean Conference on Computer Vision, pages 342–359. Springer, 2024

  51. [51]

    Di [m] o: Distilling masked diffusion models into one-step generator

    Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di [m] o: Distilling masked diffusion models into one-step generator. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18606–18618, 2025. 15 A Code and Supplementary Videos The source code for ReFPO is included in the supplementary materials to ensure the rep...