pith. machine review for the scientific record. sign in

arxiv: 2605.14274 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: no theorem link

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords video diffusionreinforcement learningLinear Temporal Logicembodied manipulationbimanual tasksreward modelingcorrective reflowsparse rewards
0
0 comments X

The pith

CreFlow uses automatically generated Linear Temporal Logic rewards plus corrective reflow to align video diffusion rollouts with embodied task rules and lift downstream success 23.8 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models trained on mixed data often produce sequences that look plausible yet break physical rules during robot manipulation. The paper shows how to correct this by converting task requirements into compositions of Linear Temporal Logic constraints that supply precise rewards and flag specific errors in each frame. These signals feed into CreFlow, an online reinforcement learning setup that restricts updates to reward-relevant video regions via a credit-aware loss and steers corrections using positive samples through a reflow term. On eight bimanual tasks the resulting videos match human and simulator judgments more closely and raise real execution success when the generated plans are handed to robots. The result matters because it replaces low-level visual metrics with logic-based verification that works in sparse-reward settings.

Core claim

The central claim is that an LTL-based compositional reward model paired with the CreFlow framework—built around a credit-aware NFT loss that limits updates to reward-relevant regions and a corrective reflow loss that uses within-group positive samples to estimate the correction direction—stabilizes reinforcement learning updates for high-dimensional video diffusion models, produces rewards better aligned with human and simulator success labels, and raises downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

What carries the argument

The CreFlow online RL framework, which pairs automatically formulated Linear Temporal Logic rewards with a credit-aware NFT loss to confine updates and a corrective reflow loss that treats positive samples as an explicit correction estimate.

If this is right

  • Reward judgments align more closely with human and simulator success labels than prior visual-metric approaches.
  • Downstream robot execution success rises by 23.8 percentage points on eight bimanual manipulation tasks.
  • Generated videos receive localized error signals that identify exactly where task specifications are violated.
  • High-dimensional diffusion updates remain stable because the credit-aware loss avoids perturbing unrelated video regions.
  • The method operates effectively under sparse rewards typical of embodied video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automating reward creation via LTL could lower the engineering cost of adapting video models to new robot tasks.
  • The corrective reflow technique might accelerate RL training for other high-dimensional generative models outside video diffusion.
  • Strong results on bimanual tasks suggest the same logic rewards could scale to longer-horizon or multi-agent planning problems.
  • If the LTL formulation generalizes, similar constraint-based rewards could improve video generation in non-robotics domains such as procedural animation.

Load-bearing premise

The automatically formulated Linear Temporal Logic constraints give faithful, localized rewards without needing significant manual engineering or domain-specific tuning.

What would settle it

Retraining the same eight bimanual tasks with CreFlow and measuring no gain in simulator-verified success rates or human alignment scores compared with baseline video-RL methods would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2605.14274 by Minshuo Chen, Philip Torr, Qi Zhu, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Yijiang Li, Zhaoran Wang, Zhenfei Yin, Zhenyang Ni.

Figure 1
Figure 1. Figure 1: Generated manipulation videos can be nearly indistinguishable under existing reward [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CreFlow. CreFlow turns sparse compositional reward feedback into localized supervision for video diffusion post-training. Instead of applying reward-induced updates uniformly over the full rollout, it identifies the spatio-temporal mask responsible for task success or failure and restricts optimization to these regions. The resulting objective combines localized negative-aware finetuning with a… view at source ↗
Figure 3
Figure 3. Figure 3: Training convergence on three RoboTwin tasks under the same binary compositional reward. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation on a put_bottles_dustbin rollout. (a) Vanilla DiffusionNFT, with no mask, visibly degrades the dustbin region during RL training. (b) Our group-shared mask M covers the task-relevant entities (arms, bottle); its complement is anchored to the pretrained prior. (c) CreFlow confines the contrastive update to M, preserving off-task visual quality. 5.3 Ablation Study Component ablation. Tab… view at source ↗
read the original abstract

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that CreFlow, an online RL post-training framework for video diffusion models in embodied bimanual manipulation, uses an automatically composed LTL-based reward model to deliver faithful, localized rewards and introduces a credit-aware NFT loss plus a corrective reflow loss to stabilize high-dimensional updates; experiments on eight tasks report superior reward alignment with human/simulator labels and a 23.8 pp gain in downstream execution success.

Significance. If the results hold, the work offers a concrete route to logic-based rewards for sparse-reward video RL, potentially improving physical consistency of generated manipulation videos without hand-crafted dense rewards; the combination of LTL compositionality with targeted diffusion losses could influence post-training of generative models for robotics.

major comments (2)
  1. [§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.
  2. [§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.
minor comments (2)
  1. [§4] Notation for the two new losses (NFT and corrective reflow) should be introduced with explicit equations early in §4 so that the credit-aware masking and within-group positive-sample direction can be traced directly to the update rule.
  2. [Abstract / §5] The abstract and §5 should explicitly name the strongest baseline (e.g., standard PPO or prior video-RL method) against which the 23.8 pp figure is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to provide the requested details and controls.

read point-by-point responses
  1. Referee: [§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.

    Authors: We agree that the current manuscript lacks sufficient granularity on proposition definition. In the revision we will add a dedicated subsection (and supplementary table) that enumerates the atomic propositions used for all eight tasks, specifies the exact simulator-derived predicates (e.g., contact distance < 0.05 m, gripper-object alignment within 0.02 rad), and clarifies that thresholds are taken from fixed physical constants rather than per-task tuning. Once the proposition vocabulary is defined once, the LTL formulas are generated automatically from a high-level task template; no additional manual engineering is performed per task. This will make explicit that the reported gains are not driven by hidden reward engineering. revision: yes

  2. Referee: [§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.

    Authors: We acknowledge the need for finer-grained controls. The revised manuscript will include an expanded ablation table that reports four variants: (1) LTL reward + credit-aware NFT only, (2) LTL reward + corrective reflow only, (3) full CreFlow, and (4) baseline without either loss. All variants will be evaluated on the same eight tasks with identical training budgets, allowing direct attribution of the 23.8 pp gain to the corrective reflow component. We will also report per-task success rates to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper's core contributions—the automatic LTL constraint formulation for rewards and the two new losses (credit-aware NFT and corrective reflow)—are introduced as independent mechanisms derived from task specifications and RL stabilization needs. No equations or steps reduce the reported 23.8 pp success gain or reward alignment metrics to a fitted parameter defined by the evaluation data itself. The LTL rewards are composed from atomic propositions tied to simulator states, and downstream execution success is measured separately, keeping the chain self-contained without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LTL constraints can be automatically composed to capture manipulation task requirements and that the proposed losses produce stable updates in video diffusion space; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Linear Temporal Logic can automatically formulate faithful compositional constraints for embodied manipulation tasks
    Invoked to generate the reward model that supplies localized error signals.

pith-pipeline@v0.9.0 · 5550 in / 1239 out tokens · 26000 ms · 2026-05-15T02:39:41.086820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 19 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  3. [3]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  4. [4]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Kevin Black, Anthony Brohan, Danny Driess, Adnan Esmaeili, Chelsea Finn, Niccolo Fuentes, Brian Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, et al. π0.5: a vision-language- action model with open-world generalization, 2025. arXiv preprint arXiv:2504.16054

  5. [5]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, et al. SAM 3: Segment anything with concepts, 2025. arXiv preprint arXiv:2511.16719. Meta AI

  6. [6]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

  7. [7]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  8. [8]

    ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

    Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

  9. [9]

    Safebiman- ual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025

    Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, and Ziwei Wang. Safebiman- ual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025

  10. [10]

    Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

    Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

  11. [11]

    VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

    Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. VideoGPA: Distilling geometry priors for 3D-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

  12. [12]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  13. [13]

    Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning, 2023. arXiv preprint arXiv:2310.10625

  14. [14]

    Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

  15. [15]

    V AMPO: Policy optimization for improving visual dynamics in video action models, 2026

    Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. V AMPO: Policy optimization for improving visual dynamics in video action models, 2026. arXiv preprint arXiv:2603.19370. 10

  16. [16]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations, 2024. arXiv preprint arXiv:2412.14803

  17. [17]

    Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

  18. [18]

    Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

    Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, and Wangmeng Zuo. Mind the generative details: Direct localized detail preference optimization for video diffusion models, 2026. arXiv preprint arXiv:2601.04068

  19. [19]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

  20. [20]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong, and Liefeng Bo. MixGRPO: Unlocking flow-based GRPO efficiency with mixed ODE-SDE, 2025. arXiv preprint arXiv:2507.21802

  21. [21]

    Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37: 100428–100534, 2024

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37: 100428–100534, 2024

  22. [22]

    Flow-GRPO: Training flow matching models via online RL,

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL,

  23. [23]

    arXiv preprint arXiv:2505.05470

  24. [24]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback, 2025. arXiv preprint arXiv:2501.13918

  25. [25]

    Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

    Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

  26. [26]

    Hpsv3: Towards wide-spectrum hu- man preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  27. [27]

    Seeing what matters: Visual preference policy optimization for visual generation,

    Ziqi Ni et al. Seeing what matters: Visual preference policy optimization for visual generation,

  28. [28]

    arXiv preprint arXiv:2511.18719

  29. [29]

    The intrinsic dimension of images and its impact on learning

    Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. InInternational Conference on Learning Representations (ICLR), 2021. arXiv preprint arXiv:2104.08894

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. arXiv preprint arXiv:2112.10752

  32. [32]

    AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

    Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. AnyPos: Automated task-agnostic actions for bimanual manipulation, 2025. arXiv preprint arXiv:2507.12768

  33. [33]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 11

  34. [34]

    TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026

    Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026. arXiv preprint arXiv:2601.05729

  35. [35]

    GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

    Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, and Xiaodan Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025. arXiv preprint arXiv:2510.22319

  36. [36]

    EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026

    Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. arXiv preprint arXiv:2603.17808

  37. [37]

    WorldCompass: Reinforcement learning for long-horizon world models, 2026

    Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. WorldCompass: Reinforcement learning for long-horizon world models, 2026. arXiv preprint arXiv:2602.09022

  38. [38]

    DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025

    Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025. arXiv preprint arXiv:2506.03517

  39. [39]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. DanceGRPO: Unleashing GRPO on visual generation, 2025. arXiv preprint arXiv:2505.07818

  40. [40]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  41. [41]

    SENTINEL: A multi-level formal framework for safety evaluation of foundation model-based embodied agents.arXiv preprint arXiv:2510.12985, 2025

    Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Yiyan Peng, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, and Qi Zhu. SENTINEL: A multi-level formal framework for safety evaluation of foundation model-based embodied agents.arXiv preprint arXiv:2510.12985, 2025

  42. [42]

    Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation

    Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, and Qi Zhu. Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation. arXiv preprint arXiv:2603.05757, 2026

  43. [43]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process, 2025. arXiv preprint arXiv:2509.16117

  44. [44]

    Manifold-aware exploration for reinforcement learning in video generation, 2026

    Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, et al. Manifold-aware exploration for reinforcement learning in video generation, 2026. arXiv preprint arXiv:2603.21872

  45. [45]

    does not move

    Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and He Wang. Code-as-monitor: Constraint-aware visual programming for reactive and proac- tive robotic failure detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6919–6929, 2025. 12 A Theoretical analysis ofCreFlow This appendix an...