arxiv: 2605.14274 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: no theorem link

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

Zhenyang Ni , Yijiang Li , Ruochen Jiao , Simon Sinong Zhan , Sipeng Chen , Zhenfei Yin , Minshuo Chen , Philip Torr

show 2 more authors

Zhaoran Wang Qi Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords video diffusionreinforcement learningLinear Temporal Logicembodied manipulationbimanual tasksreward modelingcorrective reflowsparse rewards

0 comments

The pith

CreFlow uses automatically generated Linear Temporal Logic rewards plus corrective reflow to align video diffusion rollouts with embodied task rules and lift downstream success 23.8 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models trained on mixed data often produce sequences that look plausible yet break physical rules during robot manipulation. The paper shows how to correct this by converting task requirements into compositions of Linear Temporal Logic constraints that supply precise rewards and flag specific errors in each frame. These signals feed into CreFlow, an online reinforcement learning setup that restricts updates to reward-relevant video regions via a credit-aware loss and steers corrections using positive samples through a reflow term. On eight bimanual tasks the resulting videos match human and simulator judgments more closely and raise real execution success when the generated plans are handed to robots. The result matters because it replaces low-level visual metrics with logic-based verification that works in sparse-reward settings.

Core claim

The central claim is that an LTL-based compositional reward model paired with the CreFlow framework—built around a credit-aware NFT loss that limits updates to reward-relevant regions and a corrective reflow loss that uses within-group positive samples to estimate the correction direction—stabilizes reinforcement learning updates for high-dimensional video diffusion models, produces rewards better aligned with human and simulator success labels, and raises downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

What carries the argument

The CreFlow online RL framework, which pairs automatically formulated Linear Temporal Logic rewards with a credit-aware NFT loss to confine updates and a corrective reflow loss that treats positive samples as an explicit correction estimate.

If this is right

Reward judgments align more closely with human and simulator success labels than prior visual-metric approaches.
Downstream robot execution success rises by 23.8 percentage points on eight bimanual manipulation tasks.
Generated videos receive localized error signals that identify exactly where task specifications are violated.
High-dimensional diffusion updates remain stable because the credit-aware loss avoids perturbing unrelated video regions.
The method operates effectively under sparse rewards typical of embodied video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automating reward creation via LTL could lower the engineering cost of adapting video models to new robot tasks.
The corrective reflow technique might accelerate RL training for other high-dimensional generative models outside video diffusion.
Strong results on bimanual tasks suggest the same logic rewards could scale to longer-horizon or multi-agent planning problems.
If the LTL formulation generalizes, similar constraint-based rewards could improve video generation in non-robotics domains such as procedural animation.

Load-bearing premise

The automatically formulated Linear Temporal Logic constraints give faithful, localized rewards without needing significant manual engineering or domain-specific tuning.

What would settle it

Retraining the same eight bimanual tasks with CreFlow and measuring no gain in simulator-verified success rates or human alignment scores compared with baseline video-RL methods would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2605.14274 by Minshuo Chen, Philip Torr, Qi Zhu, Ruochen Jiao, Simon Sinong Zhan, Sipeng Chen, Yijiang Li, Zhaoran Wang, Zhenfei Yin, Zhenyang Ni.

**Figure 2.** Figure 2: Overview of CreFlow. CreFlow turns sparse compositional reward feedback into localized supervision for video diffusion post-training. Instead of applying reward-induced updates uniformly over the full rollout, it identifies the spatio-temporal mask responsible for task success or failure and restricts optimization to these regions. The resulting objective combines localized negative-aware finetuning with a… view at source ↗

**Figure 3.** Figure 3: Training convergence on three RoboTwin tasks under the same binary compositional reward. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation on a put_bottles_dustbin rollout. (a) Vanilla DiffusionNFT, with no mask, visibly degrades the dustbin region during RL training. (b) Our group-shared mask M covers the task-relevant entities (arms, bottle); its complement is anchored to the pretrained prior. (c) CreFlow confines the contrastive update to M, preserving off-task visual quality. 5.3 Ablation Study Component ablation. Tab… view at source ↗

read the original abstract

Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CreFlow adds LTL-based rewards and two targeted losses to video diffusion RL for robotics, with a 23.8 pp gain that still needs checks on how automatic the logic setup is.

read the letter

The paper's main contribution is a reward model that turns task specs into compositional Linear Temporal Logic constraints for post-training video diffusion models, plus the CreFlow framework that adds a credit-aware NFT loss to focus updates on relevant video regions and a corrective reflow loss that uses within-group positive samples to estimate correction directions. This is a direct attempt to replace low-level visual metrics with logic-based signals that can localize errors in generated manipulation videos. The credit-aware and reflow pieces look like reasonable engineering to keep high-dimensional diffusion training stable under sparse rewards. The reported results show better alignment with human and simulator labels and a 23.8 percentage point rise in downstream execution success across eight bimanual tasks. Those numbers are concrete enough to notice if they replicate. The soft spot is the automatic LTL claim. Defining atomic propositions for object states, contacts, and spatial relations, then composing them temporally for bimanual sequences, typically requires choices about thresholds and ordering that are not fully automatic. If any of that setup is tuned per task, part of the alignment and gain may come from that engineering rather than the new losses alone. The abstract also summarizes outcomes without ablations or full implementation details, so the stability of the reflow loss and the exact source of the 23.8 point lift are hard to judge. This is for people working on video models for embodied planning or simulation who want logic constraints in the loop. It has a clear problem statement, a new framework, and multi-task results, so it deserves a serious referee even if the review will press for more on the LTL construction and loss ablations.

Referee Report

2 major / 2 minor

Summary. The paper claims that CreFlow, an online RL post-training framework for video diffusion models in embodied bimanual manipulation, uses an automatically composed LTL-based reward model to deliver faithful, localized rewards and introduces a credit-aware NFT loss plus a corrective reflow loss to stabilize high-dimensional updates; experiments on eight tasks report superior reward alignment with human/simulator labels and a 23.8 pp gain in downstream execution success.

Significance. If the results hold, the work offers a concrete route to logic-based rewards for sparse-reward video RL, potentially improving physical consistency of generated manipulation videos without hand-crafted dense rewards; the combination of LTL compositionality with targeted diffusion losses could influence post-training of generative models for robotics.

major comments (2)

[§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.
[§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.

minor comments (2)

[§4] Notation for the two new losses (NFT and corrective reflow) should be introduced with explicit equations early in §4 so that the credit-aware masking and within-group positive-sample direction can be traced directly to the update rule.
[Abstract / §5] The abstract and §5 should explicitly name the strongest baseline (e.g., standard PPO or prior video-RL method) against which the 23.8 pp figure is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to provide the requested details and controls.

read point-by-point responses

Referee: [§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.

Authors: We agree that the current manuscript lacks sufficient granularity on proposition definition. In the revision we will add a dedicated subsection (and supplementary table) that enumerates the atomic propositions used for all eight tasks, specifies the exact simulator-derived predicates (e.g., contact distance < 0.05 m, gripper-object alignment within 0.02 rad), and clarifies that thresholds are taken from fixed physical constants rather than per-task tuning. Once the proposition vocabulary is defined once, the LTL formulas are generated automatically from a high-level task template; no additional manual engineering is performed per task. This will make explicit that the reported gains are not driven by hidden reward engineering. revision: yes
Referee: [§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.

Authors: We acknowledge the need for finer-grained controls. The revised manuscript will include an expanded ablation table that reports four variants: (1) LTL reward + credit-aware NFT only, (2) LTL reward + corrective reflow only, (3) full CreFlow, and (4) baseline without either loss. All variants will be evaluated on the same eight tasks with identical training budgets, allowing direct attribution of the 23.8 pp gain to the corrective reflow component. We will also report per-task success rates to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper's core contributions—the automatic LTL constraint formulation for rewards and the two new losses (credit-aware NFT and corrective reflow)—are introduced as independent mechanisms derived from task specifications and RL stabilization needs. No equations or steps reduce the reported 23.8 pp success gain or reward alignment metrics to a fitted parameter defined by the evaluation data itself. The LTL rewards are composed from atomic propositions tied to simulator states, and downstream execution success is measured separately, keeping the chain self-contained without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LTL constraints can be automatically composed to capture manipulation task requirements and that the proposed losses produce stable updates in video diffusion space; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Linear Temporal Logic can automatically formulate faithful compositional constraints for embodied manipulation tasks
Invoked to generate the reward model that supplies localized error signals.

pith-pipeline@v0.9.0 · 5550 in / 1239 out tokens · 26000 ms · 2026-05-15T02:39:41.086820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 19 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

work page arXiv 2024
[3]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Kevin Black, Anthony Brohan, Danny Driess, Adnan Esmaeili, Chelsea Finn, Niccolo Fuentes, Brian Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, et al. π0.5: a vision-language- action model with open-world generalization, 2025. arXiv preprint arXiv:2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, et al. SAM 3: Segment anything with concepts, 2025. arXiv preprint arXiv:2511.16719. Meta AI

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026

work page arXiv 2026
[9]

Safebiman- ual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025

Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, and Ziwei Wang. Safebiman- ual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025

work page arXiv 2025
[10]

Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

work page arXiv 2025
[11]

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. VideoGPA: Distilling geometry priors for 3D-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[13]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning, 2023. arXiv preprint arXiv:2310.10625

work page arXiv 2023
[14]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

work page arXiv 2025
[15]

V AMPO: Policy optimization for improving visual dynamics in video action models, 2026

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. V AMPO: Policy optimization for improving visual dynamics in video action models, 2026. arXiv preprint arXiv:2603.19370. 10

work page arXiv 2026
[16]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations, 2024. arXiv preprint arXiv:2412.14803

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

work page arXiv 2024
[18]

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, and Wangmeng Zuo. Mind the generative details: Direct localized detail preference optimization for video diffusion models, 2026. arXiv preprint arXiv:2601.04068

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong, and Liefeng Bo. MixGRPO: Unlocking flow-based GRPO efficiency with mixed ODE-SDE, 2025. arXiv preprint arXiv:2507.21802

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37: 100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37: 100428–100534, 2024

work page 2024
[22]

Flow-GRPO: Training flow matching models via online RL,

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL,

work page
[23]

arXiv preprint arXiv:2505.05470

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback, 2025. arXiv preprint arXiv:2501.13918

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

work page arXiv 2026
[26]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

work page 2025
[27]

Seeing what matters: Visual preference policy optimization for visual generation,

Ziqi Ni et al. Seeing what matters: Visual preference policy optimization for visual generation,

work page
[28]

arXiv preprint arXiv:2511.18719

work page arXiv
[29]

The intrinsic dimension of images and its impact on learning

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. InInternational Conference on Learning Representations (ICLR), 2021. arXiv preprint arXiv:2104.08894

work page arXiv 2021
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[31]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. arXiv preprint arXiv:2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. AnyPos: Automated task-agnostic actions for bimanual manipulation, 2025. arXiv preprint arXiv:2507.12768

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026. arXiv preprint arXiv:2601.05729

work page arXiv 2026
[35]

GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, and Xiaodan Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025. arXiv preprint arXiv:2510.22319

work page arXiv 2025
[36]

EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. arXiv preprint arXiv:2603.17808

work page arXiv 2026
[37]

WorldCompass: Reinforcement learning for long-horizon world models, 2026

Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. WorldCompass: Reinforcement learning for long-horizon world models, 2026. arXiv preprint arXiv:2602.09022

work page arXiv 2026
[38]

DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025. arXiv preprint arXiv:2506.03517

work page arXiv 2025
[39]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. DanceGRPO: Unleashing GRPO on visual generation, 2025. arXiv preprint arXiv:2505.07818

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

SENTINEL: A multi-level formal framework for safety evaluation of foundation model-based embodied agents.arXiv preprint arXiv:2510.12985, 2025

Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Yiyan Peng, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, and Qi Zhu. SENTINEL: A multi-level formal framework for safety evaluation of foundation model-based embodied agents.arXiv preprint arXiv:2510.12985, 2025

work page arXiv 2025
[42]

Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation

Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, and Qi Zhu. Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation. arXiv preprint arXiv:2603.05757, 2026

work page arXiv 2026
[43]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process, 2025. arXiv preprint arXiv:2509.16117

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Manifold-aware exploration for reinforcement learning in video generation, 2026

Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, et al. Manifold-aware exploration for reinforcement learning in video generation, 2026. arXiv preprint arXiv:2603.21872

work page arXiv 2026
[45]

does not move

Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and He Wang. Code-as-monitor: Constraint-aware visual programming for reactive and proac- tive robotic failure detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6919–6929, 2025. 12 A Theoretical analysis ofCreFlow This appendix an...

work page 2025