Recognition: no theorem link
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
Pith reviewed 2026-05-15 02:39 UTC · model grok-4.3
The pith
CreFlow uses automatically generated Linear Temporal Logic rewards plus corrective reflow to align video diffusion rollouts with embodied task rules and lift downstream success 23.8 points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LTL-based compositional reward model paired with the CreFlow framework—built around a credit-aware NFT loss that limits updates to reward-relevant regions and a corrective reflow loss that uses within-group positive samples to estimate the correction direction—stabilizes reinforcement learning updates for high-dimensional video diffusion models, produces rewards better aligned with human and simulator success labels, and raises downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.
What carries the argument
The CreFlow online RL framework, which pairs automatically formulated Linear Temporal Logic rewards with a credit-aware NFT loss to confine updates and a corrective reflow loss that treats positive samples as an explicit correction estimate.
If this is right
- Reward judgments align more closely with human and simulator success labels than prior visual-metric approaches.
- Downstream robot execution success rises by 23.8 percentage points on eight bimanual manipulation tasks.
- Generated videos receive localized error signals that identify exactly where task specifications are violated.
- High-dimensional diffusion updates remain stable because the credit-aware loss avoids perturbing unrelated video regions.
- The method operates effectively under sparse rewards typical of embodied video generation.
Where Pith is reading between the lines
- Automating reward creation via LTL could lower the engineering cost of adapting video models to new robot tasks.
- The corrective reflow technique might accelerate RL training for other high-dimensional generative models outside video diffusion.
- Strong results on bimanual tasks suggest the same logic rewards could scale to longer-horizon or multi-agent planning problems.
- If the LTL formulation generalizes, similar constraint-based rewards could improve video generation in non-robotics domains such as procedural animation.
Load-bearing premise
The automatically formulated Linear Temporal Logic constraints give faithful, localized rewards without needing significant manual engineering or domain-specific tuning.
What would settle it
Retraining the same eight bimanual tasks with CreFlow and measuring no gain in simulator-verified success rates or human alignment scores compared with baseline video-RL methods would falsify the improvement claim.
Figures
read the original abstract
Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that CreFlow, an online RL post-training framework for video diffusion models in embodied bimanual manipulation, uses an automatically composed LTL-based reward model to deliver faithful, localized rewards and introduces a credit-aware NFT loss plus a corrective reflow loss to stabilize high-dimensional updates; experiments on eight tasks report superior reward alignment with human/simulator labels and a 23.8 pp gain in downstream execution success.
Significance. If the results hold, the work offers a concrete route to logic-based rewards for sparse-reward video RL, potentially improving physical consistency of generated manipulation videos without hand-crafted dense rewards; the combination of LTL compositionality with targeted diffusion losses could influence post-training of generative models for robotics.
major comments (2)
- [§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.
- [§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.
minor comments (2)
- [§4] Notation for the two new losses (NFT and corrective reflow) should be introduced with explicit equations early in §4 so that the credit-aware masking and within-group positive-sample direction can be traced directly to the update rule.
- [Abstract / §5] The abstract and §5 should explicitly name the strongest baseline (e.g., standard PPO or prior video-RL method) against which the 23.8 pp figure is measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to provide the requested details and controls.
read point-by-point responses
-
Referee: [§3] §3 (LTL Reward Formulation): the central claim that LTL constraints are 'automatically' composed to yield faithful, localized rewards without significant manual engineering is load-bearing for both the reward model and the reported 23.8 pp gain, yet the manuscript provides insufficient detail on how atomic propositions (object states, contact events, spatial relations) and their thresholds are chosen across the eight tasks; if domain-specific tuning is required, the alignment advantage and downstream improvement cannot be attributed solely to the corrective reflow mechanism.
Authors: We agree that the current manuscript lacks sufficient granularity on proposition definition. In the revision we will add a dedicated subsection (and supplementary table) that enumerates the atomic propositions used for all eight tasks, specifies the exact simulator-derived predicates (e.g., contact distance < 0.05 m, gripper-object alignment within 0.02 rad), and clarifies that thresholds are taken from fixed physical constants rather than per-task tuning. Once the proposition vocabulary is defined once, the LTL formulas are generated automatically from a high-level task template; no additional manual engineering is performed per task. This will make explicit that the reported gains are not driven by hidden reward engineering. revision: yes
-
Referee: [§5] §5 (Experiments, Table 2 or equivalent): the headline 23.8 pp success improvement and superior reward alignment are presented as aggregate outcomes, but the ablation isolating the corrective reflow loss from the credit-aware NFT loss and the LTL reward itself is not reported in sufficient granularity; without these controls it remains unclear whether the gains are robust or partly driven by the reward formulation.
Authors: We acknowledge the need for finer-grained controls. The revised manuscript will include an expanded ablation table that reports four variants: (1) LTL reward + credit-aware NFT only, (2) LTL reward + corrective reflow only, (3) full CreFlow, and (4) baseline without either loss. All variants will be evaluated on the same eight tasks with identical training budgets, allowing direct attribution of the 23.8 pp gain to the corrective reflow component. We will also report per-task success rates to demonstrate robustness. revision: yes
Circularity Check
No circularity detected in the derivation chain
full rationale
The paper's core contributions—the automatic LTL constraint formulation for rewards and the two new losses (credit-aware NFT and corrective reflow)—are introduced as independent mechanisms derived from task specifications and RL stabilization needs. No equations or steps reduce the reported 23.8 pp success gain or reward alignment metrics to a fitted parameter defined by the evaluation data itself. The LTL rewards are composed from atomic propositions tied to simulator states, and downstream execution success is measured separately, keeping the chain self-contained without self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear Temporal Logic can automatically formulate faithful compositional constraints for embodied manipulation tasks
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024
-
[3]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Kevin Black, Anthony Brohan, Danny Driess, Adnan Esmaeili, Chelsea Finn, Niccolo Fuentes, Brian Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, et al. π0.5: a vision-language- action model with open-world generalization, 2025. arXiv preprint arXiv:2504.16054
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, et al. SAM 3: Segment anything with concepts, 2025. arXiv preprint arXiv:2511.16719. Meta AI
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Large Video Planner Enables Generalizable Robot Control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, et al. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment.arXiv preprint arXiv:2603.23376, 2026
-
[9]
Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, and Ziwei Wang. Safebiman- ual: Diffusion-based trajectory optimization for safe bimanual manipulation.arXiv preprint arXiv:2508.18268, 2025
-
[10]
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025
-
[11]
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. VideoGPA: Distilling geometry priors for 3D-consistent video generation.arXiv preprint arXiv:2601.23286, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Tenenbaum, Dale Schuurmans, and Pieter Abbeel
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[13]
Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning, 2023. arXiv preprint arXiv:2310.10625
-
[14]
Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025
-
[15]
V AMPO: Policy optimization for improving visual dynamics in video action models, 2026
Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, et al. V AMPO: Policy optimization for improving visual dynamics in video action models, 2026. arXiv preprint arXiv:2603.19370. 10
-
[16]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations, 2024. arXiv preprint arXiv:2412.14803
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024
-
[18]
Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, and Wangmeng Zuo. Mind the generative details: Direct localized detail preference optimization for video diffusion models, 2026. arXiv preprint arXiv:2601.04068
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong, and Liefeng Bo. MixGRPO: Unlocking flow-based GRPO efficiency with mixed ODE-SDE, 2025. arXiv preprint arXiv:2507.21802
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li E Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37: 100428–100534, 2024
work page 2024
-
[22]
Flow-GRPO: Training flow matching models via online RL,
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL,
-
[23]
arXiv preprint arXiv:2505.05470
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback, 2025. arXiv preprint arXiv:2501.13918
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026
-
[26]
Hpsv3: Towards wide-spectrum hu- man preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025
work page 2025
-
[27]
Seeing what matters: Visual preference policy optimization for visual generation,
Ziqi Ni et al. Seeing what matters: Visual preference policy optimization for visual generation,
- [28]
-
[29]
The intrinsic dimension of images and its impact on learning
Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. InInternational Conference on Learning Representations (ICLR), 2021. arXiv preprint arXiv:2104.08894
-
[30]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[31]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. arXiv preprint arXiv:2112.10752
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation
Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. AnyPos: Automated task-agnostic actions for bimanual manipulation, 2025. arXiv preprint arXiv:2507.12768
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026
Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, et al. TAGRPO: Boosting GRPO on image-to- video generation with direct trajectory alignment, 2026. arXiv preprint arXiv:2601.05729
-
[35]
GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, and Xiaodan Liang. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025. arXiv preprint arXiv:2510.22319
-
[36]
EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026
Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. arXiv preprint arXiv:2603.17808
-
[37]
WorldCompass: Reinforcement learning for long-horizon world models, 2026
Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, et al. WorldCompass: Reinforcement learning for long-horizon world models, 2026. arXiv preprint arXiv:2602.09022
-
[38]
DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025
Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. DenseDPO: Fine-grained temporal preference optimization for video diffusion models, 2025. arXiv preprint arXiv:2506.03517
-
[39]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. DanceGRPO: Unleashing GRPO on visual generation, 2025. arXiv preprint arXiv:2505.07818
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Yiyan Peng, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, and Qi Zhu. SENTINEL: A multi-level formal framework for safety evaluation of foundation model-based embodied agents.arXiv preprint arXiv:2510.12985, 2025
-
[42]
Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation
Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, and Qi Zhu. Em- boAlign: Aligning video generation with compositional constraints for zero-shot manipulation. arXiv preprint arXiv:2603.05757, 2026
-
[43]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process, 2025. arXiv preprint arXiv:2509.16117
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Manifold-aware exploration for reinforcement learning in video generation, 2026
Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, et al. Manifold-aware exploration for reinforcement learning in video generation, 2026. arXiv preprint arXiv:2603.21872
-
[45]
Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and He Wang. Code-as-monitor: Constraint-aware visual programming for reactive and proac- tive robotic failure detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6919–6929, 2025. 12 A Theoretical analysis ofCreFlow This appendix an...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.