Recognition: 1 theorem link
· Lean TheoremF2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation
Pith reviewed 2026-05-13 21:32 UTC · model grok-4.3
The pith
Predicted object flow lets an asynchronous robotic policy synthesize future observations and compensate for action latency in dynamic tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By predicting object flow to synthesize future observations and aligning their visual features with ground-truth future states through a flow-based contrastive objective, the asynchronous policy acquires the ability to plan and move proactively, thereby offsetting inherent latency and succeeding at manipulation of actively moving objects.
What carries the argument
Flow-to-future synthesis: predicted object flow generates anticipated visual observations that are aligned to real future frames by contrastive learning, supplying the policy with forward context for latency-compensating actions.
If this is right
- The policy issues actions that already anticipate where the object will be after the control delay.
- Success rates rise on tasks that require continuous tracking of moving targets.
- The same framework can be applied to any asynchronous controller that receives delayed visual feedback.
- Training requires only the contrastive alignment plus the usual task reward, without extra real-time simulation.
Where Pith is reading between the lines
- The method may transfer to domains such as autonomous navigation where scene prediction can offset actuator lag.
- If the contrastive alignment generalizes across lighting and viewpoint changes, the approach could reduce reliance on high-frequency sensing.
- Combining the flow predictor with learned dynamics models might allow longer-horizon proactive plans without increasing latency.
Load-bearing premise
The flow predictor must generate future images whose visual features line up closely enough with actual future images that the contrastive objective produces useful planning signals.
What would settle it
Run the same dynamic grasping trials with and without the flow-to-future module; if success rate and responsiveness do not rise measurably when the module is added, the central claim is falsified.
Figures
read the original abstract
Asynchronous inference has emerged as a prevalent paradigm in robotic manipulation, achieving significant progress in ensuring trajectory smoothness and efficiency. However, a systemic challenge remains unresolved, as inherent latency causes generated actions to inevitably lag behind the real-time environment. This issue is particularly exacerbated in dynamic scenarios, where such temporal misalignment severely compromises the policy's ability to interpret and react to rapidly evolving surroundings. In this paper, we propose a novel framework that leverages predicted object flow to synthesize future observations, incorporating a flow-based contrastive learning objective to align the visual feature representations of predicted observations with ground-truth future states. Empowered by this anticipated visual context, our asynchronous policy gains the capacity for proactive planning and motion, enabling it to explicitly compensate for latency and robustly execute manipulation tasks involving actively moving objects. Experimental results demonstrate that our approach significantly enhances responsiveness and success rates in complex dynamic manipulation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes F2F-AP, a framework for real-time dynamic robotic manipulation that predicts object flow to synthesize future observations and applies a flow-based contrastive learning objective to align visual feature representations of these synthesized observations with ground-truth future states. This anticipated context enables an asynchronous policy to perform proactive planning and explicitly compensate for inference latency when interacting with actively moving objects, with experiments claiming improved responsiveness and success rates in complex tasks.
Significance. If the central mechanism holds, the work addresses a practical bottleneck in asynchronous robotic policies by turning latency from a liability into an opportunity for anticipation. The combination of flow-based future synthesis with contrastive feature alignment is a targeted contribution to dynamic manipulation, potentially enabling more robust performance on moving targets without requiring faster hardware. Strengths include the explicit focus on latency compensation and the use of an existing flow estimator rather than end-to-end prediction.
major comments (2)
- [§3 (method) and abstract] The central claim (abstract and §3) that contrastive alignment of flow-synthesized observations supplies 'sufficiently accurate anticipated visual context' for proactive compensation rests on an unverified assumption: that embedding similarity under the contrastive loss implies the geometric and spatial fidelity (e.g., object centroids, contact surfaces) needed for correct anticipatory actions. Contrastive objectives do not constrain pixel-level or 3D accuracy; if flow prediction errors accumulate, the policy may receive misaligned features even when the loss is minimized. No ablation or quantitative metric (flow endpoint error, future-state reconstruction error, or policy sensitivity to flow noise) is reported to test this.
- [§4 (experiments)] The experimental validation (presumably §4) reports improved success rates but provides no error bars, statistical significance tests, or comparison against a strong baseline that uses the same asynchronous policy without the flow-contrastive module. Without these, it is impossible to determine whether gains are attributable to the proposed alignment or to other factors such as training data or architecture changes.
minor comments (2)
- [§3] Notation for the contrastive loss and flow prediction modules should be introduced with explicit equations rather than prose descriptions to allow reproducibility.
- [§4] Figure captions and axis labels in the experimental results should include units and clarify whether 'success rate' is per-episode or per-timestep.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the assumptions underlying our central claims and the need for stronger experimental validation. We address each major comment below and will revise the manuscript accordingly to incorporate additional analyses and comparisons.
read point-by-point responses
-
Referee: [§3 (method) and abstract] The central claim (abstract and §3) that contrastive alignment of flow-synthesized observations supplies 'sufficiently accurate anticipated visual context' for proactive compensation rests on an unverified assumption: that embedding similarity under the contrastive loss implies the geometric and spatial fidelity (e.g., object centroids, contact surfaces) needed for correct anticipatory actions. Contrastive objectives do not constrain pixel-level or 3D accuracy; if flow prediction errors accumulate, the policy may receive misaligned features even when the loss is minimized. No ablation or quantitative metric (flow endpoint error, future-state reconstruction error, or policy sensitivity to flow noise) is reported to test this.
Authors: We acknowledge that contrastive feature alignment primarily operates at the embedding level and does not inherently enforce pixel-level geometric fidelity. The flow estimator is used as an off-the-shelf module, and the contrastive objective is intended to provide high-level anticipated context rather than precise reconstruction. To address this, the revised manuscript will include quantitative metrics such as flow endpoint error on synthesized observations, future-state reconstruction error, and an ablation evaluating policy sensitivity to injected flow noise. These additions will directly test whether the aligned features remain effective for anticipatory actions under realistic prediction errors. revision: yes
-
Referee: [§4 (experiments)] The experimental validation (presumably §4) reports improved success rates but provides no error bars, statistical significance tests, or comparison against a strong baseline that uses the same asynchronous policy without the flow-contrastive module. Without these, it is impossible to determine whether gains are attributable to the proposed alignment or to other factors such as training data or architecture changes.
Authors: We agree that the current results lack statistical rigor and an isolated baseline comparison. The revised version will report success rates with error bars (standard deviation across multiple random seeds), include statistical significance tests (e.g., paired t-tests between conditions), and add a direct ablation using the identical asynchronous policy architecture and training data but without the flow-contrastive module. This will isolate the contribution of the proposed alignment from other factors. revision: yes
Circularity Check
No significant circularity; derivation relies on external flow estimation and contrastive alignment without self-referential reduction
full rationale
The paper's core proposal uses predicted object flow to synthesize future observations, then applies a flow-based contrastive objective to align visual features of those synthesized observations with ground-truth future states. This chain does not reduce by construction to a fitted parameter or self-defined quantity; the contrastive loss operates on independently estimated flow and external ground-truth frames. No equations are presented that force the proactive compensation claim to equal its inputs, and the method description invokes standard external flow estimation rather than a self-citation load-bearing uniqueness theorem. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
flow-based contrastive learning objective to align the visual feature representations of predicted observations with ground-truth future states
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Affordances from Human Videos 9 as a Versatile Representation for Robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from Human Videos 9 as a Versatile Representation for Robotics. In2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 01–13, Vancouver, BC, Canada, June 2023. IEEE. ISBN 979-8-3503-0129-8
work page 2023
-
[2]
Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. In Ale ˇs Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and G ¨ul Varol, editors,Computer Vision – ECCV 2024, pages 306–324, Cham, 2025. Springer Nature Switz...
work page 2024
-
[3]
Real-time execution of action chunking flow policies
Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies
-
[4]
Ren, Michael Equi, and Sergey Levine
Kevin Black, Allen Z. Ren, Michael Equi, and Sergey Levine. Training-Time Action Conditioning for Efficient Real-Time Chunking, December 2025
work page 2025
-
[5]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024
work page 2024
-
[6]
Universal Manipulation Interface: In- The-Wild Robot Teaching Without In-The-Wild Robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal Manipulation Interface: In- The-Wild Robot Teaching Without In-The-Wild Robots. CoRR, January 2024
work page 2024
-
[7]
Local Neural De- scriptor Fields: Locally Conditioned Object Represen- tations for Manipulation
Ethan Chun, Yilun Du, Anthony Simeonov, Tomas Lozano-Perez, and Leslie Kaelbling. Local Neural De- scriptor Fields: Locally Conditioned Object Represen- tations for Manipulation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1830–1836, May 2023
work page 2023
-
[8]
Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow, December 2025
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow, December 2025
work page 2025
-
[9]
Affor- danceNet: An End-to-End Deep Learning Approach for Object Affordance Detection
Thanh-Toan Do, Anh Nguyen, and Ian Reid. Affor- danceNet: An End-to-End Deep Learning Approach for Object Affordance Detection. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5882–5889, May 2018
work page 2018
-
[10]
MemFlow: Optical Flow Estimation and Prediction with Memory, April 2024
Qiaole Dong and Yanwei Fu. MemFlow: Optical Flow Estimation and Prediction with Memory, April 2024
work page 2024
-
[11]
FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects, May 2024
Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects, May 2024
work page 2024
-
[12]
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting. InRobotics: Science and Systems XX. Robotics: Science and Systems Foundation, July 2024. ISBN 979-8-9902848-0-7
work page 2024
-
[13]
Wei Gao and Russ Tedrake. kPAM 2.0: Feedback Control for Category-Level Robotic Manipulation.IEEE Robotics and Automation Letters, 6(2):2962–2969, April
-
[14]
Tuba Girgin and Emre U ˘gur. Multiobject Graph Affordance Network: Goal-Oriented Planning Through Learned Compound Object Affordances.IEEE Transac- tions on Cognitive and Developmental Systems, 17(4): 847–858, August 2025. ISSN 2379-8939
work page 2025
-
[15]
UMI on Legs: Making Manipulation Policies Mo- bile with Manipulation-Centric Whole-body Controllers
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on Legs: Making Manipulation Policies Mo- bile with Manipulation-Centric Whole-body Controllers. InCoRL 2024 Workshop on Whole-Body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, November 2024
work page 2024
-
[16]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models
-
[17]
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation, January 2026
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation, January 2026
work page 2026
-
[18]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...
work page 2025
-
[19]
Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-ABC: Affordance Generalization Beyond Categories via Semantic Corre- spondence for Robot Manipulation, January 2024
work page 2024
-
[20]
Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos, October 2024
Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos, October 2024
work page 2024
-
[21]
Hun- yuanVideo: A Systematic Framework For Large Video Generative Models, March 2025
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duo- jun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li...
work page 2025
-
[22]
Aileen Liao, Dong-Ki Kim, Max Olan Smith, Ali-akbar Agha-mohammadi, and Shayegan Omidshafiei. Delay- Aware Diffusion Policy: Bridging the Observation- 10 Execution Gap in Dynamic Tasks, December 2025
work page 2025
-
[23]
Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani
Kaichun Mo, Leonidas J. Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2Act: From Pixels to Actions for Articulated 3D Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021
work page 2021
-
[24]
Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation, November 2024
work page 2024
-
[25]
TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation
Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, and David Held. TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation. InProceedings of The 6th Conference on Robot Learning, pages 1783–
-
[26]
AffordanceLLM: Ground- ing Affordance from Vision Language Models
Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. AffordanceLLM: Ground- ing Affordance from Vision Language Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 7587–7597, Seattle, W A, USA, June 2024. IEEE. ISBN 979-8-3503- 6547-4
work page 2024
-
[27]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, July 2021
work page 2021
-
[28]
SAM 2: Segment Anything in Images and Videos, October 2024
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. SAM 2: Segment Anything in Images and Videos, October 2024
work page 2024
-
[29]
Leave No Ob- servation Behind: Real-time Correction for VLA Action Chunks, September 2025
Kohei Sendai, Maxime Alvarez, Tatsuya Matsushima, Yutaka Matsuo, and Yusuke Iwasawa. Leave No Ob- servation Behind: Real-time Correction for VLA Action Chunks, September 2025
work page 2025
-
[30]
SmolVLA: A Vision-Language- Action Model for Affordable and Efficient Robotics, June 2025
Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A Vision-Language- Action Model for Affordable and Efficient Robotics, June 2025
work page 2025
-
[31]
Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann
Anthony Simeonov, Yilun Du, Andrea Tagliasac- chi, Joshua B. Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation, December 2021
work page 2021
-
[32]
Anukriti Singh, Kasra Torshizi, Khuzema Habib, Kelin Yu, Ruohan Gao, and Pratap Tokekar. AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation, October 2025
work page 2025
-
[33]
Motion Before Action: Diffusing Object Motion as Manipulation Condition, April 2025
Yue Su, Xinyu Zhan, Hongjie Fang, Yong-Lu Li, Cewu Lu, and Lixin Yang. Motion Before Action: Diffusing Object Motion as Manipulation Condition, April 2025
work page 2025
-
[34]
Segment Anything, Even Occluded, March 2025
Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, and Hwann-Tzong Chen. Segment Anything, Even Occluded, March 2025
work page 2025
-
[35]
VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference, November 2025
Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, and Song Han. VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference, November 2025
work page 2025
-
[36]
Embodiment-agnostic Action Planning via Object-Part Scene Flow
Weiliang Tang, Jia-Hui Pan, Wei Zhan, Jianshu Zhou, Huaxiu Yao, Yun-Hui Liu, Masayoshi Tomizuka, Mingyu Ding, and Chi-Wing Fu. Embodiment-agnostic Action Planning via Object-Part Scene Flow. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2086–2093, May 2025
work page 2086
-
[37]
Aisformer: Amodal instance segmentation with transformer
Minh Tran. Aisformer: Amodal instance segmentation with transformer
-
[38]
Wan: Open and Advanced Large-Scale Video Generative Models, April 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page 2025
-
[39]
Real- Time Robot Execution with Masked Action Chunking, January 2026
Haoxuan Wang, Gengyu Zhang, Yan Yan, Yuzhang Shang, Ramana Rao Kompella, and Gaowen Liu. Real- Time Robot Execution with Masked Action Chunking, January 2026
work page 2026
-
[40]
Any-point Trajectory Modeling for Policy Learning, July 2024
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point Trajectory Modeling for Policy Learning, July 2024
work page 2024
-
[41]
Neural Grasp Distance Fields for Robot Manipulation
Thomas Weng, David Held, Franziska Meier, and Mustafa Mukadam. Neural Grasp Distance Fields for Robot Manipulation. In2023 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 1814– 1821, May 2023
work page 2023
-
[42]
MoManipVLA: Transferring Vision- language-action Models for General Mobile Manipula- tion
Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. MoManipVLA: Transferring Vision- language-action Models for General Mobile Manipula- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1714–1723, 2025
work page 2025
-
[43]
Dynam- icVLA: A Vision-Language-Action Model for Dynamic Object Manipulation, January 2026
Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, 11 Fangzhou Hong, Haiwen Diao, and Ziwei Liu. Dynam- icVLA: A Vision-Language-Action Model for Dynamic Object Manipulation, January 2026
work page 2026
-
[44]
Flow as the Cross-Domain Manipulation Interface, October 2024
Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gor- don Wetzstein, Manuela Veloso, and Shuran Song. Flow as the Cross-Domain Manipulation Interface, October 2024
work page 2024
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, March 2025
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, March 2025
work page 2025
-
[46]
General Flow as Foundation Affordance for Scalable Robot Learning
Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General Flow as Foundation Affordance for Scalable Robot Learning. In8th Annual Conference on Robot Learning, September 2024
work page 2024
-
[47]
Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos, January 2026
work page 2026
-
[48]
Affordance-based Robot Manipulation with Flow Matching, November 2025
Fan Zhang and Michael Gienger. Affordance-based Robot Manipulation with Flow Matching, November 2025
work page 2025
-
[49]
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation
Junjie Zhang, Chenjia Bai, Haoran He, Zhigang Wang, Bin Zhao, Xiu Li, and Xuelong Li. SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation. InProceedings of the 41st International Conference on Machine Learning, pages 58579–58598. PMLR, July 2024
work page 2024
-
[50]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manip- ulation with Low-Cost Hardware. InICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, July 2023
work page 2023
-
[51]
3DFlowAc- tion: Learning Cross-Embodiment Manipulation from 3D Flow World Model, June 2025
Hongyan Zhi, Peihao Chen, Siyuan Zhou, Yubo Dong, Quanxi Wu, Lei Han, and Mingkui Tan. 3DFlowAc- tion: Learning Cross-Embodiment Manipulation from 3D Flow World Model, June 2025. 12 APPENDIX A. Data Details Data collection was performed manually using a UMI interface equipped with a GoPro fisheye camera and a gripper. For each task, we collected100∼200tra...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.