Recognition: unknown
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
Pith reviewed 2026-05-10 15:09 UTC · model grok-4.3
The pith
WARPED synthesizes realistic wrist-view robot observations from monocular egocentric human videos to train policies that match teleoperated performance with 5-8 times less data collection time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present WARPED, a framework that synthesizes realistic wrist-view observations from human demonstration videos collected with an egocentric RGB camera. The system leverages vision foundation models to initialize the interactive scene, employs a hand-object interaction pipeline to track the hand and manipulated object and retarget the trajectories to a robotic end-effector, and synthesizes photo-realistic wrist-view observations via Gaussian Splatting to directly train a robotic policy. We demonstrate that WARPED achieves success rates comparable to policies trained on teleoperated demonstration data for five tabletop manipulation tasks, while requiring 5-8x less data collection time.
What carries the argument
The WARPED synthesis pipeline, which initializes scenes from vision models, tracks and retargets human hand-object trajectories to robot end-effectors, then renders wrist views via Gaussian Splatting.
If this is right
- Policies can be trained directly from monocular egocentric RGB videos without multiview camera rigs or depth sensors.
- Demonstration collection time drops by a factor of 5-8 compared with teleoperation while preserving success rates on tabletop manipulation.
- The same human video can supply training data for wrist-view robot policies that transfer to real execution.
- Training bypasses viewpoint mismatch issues because the rendered observations align with the robot's wrist camera.
Where Pith is reading between the lines
- The approach could extend to longer-horizon or contact-rich tasks if tracking and rendering remain reliable under greater motion complexity.
- Consumer phones or head-mounted cameras might replace specialized robot data rigs, lowering barriers for collecting diverse demonstrations.
- Combining the synthesis step with existing imitation learning algorithms could further reduce the total number of human trials needed.
Load-bearing premise
The rendered wrist views and retargeted trajectories must be accurate enough that policies trained on the synthetic data execute successfully on the real robot without major losses from artifacts, tracking mistakes, or viewpoint shifts.
What would settle it
If a policy trained solely on WARPED-synthesized data achieves substantially lower success rates on the physical robot than an otherwise identical policy trained on teleoperated demonstrations for the same five tasks, the central claim would not hold.
Figures
read the original abstract
Recent advancements in learning from human demonstration have shown promising results in addressing the scalability and high cost of data collection required to train robust visuomotor policies. However, existing approaches are often constrained by a reliance on multiview camera setups, depth sensors, or custom hardware and are typically limited to policy execution from third-person or egocentric cameras. In this paper, we present WARPED, a framework designed to synthesize realistic wrist-view observations from human demonstration videos to facilitate the training of visuomotor policies using only monocular RGB data. With data collected from an egocentric RGB camera, our system leverages vision foundation models to initialize the interactive scene. A hand-object interaction pipeline is then employed to track the hand and manipulated object and retarget the trajectories to a robotic end-effector. Lastly, photo-realistic wrist-view observations are synthesized via Gaussian Splatting to directly train a robotic policy. We demonstrate that WARPED achieves success rates comparable to policies trained on teleoperated demonstration data for five tabletop manipulation tasks, while requiring 5-8x less data collection time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents WARPED, a pipeline that converts monocular egocentric RGB human demonstration videos into wrist-view training data for visuomotor policies. It uses vision foundation models for scene initialization, tracks and retargets hand-object interactions to a robot end-effector, and renders photo-realistic wrist observations via Gaussian Splatting. The central empirical claim is that policies trained on this synthesized data achieve success rates comparable to those trained on teleoperated demonstrations across five tabletop manipulation tasks, while requiring 5-8x less data collection time.
Significance. If the synthesized wrist views and retargeted trajectories prove distributionally close to real robot camera streams, the method could meaningfully reduce the cost and hardware requirements of demonstration collection for robot policy learning. The approach of leveraging only egocentric RGB plus off-the-shelf foundation models and 3DGS for wrist-aligned rendering offers a practical route to scaling data without multiview rigs or teleoperation hardware.
major comments (3)
- [§4] §4 (Experiments): The headline claim of 'comparable success rates' on five tasks is stated without reporting the number of evaluation trials per task, standard deviations or confidence intervals, statistical tests against the teleoperation baseline, or breakdown of failure modes. This absence makes it impossible to evaluate whether the performance is statistically equivalent or merely directionally similar.
- [§3.3] §3.3 (Gaussian Splatting rendering) and §3.2 (trajectory retargeting): No quantitative metrics are supplied for rendering fidelity (e.g., PSNR/SSIM on held-out wrist views) or retargeting accuracy (e.g., SE(3) trajectory error). Without these, it is unclear whether residual floaters, depth-ordering errors, or viewpoint mismatches remain that policies could exploit in simulation but that would degrade real-robot execution.
- [§4.1] §4.1 (Data collection and baselines): The 5-8x reduction in data collection time is asserted but not accompanied by a breakdown of time spent on human recording versus any post-processing or optimization steps required for the WARPED pipeline, nor by a direct comparison of total human effort including setup of the egocentric capture rig.
minor comments (2)
- [Abstract] The abstract lists 'five tabletop manipulation tasks' but does not name them; adding the task names would improve readability.
- [Figure 3] Figure 3 (qualitative results) would benefit from explicit labels indicating which images are real robot wrist views versus synthesized views.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address each of the major comments below and outline the changes we will make to the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The headline claim of 'comparable success rates' on five tasks is stated without reporting the number of evaluation trials per task, standard deviations or confidence intervals, statistical tests against the teleoperation baseline, or breakdown of failure modes. This absence makes it impossible to evaluate whether the performance is statistically equivalent or merely directionally similar.
Authors: We agree that providing more rigorous statistical analysis and details on the evaluation protocol would improve the clarity and credibility of our results. In the revised version of the manuscript, we will include the number of evaluation trials conducted per task (typically 20-30 trials), report means with standard deviations, and include statistical comparisons (e.g., paired t-tests or Wilcoxon tests) against the teleoperation baseline. Additionally, we will add a section detailing common failure modes observed in both WARPED-trained and teleop-trained policies to allow for a more nuanced comparison. revision: yes
-
Referee: [§3.3] §3.3 (Gaussian Splatting rendering) and §3.2 (trajectory retargeting): No quantitative metrics are supplied for rendering fidelity (e.g., PSNR/SSIM on held-out wrist views) or retargeting accuracy (e.g., SE(3) trajectory error). Without these, it is unclear whether residual floaters, depth-ordering errors, or viewpoint mismatches remain that policies could exploit in simulation but that would degrade real-robot execution.
Authors: We recognize the importance of intermediate quantitative evaluations for the rendering and retargeting components. While the end-to-end policy success rate on physical robots serves as our primary validation metric, we will incorporate quantitative assessments in the revised manuscript. Specifically, we will report PSNR and SSIM values for the Gaussian Splatting renders on a set of held-out wrist-view images, and provide SE(3) pose error metrics for the retargeted trajectories compared to ground-truth robot executions where available. This will help demonstrate the fidelity of the synthesized data. revision: yes
-
Referee: [§4.1] §4.1 (Data collection and baselines): The 5-8x reduction in data collection time is asserted but not accompanied by a breakdown of time spent on human recording versus any post-processing or optimization steps required for the WARPED pipeline, nor by a direct comparison of total human effort including setup of the egocentric capture rig.
Authors: The reported 5-8x reduction specifically measures the active human demonstration collection time: egocentric video recording versus the setup and execution time for teleoperation. Post-processing steps in WARPED, such as tracking, retargeting, and Gaussian Splatting optimization, are fully automated and do not require additional human time beyond initial setup. We will revise §4.1 to include a detailed time breakdown table, specifying human recording time, rig setup time for the egocentric camera (which is a simple head-mounted setup), and noting that compute time is separate from human effort. This will clarify the total human effort comparison. revision: yes
Circularity Check
No circularity; empirical pipeline is self-contained against external baseline
full rationale
The paper describes a data-synthesis pipeline (vision foundation models for scene init, hand-object tracking/retargeting, Gaussian Splatting for wrist views) whose output is used to train policies that are then evaluated on real-robot success rates against an independent teleoperation baseline. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce to the inputs by construction. The central claim (comparable success with 5-8x less collection time) is framed as a direct experimental comparison rather than a self-referential prediction or renamed known result. This matches the default expectation of an honest non-finding for a methods paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- Gaussian Splatting optimization parameters
axioms (2)
- domain assumption Vision foundation models can reliably initialize an interactive scene from monocular egocentric RGB video
- domain assumption Hand-object tracking and retargeting to robotic end-effector preserves task-relevant motion for policy learning
Reference graph
Works this paper leans on
-
[1]
Cihan Acar, Kuluhan Binici, Alp Tekirda ˘g, and Yan Wu. Visual-policy learning through multi-camera view to single-camera view knowledge distillation for robot manipulation tasks.IEEE Robotics and Automation Letters, 9(1):691–698, January 2024. ISSN 2377-3774. doi: 10.1109/lra.2023.3336245. URL http://dx.doi.org/ 10.1109/LRA.2023.3336245
-
[2]
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics, 2023. URL https://arxiv.org/abs/2304.08488
-
[3]
Homanga Bharadhwaj, Abhinav Gupta, Shubham Tul- siani, and Vikash Kumar. Zero-shot robot manipulation from passive human videos, 2023. URL https://arxiv. org/abs/2302.02011
-
[4]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation, 2024. URL https://arxiv.org/abs/2409.16283
work page internal anchor Pith review arXiv 2024
-
[5]
Homanga Bharadhwaj, Abhinav Gupta, Vikash Kumar, and Shubham Tulsiani. Towards generalizable zero-shot manipulation via translating human interaction plans. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6904–6911, 2024. doi: 10. 1109/ICRA57147.2024.10610288
-
[6]
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation, 2024. URL https://arxiv.org/abs/ 2405.01527
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilin- sky.π 0: A vi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Haus- man, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash-...
work page internal anchor Pith review arXiv 2023
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Ju- lian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mal...
work page internal anchor Pith review arXiv 2023
-
[10]
Arunkumar Byravan, Jan Humplik, Leonard Hasenclever, Arthur Brussee, Francesco Nori, Tuomas Haarnoja, Ben Moran, Steven Bohez, Fereshteh Sadeghi, Bojan Vujatovic, and Nicolas Heess. Nerf2real: Sim2real transfer of vision-guided bipedal motion skills using neural radiance fields, 2022. URL https://arxiv.org/abs/2210.04932
-
[11]
Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Polle- feys, and Stefan Leutenegger. Vidbot: Learning gener- alizable 3d actions from in-the-wild 2d human videos for zero-shot robotic manipulation, 2025. URL https: //arxiv.org/abs/2503.07135
-
[12]
Tool-as-interface: Learning robot policies from observing human tool use
Haonan Chen, Cheng Zhu, Shuijing Liu, Yunzhu Li, and Katherine Rose Driggs-Campbell. Tool-as-interface: Learning robot policies from observing human tool use. InProceedings of Robotics: Conference on Robot Learning (CoRL), 2025
2025
-
[13]
Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting, 2024
Lawrence Yunliang Chen, Kush Hari, Karthik Dhar- marajan, Chenfeng Xu, Quan Vuong, and Ken Gold- berg. Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting, 2024. URL https://arxiv.org/abs/ 2402.19249
-
[14]
Lawrence Yunliang Chen, Chenfeng Xu, Karthik Dhar- marajan, Muhammad Zubair Irshad, Richard Cheng, Kurt Keutzer, Masayoshi Tomizuka, Quan Vuong, and Ken Goldberg. Rovi-aug: Robot and viewpoint aug- mentation for cross-embodiment robot learning, 2024. URL https://arxiv.org/abs/2409.03403
-
[15]
Karen Liu
Sirui Chen, Chen Wang, Kaden Nguyen, Li Fei-Fei, and C. Karen Liu. Arcap: Collecting high-quality human demonstrations for robot learning with augmented re- ality feedback, 2024. URL https://arxiv.org/abs/2410. 08464
2024
-
[16]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[17]
Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In- the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[18]
Active vision might be all you need: Exploring active vision in bimanual robotic manipulation, 2025
Ian Chuang, Andrew Lee, Dechen Gao, M-Mahdi Naddaf-Sh, and Iman Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation, 2025. URL https://arxiv.org/abs/ 2409.17435
-
[19]
Open X-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Ani- mesh Garg...
work page internal anchor Pith review arXiv 2023
-
[20]
International Journal of Computer Vision (IJCV) 130: 33–55
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision (IJCV), 130:33–55, 2022. URL https://doi.org/10.100...
-
[21]
Imagination at inference: Synthesizing in-hand views for robust visuomotor pol- icy inference, 2025
Haoran Ding, Anqing Duan, Zezhou Sun, Dezhen Song, and Yoshihiko Nakamura. Imagination at inference: Synthesizing in-hand views for robust visuomotor pol- icy inference, 2025. URL https://arxiv.org/abs/2509. 15717
2025
-
[22]
Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning
Runyu Ding, Yuzhe Qin, Jiyue Zhu, Chengzhe Jia, Shiqi Yang, Ruihan Yang, Xiaojuan Qi, and Xiaolong Wang. Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning. 2024
2024
-
[23]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URL https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Anca D. Dragan and Siddhartha S. Srinivasa. Online customization of teleoperation interfaces. In2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, pages 919–924, 2012. doi: 10.1109/ROMAN.2012.6343868
-
[25]
Ar2-d2:training a robot without a robot, 2023
Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2:training a robot without a robot, 2023. URL https://arxiv.org/abs/2306.13818
-
[26]
Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Jenny Seidenschwarz, Mike Zheng Shou, Deva Ramanan, Shuran Song, Stan Birchfield, Bowen Wen, and Jeffrey Ichnowski. Deformgs: Scene flow in highly deformable scenes for deformable object manip- ulation, 2024. URL https://arxiv.org/abs/2312.00583
-
[27]
HOLD: Category-agnostic 3d recon- struction of interacting hands and objects from video
Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Otmar Hilliges. HOLD: Category-agnostic 3d recon- struction of interacting hands and objects from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024
2024
-
[28]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 653–660. IEEE, 2024
2024
-
[29]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation, 2024. URL https: //arxiv.org/abs/2401.02117
work page internal anchor Pith review arXiv 2024
-
[30]
Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment,
Caelan Garrett, Ajay Mandlekar, Bowen Wen, and Dieter Fox. Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment,
- [31]
-
[32]
Rvt2: Learning precise manipulation from few demonstrations.RSS, 2024
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu- Wei Chao, and Dieter Fox. Rvt2: Learning precise manipulation from few demonstrations.RSS, 2024
2024
-
[33]
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhong- cong Xu, Chen Zhao, Siddhant Bansal, Dhruv Ba- tra, Vincent...
-
[34]
Siddhant Haldar and Lerrel Pinto. Point policy: Unify- ing observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025
-
[35]
Black, Ivan Laptev, and Cordelia Schmid
Yana Hasson, G ¨ul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects, 2019. URL https://arxiv.org/abs/ 1904.05767
-
[36]
Towards unconstrained joint hand-object re- construction from rgb videos, 2022
Yana Hasson, G ¨ul Varol, Ivan Laptev, and Cordelia Schmid. Towards unconstrained joint hand-object re- construction from rgb videos, 2022. URL https://arxiv. org/abs/2108.07044
-
[37]
Rwor: Generating robot demonstrations from hu- man hand collection for policy learning without robot,
Liang Heng, Xiaoqi Li, Shangqing Mao, Jiaming Liu, Ruolin Liu, Jingli Wei, Yu-Kai Wang, Yueru Jia, Chenyang Gu, Rui Zhao, Shanghang Zhang, and Hao Dong. Rwor: Generating robot demonstrations from hu- man hand collection for policy learning without robot,
- [38]
-
[39]
Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan
Nick Heppert, Max Argus, Tim Welschehold, Thomas Brox, and Abhinav Valada. Ditto: Demonstration imi- tation by trajectory transformation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), page 7565–7572. IEEE, October 2024. doi: 10.1109/iros58592.2024.10801982. URL http://dx. doi.org/10.1109/IROS58592.2024.10801982
-
[40]
Real2gen: Imitation learning from a single hu- man demonstration with generative foundational mod- els
Nick Heppert, Minh Quang Nguyen, and Abhinav Val- ada. Real2gen: Imitation learning from a single hu- man demonstration with generative foundational mod- els. InICRA 2025 Workshop on Foundation Models and Neuro-Symbolic AI for Robotics, 2025. URL https://openreview.net/forum?id=TYtYTHTlel
2025
-
[41]
Imitation learning: A survey of learning methods.ACM Comput
Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Comput. Surv., 50(2), April
-
[42]
ISSN 0360-0300. doi: 10.1145/3054912. URL https://doi.org/10.1145/3054912
-
[43]
Open teach: A versatile teleoperation system for robotic manipulation,
Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. Open teach: A versatile teleoperation system for robotic manipulation, 2024. URL https://arxiv.org/abs/ 2403.07870
-
[44]
BC-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kap- pler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/ forum?id=8kbp23tSGYv
2021
-
[45]
Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation,
Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilue Wang, Longzan Luo, Lily Lee, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, and Shanghang Zhang. Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic manipulation,
- [46]
-
[47]
Egomimic: Scaling imitation learning via egocentric video, 2024
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URL https://arxiv.org/abs/2410. 24221
2024
-
[48]
3d diffuser actor: Policy diffusion with 3d scene representations.Arxiv, 2024
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.Arxiv, 2024
2024
-
[49]
A comparison of remote robot teleoperation interfaces for general object manipulation
David Kent, Carl Saldanha, and Sonia Chernova. A comparison of remote robot teleoperation interfaces for general object manipulation. In2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pages 371–379, 2017
2017
-
[50]
Robot see robot do: Imitating articulated object ma- nipulation with monocular 4d reconstruction
Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qianqian Wang, Ken Goldberg, and Angjoo Kanazawa. Robot see robot do: Imitating articulated object ma- nipulation with monocular 4d reconstruction. In8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=2LLu3gavF1
2024
-
[51]
Chung Hee Kim, Abhisesh Silwal, and George Kantor. Autonomous robotic pepper harvesting: Imitation learn- ing in unstructured agricultural environments, 2024. URL https://arxiv.org/abs/2411.09929
-
[52]
Moo Jin Kim, Jiajun Wu, and Chelsea Finn. Giving robots a hand: Learning generalizable manipulation with eye-in-hand human video demonstrations, 2023. URL https://arxiv.org/abs/2307.05959
-
[53]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Megapose: 6d pose estimation of novel objects via render & compare
Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022
2022
-
[55]
Modular prim- itives for high-performance differentiable rendering
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular prim- itives for high-performance differentiable rendering. ACM Transactions on Graphics, 39(6), 2020
2020
-
[56]
Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations, 2017
Michael Laskey, Caleb Chuck, Jonathan Lee, Jeffrey Mahler, Sanjay Krishnan, Kevin Jamieson, Anca Dra- gan, and Ken Goldberg. Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations, 2017. URL https://arxiv.org/abs/1610. 00850
2017
-
[57]
Strategies for human- in-the-loop robotic grasping
Adam Leeper, Kaijen Hsiao, Matei Ciocarlie, Leila Takayama, and David Gossow. Strategies for human- in-the-loop robotic grasping. In2012 7th ACM/IEEE International Conference on Human-Robot Interac- tion (HRI), pages 1–8, 2012. doi: 10.1145/2157689. 2157691
-
[58]
Shadow: Leveraging segmentation masks for cross- embodiment policy transfer, 2025
Marion Lepert, Ria Doshi, and Jeannette Bohg. Shadow: Leveraging segmentation masks for cross- embodiment policy transfer, 2025. URL https://arxiv. org/abs/2503.00774
-
[59]
arXiv preprint arXiv:2508.09976 (2025)
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Mas- querade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025
-
[60]
Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phan- tom: Training robots without robots using only human videos, 2025. URL https://arxiv.org/abs/2503.00779
-
[61]
P3-po: Prescriptive point priors for visuo- spatial generalization of robot policies, 2024
Mara Levy, Siddhant Haldar, Lerrel Pinto, and Abhinav Shirivastava. P3-po: Prescriptive point priors for visuo- spatial generalization of robot policies, 2024. URL https://arxiv.org/abs/2412.06784
-
[62]
Okami: Teach- ing humanoid robots manipulation skills through single video imitation
Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teach- ing humanoid robots manipulation skills through single video imitation. In8th Annual Conference on Robot Learning (CoRL), 2024
2024
-
[63]
Zero-shot recon- struction of in-scene object manipulation from video,
Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu, and Kostas Daniilidis. Zero-shot recon- struction of in-scene object manipulation from video,
- [64]
-
[65]
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV, 2023
2023
-
[66]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023
work page Pith review arXiv 2023
-
[67]
Egozero: Robot learning from smart glasses,
Vincent Liu, Ademi Adeniji, Haotian Zhan, Siddhant Haldar, Raunaq Bhirangi, Pieter Abbeel, and Lerrel Pinto. Egozero: Robot learning from smart glasses,
- [68]
-
[69]
Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language-image representations and rewards for robotic control, 2023. URL https://arxiv. org/abs/2306.00958
-
[70]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training, 2023. URL https://arxiv.org/ abs/2210.00030
work page internal anchor Pith review arXiv 2023
-
[71]
Where are we in the search for an artificial visual cortex for embodied intelligence?
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an artificial visual cortex for embodied intelligence?, 2024. URL https://arxiv.org...
-
[72]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. InarXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review arXiv 2021
-
[73]
Mimicgen: A data generation system for scalable robot learning using human demonstrations,
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ire- tiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations,
- [74]
-
[75]
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URL https://arxiv.org/abs/2203.12601
work page internal anchor Pith review arXiv 2022
-
[76]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rab- bat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick ...
2023
-
[77]
One demo is worth a thousand trajectories: Action-view augmentation for visuomotor policies
Chuer Pan, Litian Liang, Dominik Bauer, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, and Shu- ran Song. One demo is worth a thousand trajectories: Action-view augmentation for visuomotor policies. In Conference on Robot Learning (CoRL), 2025
2025
-
[78]
R+ x: Retrieval and execution from everyday human videos,
Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, and Edward Johns. R+x: Retrieval and execution from everyday human videos, 2025. URL https://arxiv.org/ abs/2407.12957
-
[79]
Demodiffusion: One-shot human imitation using pre-trained diffusion policy, 2025
Sungjae Park, Homanga Bharadhwaj, and Shubham Tulsiani. Demodiffusion: One-shot human imitation using pre-trained diffusion policy, 2025. URL https: //arxiv.org/abs/2506.20668
-
[80]
Learning to imitate object interactions from internet videos, 2022
Austin Patel, Andrew Wang, Ilija Radosavovic, and Jitendra Malik. Learning to imitate object interactions from internet videos, 2022. URL https://arxiv.org/abs/ 2211.13225
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.