Recognition: 2 theorem links
· Lean TheoremLAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Pith reviewed 2026-05-10 16:53 UTC · model grok-4.3
The pith
Lifting image edits into 3D transformations supplies precise guidance for robotic manipulation in new environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By lifting the implicit 2D spatial cues encoded in image-editing results into 3D space, LAMP derives precise inter-object transformations that serve as generalizable priors for manipulation tasks, achieving strong zero-shot performance in open-world settings.
What carries the argument
The lifting process that converts 2D spatial cues from image edits into continuous 3D inter-object transformations as geometry-aware representations.
Where Pith is reading between the lines
- The same lifting idea could apply to other 2D-to-3D conversion problems such as scene reconstruction from casual photos.
- Pairing these geometric priors with existing vision-language models might yield planners that combine spatial accuracy with semantic understanding.
- Deployment on physical robots in cluttered, changing real-world scenes would provide a direct test of whether the derived transformations transfer beyond simulation.
Load-bearing premise
2D spatial cues from image editing can be reliably lifted into accurate, continuous 3D inter-object transformations without requiring explicit 3D supervision or task-specific fine-tuning.
What would settle it
An experiment showing that the 3D transformations extracted from image edits deviate substantially from measured ground-truth object poses or that zero-shot manipulation success rates remain unchanged from baselines without 3D lifting would disprove the central claim.
Figures
read the original abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LAMP, a method that repurposes image-editing models to extract implicit 2D spatial cues and lifts them into continuous 3D inter-object transformations as general priors for open-world robotic manipulation. The central claim is that this lifting yields fine-grained, geometry-aware 3D guidance without task-specific fine-tuning or explicit 3D supervision, enabling strong zero-shot generalization beyond what LLMs, VLMs, or standard learning-based approaches achieve.
Significance. If the 2D-to-3D lifting produces verifiably accurate metric transformations, the work would offer a practical route to 3D-aware manipulation by leveraging abundant 2D generative models, reducing reliance on 3D datasets or per-task training. This could meaningfully advance open-world robotics if the priors are shown to be more than projective heuristics.
major comments (2)
- [Experiments] Experiments section: the abstract and method claim 'precise 3D transformations' and 'geometry-aware' representations, yet no quantitative 3D error metrics (e.g., rotation or translation error against ground-truth poses) or ablations comparing lifted 3D deltas to direct 3D supervision are reported; downstream task success alone does not establish that the priors are metric 3D rather than 2D appearance-based.
- [Method] Method section (lifting procedure): the inversion from 2D edit cues to continuous 3D transformations is presented without explicit handling of depth ambiguity or projective scale; if the lifting relies on off-the-shelf depth estimators or optimization without 3D consistency losses, the 'accurate' claim risks being circular with the image editor's 2D training distribution.
minor comments (2)
- [Abstract] Abstract: 'extensive experiments' and 'strong zero-shot generalization' are asserted without any numerical results, baseline comparisons, or task counts, which reduces clarity even for a high-level summary.
- Notation: the symbols used for the lifted 3D transformation (e.g., rotation and translation components) should be defined once in the main text rather than only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address the two major comments point by point below, clarifying our approach and indicating planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Experiments section: the abstract and method claim 'precise 3D transformations' and 'geometry-aware' representations, yet no quantitative 3D error metrics (e.g., rotation or translation error against ground-truth poses) or ablations comparing lifted 3D deltas to direct 3D supervision are reported; downstream task success alone does not establish that the priors are metric 3D rather than 2D appearance-based.
Authors: We agree that direct quantitative 3D metrics would provide stronger evidence for the metric accuracy of the lifted transformations. Our current evaluation prioritizes downstream zero-shot manipulation success across diverse open-world tasks to demonstrate practical utility where explicit 3D ground truth is typically unavailable. In the revised version, we will add quantitative evaluations on controlled datasets with available ground-truth poses (e.g., synthetic scenes and selected real-world captures), reporting mean rotation and translation errors. We will also include an ablation comparing our lifted 3D priors against baselines that use direct 3D supervision or raw 2D cues. This will help isolate the contribution of the 3D lifting step. revision: yes
-
Referee: Method section (lifting procedure): the inversion from 2D edit cues to continuous 3D transformations is presented without explicit handling of depth ambiguity or projective scale; if the lifting relies on off-the-shelf depth estimators or optimization without 3D consistency losses, the 'accurate' claim risks being circular with the image editor's 2D training distribution.
Authors: We appreciate the concern regarding depth ambiguity and scale. The lifting procedure combines off-the-shelf monocular depth estimation with a multi-view consistency optimization that enforces geometric constraints across edited image pairs, including scale normalization based on known camera intrinsics and object size priors from the scene. This is not purely circular, as the 2D editing model supplies appearance-consistent cues while the lifting step introduces explicit 3D geometric reasoning. We will revise the method section to provide a clearer step-by-step description of the ambiguity resolution process, including the optimization objective and any consistency losses used. We will also add a limitations paragraph discussing residual depth ambiguities in highly occluded or textureless scenes. revision: partial
Circularity Check
No significant circularity in LAMP's lifting of image-editing cues to 3D priors
full rationale
The paper presents LAMP as a method that takes outputs from external pre-trained image-editing models (providing 2D spatial cues) and applies a lifting procedure to obtain 3D inter-object transformations. No equations, self-definitions, or fitted parameters are shown that would make the claimed 3D priors equivalent to the input 2D edits by construction. The abstract and description contain no load-bearing self-citations, uniqueness theorems from the same authors, or ansatzes smuggled via prior work. The central claim remains an independent methodological proposal relying on external models rather than re-deriving results from its own manipulation data or self-referential inputs. This is the common case of a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations... using monocular depth estimator (e.g., VGGT), DINOv3 features, Umeyama algorithm, and unified scale s_a = s_p.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hierarchical 2D-3D fused filtering... DBSCAN within K-Means clusters... cross-state point cloud registration.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Playing hard exploration games by watching youtube.Advances in neural information processing systems, 31, 2018
Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando De Freitas. Playing hard exploration games by watching youtube.Advances in neural information processing systems, 31, 2018. 1
2018
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Rt-h: Action hierarchies using language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. InRobotics: Science and Systems, 2024. 1
2024
-
[6]
Gen2act: Hu- man video generation in novel scenarios enables generaliz- able robot manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Hu- man video generation in novel scenarios enables generaliz- able robot manipulation. In1st Workshop on X-Embodiment Robot Learning. 2, 3
-
[7]
Track2act: Predicting point tracks from internet videos enables generalizable robot manipula- tion
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipula- tion. InEuropean Conference on Computer Vision (ECCV),
-
[8]
Robotic grasping and contact: A review
Antonio Bicchi and Vijay Kumar. Robotic grasping and contact: A review. InProceedings 2000 ICRA. Millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No. 00CH37065), pages 348–353. IEEE, 2000. 1
2000
-
[9]
Zero-shot robotic manipulation with pre-trained image-editing diffusion models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image-editing diffusion models. InThe Twelfth International Conference on Learning Representations. 3
-
[10]
In9th Annual Conference on Robot Learning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open-world generaliza- tion. In9th Annual Conference on Robot Learning, 2025. 1, 3
2025
-
[11]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2, 3
work page internal anchor Pith review arXiv 2022
-
[12]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 3
2023
-
[13]
Haonan Chen, Jingxiang Guo, Bangjun Wang, Tianrui Zhang, Xuchuan Huang, Boren Zheng, Yiwen Hou, Chenrui Tie, Jiajun Deng, and Lin Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation.arXiv preprint arXiv:2506.23919, 2025. 3, 4
-
[14]
Neural shape mating: Self-supervised object assembly with adversarial shape priors
Yun-Chun Chen, Haoda Li, Dylan Turpin, Alec Jacobson, and Animesh Garg. Neural shape mating: Self-supervised object assembly with adversarial shape priors. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12724–12733, 2022. 2, 3
2022
-
[15]
Putting the object back into video object segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3151–3161, 2024. 1
2024
-
[16]
3d-fixup: Advancing photo editing with 3d priors
Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon, Alexander Schwing, Liang-Yan Gui, Matheus Gadelha, Paul Guerrero, and Nanxuan Zhao. 3d-fixup: Advancing photo editing with 3d priors. InProceedings of the Special Inter- est Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025. 2
2025
-
[17]
Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 1
2025
-
[18]
Local neural descriptor fields: Locally conditioned object representations for manipulation
Ethan Chun, Yilun Du, Anthony Simeonov, Tomas Lozano- Perez, and Leslie Kaelbling. Local neural descriptor fields: Locally conditioned object representations for manipulation. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 1830–1836. IEEE, 2023. 2
2023
-
[19]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei- Fei, and Ruohan Zhang. Dream2flow: Bridging video gen- eration and open-world manipulation with 3d object flow. arXiv preprint arXiv:2512.24766, 2025. 3
-
[21]
Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion.Advances in neural information processing systems, 36:9156–9172, 2023. 3
2023
-
[22]
Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022
Ben Eisner and Harry Zhang. Flowbot3d: Learning 3d ar- ticulation flow to manipulate articulated objects.Robotics Science and Systems 2022, 2022. 2
2022
-
[23]
Anygrasp: Robust and efficient grasp perception in spa- tial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023
Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spa- tial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023. 6 7
2023
-
[24]
Moka: Open-world robotic manipulation through mark- based visual prompting.Robotics: Science and Systems (RSS), 2024
Kuan Fang, Fangchen Liu, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark- based visual prompting.Robotics: Science and Systems (RSS), 2024. 2
2024
-
[25]
Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739,
Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the future.The International Journal of Robotics Research, 44(5):701–739,
-
[26]
Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models
Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, and Wei-Shi Zheng. Llmdet: Learning strong open-vocabulary object detectors under the supervision of large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14987–14997, 2025. 4
2025
-
[27]
Flip: Flow-centric generative planning as general-purpose manipulation world model
Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model. InThe Thir- teenth International Conference on Learning Representa- tions. 3
-
[28]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[29]
Flowdreamer: A rgb-d world model with flow- based motion representations for robot manipulation.IEEE Robotics and Automation Letters, 2026
Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow- based motion representations for robot manipulation.IEEE Robotics and Automation Letters, 2026. 3
2026
-
[30]
Dextreme: Transfer of agile in-hand ma- nipulation from simulation to reality
Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundar- alingam, et al. Dextreme: Transfer of agile in-hand ma- nipulation from simulation to reality. In2023 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 5977–5984. IE...
2023
-
[31]
Modem: Accelerating visual model-based reinforcement learning with demonstra- tions
Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, and Aravind Rajeswaran. Modem: Accelerating visual model-based reinforcement learning with demonstra- tions. InThe Eleventh International Conference on Learning Representations. 1
-
[32]
Visuomotor control in multi-object scenes using object-aware representations
Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, and Debidatta Dwibedi. Visuomotor control in multi-object scenes using object-aware representations. In 2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9515–9522. IEEE, 2023. 2
2023
-
[33]
Learning particle-based world model from human for robot dexterous manipulation
Zhengdong Hong, Y Liu, H Hou, B Ai, J Wang, T Mu, Y Qin, J Gu, and H Su. Learning particle-based world model from human for robot dexterous manipulation. In3rd RSS Workshop on Dexterous Manipulation: Learning and Con- trol with Diverse Data, 2025. 1
2025
-
[34]
Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning
Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, and Yang Gao. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024. 4
2024
-
[35]
Copa: General robotic manipulation through spa- tial constraints of parts with foundation models
Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spa- tial constraints of parts with foundation models. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495. IEEE, 2024. 2, 3, 8, 10
2024
-
[36]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 2, 3, 8, 9
work page internal anchor Pith review arXiv 2023
-
[37]
Rekep: Spatio-temporal reasoning of rela- tional keypoint constraints for robotic manipulation
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of rela- tional keypoint constraints for robotic manipulation. InCon- ference on Robot Learning, pages 4573–4602. PMLR, 2025. 2, 3, 8, 9, 10
2025
-
[38]
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming- Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion.arXiv preprint arXiv:2601.03782, 2026. 3
-
[39]
Strictly batch imitation learning by energy-based distribution matching.Advances in Neural Information Processing Sys- tems, 33:7354–7365, 2020
Daniel Jarrett, Ioana Bica, and Mihaela van der Schaar. Strictly batch imitation learning by energy-based distribution matching.Advances in Neural Information Processing Sys- tems, 33:7354–7365, 2020. 1
2020
-
[40]
Real- world robot applications of foundation models: A review
Kento Kawaharazuka, Tatsuya Matsushima, Andrew Gam- bardella, Jiaxian Guo, Chris Paxton, and Andy Zeng. Real- world robot applications of foundation models: A review. Advanced Robotics, 38(18):1232–1254, 2024. 3
2024
-
[41]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 2
2025
-
[42]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4
2023
-
[43]
Graph inverse reinforcement learning from diverse videos
Sateesh Kumar, Jonathan Zamora, Nicklas Hansen, Rishabh Jangir, and Xiaolong Wang. Graph inverse reinforcement learning from diverse videos. InConference on Robot Learn- ing, pages 55–66. PMLR, 2023. 1
2023
-
[44]
Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance
Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 9759–9769, 2025. 3
2025
-
[45]
Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.CoRR, 2024. 2
2024
-
[46]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9493–9500. IEEE, 2023. 2 8
2023
-
[47]
Dreamitate: Real-world visuomotor policy learn- ing via video generation
Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sud- hakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learn- ing via video generation. InConference on Robot Learning, pages 3943–3960. PMLR, 2025. 3
2025
-
[48]
Prompting depth anything for 4k resolution accurate metric depth estimation
Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Ji- aming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting depth anything for 4k resolution accurate metric depth estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17070–17080, 2025. 5
2025
-
[49]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
2023
-
[50]
Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004
David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004. 5
2004
-
[51]
Jigsaw: Learning to assemble multiple fractured objects.Advances in Neural Information Processing Systems, 36:14969–14986, 2023
Jiaxin Lu, Yifan Sun, and Qixing Huang. Jigsaw: Learning to assemble multiple fractured objects.Advances in Neural Information Processing Systems, 36:14969–14986, 2023. 3
2023
-
[52]
Model-based reinforcement learn- ing: A survey.Foundations and Trends® in Machine Learn- ing, 16(1):1–118, 2023
Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learn- ing: A survey.Foundations and Trends® in Machine Learn- ing, 16(1):1–118, 2023. 1
2023
-
[53]
Contact-invariant optimization for hand manipulation
Igor Mordatch, Zoran Popovi ´c, and Emanuel Todorov. Contact-invariant optimization for hand manipulation. In Proceedings of the ACM SIGGRAPH/Eurographics sympo- sium on computer animation, pages 137–144, 2012. 1
2012
-
[54]
Pivot: iterative visual prompting elicits actionable knowledge for vlms
Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al. Pivot: iterative visual prompting elicits actionable knowledge for vlms. InProceedings of the 41st International Conference on Machine Learning, pages 37321–37341, 2024. 2
2024
-
[55]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. 2
2024
-
[56]
Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints
Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wen- long Gao, and Hao Dong. Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17359–17369,
-
[57]
Robotic manipulation by imitating generated videos without physical demonstrations
Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipulation by imitating generated videos without physical demonstrations. InWorkshop on Foundation Models Meet Embodied Agents at CVPR 2025. 2, 3
2025
-
[58]
Two by two: Learning multi-task pairwise objects assembly for generalizable robot manipu- lation
Yu Qi, Yuanchen Ju, Tianming Wei, Chi Chu, Lawson LS Wong, and Huazhe Xu. Two by two: Learning multi-task pairwise objects assembly for generalizable robot manipu- lation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17383–17393, 2025. 3, 6, 7, 8
2025
-
[59]
Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Sci- ence and Systems XIV, 2018
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giu- lia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Sci- ence and Systems XIV, 2018. 1
2018
-
[60]
Goal conditioned imitation learning using score- based diffusion policies
Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Li- outikov. Goal conditioned imitation learning using score- based diffusion policies. InRobotics: Science and Systems,
-
[61]
In-hand dexterous manipulation of piecewise- smooth 3-d objects.The International Journal of Robotics Research, 18(4):355–381, 1999
Daniela Rus. In-hand dexterous manipulation of piecewise- smooth 3-d objects.The International Journal of Robotics Research, 18(4):355–381, 1999. 1
1999
-
[62]
Fast point feature histograms (fpfh) for 3d registration
Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (fpfh) for 3d registration. In2009 IEEE international conference on robotics and automation, pages 3212–3217. IEEE, 2009. 5
2009
-
[63]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 5
2020
-
[64]
Dbscan revisited, revisited: why and how you should (still) use dbscan.ACM Transactions on Database Systems (TODS), 42(3):1–21, 2017
Erich Schubert, J ¨org Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu. Dbscan revisited, revisited: why and how you should (still) use dbscan.ACM Transactions on Database Systems (TODS), 42(3):1–21, 2017. 5
2017
-
[65]
Deep imita- tion learning for humanoid loco-manipulation through hu- man teleoperation
Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imita- tion learning for humanoid loco-manipulation through hu- man teleoperation. InIEEE-RAS International Conference on Humanoid Robots (Humanoids), 2023. 1
2023
-
[66]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Neural descriptor fields: Se (3)- equivariant object representations for manipulation
Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)- equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022. 2
2022
-
[68]
Se (3)-equivariant relational rearrange- ment with neural descriptor fields
Anthony Simeonov, Yilun Du, Yen-Chen Lin, Alberto Ro- driguez Garcia, Leslie Pack Kaelbling, Tom´as Lozano-P´erez, and Pulkit Agrawal. Se (3)-equivariant relational rearrange- ment with neural descriptor fields. InConference on Robot Learning, pages 835–846. PMLR, 2023. 2
2023
-
[69]
Llm-planner: Few-shot grounded planning for embodied agents with large language models
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2998–3009, 2023. 4
2023
-
[70]
Open-world object manipulation using pre-trained vision-language models
Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kir- mani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. In 7th Annual Conference on Robot Learning. 2
-
[71]
arXiv preprint arXiv:2506.05282 (2025) 2, 3, 4, 5, 9, 10, 11, 13, 14, 15, 22, 23, 24
Tao Sun, Liyuan Zhu, Shengyu Huang, Shuran Song, and Iro Armeni. Rectified point flow: Generic point cloud pose estimation.arXiv preprint arXiv:2506.05282, 2025. 3 9
-
[72]
Curobo: Parallelized collision-free robot mo- tion generation
Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, et al. Curobo: Parallelized collision-free robot mo- tion generation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8112–8119. IEEE,
-
[73]
Learning rope manipula- tion policies using dense object descriptors trained on syn- thetic depth data
Priya Sundaresan, Jennifer Grannen, Brijen Thananjeyan, Ashwin Balakrishna, Michael Laskey, Kevin Stone, Joseph E Gonzalez, and Ken Goldberg. Learning rope manipula- tion policies using dense object descriptors trained on syn- thetic depth data. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9411–9418. IEEE,
-
[74]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Deep object pose estimation for semantic robotic grasping of household ob- jects
Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household ob- jects. InConference on Robot Learning, pages 306–316. PMLR, 2018. 2
2018
-
[77]
S. Umeyama. Least-squares estimation of transformation pa- rameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380,
-
[78]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 4
2025
-
[79]
Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Sys- tems, 37:121475–121499, 2024. 2
2024
-
[80]
Deep closest point: Learn- ing representations for point cloud registration
Yue Wang and Justin M Solomon. Deep closest point: Learn- ing representations for point cloud registration. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 3523–3532, 2019. 5
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.