Recognition: 2 theorem links
· Lean TheoremCLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3
The pith
CLAMP shows that contrastive pretraining on 3D multi-view action-conditioned observations from simulation transfers to superior real-world robotic manipulation performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLAMP establishes that contrastive learning on action-conditioned 3D multi-view representations derived from point clouds, combined with diffusion policy pretraining, produces encoders and policies that, when fine-tuned on few task demonstrations, achieve substantially higher success rates on unseen manipulation tasks in both simulated and real environments.
What carries the argument
The contrastive pretraining objective that associates 3D geometric and positional information from re-rendered multi-view images with robot action patterns, using merged point clouds from RGB-D and extrinsics.
If this is right
- The pretrained encoders capture 3D spatial details essential for precise actions.
- Sample efficiency improves during fine-tuning on real demonstrations.
- Generalization to new tasks increases compared to non-pretrained policies.
- The method works without major changes to the policy architecture.
Where Pith is reading between the lines
- The method could extend to other robot learning domains like navigation or assembly if similar 3D data is available.
- Increasing the diversity of simulated trajectories might further close the sim-to-real gap.
- Applying the same pretraining to different policy architectures beyond diffusion policies could yield similar benefits.
Load-bearing premise
The representations learned from simulated point clouds and actions remain effective despite differences in real-world sensing noise, lighting, and physical dynamics.
What would settle it
A real-world experiment where fine-tuning the CLAMP-pretrained model on limited demos yields success rates no higher than a randomly initialized model or a 2D-pretrained baseline.
Figures
read the original abstract
Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLAMP, a 3D pre-training framework for robotic manipulation that merges RGB-D point clouds using camera extrinsics, re-renders them as multi-view 4-channel images (depth plus 3D coordinates) including dynamic wrist views, and applies contrastive learning on large-scale simulated trajectories to associate geometric features with action patterns. It additionally pre-trains a Diffusion Policy for weight initialization and fine-tunes the resulting encoders and policy on limited task demonstrations, claiming improved sample efficiency and outperformance versus state-of-the-art baselines on six simulated tasks and five real-world tasks.
Significance. If the empirical claims hold, the work would be significant for the robotics community by demonstrating that action-conditioned contrastive pre-training on clean simulated 3D renderings can produce representations that transfer to real RGB-D settings, addressing the spatial limitations of 2D image pre-training and reducing reliance on large real-world datasets.
major comments (2)
- [Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
- [Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.
minor comments (2)
- [Methods] Methods: the exact construction of the four-channel re-rendered images (depth plus 3D coordinates) from the merged point cloud should be formalized with an equation or pseudocode for reproducibility.
- [Introduction] The paper should include a reference to prior contrastive learning works in robotics (e.g., on point-cloud or multi-view representations) to situate the novelty of the action-conditioned objective.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of outperformance on six simulated and five real-world tasks is stated without any quantitative tables, ablation results, baseline details, or statistical tests, so the magnitude and reliability of the reported gains cannot be assessed from the provided text.
Authors: We acknowledge that the abstract, as a concise summary, does not include quantitative details or tables. The full results—including success rate tables comparing CLAMP to baselines, ablation studies on pretraining components, baseline descriptions, and run-to-run variance—are presented in Section 4 (Experiments). To address the concern, we will revise the abstract to include specific quantitative highlights (e.g., average success rate improvements of X% on simulated tasks and Y% on real tasks) while remaining within length limits. This will better convey the magnitude of gains directly in the abstract. revision: yes
-
Referee: [Experiments] Experiments section (implied by the sim-to-real claims): no ablations are described on depth sensor noise models, calibration perturbations, or domain randomization during pre-training, leaving the robustness of the merged point-cloud re-rendering pipeline to real-world RGB-D artifacts unverified and directly threatening the transfer results on the five real tasks.
Authors: We agree this is an important aspect for strengthening the sim-to-real claims. Our current experiments demonstrate empirical transfer from clean simulated pretraining to real RGB-D tasks, but do not include explicit ablations on sensor noise, calibration perturbations, or domain randomization. We will add a dedicated subsection in the revised Experiments section with these analyses: we will introduce controlled noise models and calibration variations during pretraining and report their impact on downstream fine-tuning performance. This will directly verify robustness of the merged point-cloud pipeline. revision: yes
Circularity Check
No circularity in pretraining-to-evaluation chain
full rationale
The paper describes an empirical pipeline: contrastive pretraining of encoders on large-scale simulated trajectories (using merged point clouds re-rendered as multi-view 4-channel images), joint pretraining of a Diffusion Policy for weight initialization, followed by fine-tuning on limited task demonstrations and evaluation on held-out simulated and real-world tasks. No equations, derivations, or load-bearing steps reduce reported performance gains to quantities defined by fitted parameters inside the paper or to self-referential definitions. The evaluation uses separate data splits and external benchmarks, so the outperformance claims remain independent of the training inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Re-rendered multi-view four-channel images supply clearer 3D geometric and positional cues than raw 2D RGB images for manipulation
Lean theorems connected to this paper
-
Foundation/AlexanderDualityalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates... We are the first to apply STRING relative positional encoding using 3D coordinates derived from point clouds.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/big vision, 2022
work page 2022
-
[2]
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π 0: A Vi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/jax-ml/jax
work page 2018
-
[4]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024
work page 2024
-
[5]
Reproducible scaling laws for contrastive language- image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 2818–2829, 2023
work page 2023
-
[6]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[7]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...
work page 2021
-
[9]
Eva: Exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19358–19369, 2023
work page 2023
-
[10]
Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis.Image and Vision Computing, 149:105171, 2024
work page 2024
-
[11]
Act3d: 3d feature field transform- ers for multi-task robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transform- ers for multi-task robotic manipulation. InConference on Robot Learning, pages 3949–3965. PMLR, 2023
work page 2023
-
[12]
Revisiting point cloud shape classification with a simple and effective baseline
Ankit Goyal, Hei Law, Bowei Liu, Alejandro Newell, and Jia Deng. Revisiting point cloud shape classification with a simple and effective baseline. InInternational con- ference on machine learning, pages 3809–3820. PMLR, 2021
work page 2021
-
[13]
Rvt: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023
work page 2023
-
[14]
Learning dense visual descriptors using image augmentations for robot manipulation tasks
Christian Graf, David B Adrian, Joshua Weil, Miroslav Gabriel, Philipp Schillinger, Markus Spies, Heiko Neu- mann, and Andras Gabor Kupcsik. Learning dense visual descriptors using image augmentations for robot manipulation tasks. Inconference on Robot Learning, pages 871–880. PMLR, 2023
work page 2023
-
[15]
Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021
Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai- Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer.Computational visual media, 7(2): 187–199, 2021
work page 2021
-
[16]
Mvtn: Multi-view transformation network for 3d shape recognition
Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 1–11, 2021
work page 2021
-
[17]
Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. V oint cloud: Multi-view point cloud representation for 3d understanding.arXiv preprint arXiv:2111.15363, 2021
-
[18]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising diffusion probabilistic models.arXiv preprint arxiv:2006.11239, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[19]
Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, and Matthias Nießner. Pri3d: Can 3d priors help 2d representation learning? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5693–5702, 2021
work page 2021
-
[20]
A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024
Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neurocomputing, 610:128645, 2024
work page 2024
-
[21]
An embodied generalist agent in 3d world,
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023
-
[22]
Multi-view transformer for 3d visual grounding
Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022
work page 2022
-
[23]
Frozen clip transformer is an efficient point cloud en- coder
Xiaoshui Huang, Zhou Huang, Sheng Li, Wentao Qu, Tong He, Yuenan Hou, Yifan Zuo, and Wanli Ouyang. Frozen clip transformer is an efficient point cloud en- coder. InProceedings of the AAAI Conference on Artifi- cial Intelligence, volume 38, pages 2382–2390, 2024
work page 2024
-
[24]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architec- ture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[27]
Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation
Yueru Jia, Jiaming Liu, Sixiang Chen, Chenyang Gu, Zhilve Wang, Longzan Luo, Xiaoqi Li, Pengwei Wang, Zhongyuan Wang, Renrui Zhang, et al. Lift3d policy: Lifting 2d foundation models for robust 3d robotic ma- nipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17347–17358, 2025
work page 2025
-
[28]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,
work page 2015
-
[29]
URL http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Set transformer: A framework for attention-based permutation-invariant neural networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InInternational conference on machine learning, pages 3744–3753. PMLR, 2019
work page 2019
-
[31]
Class: Contrastive learning via action se- quence supervision for robot manipulation
Sung-Wook Lee, Xuhui Kang, Brandon Yang, and Yen- Ling Kuo. Class: Contrastive learning via action se- quence supervision for robot manipulation. InConfer- ence on Robot Learning (CoRL), 2025
work page 2025
-
[32]
Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework.arXiv preprint arXiv:2202.07123, 2022
-
[33]
Zijing Ma, Zhi Yang, Aihua Mao, Shuyi Wen, Ran Yi, and Yongjin Liu. A multi-view projection-based object-aware graph network for dense captioning of point clouds.Computers & Graphics, 126:104156, 2025
work page 2025
-
[34]
Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024
Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, et al. Tips: Text- image pretraining with spatial awareness.arXiv preprint arXiv:2410.16512, 2024
-
[35]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, :, Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 652–660, 2017
work page 2017
-
[37]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017
work page 2017
-
[38]
3d-mvp: 3d multi- view pretraining for manipulation
Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3d-mvp: 3d multi- view pretraining for manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22530–22539, 2025
work page 2025
-
[39]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[40]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[41]
Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Avinava Dubey, Ayzaan Wahid, et al. Learning the ropes: Better 2d and 3d position encodings with string.arXiv preprint arXiv:2502.02562, 2025
-
[42]
Tanner Schmidt, Richard Newcombe, and Dieter Fox. Self-supervised visual descriptor learning for dense cor- respondence.IEEE Robotics and Automation Letters, 2 (2):420–427, 2016
work page 2016
-
[43]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning, pages 785–
-
[44]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Gemini Robotics Team. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025. URL https://arxiv.org/abs/2510.03342
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012
work page 2012
-
[47]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[48]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds.ACM Transac- tions on Graphics (tog), 38(5):1–12, 2019
work page 2019
-
[49]
Articubot: Learning universal articulated object manipulation policy via large scale simulation
Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object manipulation policy via large scale simulation. arXiv preprint arXiv:2503.03045, 2025
-
[50]
Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: Grouped vector attention and partition-based pooling.Advances in Neu- ral Information Processing Systems, 35:33330–33342, 2022
work page 2022
-
[51]
Point transformer v3: Simpler faster stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840– 4851, 2024
work page 2024
-
[52]
Pointllm: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. In European Conference on Computer Vision, pages 131–
-
[53]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın- Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1179–1189, 2023
work page 2023
-
[54]
Ulip-2: Towards scalable multimodal pre-training for 3d under- standing
Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091– 27101, 2024
work page 2024
-
[55]
mt5: A massively multilingual pre-trained text-to- text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to- text transformer. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technolo- gies, pages 483–498, 2021
work page 2021
-
[56]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Con- trastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Point-bert: Pre-training 3d point cloud transformers with masked point modeling
Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022
work page 2022
-
[58]
Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021
work page internal anchor Pith review arXiv 2021
- [59]
-
[60]
Gnfactor: Multi-task real robot learning with generalizable neural feature fields
Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning, pages 284–301. PMLR, 2023
work page 2023
-
[61]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024
work page 2024
-
[62]
Sigmoid loss for language image pre- training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[63]
Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point- m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.Advances in neural information processing systems, 35:27061–27074, 2022
work page 2022
-
[64]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022
work page 2022
-
[65]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021
work page 2021
-
[66]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manip- ulation with Low-Cost Hardware. 2023
work page 2023
-
[67]
Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024
Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024
-
[68]
Uni3d: Exploring unified 3d representation at scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Explor- ing unified 3d representation at scale.arXiv preprint arXiv:2310.06773, 2023
-
[69]
Wenxuan Zhou, Bowen Jiang, Fan Yang, Chris Paxton, and David Held. Hacman: Learning hybrid actor-critic maps for 6d non-prehensile manipulation.arXiv preprint arXiv:2305.03942, 2023
-
[70]
Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for pow- erful 3d open-world learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2639–2650, 2023. VII. APPENDIX A. Proof of Lemma 3.1 Proof:Note that, by applying the definition...
work page 2023
-
[71]
For image preprocessing, we use a voxel size of 0.001 for voxel-grid downsampling. If the point cloud contains more than 0.3M points, we randomly subsample to 0.3M; otherwise, we pad with points at the origin (0,0,0) to reach 0.3M points. Each virtual view has dimensions224×224×4, and we tile the views horizontally to form a224×1120×4input. The action enc...
work page 2048
-
[72]
GR1.5 tasks and episode counts: •multitoolsmagnifierincaddy_left: 47,679 •multitoolsscrewdriverincaddy_left: 42,466 •multitoolsmagnifierincaddy_right: 47,613 •handoverpen: 39,603 •multitoolsscissorsincaddy_right: 42,167 •multitoolscanopenerincaddy_right: 42,069 •multitoolsscissorsincaddy_left: 40,568 •multitoysraccooninbasket: 40,409 •multitoysfireenginei...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.