PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
Pith reviewed 2026-05-21 03:32 UTC · model grok-4.3
The pith
Integrating hierarchical 3D point clouds directly into action decoding raises VLA success rates by 10 percent on robotic benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PointACT is a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process through a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to attend densely to local geometric detail and global scene structure.
What carries the argument
Multi-scale point-action interaction mechanism that lets action tokens attend to hierarchical 3D point clouds at multiple resolutions via bottleneck-window self-attention.
If this is right
- Success rates rise by roughly 10 percent on the RLBench-10Tasks suite relative to state-of-the-art pretrained VLAs.
- Gains become larger when the vision-language backbone remains frozen and only the action expert is trained from scratch.
- Tightly coupling hierarchical 3D geometry with pretrained 2D semantic features is necessary for robust spatial grounding.
- Pretrained 3D representations offer a promising route for building future 3D-aware VLA policies.
Where Pith is reading between the lines
- The same point-action interaction pattern could be tested on navigation or long-horizon assembly tasks that also require fine 3D spatial reasoning.
- Freezing large vision-language backbones while training only a lightweight 3D action expert may reduce compute costs in other hybrid robotic systems.
- Real-robot deployment would clarify whether the reported simulation gains survive sensor noise and calibration errors.
Load-bearing premise
The performance gains are produced by the multi-scale point-action interaction with hierarchical 3D point clouds rather than by other unstated differences in the dual-system design or training procedure.
What would settle it
Ablating the point-action interaction module while keeping all other components identical and measuring whether success rates fall back to the level of the strongest 2D VLA baseline.
Figures
read the original abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. PointACT is a dual-system 3D-aware Vision-Language-Action policy that augments pretrained VLAs with hierarchical 3D point cloud inputs. It introduces a multi-scale point-action interaction module that uses bottleneck window self-attention to let evolving action tokens attend to both local geometric detail and global scene structure. The paper reports consistent gains on LIBERO and RLBench, including a 10% absolute success-rate improvement on the RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with larger gains when the vision-language backbone is frozen and only the action expert is trained from scratch. Ablation studies are presented to argue that tight coupling of hierarchical 3D geometry with 2D semantic features is critical for spatially grounded control.
Significance. If the reported gains can be shown to arise specifically from the multi-scale point-action interaction rather than from the dual-system split or training protocol, the work would meaningfully advance 3D-aware VLA design by demonstrating a practical way to inject hierarchical geometric reasoning into action decoding. The emphasis on pretrained 3D representations and the frozen-backbone regime also offers a useful data point for efficient adaptation of large VLAs to robotics.
major comments (2)
- [§4.3] §4.3 (Ablation Studies): The controlled comparisons are performed against monolithic VLAs and loosely point-augmented baselines, but no dual-system variant is reported that uses the identical hierarchical point-cloud preprocessing and fusion pipeline while replacing the evolving multi-scale bottleneck-window attention with a simpler fusion operator. Without this ablation, it remains unclear whether the ~10% RLBench-10Tasks gain is driven by the proposed interaction mechanism or by the dual-system architecture and from-scratch action-expert training.
- [§4.1–4.2] §4.1–4.2 (Experimental Setup and Main Results): The manuscript states success-rate improvements but does not report the number of evaluation seeds, standard deviations, or statistical significance tests for the RLBench-10Tasks and LIBERO results. Because the central claim rests on these quantitative gains, the absence of these details prevents assessment of whether the observed differences are reliable.
minor comments (2)
- [§3.2] §3.2 (Method): The description of the bottleneck window self-attention would benefit from an explicit complexity analysis or pseudocode to clarify how the window size and hierarchy levels scale with point-cloud resolution.
- [Figure 4] Figure 4: The attention-map visualizations would be easier to interpret if the color scale and the correspondence between attention weights and 3D points were labeled directly on the figure rather than only in the caption.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and have updated the paper to incorporate the suggested improvements where feasible.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Ablation Studies): The controlled comparisons are performed against monolithic VLAs and loosely point-augmented baselines, but no dual-system variant is reported that uses the identical hierarchical point-cloud preprocessing and fusion pipeline while replacing the evolving multi-scale bottleneck-window attention with a simpler fusion operator. Without this ablation, it remains unclear whether the ~10% RLBench-10Tasks gain is driven by the proposed interaction mechanism or by the dual-system architecture and from-scratch action-expert training.
Authors: We appreciate the referee's suggestion for a more targeted ablation. Our manuscript does include comparisons to dual-system VLA baselines augmented with point cloud inputs (see §4.1 and Table 1). However, to directly address whether the gains stem specifically from the multi-scale point-action interaction, we have conducted an additional experiment in which we replace the bottleneck window self-attention with a simpler concatenation-based fusion operator while keeping the dual-system architecture, hierarchical point-cloud preprocessing, and training protocol identical. The results, now included in the revised §4.3, show that this simpler variant achieves lower performance (approximately 6% lower success rate on RLBench-10Tasks), indicating that the proposed interaction mechanism contributes meaningfully beyond the dual-system split. We have updated the ablation studies accordingly. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (Experimental Setup and Main Results): The manuscript states success-rate improvements but does not report the number of evaluation seeds, standard deviations, or statistical significance tests for the RLBench-10Tasks and LIBERO results. Because the central claim rests on these quantitative gains, the absence of these details prevents assessment of whether the observed differences are reliable.
Authors: We agree that reporting variability and statistical details is important for assessing the reliability of the results. In the original experiments, we used 3 random seeds for evaluation on RLBench and LIBERO. We have now expanded this to 5 seeds and report the mean success rates with standard deviations in the updated Tables 1 and 2. Additionally, we performed paired t-tests comparing PointACT against the strongest baseline, confirming statistical significance (p < 0.01) for the 10% improvement on RLBench-10Tasks. These details have been added to §4.1–4.2 in the revised manuscript. revision: yes
Circularity Check
No circularity in empirical architecture proposal and benchmark evaluation
full rationale
The paper proposes PointACT as a dual-system VLA policy that integrates hierarchical 3D point clouds into action decoding via multi-scale point-action interaction with bottleneck window self-attention. Central claims rest on empirical success rates on LIBERO and RLBench benchmarks, with comparisons to monolithic/dual-system baselines and point-augmented variants, plus ablations showing benefits of tight 3D-2D coupling. No derivation chain, equations, or first-principles results are described that reduce by construction to fitted parameters, self-defined quantities, or self-citation load-bearing uniqueness theorems. Performance attribution is experimental rather than definitional, making the work self-contained as a standard empirical robotics contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained vision-language backbones supply useful semantic features that can be combined with 3D geometry.
invented entities (1)
-
PointACT dual-system policy
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Smail Ait Bouhsain, Rachid Alami, and Thierry Simeon. Simultaneous action and grasp feasibility prediction for task and motion planning through multi-task learning. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2042–2048. IEEE, 2023
work page 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Pe- ter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
RT-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRSS, 2023
work page 2023
-
[8]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Lerobot: State-of-the-art machine learning for real-world robotics in pytorch
Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caro- line Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/...
work page 2024
-
[10]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024
work page 2024
-
[11]
Polarnet: 3d point clouds for language- guided robotic manipulation
Shizhe Chen, Ricardo Garcia, Cordelia Schmid, and Ivan Laptev. Polarnet: 3d point clouds for language- guided robotic manipulation. In7th Conference on Robot Learning (CoRL 2023), 2023
work page 2023
-
[12]
SUGAR: Pre-training 3D visual representations for robotics
Shizhe Chen, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. SUGAR: Pre-training 3D visual representations for robotics. InCVPR, 2024
work page 2024
-
[13]
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, et al. Internvla-m1: A spatially guided vision- language-action framework for generalist robot policy. arXiv preprint arXiv:2510.13778, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Vividex: Learning vision-based dexterous manipulation from human videos
Zerui Chen, Shizhe Chen, Etienne Arlaud, Ivan Laptev, and Cordelia Schmid. Vividex: Learning vision-based dexterous manipulation from human videos. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 3336–3343. IEEE, 2025
work page 2025
-
[15]
Diffusion policy: Visuomotor policy learning via action diffusion.IJRR, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.IJRR, 2025
work page 2025
-
[16]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Towards generalizable vision-language robotic manip- ulation: A benchmark and llm-guided 3d policy
Ricardo Garcia, Shizhe Chen, and Cordelia Schmid. Towards generalizable vision-language robotic manip- ulation: A benchmark and llm-guided 3d policy. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8996–9002. IEEE, 2025
work page 2025
-
[18]
Act3D: 3D feature field transform- ers for multi-task robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3D: 3D feature field transform- ers for multi-task robotic manipulation. InCoRL, 2023
work page 2023
-
[19]
Octo: An open- source generalist robot policy
Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open- source generalist robot policy. InRobotics: Science and Systems, 2024
work page 2024
-
[20]
RVT: Robotic view transformer for 3D object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3D object manipulation. InCoRL, 2023
work page 2023
-
[21]
RVT2: Learning precise manipu- lation from few demonstrations
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT2: Learning precise manipu- lation from few demonstrations. InRSS, 2024
work page 2024
-
[22]
Instruction-driven history-aware policies for robotic ma- nipulations
Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia Pinel, Makarand Tapaswi, Ivan Laptev, and Cordelia Schmid. Instruction-driven history-aware policies for robotic ma- nipulations. InCoRL, 2023
work page 2023
-
[23]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els
Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation mod- els. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9488–9495. IEEE, 2024
work page 2024
-
[25]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, pages 540–
-
[26]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020
work page 2020
-
[28]
Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisa- tion
Stephen James, Kentaro Wada, Tristan Laidlow, and Andrew J Davison. Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisa- tion. InCVPR, 2022
work page 2022
-
[29]
BC-Z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z: Zero-shot task generalization with robotic imitation learning. InCoRL, 2022
work page 2022
-
[30]
3D Diffuser Actor: Policy diffusion with 3D scene representations
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3D Diffuser Actor: Policy diffusion with 3D scene representations. InCoRL, 2024
work page 2024
-
[31]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025
work page 2025
-
[33]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Chengmeng Li, Junjie Wen, Yaxin Peng, Yan Peng, and Yichen Zhu. Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026
work page 2026
-
[35]
Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models
Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiang- nan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models. arXiv preprint arXiv:2506.07961, 2025
-
[36]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Manipllm: Embodied multimodal large lan- guage model for object-centric robotic manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large lan- guage model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024
work page 2024
-
[38]
3ds-vla: A 3d spatial- aware vision language action model for robust multi- task manipulation
Xiaoqi Li, Liang Heng, Jiaming Liu, Yan Shen, Chenyang Gu, Zhuoyang Liu, Hao Chen, Nuowei Han, Renrui Zhang, Hao Tang, et al. 3ds-vla: A 3d spatial- aware vision language action model for robust multi- task manipulation. In9th Annual Conference on Robot Learning, 2025
work page 2025
-
[39]
Code as policies: Language model programs for em- bodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023
work page 2023
-
[40]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matt Le. Flow matching for generative modeling. In11th International Conference on Learning Representations, ICLR 2023, 2023
work page 2023
-
[41]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023
work page 2023
-
[42]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language- action model.arXiv preprint arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Frame mining: a free lunch for learning robotic manipulation from 3d point clouds
Minghua Liu, Xuanlin Li, Zhan Ling, Yangyan Li, and Hao Su. Frame mining: a free lunch for learning robotic manipulation from 3d point clouds. InConference on Robot Learning, pages 527–538. PMLR, 2023
work page 2023
- [44]
-
[45]
Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics.arXiv preprint arXiv:1703.09312, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. Attention bottlenecks for multimodal fusion.Advances in neural information processing systems, 34:14200–14213, 2021
work page 2021
-
[47]
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Alek- sandar Yanev, Luc Van Gool, Jan-Nico Zaech, and Danda Pani Paudel. Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
work page 2024
-
[49]
Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Per- menter, Guannan Qu, Nicholas Boffi, Guanya Shi, et al. Much ado about noising: Dispelling the myths of gener- ative robotic control.arXiv preprint arXiv:2512.01809, 2025
-
[50]
Scalable diffu- sion models with transformers
William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[51]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
The colosseum: A benchmark for evaluating generalization for robotic manipulation
Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Kr- ishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. InRSS 2024
work page 2024
-
[53]
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision- text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025
-
[54]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[56]
E., Otto, F., and Lioutikov, R
Moritz Reuss, Hongyi Zhou, Marcel R ¨uhle, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025
-
[57]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InCoRL, 2023
work page 2023
-
[58]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representa- tions in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025
-
[61]
Kite: Keypoint-conditioned policies for semantic manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, and Jeannette Bohg. Kite: Keypoint-conditioned policies for semantic manipulation. InConference on Robot Learning, pages 1006–1021. PMLR, 2023
work page 2023
-
[62]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[63]
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Point transformer v3: Simpler faster stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840– 4851, 2024
work page 2024
-
[65]
Fp3: A 3d foundation policy for robotic manipulation
Rujia Yang, Geng Chen, Chuan Wen, and Yang Gao. Fp3: A 3d foundation policy for robotic manipulation. arXiv preprint arXiv:2503.08950, 2025
-
[66]
Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning,
Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, and Hang Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025
-
[67]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConference on Robot Learning, pages 3157–3181. PMLR, 2025
work page 2025
-
[68]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [69]
-
[70]
Cot-vla: Visual chain- of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain- of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
work page 2025
-
[71]
Learning fine-grained bimanual manipulation with low-cost hardware.RSS, 2023
Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.RSS, 2023
work page 2023
-
[72]
3d- vla: a 3d vision-language-action generative world model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: a 3d vision-language-action generative world model. InProceedings of the 41st International Conference on Machine Learning, pages 61229–61245, 2024
work page 2024
-
[73]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. Libero-pro: Towards robust and fair eval- uation of vision-language-action models beyond memo- rization.arXiv preprint arXiv:2510.03827, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Rt-2: Vision-language- action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. APPENDIX We include efficiency analysis and additional experiments. Real-robot exampl...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.