Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
Pith reviewed 2026-05-20 05:29 UTC · model grok-4.3
The pith
First-frame spatial prompts allow models to forecast end-effector trajectories more reliably across changing egocentric scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SP-VTP defines task objectives through static first-frame spatial prompts while the scene evolves, and SPOT solves it by fusing a task encoder for visual and coordinate prompts, an observation encoder for current views plus history, and a trajectory generator that outputs future end-effector motion; under scene-level splits this yields higher accuracy than non-prompted or single-source baselines on the EgoSPT dataset.
What carries the argument
SPOT (Spatially Prompted Object-Target Policy), which encodes first-frame visual and coordinate prompts separately from current visual observations and history, then generates future end-effector trajectories.
If this is right
- Robotic systems can receive manipulation goals through simple pointing or boxing gestures instead of language or task IDs.
- Cross-scene generalization improves because the prompt supplies explicit object and target locations rather than relying on learned scene priors.
- The static-prompt setting scales to cluttered environments where multiple similar objects exist.
- Trajectory prediction becomes a direct, vision-centric output rather than an intermediate step in a larger planning pipeline.
Where Pith is reading between the lines
- The same first-frame prompting mechanism could be paired with online visual servoing to correct trajectories when objects shift unexpectedly.
- Extending the prompt to include 3D depth or surface normals at the boxed locations might further reduce ambiguity in placement tasks.
- The EgoSPT collection protocol could be reused to benchmark hybrid language-plus-spatial conditioning for more complex multi-step manipulations.
Load-bearing premise
The initial spatial prompt on the first frame continues to specify the correct object and goal even after the scene configuration and object positions have changed during the trajectory.
What would settle it
A controlled test in which the same first-frame prompt is used but the target object is moved or occluded midway through the sequence, checking whether prediction error rises sharply compared with an updated-prompt baseline.
Figures
read the original abstract
Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes Spatially Prompted Visual Trajectory Prediction (SP-VTP) as a vision-centric task for egocentric manipulation, where first-frame spatial prompts (bounding boxes or points) specify the object and placement goal. It introduces the EgoSPT dataset of annotated egocentric trajectories with recovered 3D end-effector motion and proposes the SPOT model, which encodes initial prompts via a dedicated task encoder while an observation encoder processes current frames and history to generate future trajectories. Experiments under scene-level splits report improvements over non-prompted and single-source prompted baselines.
Significance. If the empirical gains are shown to be robust, this work provides a practical and scalable task-specification mechanism for cluttered scenes where language is ambiguous. The EgoSPT dataset and the separation of task and observation encoders are clear contributions that could support further research in vision-based robotics. The scene-level split protocol is a strength for assessing generalization.
major comments (2)
- [§5] §5 (Experiments): The central claim of cross-scene improvement rests on quantitative gains, yet the manuscript provides no error bars, no details on baseline re-implementations, and no ablation on high-displacement or occlusion subsets. This leaves open whether reported gains derive from prompt utility or simply richer visual history, directly affecting the load-bearing assumption that first-frame prompts remain sufficient as scenes evolve.
- [§4.1] §4.1 (Model Architecture): The task encoder ingests only first-frame coordinates and visual prompts while the observation encoder handles evolving frames; no mechanism or analysis is described for maintaining object correspondence after grasp, displacement, or partial occlusion. This architectural choice is central to the cross-scene generalization claim but is not tested against the static-vs-evolving tension noted in the abstract.
minor comments (2)
- [Abstract] The abstract and §3 would benefit from explicit statement of the precise metrics (e.g., ADE, FDE) and numerical improvements rather than qualitative statements of 'improves'.
- [§4] Notation for the task encoder output and its fusion with the observation encoder is introduced without accompanying equations, making the forward pass difficult to follow.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, with clear indications of planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): The central claim of cross-scene improvement rests on quantitative gains, yet the manuscript provides no error bars, no details on baseline re-implementations, and no ablation on high-displacement or occlusion subsets. This leaves open whether reported gains derive from prompt utility or simply richer visual history, directly affecting the load-bearing assumption that first-frame prompts remain sufficient as scenes evolve.
Authors: We agree that the absence of error bars, implementation details, and subset ablations weakens the strength of the empirical claims. In the revised version we will add error bars reporting standard deviation across three independent training runs to all quantitative tables in Section 5. We will also expand the supplementary material with a dedicated subsection detailing the exact re-implementation choices, hyperparameters, and training schedules for every baseline. Finally, we will introduce new ablations that isolate performance on high-displacement and occlusion subsets; these results will be used to quantify the incremental benefit of the spatial prompts over visual history alone. revision: yes
-
Referee: [§4.1] §4.1 (Model Architecture): The task encoder ingests only first-frame coordinates and visual prompts while the observation encoder handles evolving frames; no mechanism or analysis is described for maintaining object correspondence after grasp, displacement, or partial occlusion. This architectural choice is central to the cross-scene generalization claim but is not tested against the static-vs-evolving tension noted in the abstract.
Authors: The SPOT design deliberately factors the problem into a static task encoder that receives only the first-frame prompts and a dynamic observation encoder that receives the evolving visual stream and history. This separation is intended to address the static-versus-evolving tension stated in the abstract. While we do not introduce an explicit object tracker, the observation encoder is expected to maintain implicit correspondence through learned visual features. We acknowledge that the manuscript currently lacks both a clear discussion of this design choice and supporting analysis. In revision we will add a paragraph to Section 4.1 explaining the implicit correspondence mechanism and will include qualitative trajectory visualizations on sequences exhibiting grasp, displacement, and partial occlusion to illustrate how the model behaves under these conditions. revision: partial
Circularity Check
No significant circularity; empirical gains measured against explicit baselines on held-out scenes.
full rationale
The paper defines a new task (SP-VTP), releases a new dataset (EgoSPT) with first-frame annotations, and proposes an architecture (SPOT) that encodes static prompts separately from evolving visual observations. The central claim is an empirical improvement in cross-scene trajectory prediction under strict scene-level splits, evaluated against explicitly described non-prompted and single-source baselines. No mathematical derivation, fitted parameter, or self-citation chain is presented that reduces the reported result to the inputs by construction. The evaluation protocol and baseline comparisons are independent of the model definition itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- SPOT model hyperparameters
axioms (1)
- domain assumption First-frame spatial prompts remain valid task specifications as the visual scene evolves.
Reference graph
Works this paper leans on
-
[1]
Hot3d: Hand and object tracking in 3d from egocentric multi-view videos
Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7061–7071. IEEE, 2025
work page 2025
-
[2]
Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting
Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. InInt. Conf. Comput. Vis., pages 13702–13711, 2023
work page 2023
-
[3]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...
work page 2025
-
[4]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots
Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024
work page 2024
-
[6]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res., 44(10-11):1684–1704, 2025
work page 2025
-
[7]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.Int
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.Int. J. Comput. Vis., 130(1): 33–55, 2022
work page 2022
-
[8]
Egopat3dv2: Predicting 3d action target from 2d egocentric vision for human-robot interaction
Irving Fang, Yuzhong Chen, Yifan Wang, Jianghan Zhang, Qiushi Zhang, Jiali Xu, Xibo He, Weibo Gao, Hao Su, Yiming Li, et al. Egopat3dv2: Predicting 3d action target from 2d egocentric vision for human-robot interaction. InIEEE Int. Conf. Robot. Autom., pages 3036–3043. IEEE, 2024
work page 2024
-
[9]
Eva-02: A visual representation for neon genesis,
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.arXiv:2303.11331, 2023
-
[10]
Rvt2: Learning precise manipulation from few demonstrations.Robotics: Science and Systems, 2024
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt2: Learning precise manipulation from few demonstrations.Robotics: Science and Systems, 2024
work page 2024
-
[11]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InIEEE Conf. Comput. Vis. Pattern Recog., pages 18995–19012, 2022
work page 2022
-
[12]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InIEEE Conf. Comput. Vis. Pattern Recog., pages 19383–19400, 2024
work page 2024
-
[13]
Umi-on-air: Embodiment-aware guidance for embodiment- agnostic visuomotor policies,
Harsh Gupta, Xiaofeng Guo, Huy Ha, Chuer Pan, Muqing Cao, Dongjae Lee, Sebastian Scherer, Shuran Song, and Guanya Shi. Umi-on-air: Embodiment-aware guidance for embodiment- agnostic visuomotor policies, 2025. URLhttps://arxiv.org/abs/2510.02614. 10
-
[14]
UMI-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers
Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. InConf. Robot Learn., 2024. URLhttps://openreview.net/forum?id=3i7j8ZPnbm
work page 2024
-
[15]
Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos
Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEur. Conf. Comput. Vis., pages 119–136, 2024
work page 2024
-
[16]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., volume 33, pages 6840–6851, 2020
work page 2020
-
[17]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConf. Robot Learn., 2023
work page 2023
-
[18]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE Int. Conf. Robot. Autom., pages 13226–13233. IEEE, 2025
work page 2025
-
[19]
OpenVLA: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConf. Robot Learn., 2024
work page 2024
-
[20]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInt. Conf. Comput. Vis., pages 4015–4026, 2023
work page 2023
-
[21]
H2o: Two hands manipulating objects for first person interaction recognition
Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InInt. Conf. Comput. Vis., pages 10138–10148, 2021
work page 2021
-
[22]
Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hier- archical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025
-
[23]
Egocentric prediction of action target in 3d
Yiming Li, Ziang Cao, Andrew Liang, Benjamin Liang, Luoyao Chen, Hang Zhao, and Chen Feng. Egocentric prediction of action target in 3d. InIEEE Conf. Comput. Vis. Pattern Recog., pages 20971–20980. IEEE, 2022
work page 2022
-
[24]
Data scaling laws in imitation learning for robotic manipulation
Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InInt. Conf. Learn. Represent.,
-
[25]
URLhttps://openreview.net/forum?id=pISLZG7ktL
-
[26]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2020
work page 2020
-
[27]
Kehui Liu, Zhongjie Jia, Yang Li, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, et al. Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025
-
[28]
Joint hand motion and interaction hotspots prediction from egocentric videos
Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3282–3292, 2022
work page 2022
-
[29]
Maniwav: Learning robot manipulation from in-the-wild audio-visual data
Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, and Shuran Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data. InConf. Robot Learn., pages 947–962. PMLR, 2025
work page 2025
-
[30]
Grounding video models to actions through goal conditioned exploration
Yunhao Luo and Yilun Du. Grounding video models to actions through goal conditioned exploration. InInt. Conf. Learn. Represent., 2025. URL https://openreview.net/forum? id=G6dMvRuhFr. 11
work page 2025
-
[31]
Language conditioned imitation learning over unstructured data
Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. InRobotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648
-
[32]
Diff-ip2d: Diffusion-based hand- object interaction prediction on egocentric videos
Junyi Ma, Xieyuanli Chen, Jingyi Xu, and Hesheng Wang. Diff-ip2d: Diffusion-based hand- object interaction prediction on egocentric videos. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 4291–4298. IEEE, 2025
work page 2025
-
[33]
Nerf: Representing scenes as neural radiance fields for view synthesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021
work page 2021
-
[34]
Ruiqian Nai, Boyuan Zheng, Junming Zhao, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, et al. Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations.arXiv preprint arXiv:2602.06643, 2026
-
[35]
Visual reinforcement learning with imagined goals.Adv
Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals.Adv. Neural Inform. Process. Syst., 31, 2018
work page 2018
-
[36]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, Del...
work page 2024
-
[37]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patric...
work page 2024
-
[38]
Goal-conditioned imitation learning using score-based diffusion policies
Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. InRobotics: Science and Systems, 2023. URL https://www.roboticsproceedings.org/rss19/p028.pdf
work page 2023
-
[39]
Legato: Cross-embodiment imitation using a grasping tool.IEEE Robot
Mingyo Seo, H Andy Park, Shenli Yuan, Yuke Zhu, and Luis Sentis. Legato: Cross-embodiment imitation using a grasping tool.IEEE Robot. Autom. Lett., 10(3):2854–2861, 2025
work page 2025
-
[40]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InConf. Robot Learn., pages 894–906. PMLR, 2022
work page 2022
-
[41]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConf. Robot Learn., pages 785–799. PMLR, 2023
work page 2023
-
[42]
Universal planning networks: Learning generalizable representations for visuomotor control
Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InInt. Conf. Machi. Learn., pages 4732–4741. PMLR, 2018
work page 2018
-
[43]
Fourier features let networks learn high frequency functions in low dimensional domains
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. InAdv. Neural Inform. Process. Syst., volume 33, pages 7537–7547, 2020
work page 2020
-
[44]
Egotracks: A long-term egocentric visual object tracking dataset.Adv
Hao Tang, Kevin J Liang, Kristen Grauman, Matt Feiszli, and Weiyao Wang. Egotracks: A long-term egocentric visual object tracking dataset.Adv. Neural Inform. Process. Syst., 36: 75716–75739, 2023
work page 2023
-
[45]
Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, and Deepak Pathak. Dexwild: Dexterous human interactions for in-the-wild robot policies.Robotics: Science and Systems, 2025. 12
work page 2025
-
[46]
Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., volume 30, 2017
work page 2017
-
[48]
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, and Jiaya Jia. Vp-vla: Visual prompting as an interface for vision-language-action models. arXiv preprint arXiv:2603.22003, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System
Yifei Wei, Linqing Zhong, Yi Liu, Yuxiang Lu, Xindong He, Maoqing Yao, and Guanghui Ren. Libra-vla: Achieving learning equilibrium via asynchronous coarse-to-fine dual-system.arXiv preprint arXiv:2604.24921, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Momanipvla: Transferring vision-language-action models for general mobile manipulation
Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. Momanipvla: Transferring vision-language-action models for general mobile manipulation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1714–1723, 2025
work page 2025
-
[51]
Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation
Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. InConf. Robot Learn., pages 437–459. PMLR, 2025
work page 2025
-
[52]
Magma: A foundation model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. InIEEE Conf. Comput. Vis. Pattern Recog., pages 14203–14214, June 2025
work page 2025
-
[53]
Multi-task reinforcement learn- ing with soft modularization
Ruihan Yang, Huazhe Xu, YI WU, and Xiaolong Wang. Multi-task reinforcement learn- ing with soft modularization. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Adv. Neural Inform. Process. Syst., volume 33, pages 4767–4777. Cur- ran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 32cfdce9631d8c7906e8...
work page 2020
-
[54]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConf. Robot Learn., pages 1094–1100. PMLR, 2020
work page 2020
-
[55]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConf. Robot Learn., pages 3157–
-
[56]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InInt. Conf. Comput. Vis., pages 11975–11986, 2023
work page 2023
-
[57]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023
work page 2023
-
[58]
Fastumi: A scalable and hardware- independent universal manipulation interface with dataset,
Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi: A scalable and hardware- independent universal manipulation interface with dataset.arXiv, 2025. URL https://arxiv. org/abs/2409.19499
-
[59]
RT-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R Sanketi, Grecia Salazar, Michael S Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.