pith. sign in

arxiv: 2605.20085 · v1 · pith:W37Y5T27new · submitted 2026-05-19 · 💻 cs.CV

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

Pith reviewed 2026-05-20 05:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric manipulationtrajectory predictionspatial promptingrobotic visionend-effector forecastingSP-VTPEgoSPT dataset
0
0 comments X

The pith

First-frame spatial prompts allow models to forecast end-effector trajectories more reliably across changing egocentric scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes Spatially Prompted Visual Trajectory Prediction, a setting in which initial bounding boxes or points on the first frame specify both the object to move and the placement target. From egocentric video streams, the model must then predict the full future path of the robot's end-effector. To support this task the authors release the EgoSPT dataset of annotated manipulation sequences and introduce the SPOT architecture, which separately encodes the static spatial prompt, the evolving visual observations, and the history before generating the trajectory. Experiments with strict scene-level splits demonstrate that this dual-source prompting improves prediction accuracy over baselines that receive either no prompt or only one source of spatial information.

Core claim

SP-VTP defines task objectives through static first-frame spatial prompts while the scene evolves, and SPOT solves it by fusing a task encoder for visual and coordinate prompts, an observation encoder for current views plus history, and a trajectory generator that outputs future end-effector motion; under scene-level splits this yields higher accuracy than non-prompted or single-source baselines on the EgoSPT dataset.

What carries the argument

SPOT (Spatially Prompted Object-Target Policy), which encodes first-frame visual and coordinate prompts separately from current visual observations and history, then generates future end-effector trajectories.

If this is right

  • Robotic systems can receive manipulation goals through simple pointing or boxing gestures instead of language or task IDs.
  • Cross-scene generalization improves because the prompt supplies explicit object and target locations rather than relying on learned scene priors.
  • The static-prompt setting scales to cluttered environments where multiple similar objects exist.
  • Trajectory prediction becomes a direct, vision-centric output rather than an intermediate step in a larger planning pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same first-frame prompting mechanism could be paired with online visual servoing to correct trajectories when objects shift unexpectedly.
  • Extending the prompt to include 3D depth or surface normals at the boxed locations might further reduce ambiguity in placement tasks.
  • The EgoSPT collection protocol could be reused to benchmark hybrid language-plus-spatial conditioning for more complex multi-step manipulations.

Load-bearing premise

The initial spatial prompt on the first frame continues to specify the correct object and goal even after the scene configuration and object positions have changed during the trajectory.

What would settle it

A controlled test in which the same first-frame prompt is used but the target object is moved or occluded midway through the sequence, checking whether prediction error rises sharply compared with an updated-prompt baseline.

Figures

Figures reproduced from arXiv: 2605.20085 by Xinyu Zhou, Yifan Li, Yu Kong, Yunhao Ge.

Figure 1
Figure 1. Figure 1: Illustration of the EgoSPT dataset. We use a modified UMI device equipped with an iPhone and a GoPro to collect EgoSPT, an egocentric visual trajectory dataset containing five forks and nine targets, including three plates, three bowls, and three cups. EgoSPT covers three scenes designed to evaluate different policy capabilities. Egocentric Manipulation Datasets. Egocentric manipulation datasets provide vi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed SPOT framework. Given a first-frame task input with object [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation on tuning DINOv2-Base. We compare the default frozen DINOv2-Base encoder with an unfrozen variant on the full validation split [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on history horizon. We compare history horizons of 0, 4, 8, and 12 on the full validation split. Stars mark the best value for each metric. Lower is better for all metrics. Trajectory head. We compare the default flow-matching head with a diffusion head under the same task, observation tokens, and the architecture [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on trajectory head. We compare the default flow-matching head with a diffusion head on the full validation split. Lower is better for all metrics. Tuning vision backbone. We com￾pare the default frozen DINOv2-Base encoder with an unfrozen variant that updates the visual backbone during policy training. This ablation tests whether adapting the visual features to EgoSPT improves trajectory predic￾ti… view at source ↗
Figure 6
Figure 6. Figure 6: Full-trajectory visualization. We show stitched predictions over full episodes together with ground-truth 3D trajectories, first-frame spatial prompts, and per-frame position errors [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: First-chunk visualization. We show the first-frame prompt, current observation, predicted and ground-truth trajectory chunks, and chunk-level Pos. L2, Rot. L2, and Grip. L1 curves. and end of the episode, with a final error of 0.0525, but increases in the middle and late stages as chunk-level errors accumulate under larger camera and EE motion. These results suggest that SPOT captures the intended spatial … view at source ↗
Figure 8
Figure 8. Figure 8: More visualization demonstrations of modified UMI device. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Annotation interface. Annotators label the manipulated object and target region on the first egocentric frame. The resulting bounding boxes are stored in the annotation JSON and used as spatial prompts for SP-VTP. missing prompts, missing zarr arrays, or too few valid frames. For a sample at timestep t, the loader reads the first frame, the current RGB frame, normalized object/target prompts, a history win… view at source ↗
Figure 10
Figure 10. Figure 10: Annotation modification interface. After initial annotation, annotators inspect each video and correct frame-0 object/target bounding boxes directly in the merged annotation file. pose and gripper arrays share compatible temporal indexing, and first-frame boxes are valid in the original image coordinate system. We also check that scene/task/episode names in the annotations match directory names after path… view at source ↗
Figure 11
Figure 11. Figure 11: Additional full-trajectory visualization for Scene 1. We show a representative stitched full-episode prediction under a structured layout. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional full-trajectory visualization for Scene 2. We show a representative stitched full-episode prediction under a cluttered layout. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional full-trajectory visualization for Scene 3. We show a representative stitched full-episode prediction under diverse cluttered subscenes. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional first-chunk visualization for three scenes. We present three representative first-chunk visualization results of three scenes from top to down. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
read the original abstract

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes Spatially Prompted Visual Trajectory Prediction (SP-VTP) as a vision-centric task for egocentric manipulation, where first-frame spatial prompts (bounding boxes or points) specify the object and placement goal. It introduces the EgoSPT dataset of annotated egocentric trajectories with recovered 3D end-effector motion and proposes the SPOT model, which encodes initial prompts via a dedicated task encoder while an observation encoder processes current frames and history to generate future trajectories. Experiments under scene-level splits report improvements over non-prompted and single-source prompted baselines.

Significance. If the empirical gains are shown to be robust, this work provides a practical and scalable task-specification mechanism for cluttered scenes where language is ambiguous. The EgoSPT dataset and the separation of task and observation encoders are clear contributions that could support further research in vision-based robotics. The scene-level split protocol is a strength for assessing generalization.

major comments (2)
  1. [§5] §5 (Experiments): The central claim of cross-scene improvement rests on quantitative gains, yet the manuscript provides no error bars, no details on baseline re-implementations, and no ablation on high-displacement or occlusion subsets. This leaves open whether reported gains derive from prompt utility or simply richer visual history, directly affecting the load-bearing assumption that first-frame prompts remain sufficient as scenes evolve.
  2. [§4.1] §4.1 (Model Architecture): The task encoder ingests only first-frame coordinates and visual prompts while the observation encoder handles evolving frames; no mechanism or analysis is described for maintaining object correspondence after grasp, displacement, or partial occlusion. This architectural choice is central to the cross-scene generalization claim but is not tested against the static-vs-evolving tension noted in the abstract.
minor comments (2)
  1. [Abstract] The abstract and §3 would benefit from explicit statement of the precise metrics (e.g., ADE, FDE) and numerical improvements rather than qualitative statements of 'improves'.
  2. [§4] Notation for the task encoder output and its fusion with the observation encoder is introduced without accompanying equations, making the forward pass difficult to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, with clear indications of planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): The central claim of cross-scene improvement rests on quantitative gains, yet the manuscript provides no error bars, no details on baseline re-implementations, and no ablation on high-displacement or occlusion subsets. This leaves open whether reported gains derive from prompt utility or simply richer visual history, directly affecting the load-bearing assumption that first-frame prompts remain sufficient as scenes evolve.

    Authors: We agree that the absence of error bars, implementation details, and subset ablations weakens the strength of the empirical claims. In the revised version we will add error bars reporting standard deviation across three independent training runs to all quantitative tables in Section 5. We will also expand the supplementary material with a dedicated subsection detailing the exact re-implementation choices, hyperparameters, and training schedules for every baseline. Finally, we will introduce new ablations that isolate performance on high-displacement and occlusion subsets; these results will be used to quantify the incremental benefit of the spatial prompts over visual history alone. revision: yes

  2. Referee: [§4.1] §4.1 (Model Architecture): The task encoder ingests only first-frame coordinates and visual prompts while the observation encoder handles evolving frames; no mechanism or analysis is described for maintaining object correspondence after grasp, displacement, or partial occlusion. This architectural choice is central to the cross-scene generalization claim but is not tested against the static-vs-evolving tension noted in the abstract.

    Authors: The SPOT design deliberately factors the problem into a static task encoder that receives only the first-frame prompts and a dynamic observation encoder that receives the evolving visual stream and history. This separation is intended to address the static-versus-evolving tension stated in the abstract. While we do not introduce an explicit object tracker, the observation encoder is expected to maintain implicit correspondence through learned visual features. We acknowledge that the manuscript currently lacks both a clear discussion of this design choice and supporting analysis. In revision we will add a paragraph to Section 4.1 explaining the implicit correspondence mechanism and will include qualitative trajectory visualizations on sequences exhibiting grasp, displacement, and partial occlusion to illustrate how the model behaves under these conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured against explicit baselines on held-out scenes.

full rationale

The paper defines a new task (SP-VTP), releases a new dataset (EgoSPT) with first-frame annotations, and proposes an architecture (SPOT) that encodes static prompts separately from evolving visual observations. The central claim is an empirical improvement in cross-scene trajectory prediction under strict scene-level splits, evaluated against explicitly described non-prompted and single-source baselines. No mathematical derivation, fitted parameter, or self-citation chain is presented that reduces the reported result to the inputs by construction. The evaluation protocol and baseline comparisons are independent of the model definition itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that static first-frame spatial prompts suffice for a dynamic scene plus standard deep-learning training assumptions; no new physical entities are postulated.

free parameters (1)
  • SPOT model hyperparameters
    Training-time choices for the task encoder, observation encoder, and trajectory generator that are fitted to the EgoSPT data.
axioms (1)
  • domain assumption First-frame spatial prompts remain valid task specifications as the visual scene evolves.
    Invoked in the problem definition and in the design of the task encoder.

pith-pipeline@v0.9.0 · 5785 in / 1334 out tokens · 44827 ms · 2026-05-20T05:29:39.909530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 7061–7071. IEEE, 2025

  2. [2]

    Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting

    Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, and Yu Kong. Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting. InInt. Conf. Comput. Vis., pages 13702–13711, 2023

  3. [3]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  4. [4]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025

  5. [5]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InRobotics: Science and Systems, 2024

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. Int. J. Robot. Res., 44(10-11):1684–1704, 2025

  7. [7]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.Int

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.Int. J. Comput. Vis., 130(1): 33–55, 2022

  8. [8]

    Egopat3dv2: Predicting 3d action target from 2d egocentric vision for human-robot interaction

    Irving Fang, Yuzhong Chen, Yifan Wang, Jianghan Zhang, Qiushi Zhang, Jiali Xu, Xibo He, Weibo Gao, Hao Su, Yiming Li, et al. Egopat3dv2: Predicting 3d action target from 2d egocentric vision for human-robot interaction. InIEEE Int. Conf. Robot. Autom., pages 3036–3043. IEEE, 2024

  9. [9]

    Eva-02: A visual representation for neon genesis,

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.arXiv:2303.11331, 2023

  10. [10]

    Rvt2: Learning precise manipulation from few demonstrations.Robotics: Science and Systems, 2024

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt2: Learning precise manipulation from few demonstrations.Robotics: Science and Systems, 2024

  11. [11]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InIEEE Conf. Comput. Vis. Pattern Recog., pages 18995–19012, 2022

  12. [12]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InIEEE Conf. Comput. Vis. Pattern Recog., pages 19383–19400, 2024

  13. [13]

    Umi-on-air: Embodiment-aware guidance for embodiment- agnostic visuomotor policies,

    Harsh Gupta, Xiaofeng Guo, Huy Ha, Chuer Pan, Muqing Cao, Dongjae Lee, Sebastian Scherer, Shuran Song, and Guanya Shi. Umi-on-air: Embodiment-aware guidance for embodiment- agnostic visuomotor policies, 2025. URLhttps://arxiv.org/abs/2510.02614. 10

  14. [14]

    UMI-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI-on-legs: Making manipulation policies mobile with manipulation-centric whole-body controllers. InConf. Robot Learn., 2024. URLhttps://openreview.net/forum?id=3i7j8ZPnbm

  15. [15]

    Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos

    Masashi Hatano, Ryo Hachiuma, and Hideo Saito. Emag: Ego-motion aware and generalizable 2d hand forecasting from egocentric videos. InEur. Conf. Comput. Vis., pages 119–136, 2024

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdv. Neural Inform. Process. Syst., volume 33, pages 6840–6851, 2020

  17. [17]

    V oxposer: Composable 3d value maps for robotic manipulation with language models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConf. Robot Learn., 2023

  18. [18]

    Egomimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. InIEEE Int. Conf. Robot. Autom., pages 13226–13233. IEEE, 2025

  19. [19]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Ben- jamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConf. Robot Learn., 2024

  20. [20]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InInt. Conf. Comput. Vis., pages 4015–4026, 2023

  21. [21]

    H2o: Two hands manipulating objects for first person interaction recognition

    Taein Kwon, Bugra Tekin, Jan Stühmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InInt. Conf. Comput. Vis., pages 10138–10148, 2021

  22. [22]

    Hamster: Hier- archical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025

    Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hier- archical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025

  23. [23]

    Egocentric prediction of action target in 3d

    Yiming Li, Ziang Cao, Andrew Liang, Benjamin Liang, Luoyao Chen, Hang Zhao, and Chen Feng. Egocentric prediction of action target in 3d. InIEEE Conf. Comput. Vis. Pattern Recog., pages 20971–20980. IEEE, 2022

  24. [24]

    Data scaling laws in imitation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InInt. Conf. Learn. Represent.,

  25. [25]

    URLhttps://openreview.net/forum?id=pISLZG7ktL

  26. [26]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInt. Conf. Learn. Represent., 2020

  27. [27]

    Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

    Kehui Liu, Zhongjie Jia, Yang Li, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, et al. Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

  28. [28]

    Joint hand motion and interaction hotspots prediction from egocentric videos

    Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. Joint hand motion and interaction hotspots prediction from egocentric videos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 3282–3292, 2022

  29. [29]

    Maniwav: Learning robot manipulation from in-the-wild audio-visual data

    Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, and Shuran Song. Maniwav: Learning robot manipulation from in-the-wild audio-visual data. InConf. Robot Learn., pages 947–962. PMLR, 2025

  30. [30]

    Grounding video models to actions through goal conditioned exploration

    Yunhao Luo and Yilun Du. Grounding video models to actions through goal conditioned exploration. InInt. Conf. Learn. Represent., 2025. URL https://openreview.net/forum? id=G6dMvRuhFr. 11

  31. [31]

    Language conditioned imitation learning over unstructured data

    Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. InRobotics: Science and Systems, 2021. URL https://arxiv.org/abs/2005.07648

  32. [32]

    Diff-ip2d: Diffusion-based hand- object interaction prediction on egocentric videos

    Junyi Ma, Xieyuanli Chen, Jingyi Xu, and Hesheng Wang. Diff-ip2d: Diffusion-based hand- object interaction prediction on egocentric videos. InIEEE/RSJ Int. Conf. Intell. Robots Syst., pages 4291–4298. IEEE, 2025

  33. [33]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  34. [34]

    Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations.arXiv preprint arXiv:2602.06643, 2026

    Ruiqian Nai, Boyuan Zheng, Junming Zhao, Haodong Zhu, Sicong Dai, Zunhao Chen, Yihang Hu, Yingdong Hu, Tong Zhang, Chuan Wen, et al. Humanoid manipulation interface: Humanoid whole-body manipulation from robot-free demonstrations.arXiv preprint arXiv:2602.06643, 2026

  35. [35]

    Visual reinforcement learning with imagined goals.Adv

    Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals.Adv. Neural Inform. Process. Syst., 31, 2018

  36. [36]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InRobotics: Science and Systems, Del...

  37. [37]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patric...

  38. [38]

    Goal-conditioned imitation learning using score-based diffusion policies

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. InRobotics: Science and Systems, 2023. URL https://www.roboticsproceedings.org/rss19/p028.pdf

  39. [39]

    Legato: Cross-embodiment imitation using a grasping tool.IEEE Robot

    Mingyo Seo, H Andy Park, Shenli Yuan, Yuke Zhu, and Luis Sentis. Legato: Cross-embodiment imitation using a grasping tool.IEEE Robot. Autom. Lett., 10(3):2854–2861, 2025

  40. [40]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. InConf. Robot Learn., pages 894–906. PMLR, 2022

  41. [41]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConf. Robot Learn., pages 785–799. PMLR, 2023

  42. [42]

    Universal planning networks: Learning generalizable representations for visuomotor control

    Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. InInt. Conf. Machi. Learn., pages 4732–4741. PMLR, 2018

  43. [43]

    Fourier features let networks learn high frequency functions in low dimensional domains

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. InAdv. Neural Inform. Process. Syst., volume 33, pages 7537–7547, 2020

  44. [44]

    Egotracks: A long-term egocentric visual object tracking dataset.Adv

    Hao Tang, Kevin J Liang, Kristen Grauman, Matt Feiszli, and Weiyao Wang. Egotracks: A long-term egocentric visual object tracking dataset.Adv. Neural Inform. Process. Syst., 36: 75716–75739, 2023

  45. [45]

    Dexwild: Dexterous human interactions for in-the-wild robot policies.Robotics: Science and Systems, 2025

    Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, and Deepak Pathak. Dexwild: Dexterous human interactions for in-the-wild robot policies.Robotics: Science and Systems, 2025. 12

  46. [46]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

    Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  47. [47]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdv. Neural Inform. Process. Syst., volume 30, 2017

  48. [48]

    VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, and Jiaya Jia. Vp-vla: Visual prompting as an interface for vision-language-action models. arXiv preprint arXiv:2603.22003, 2026

  49. [49]

    Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

    Yifei Wei, Linqing Zhong, Yi Liu, Yuxiang Lu, Xindong He, Maoqing Yao, and Guanghui Ren. Libra-vla: Achieving learning equilibrium via asynchronous coarse-to-fine dual-system.arXiv preprint arXiv:2604.24921, 2026

  50. [50]

    Momanipvla: Transferring vision-language-action models for general mobile manipulation

    Zhenyu Wu, Yuheng Zhou, Xiuwei Xu, Ziwei Wang, and Haibin Yan. Momanipvla: Transferring vision-language-action models for general mobile manipulation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1714–1723, 2025

  51. [51]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation

    Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation. InConf. Robot Learn., pages 437–459. PMLR, 2025

  52. [52]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. InIEEE Conf. Comput. Vis. Pattern Recog., pages 14203–14214, June 2025

  53. [53]

    Multi-task reinforcement learn- ing with soft modularization

    Ruihan Yang, Huazhe Xu, YI WU, and Xiaolong Wang. Multi-task reinforcement learn- ing with soft modularization. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,Adv. Neural Inform. Process. Syst., volume 33, pages 4767–4777. Cur- ran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 32cfdce9631d8c7906e8...

  54. [54]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConf. Robot Learn., pages 1094–1100. PMLR, 2020

  55. [55]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. InConf. Robot Learn., pages 3157–

  56. [56]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InInt. Conf. Comput. Vis., pages 11975–11986, 2023

  57. [57]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023

  58. [58]

    Fastumi: A scalable and hardware- independent universal manipulation interface with dataset,

    Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi: A scalable and hardware- independent universal manipulation interface with dataset.arXiv, 2025. URL https://arxiv. org/abs/2409.19499

  59. [59]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R Sanketi, Grecia Salazar, Michael S Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski, ...