arxiv: 2401.00025 · v3 · submitted 2023-12-28 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Any-point Trajectory Modeling for Policy Learning

Chuan Wen , Xingyu Lin , John So , Kai Chen , Qi Dou , Yang Gao , Pieter Abbeel

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:27 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords trajectory modelingvideo pre-trainingpolicy learningvisuomotor controlimitation learningrobot manipulationtransfer learninglanguage-conditioned tasks

0 comments

The pith

Pre-training a model to predict future trajectories of arbitrary points in videos supplies control guidance that lets robots learn policies from minimal action-labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Any-point Trajectory Modeling (ATM) to turn large amounts of unlabeled video into useful signals for robot control. It trains a model to forecast where arbitrary points in a video frame will move next, then uses those forecasts as detailed guidance when training visuomotor policies. Only a small amount of action-labeled data is needed on top of the video pre-training. Across more than 130 language-conditioned tasks in simulation and the real world, the resulting policies outperform strong video pre-training baselines by 80 percent on average. The same pre-trained trajectories also support skill transfer from human videos and from videos of robots that have different physical shapes.

Core claim

ATM pre-trains a trajectory model on video demonstrations to predict the future trajectories of arbitrary points within each video frame. These predicted trajectories then serve as detailed control guidance that enables the learning of robust visuomotor policies from only a small quantity of action-labeled data.

What carries the argument

Any-point Trajectory Modeling (ATM): a pre-trained model that forecasts trajectories of arbitrary points in video frames to supply control signals for policy learning.

If this is right

Policies reach higher success rates across diverse manipulation tasks while requiring far less action-labeled demonstration data.
Skills demonstrated in human videos transfer directly to robotic execution without additional robot-specific labeling.
Policies trained on one robot body shape remain effective when deployed on robots with different morphologies.
The performance advantage holds in both simulated environments and physical real-world settings for language-conditioned tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the cost barrier for deploying learned robot skills by relying mainly on cheap video rather than expensive robot trials.
Predicting point trajectories in 3D rather than 2D might increase robustness for tasks that involve depth or occlusion.
Pairing the trajectory predictions with additional signals such as force or tactile data could further reduce the remaining need for action labels.

Load-bearing premise

Predicted trajectories of arbitrary points supply sufficiently accurate and transferable control guidance to enable robust policy learning from only minimal action-labeled data.

What would settle it

A head-to-head test on the same 130 tasks showing that policies trained with ATM trajectories achieve no higher success rates than the video pre-training baselines when both use the same minimal action-labeled data.

read the original abstract

Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATM pretrains on arbitrary-point video trajectories to bootstrap policies with little action data and reports big gains, but the 2D-to-action mapping is underspecified in the abstract.

read the letter

The main point is that pre-training a model to predict future trajectories of arbitrary points in video frames supplies enough signal to learn visuomotor policies from minimal action labels. The authors report an 80% average lift over video pre-training baselines across 130 language-conditioned tasks in simulation and the real world, plus some successful transfer from human videos and a different robot morphology. That scale of evaluation is the strongest part of what they show. The choice to track arbitrary points rather than full frames or fixed keypoints looks like a practical way to extract dense, task-relevant motion information that standard video models often miss. If the numbers hold after ablations, this could be a useful addition to the imitation-learning toolkit. The soft spot is the missing description of how the predicted 2D trajectories actually become robot actions. The abstract says they provide “detailed control guidance,” but does not say whether this happens via camera intrinsics, direct regression to joint velocities, or some other interface. That step is load-bearing for the cross-domain transfer claims, so without seeing the exact mechanism and the corresponding ablations it is hard to judge how general the result really is. Minor issues like missing statistical details on the gains are secondary to this. The paper is aimed at people working on video-based robot learning and imitation from demonstration. It is worth sending to peer review because the core idea is straightforward to test and the evaluation breadth is substantial; referees can check the implementation details and reproducibility directly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Any-point Trajectory Modeling (ATM), a framework that pre-trains a model on video data to predict future trajectories of arbitrary points in a frame. These trajectories are then claimed to supply detailed control guidance that enables learning of robust visuomotor policies from only minimal action-labeled demonstrations. The central empirical claim is an 80% average outperformance over strong video pre-training baselines across more than 130 language-conditioned tasks evaluated in both simulation and the real world, together with successful transfer of manipulation skills from human videos and from videos of a different robot morphology.

Significance. If the results hold under rigorous verification, the work would be significant for robot learning: it offers a concrete route to leverage abundant unlabeled video data to reduce the high cost of robot demonstration collection while supporting cross-domain transfer. The approach follows a standard pre-train-then-fine-tune structure but grounds the pre-training objective in point trajectories rather than generic video features.

major comments (1)

[Abstract / Methods] The mechanism that converts the predicted 2D any-point trajectories into robot action commands is not described. The abstract states that the trajectories 'provide detailed control guidance' enabling policy learning 'with minimal action-labeled data,' yet no details are supplied on whether trajectories are lifted to 3D via camera intrinsics, supplied as dense flow inputs to the policy, regressed directly to joint velocities, or handled by some other interface. This mapping is load-bearing for the transfer claims (human videos to robot, different morphologies) and for the reported performance gains.

minor comments (2)

[Abstract] The abstract reports an '80% on average' improvement but does not indicate whether this is a relative or absolute gain, nor does it reference the specific baseline methods or the statistical aggregation method (mean across tasks, median, etc.).
[Abstract] The number 'over 130' tasks is stated without a breakdown by environment, task category, or success-rate table that would allow readers to assess variance and outlier influence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the paper to improve clarity on the trajectory-to-action interface.

read point-by-point responses

Referee: [Abstract / Methods] The mechanism that converts the predicted 2D any-point trajectories into robot action commands is not described. The abstract states that the trajectories 'provide detailed control guidance' enabling policy learning 'with minimal action-labeled data,' yet no details are supplied on whether trajectories are lifted to 3D via camera intrinsics, supplied as dense flow inputs to the policy, regressed directly to joint velocities, or handled by some other interface. This mapping is load-bearing for the transfer claims (human videos to robot, different morphologies) and for the reported performance gains.

Authors: We agree the mapping requires explicit description. In Section 3.3 of the manuscript, the 2D any-point trajectories are lifted to 3D using camera intrinsics and monocular depth estimates, then provided as dense 3D flow inputs to a transformer policy that regresses to joint velocities. This interface is what supports the cross-domain transfer results, as the point motions are morphology-agnostic. We will add a dedicated paragraph and pipeline figure in the Methods section, plus a brief reference in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre-train/fine-tune pipeline with independent validation

full rationale

The paper's core contribution is a pre-training stage that learns to predict arbitrary-point trajectories from video, followed by using those predictions as guidance for downstream policy learning on limited action data. This is a standard two-stage architecture whose performance claims rest on reported empirical gains across 130+ tasks rather than any closed-form derivation. No equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no load-bearing premise reduces to a self-citation chain. The interface from 2-D trajectories to actions is left as an implementation detail (as noted by the skeptic), but that is a question of completeness, not circularity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that video-derived point trajectories contain usable control information; no free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Video demonstrations contain sufficient motion information that predicted point trajectories can guide policy learning
This assumption bridges the pre-training stage to the downstream policy learning stage.

pith-pipeline@v0.9.0 · 5488 in / 1152 out tokens · 108808 ms · 2026-05-16T23:27:58.879801+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
cs.RO 2024-09 conditional novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms ...
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances
cs.RO 2026-04 unverdicted novelty 6.0

BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
cs.CV 2025-07 unverdicted novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
cs.CV 2025-03 unverdicted novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
cs.RO 2024-12 conditional novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
cs.RO 2024-09 unverdicted novelty 6.0

Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Trajectory- tracking and path-following of underactuated au- tonomous vehicles with parametric modeling uncertainty

A Pedro Aguiar and Joao P Hespanha. Trajectory- tracking and path-following of underactuated au- tonomous vehicles with parametric modeling uncertainty. IEEE transactions on automatic control , 52(8):1362– 1379, 2007

work page 2007
[2]

Affordances from human videos 417 as a versatile rep- resentation for robotics

S Bahl, R Mendonca, L Chen, U Jain, and D Pathak. Affordances from human videos 417 as a versatile rep- resentation for robotics. In CVPR, 2023

work page 2023
[3]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems , 35: 24639–24654, 2022

work page 2022
[4]

Zero-shot robot manipu- lation from passive human videos

Homanga Bharadhwaj, Abhinav Gupta, Shubham Tul- siani, and Vikash Kumar. Zero-shot robot manipu- lation from passive human videos. arXiv preprint arXiv:2302.02011, 2023

work page arXiv 2023
[5]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[8]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin CM Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.026

work page doi:10.15607/rss.2023.xix.026 2023
[9]

BERT: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, edi- tors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,...

work page 2019
[10]

Tap-vid: A benchmark for tracking any point in a video, 2023

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adri `a Recasens, Lucas Smaira, Yusuf Aytar, Jo ˜ao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video, 2023

work page 2023
[11]

Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023

work page 2023
[12]

Learning universal policies via text- guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text- guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems , 2023

work page 2023
[13]

Video prediction models as rewards for reinforcement learning

Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. Neural Information Processing Systems , 2023

work page 2023
[14]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023

work page 2023
[15]

Ifor: Iterative flow minimization for robotic object rearrange- ment

Ankit Goyal, Arsalan Mousavian, Chris Paxton, Yu-Wei Chao, Brian Okorn, Jia Deng, and Dieter Fox. Ifor: Iterative flow minimization for robotic object rearrange- ment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14787– 14797, 2022

work page 2022
[16]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977 , 2023

work page arXiv 2023
[17]

Particle video revisited: Tracking through occlu- sions using point trajectories

Adam W Harley, Zhaoyuan Fang, and Katerina Fragki- adaki. Particle video revisited: Tracking through occlu- sions using point trajectories. In European Conference on Computer Vision , pages 59–75. Springer, 2022

work page 2022
[18]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 16000–16009, 2022

work page 2022
[19]

Mesh- based dynamics model with occlusion reasoning for cloth manipulation

Zixuan Huang, Xingyu Lin, and David Held. Mesh- based dynamics model with occlusion reasoning for cloth manipulation. In Robotics: Science and Systems (RSS) , 2022

work page 2022
[20]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. arXiv:2307.07635, 2023

work page arXiv 2023
[21]

Vilt: Vision- and-language transformer without convolution or region supervision, 2021

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region supervision, 2021

work page 2021
[22]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages 4015–4026, October 2023

work page 2023
[23]

Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576, 2023

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to Act from Actionless Video through Dense Correspondences. arXiv:2310.08576, 2023

work page arXiv 2023
[24]

3d neural scene rep- resentations for visuomotor control

Yunzhu Li, Shuang Li, Vincent Sitzmann, Pulkit Agrawal, and Antonio Torralba. 3d neural scene rep- resentations for visuomotor control. arXiv preprint arXiv:2107.04004, 2021

work page arXiv 2021
[25]

Learning visible connectivity dynamics for cloth smoothing

Xingyu Lin, Yufei Wang, Zixuan Huang, and David Held. Learning visible connectivity dynamics for cloth smoothing. In Conference on Robot Learning , 2021

work page 2021
[26]

Spawnnet: Learning generaliz- able visuomotor skills from pre-trained networks

Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, and Pieter Abbeel. Spawnnet: Learning generaliz- able visuomotor skills from pre-trained networks. arXiv preprint arXiv:2307.03567, 2023

work page arXiv 2023
[27]

Libero: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Yuke Zhu, Peter Stone, et al. Libero: Benchmarking knowledge transfer for lifelong robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023

work page 2023
[28]

VIP: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2023

work page 2023
[30]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning, 2020

Lucas Manuelli, Yunzhu Li, Pete Florence, and Russ Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning, 2020

work page 2020
[32]

Structured world models from human videos

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In RSS, 2023

work page 2023
[33]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR, 2023

work page 2023
[34]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG) , 37(4):1–14, 2018

work page 2018
[36]

Keto: Learning keypoint representations for tool manipulation

Zengyi Qin, Kuan Fang, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Keto: Learning keypoint representations for tool manipulation. In 2020 IEEE International Confer- ence on Robotics and Automation (ICRA) , pages 7278–

work page 2020
[37]

Language embedded radiance fields for zero-shot task-oriented grasping

Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Goldberg. Language embedded radiance fields for zero-shot task-oriented grasping. In Conference on Robot Learning, pages 178–200. PMLR, 2023

work page 2023
[38]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Learning predictive models from observation and interaction, 2019

Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Learning predictive models from observation and interaction, 2019

work page 2019
[40]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. In The Twelfth International Confer- ence on Learning Representations , 2024. URL https: //openreview.net/forum?id=rvUq3cxpDF

work page 2024
[41]

Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds

Daniel Seita, Yufei Wang, Sarthak J Shetty, Edward Yao Li, Zackory Erickson, and David Held. Toolflownet: Robotic manipulation with tools via predicting tool flow from point clouds. In Conference on Robot Learning , pages 1038–1049. PMLR, 2023

work page 2023
[42]

Reinforcement learning with action-free pre- training from videos, 2022

Younggyo Seo, Kimin Lee, Stephen James, and Pieter Abbeel. Reinforcement learning with action-free pre- training from videos, 2022

work page 2022
[44]

Time-contrastive networks: Self- supervised learning from video

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jas- mine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self- supervised learning from video. In 2018 IEEE inter- national conference on robotics and automation (ICRA) , pages 1134–1141. IEEE, 2018

work page 2018
[45]

Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions

Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. The International Journal of Robotics Research , 40(12-14):1419–1434, 2021

work page 2021
[46]

Videodex: Learning dexterity from internet videos

Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning , pages 654–665. PMLR, 2023

work page 2023
[47]

Behav- ioral cloning from observation, 2018

Faraz Torabi, Garrett Warnell, and Peter Stone. Behav- ioral cloning from observation, 2018

work page 2018
[48]

Robotap: Tracking arbitrary points for few-shot visual imitation

Mel Vecerik, Carl Doersch, Yi Yang, Todor Davchev, Yusuf Aytar, Guangyao Zhou, Raia Hadsell, Lourdes Agapito, and Jon Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv, 2023

work page 2023
[49]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning , pages 1723–

work page
[50]

Mimicplay: Long-horizon imitation learning by watching human play

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. In 7th Annual Conference on Robot Learning, 2023

work page 2023
[51]

Tracking everything everywhere all at once, 2023

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking everything everywhere all at once, 2023

work page 2023
[52]

Fighting fire with fire: Avoiding dnn shortcuts through priming

Chuan Wen, Jianing Qian, Jierui Lin, Jiaye Teng, Dinesh Jayaraman, and Yang Gao. Fighting fire with fire: Avoiding dnn shortcuts through priming. In International Conference on Machine Learning , pages 23723–23750. PMLR, 2022

work page 2022
[53]

Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators

Philipp Wu, Yide Shentu, Zhongke Yi, Xingyu Lin, and Pieter Abbeel. Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. arXiv preprint arXiv:2309.13037, 2023

work page arXiv 2023
[54]

Learn- ing by watching: Physical imitation of manipulation skills from human videos

Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021

work page 2021
[55]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 5628–5635. IEEE, 2018

work page 2018
[57]

Pointodyssey: A large- scale synthetic dataset for long-term point tracking

Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J Guibas. Pointodyssey: A large- scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 19855–19865, 2023. APPENDIX A ADDITIONAL EXPERIMENTAL RESULTS A. Simulation Experiments Numerical results. We report ...

work page 2023
[58]

pick up the milk and place it in the basket

random tracking2. filter & retrackvideo “pick up the milk and place it in the basket” Fig. 13: Given a video (left), we query 1000 randomly sampled points using an off-the-shelf TAP model (middle), where each colored dot represents the starting position of a track. We then filter the tracks using a heuristic of their position displacement across the video...

work page