Recognition: 3 theorem links
· Lean TheoremGR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Pith reviewed 2026-05-12 01:04 UTC · model grok-4.3
The pith
A robot model pre-trained on 38 million internet videos reaches 97.7 percent success across over 100 manipulation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GR-2 is a generative video-language-action model first pre-trained on 38 million video clips and over 50 billion tokens from the internet to capture the dynamics of the world. It is subsequently fine-tuned on robot trajectories for both video generation and action prediction. This produces a single model that achieves an average success rate of 97.7 percent across more than 100 tasks and generalizes to novel backgrounds, environments, objects, and tasks. The model also scales effectively as its size increases.
What carries the argument
Two-stage generative video-language-action architecture: web-scale video pre-training to learn dynamics followed by fine-tuning on robot trajectories for joint video and action prediction.
If this is right
- A single model can handle over 100 distinct manipulation tasks without task-specific retraining or architectures.
- Strong generalization to unseen objects and environments reduces the volume of robot data needed for new deployments.
- Performance improves as model size grows, indicating further gains are possible with additional compute.
- Joint prediction of future video and actions supports planning by allowing the model to simulate outcomes before acting.
Where Pith is reading between the lines
- Web video pre-training could serve as a cheaper alternative to large-scale robot data collection for building generalist agents.
- The same pre-training approach might transfer to other embodied tasks such as navigation or tool use if video dynamics are sufficiently shared.
- Future extensions could test whether adding explicit physics simulation during pre-training further closes any remaining domain gap.
- Public release of the model weights would let other groups measure how well the claimed generalization holds on their own robot platforms.
Load-bearing premise
Dynamics knowledge extracted from internet videos transfers directly to physical robot control despite differences in viewpoint, embodiment, and lighting.
What would settle it
A controlled experiment showing GR-2 success rates drop sharply below 50 percent on manipulation tasks that involve physical interactions rarely shown in typical internet videos, such as precise insertion of delicate parts under novel lighting.
read the original abstract
We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GR-2, a generative video-language-action model for robot manipulation. It is first pre-trained on 38 million internet video clips (over 50 billion tokens) to capture world dynamics, then fine-tuned on robot trajectories for joint video generation and action prediction. The central empirical claim is an average success rate of 97.7% across more than 100 tasks together with strong generalization to novel backgrounds, environments, objects, and tasks, plus favorable scaling with model size.
Significance. If the performance numbers and generalization claims are substantiated by controlled experiments, the work would demonstrate that web-scale video pre-training can supply transferable dynamics knowledge that improves robotic policy learning, potentially reducing reliance on large amounts of robot-specific data. This would be a notable data-efficiency result in the generalist robot manipulation literature.
major comments (3)
- [§4] §4 (Experiments): The manuscript reports a 97.7% average success rate and broad generalization but provides no ablation that holds model architecture, fine-tuning data volume, and task suite fixed while removing the 38 M internet-video pre-training stage. Without this control, the headline attribution of performance to web-scale dynamics knowledge remains untested and compatible with explanations based solely on robot-trajectory fine-tuning scale or task curation.
- [§4.1] §4.1 and Table 2: Evaluation protocols are insufficiently specified; the text does not state the number of trials per task, the precise success criteria (e.g., end-effector tolerance, object pose thresholds), or whether error bars reflect multiple random seeds or environment variations. These details are load-bearing for the generalization claims across >100 tasks.
- [§3.2] §3.2 (Pre-training and fine-tuning): The domain-shift argument (human-centric 2D video to 3D robot proprioception and contact) is acknowledged but not quantified; no analysis or auxiliary experiment measures how much dynamics knowledge actually transfers versus being re-learned during the robot fine-tuning phase.
minor comments (3)
- [Abstract] The abstract states 'more than 100 tasks' while the main text should give the exact count, task taxonomy, and breakdown of success rates by category (e.g., pick-and-place vs. articulated objects).
- [Figure 5] Figure captions and axis labels in the scaling plots should explicitly state the x-axis metric (parameter count or FLOPs) and whether the fine-tuning data volume was held constant across model sizes.
- [§2] The related-work section should include a direct comparison paragraph with contemporaneous video-language-action models (e.g., RT-2, PaLM-E) that also use large-scale pre-training, highlighting architectural and data differences.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We address each of the major comments point by point below, and we plan to incorporate revisions to improve the paper accordingly.
read point-by-point responses
-
Referee: §4 (Experiments): The manuscript reports a 97.7% average success rate and broad generalization but provides no ablation that holds model architecture, fine-tuning data volume, and task suite fixed while removing the 38 M internet-video pre-training stage. Without this control, the headline attribution of performance to web-scale dynamics knowledge remains untested and compatible with explanations based solely on robot-trajectory fine-tuning scale or task curation.
Authors: We agree that a controlled ablation isolating the pre-training stage would provide the strongest evidence for the benefits of web-scale video pre-training. Unfortunately, training our model architecture from random initialization on only the robot trajectories is not feasible due to the substantial computational resources required and the limited scale of available robot data. We instead demonstrate the value of pre-training through comparisons with non-pretrained baselines and through scaling experiments. We will revise Section 4 to include a more explicit discussion of these limitations and the supporting evidence from our current experiments. revision: partial
-
Referee: §4.1 and Table 2: Evaluation protocols are insufficiently specified; the text does not state the number of trials per task, the precise success criteria (e.g., end-effector tolerance, object pose thresholds), or whether error bars reflect multiple random seeds or environment variations. These details are load-bearing for the generalization claims across >100 tasks.
Authors: Thank you for highlighting this issue. We will update the manuscript to provide a clear description of the evaluation protocol in Section 4.1. Specifically, we will state that each task is evaluated over 20 trials, detail the success criteria involving object pose and end-effector position thresholds, and clarify that the reported results and any error bars are computed over multiple random seeds and environment configurations to account for variations. revision: yes
-
Referee: §3.2 (Pre-training and fine-tuning): The domain-shift argument (human-centric 2D video to 3D robot proprioception and contact) is acknowledged but not quantified; no analysis or auxiliary experiment measures how much dynamics knowledge actually transfers versus being re-learned during the robot fine-tuning phase.
Authors: We acknowledge that quantifying the exact amount of transferred dynamics knowledge versus re-learning during fine-tuning would be valuable. Our current experiments show significant performance improvements attributable to pre-training, but we do not include a direct measurement of transfer. In the revised manuscript, we will expand Section 3.2 with additional discussion on the domain shift and outline potential methods for future quantification, such as through intermediate representation analysis. revision: partial
Circularity Check
No circularity in empirical training and evaluation pipeline
full rationale
The paper describes a two-stage empirical process: large-scale pre-training on 38M internet video clips followed by fine-tuning on robot trajectories, with performance measured via direct task success rates (97.7% average across >100 tasks). No equations, derivations, or first-principles predictions are presented. Claims about generalization rest on observed evaluation outcomes rather than any quantity that reduces to its inputs by construction. Self-citations (if present) do not supply load-bearing uniqueness theorems or ansatzes for the results; the work is self-contained as standard large-scale model training and benchmarking.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Internet videos contain transferable dynamics knowledge for robot manipulation
Forward citations
Cited by 40 Pith papers
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation
DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[4]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[7]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 12873–12883, 2021
work page 2021
-
[8]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision , pages 2630–2640, 2019
work page 2019
-
[9]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022
work page 2022
-
[10]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision , pages 5842–5...
work page 2017
-
[11]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) , pages 720–736, 2018
work page 2018
-
[12]
A short note on the kinetics-700 human action dataset
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019
-
[13]
MediaPipe: A Framework for Building Perception Pipelines
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019. 18
work page internal anchor Pith review arXiv 1906
-
[14]
Open-Sora: Democratizing efficient video production for all, March 2024
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing efficient video production for all, March 2024
work page 2024
-
[15]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023
work page 2023
-
[17]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems , 28, 2015
work page 2015
-
[18]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
MOMA-Force: Visual-force imitation for real-world mobile manipulation
Taozheng Yang, Ya Jing, Hongtao Wu, Jiafeng Xu, Kuankuan Sima, Guangzeng Chen, Qie Sima, and Tao Kong. MOMA-Force: Visual-force imitation for real-world mobile manipulation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 6847–6852. IEEE, 2023
work page 2023
-
[21]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[23]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020
work page 1956
-
[24]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4015–4026, 2023
work page 2023
-
[25]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. RoboAgent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023
-
[27]
What matters in language conditioned robotic imitation learning over unstructured data
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters , 7(4):11205–11212, 2022
work page 2022
-
[28]
Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023
-
[29]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
VIMA : General robot manipulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2(3):6, 2022
-
[31]
Language conditioned imitation learning over unstructured data,
Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020
-
[32]
BC-Z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022. 19
work page 2022
-
[33]
Multimodal diffusion transformer: Learning versatile behavior from multimodal goals
Moritz Reuss, Ömer Erdinç Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals. In Robotics: Science and Systems , 2024
work page 2024
-
[34]
Scaling up and distilling down: Language-guided robot skill acquisition
Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023
work page 2023
-
[35]
CLIPort: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manipulation. In Conference on robot learning, pages 894–906. PMLR, 2022
work page 2022
-
[36]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [37]
-
[38]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review arXiv 2022
-
[40]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023
work page 2023
-
[42]
Chained- diffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation
Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chained- diffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023
work page 2023
-
[43]
3d diffuser actor: Policy diffusion with 3d scene representations, 2024
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024
-
[44]
Act3D: 3d feature field transformers for multi-task robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3D: 3d feature field transformers for multi-task robotic manipulation. In 7th Annual Conference on Robot Learning , 2023
work page 2023
-
[45]
RoboCat : A self-improving foundation agent for robotic manipulation
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. RoboCat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023
-
[46]
Transporters with visual foresight for solving unseen rearrangement tasks
Hongtao Wu, Jikai Ye, Xin Meng, Chris Paxton, and Gregory S Chirikjian. Transporters with visual foresight for solving unseen rearrangement tasks. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10756–10763. IEEE, 2022
work page 2022
-
[47]
Daniel Seita, Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, Ken Goldberg, and Andy Zeng. Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages 4568–4575. IEEE, 2021
work page 2021
-
[48]
Goal-conditioned end-to-end visuomotor control for versatile skill primitives
Oliver Groth, Chia-Man Hung, Andrea Vedaldi, and Ingmar Posner. Goal-conditioned end-to-end visuomotor control for versatile skill primitives. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1319–1325. IEEE, 2021
work page 2021
-
[49]
Wish you were here: Hindsight goal selection for long-horizon dexterous manipulation
Todor Davchev, Oleg Sushkov, Jean-Baptiste Regli, Stefan Schaal, Yusuf Aytar, Markus Wulfmeier, and Jon Scholz. Wish you were here: Hindsight goal selection for long-horizon dexterous manipulation. arXiv preprint arXiv:2112.00597, 2021
-
[50]
Masked autoen- coders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoen- coders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[51]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners, 2020. 20
work page 2020
-
[52]
Masked visual pre-training for motor control
Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022
-
[53]
arXiv preprint arXiv:2302.12766 , year=
Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023
-
[54]
arXiv preprint arXiv:2203.12601 (2022)
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022
-
[55]
Robot learning with sensorimotor pre-training
Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pages 683–693. PMLR, 2023
work page 2023
-
[56]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023
-
[58]
arXiv preprint arXiv:2310.061141(2), 6 (2023)
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023
-
[59]
Masked world models for visual control
Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023
work page 2023
-
[60]
Real- world robot learning with masked visual pre-training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real- world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023
work page 2023
-
[61]
Exploring vi- sual pre-training for robot manipulation: Datasets, models and methods
Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring vi- sual pre-training for robot manipulation: Datasets, models and methods. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 11390–11395. IEEE, 2023
work page 2023
-
[62]
Curl: Contrastive unsupervised representations for reinforcement learning
Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning , pages 5639–5650. PMLR, 2020
work page 2020
-
[63]
Time-contrastive networks: Self-supervised learning from video
Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA) , pages 1134–1141. IEEE, 2018
work page 2018
-
[64]
David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[65]
Video prediction models as rewards for reinforcement learning
Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[66]
Learning universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[67]
Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning.arXiv preprint arXiv:2310.10625, 2023
-
[68]
Deep visual foresight for planning robot motion
Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2786–2793. IEEE, 2017
work page 2017
-
[69]
arXiv preprint arXiv:2206.11894 , year=
Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Li Fei-Fei. MaskViT: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894, 2022
-
[70]
Video pretraining (vpt): Learning to act by watching unlabeled online videos
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems , 35:24639–24654, 2022
work page 2022
-
[71]
SpawnNet: Learning generalizable visuomotor skills from pre-trained network
Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, and Pieter Abbeel. SpawnNet: Learning generalizable visuomotor skills from pre-trained network. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 4781–4787. IEEE, 2024. 21
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.