arxiv: 2503.14734 · v2 · submitted 2025-03-18 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA: Johan Bjorck , Fernando Casta\~neda , Nikita Cherniadev , Xingye Da , Runyu Ding , Linxi "Jim" Fan , Yu Fang , Dieter Fox

show 33 more authors

Fengyuan Hu Spencer Huang Joel Jang Zhenyu Jiang Jan Kautz Kaushil Kundalia Lawrence Lao Zhiqi Li Zongyu Lin Kevin Lin Guilin Liu Edith Llontop Loic Magne Ajay Mandlekar Avnish Narayan Soroush Nasiriany Scott Reed You Liang Tan Guanzhi Wang Zu Wang Jing Wang Qi Wang Jiannan Xiang Yuqi Xie Yinzhen Xu Zhenjia Xu Seonghyeon Ye Zhiding Yu Ao Zhang Hao Zhang Yizhou Zhao Ruijie Zheng Yuke Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords humanoid robotfoundation modelvision-language-actionimitation learningbimanual manipulationdiffusion transformerrobot learninggeneralist autonomy

0 comments

The pith

GR00T N1 is an open foundation model that lets humanoid robots outperform prior imitation learning methods on benchmarks and real bimanual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GR00T N1 as a generalist robot foundation model built to handle novel situations and new tasks with limited additional data. It uses a dual-system design: one part processes vision and language instructions to understand the scene, while the other part generates smooth actions in real time. The model is trained end-to-end on a combination of real robot trajectories, human videos, and synthetic data. In tests it beats standard imitation learning baselines across simulation benchmarks for several robot types. On a real Fourier GR-1 humanoid it completes language-guided two-handed manipulation tasks effectively while requiring relatively little task-specific data.

Core claim

GR00T N1, a Vision-Language-Action model with a dual-system architecture consisting of a vision-language module and a diffusion transformer module trained jointly end-to-end, outperforms state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments when trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets, and achieves strong performance with high data efficiency when deployed for language-conditioned bimanual manipulation on the Fourier GR-1 humanoid robot.

What carries the argument

Dual-system Vision-Language-Action architecture in which a vision-language module interprets the environment and instructions and feeds into a diffusion transformer that generates fluid motor actions, with the modules tightly coupled and trained end-to-end on mixed real, video, and synthetic data.

If this is right

A single generalist model can be applied across multiple robot embodiments without separate retraining from scratch.
Language-conditioned bimanual tasks become feasible on physical humanoids with reduced task-specific data collection.
Training on mixed human videos and synthetic data can substitute for large volumes of robot-only trajectories.
Real-time action generation through diffusion allows fluid motion while the vision-language component handles reasoning.
Open release of the model enables broader testing and adaptation by other researchers working on humanoid platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the data mixture generalizes as claimed, similar foundation models could be built for non-humanoid robots by swapping embodiment-specific action heads.
The high data efficiency observed in real deployments suggests that scaling the synthetic and video portions could further reduce the cost of real-robot data collection.
Success on bimanual manipulation opens the possibility of extending the same architecture to longer-horizon household tasks that require sequencing multiple skills.
The open nature of the model allows direct comparison of its internal representations against those learned by task-specific methods to identify where generalization occurs.

Load-bearing premise

The heterogeneous mixture of real-robot trajectories, human videos, and synthetic datasets supplies enough coverage and quality for the model to generalize to new situations and different robot hardware in real-world use.

What would settle it

Deploy the trained model on a different humanoid robot or in a substantially new task environment and measure whether it continues to outperform imitation learning baselines or requires extensive new data collection to match prior performance levels.

read the original abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GR00T N1 is an open dual-system VLA for humanoids trained on mixed real-video-synthetic data, with sim outperformance and GR-1 deployment, but the data coverage for generalization is the unproven link.

read the letter

The main point is that GR00T N1 is a new open foundation model for humanoid robots. It combines a vision-language module that processes scenes and instructions with a diffusion transformer that outputs actions, trained end-to-end on a mix of real robot trajectories, human videos, and synthetic data. The paper reports that this beats standard imitation learning baselines on multi-embodiment simulation benchmarks and delivers strong language-conditioned bimanual performance on the Fourier GR-1 with high data efficiency.

Referee Report

3 major / 2 minor

Summary. The paper introduces GR00T N1, an open Vision-Language-Action (VLA) foundation model for generalist humanoid robots featuring a dual-system architecture: a vision-language module (System 2) that interprets environments and language instructions, coupled with a diffusion transformer module (System 1) that generates real-time motor actions. Both modules are jointly trained end-to-end on a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. The central claims are that GR00T N1 outperforms state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments and achieves strong performance with high data efficiency when deployed on the Fourier GR-1 humanoid for language-conditioned bimanual manipulation tasks.

Significance. If the empirical results hold with detailed supporting evidence, this would be a notable contribution to open foundation models in robotics. The dual-system VLA design, scale of heterogeneous training data, cross-embodiment simulation results, and real-world humanoid deployment represent concrete progress toward generalist robot autonomy. The decision to release the model openly is a particular strength that supports reproducibility and follow-on work.

major comments (3)

[§4] §4 (Experiments and Results): The claim of outperformance over SOTA imitation learning baselines on multi-embodiment simulation benchmarks is central but presented without the specific quantitative metrics, baseline details, number of runs, or statistical significance tests needed to verify the magnitude and reliability of the improvement. This directly affects assessment of the headline result.
[§3.2] §3.2 (Training Data): The heterogeneous mixture of real-robot trajectories, human videos, and synthetic data is treated as sufficient for cross-embodiment generalization and real-world transfer by construction, yet no per-source ablations, coverage metrics, or distribution-shift experiments are reported. This assumption is load-bearing for both the simulation outperformance and the GR-1 deployment claims; without it the joint VLA + diffusion training may not reliably close the domain gaps.
[§5] §5 (Real-world Deployment): The GR-1 bimanual manipulation results are summarized as 'strong performance' and 'high data efficiency,' but lack concrete success rates, trial counts, task variations, failure modes, or comparisons to alternative methods. These details are required to substantiate the practical impact of the model.

minor comments (2)

[§2] The dual-system terminology (System 1 / System 2) is introduced without explicit mapping to the cognitive-science analogy or to the precise information flow between the VLM and diffusion transformer; a short clarifying paragraph would improve readability.
Figure 1 (architecture diagram) would benefit from explicit arrows or annotations showing the end-to-end joint training objective and the real-time inference path.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the empirical support for our claims. We address each major comment below and commit to revisions that add the requested details without altering the core contributions.

read point-by-point responses

Referee: §4 (Experiments and Results): The claim of outperformance over SOTA imitation learning baselines on multi-embodiment simulation benchmarks is central but presented without the specific quantitative metrics, baseline details, number of runs, or statistical significance tests needed to verify the magnitude and reliability of the improvement. This directly affects assessment of the headline result.

Authors: We agree that the simulation results require more granular reporting to allow independent verification. In the revised manuscript we will expand §4 with exact success rates and performance deltas for GR00T N1 versus each baseline, full baseline implementation details (including training hyperparameters and data used), the number of evaluation runs per task (minimum 5 seeds), and statistical tests (e.g., 95% confidence intervals and paired t-tests). Updated tables and text will present these metrics clearly. revision: yes
Referee: §3.2 (Training Data): The heterogeneous mixture of real-robot trajectories, human videos, and synthetic data is treated as sufficient for cross-embodiment generalization and real-world transfer by construction, yet no per-source ablations, coverage metrics, or distribution-shift experiments are reported. This assumption is load-bearing for both the simulation outperformance and the GR-1 deployment claims; without it the joint VLA + diffusion training may not reliably close the domain gaps.

Authors: We recognize that per-source ablations would strengthen the data-mixture claims. The revised §3.2 will include dataset coverage statistics (sizes, diversity metrics) and a discussion of observed distribution shifts across sources. Full per-source ablations were not completed due to compute limits in this initial study; we will add partial ablations on key subsets and note their implications for generalization. The cross-embodiment simulation results provide supporting evidence for the joint-training approach. revision: partial
Referee: §5 (Real-world Deployment): The GR-1 bimanual manipulation results are summarized as 'strong performance' and 'high data efficiency,' but lack concrete success rates, trial counts, task variations, failure modes, or comparisons to alternative methods. These details are required to substantiate the practical impact of the model.

Authors: We agree that concrete metrics are essential. The revised §5 will report specific success rates, total trial counts, task variations tested, observed failure modes, and direct comparisons to alternative methods (where available). These additions will quantify the claimed performance and data efficiency on the Fourier GR-1 platform. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and evaluation results

full rationale

The paper presents GR00T N1 as a trained VLA model using a heterogeneous data mixture, with performance claims based on direct evaluation against imitation learning baselines in simulation and real-robot deployments. No derivation chain, mathematical predictions, fitted parameters renamed as outputs, or self-citation load-bearing steps are present in the abstract or described methodology. All results are experimental outcomes from end-to-end training and testing, with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of the training data being diverse and the architecture's ability to integrate the two systems without additional unstated assumptions.

free parameters (2)

Diffusion transformer parameters
The weights and hyperparameters of the diffusion model are learned from data.
Vision-language model parameters
Parameters of the vision-language module trained jointly.

axioms (2)

domain assumption Joint end-to-end training of the vision-language and action modules leads to better performance than separate training.
Invoked in the model design description.
domain assumption The data mixture from real robots, humans, and synthetic sources is representative enough for generalization.
Central to the training strategy.

pith-pipeline@v0.9.0 · 5698 in / 1476 out tokens · 159046 ms · 2026-05-10T19:02:15.801386+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 8.0

SafeManip is a new benchmark that applies LTLf monitors to assess temporal safety properties across eight categories in robotic manipulation, demonstrating that task success frequently fails to ensure safe execution i...
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
cs.CV 2026-03 unverdicted novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete
cs.RO 2026-05 unverdicted novelty 7.0

Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
cs.RO 2026-05 unverdicted novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
cs.RO 2026-05 unverdicted novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts
cs.RO 2026-05 unverdicted novelty 7.0

Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 7.0

ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
cs.RO 2026-05 unverdicted novelty 7.0

Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.
Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection
cs.RO 2026-05 unverdicted novelty 7.0

Phone2Act is a smartphone-based teleoperation system that collects synchronized multi-camera robot manipulation data in LeRobot format without custom hardware, validated by fine-tuning GR00T-N1.5 to 90% success on a r...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors
cs.RO 2026-04 unverdicted novelty 7.0

Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
cs.CV 2026-04 unverdicted novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
cs.AI 2026-04 unverdicted novelty 7.0

ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models
cs.RO 2026-04 unverdicted novelty 7.0

Flow Motion Policy uses flow matching to model distributions over feasible manipulator paths, enabling best-of-N sampling with post-generation collision filtering to improve success and efficiency over prior neural an...
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching
cs.RO 2026-03 conditional novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
cs.RO 2026-05 unverdicted novelty 6.0

HandITL blends human intent with policy execution to eliminate gesture jumps in dexterous VLA interventions, cutting jitter by 99.8%, grasp failures by 87.5%, and yielding 19% better refined policies.
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
cs.LG 2026-05 unverdicted novelty 6.0

Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
cs.SE 2026-05 unverdicted novelty 6.0

VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
cs.LG 2026-05 unverdicted novelty 6.0

ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.
Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training
cs.DC 2026-05 unverdicted novelty 6.0

Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
cs.CV 2026-05 unverdicted novelty 6.0

EggHand unifies VLA action decoding with viewpoint-aware video-text encoding to forecast egocentric hand poses, achieving SOTA accuracy on EgoExo4D while remaining robust to ego-motion and controllable via language prompts.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 129 Pith papers · 17 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 6, 18

work page internal anchor Pith review arXiv 2025
[2]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot-World-Contributors et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669, 2025. 9, 17

work page internal anchor Pith review arXiv 2025
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińko...

work page 2022
[4]

Aloha 2: An enhanced low-cost hardware for bimanual teleoperation.arXiv preprint arXiv:2405.02292, 2024

Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, et al. Aloha 2: An enhanced low-cost hardware for bimanual teleoperation.arXiv preprint arXiv:2405.02292, 2024. 17

work page arXiv 2024
[5]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737,

work page internal anchor Pith review arXiv
[6]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022. 8, 17

work page 2022
[7]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 18

work page internal anchor Pith review arXiv 2024
[8]

Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv e-prints, pages arXiv–2405, 2024

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv e-prints, pages arXiv–2405, 2024. 18

work page 2024
[9]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 9

work page 2024
[10]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3, 5, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 17

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale.arXiv preprint arXiv:2212.06817, 2022. 17

work page internal anchor Pith review arXiv 2022
[14]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuh...

work page internal anchor Pith review arXiv 2023
[15]

Do as i can, not as i say: Grounding language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning, pages 287–318. PMLR, 2023. 17

work page 2023
[16]

Do as i can, not as i say: Grounding language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023. 17

work page 2023
[17]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024.URL https://openai. com/research/video-generation-models-as-world-simulators, 3:1, 2024. 6

work page 2024
[18]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/ huggingface/lerobot, 2024

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/ huggingface/lerobot, 2024. 21

work page 2024
[19]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 17

work page internal anchor Pith review arXiv 2024
[20]

Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,

Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation.arXiv preprint arXiv:2302.06671, 2023. 18

work page arXiv 2023
[21]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024. 14

work page 2024
[22]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024. 18

work page 2024
[23]

Imitating task and motion planning with visuomotor transformers

Murtaza Dalal, Ajay Mandlekar, Caelan Reed Garrett, Ankur Handa, Ruslan Salakhutdinov, and Dieter Fox. Imitating task and motion planning with visuomotor transformers. InConference on Robot Learning, pages 2565–2593. PMLR, 2023. 18 30 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

work page 2023
[24]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 11, 18

work page 2018
[25]

Telemoma: A modular and versatile teleoperation system for mobile manipulation

Shivin Dass, Wensi Ai, Yuqian Jiang, Samik Singh, Jiaheng Hu, Ruohan Zhang, Peter Stone, Ben Abbatematteo, and Roberto Martín-Martín. Telemoma: A modular and versatile teleoperation system for mobile manipulation. In2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024, 2024. 17

work page 2024
[26]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 17

work page internal anchor Pith review arXiv 2023
[27]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets. InProceedings of Robotics: Science and Systems, New York City, NY, USA, June

work page
[28]

doi: 10.15607/RSS.2022.XVIII.063. 17

work page doi:10.15607/rss.2022.xviii.063 2022
[29]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023. 11

work page 2023
[30]

Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild

Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15031–15038. IEEE, 2024. 18

work page 2024
[31]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 17

work page 2024
[32]

Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

Caelan Garrett, Ajay Mandlekar, Bowen Wen, and Dieter Fox. Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024. 18

work page arXiv 2024
[33]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–585...

work page 2017
[34]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 11, 18

work page 2022
[35]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

KristenGrauman, AndrewWestbury, LorenzoTorresani, KrisKitani, JitendraMalik, TriantafyllosAfouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–194...

work page 2024
[36]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations, 2023. 18

work page 2023
[37]

Scaling up and distilling down: Language-guided robot skill acquisition

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. InConference on Robot Learning, pages 3766–3777. PMLR, 2023. 18 31 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

work page 2023
[38]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 24

work page 2022
[39]

Adaflow: Imitation learning with variance-adaptive flow-based policies, 2024

Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. Adaflow: Imitation learning with variance-adaptive flow-based policies.arXiv preprint arXiv:2402.04292, 2024. 17

work page arXiv 2024
[40]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InProceedings of the International Conference on Machine Learning (ICML), 2024. 17

work page 2024
[41]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning. 17

work page
[42]

Grounded decoding: Guiding text generation with grounded models for embodied agents.Advances in Neural Information Processing Systems, 36:59636–59661, 2023

Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, et al. Grounded decoding: Guiding text generation with grounded models for embodied agents.Advances in Neural Information Processing Systems, 36:59636–59661, 2023. 17

work page 2023
[43]

Open teach: A versatile teleoperation system for robotic manipulation

Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. Open teach: A versatile teleoperation system for robotic manipulation. InCoRL 2024 Workshop on Mastering Robot Manipulation in a World of Abundant Data. 17

work page 2024
[44]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020. 18

work page 2020
[45]

Vima: robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: robot manipulation with multimodal prompts. In Proceedings of the 40th International Conference on Machine Learning, pages 14975–15022, 2023. 5

work page 2023
[46]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

work page
[47]

Daniel Kahneman.Thinking, fast and slow. 2011. 2

work page 2011
[48]

Language-driven representation learning for robotics

Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023. 18

work page arXiv 2023
[49]

EgoMimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning via egocentric video, 2024. 18

work page 2024
[50]

AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon...

work page 2024
[51]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 17

work page arXiv 2023
[53]

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 3, 4, 17

work page arXiv 2025
[54]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023. 17

work page 2023
[55]

Text2motion: from natural language instructions to feasible plans.Autonomous Robots, 2023

Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: from natural language instructions to feasible plans.Autonomous Robots, 2023. 17

work page 2023
[56]

Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024

Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024. 6

work page arXiv 2024
[57]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations. 3, 17

work page
[58]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 17

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Hoi4d: A 4d egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022. 11

work page 2022
[60]

Interactive language: Talking to robots in real time, 2022

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time, 2022. 9

work page 2022
[61]

Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 17

work page 2023
[62]

Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning.arXiv preprint arXiv:2212.05711, 2022. 18

work page arXiv 2022
[63]

RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation. InConference on Robot Learning, 2018. 17 33 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

work page 2018
[64]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulationdatasetthroughhuman reasoningand dexterity

Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulationdatasetthroughhuman reasoningand dexterity. In 2019IEEE/RSJInternationalConference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE,...

work page 2019
[65]

Human-in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human- in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020. 17

work page arXiv 2012
[66]

What matters in learning from offline human demonstrations for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021. 14

work page 2021
[67]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning, 2023. 6, 12, 18

work page 2023
[68]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 18

work page 2019
[69]

Gritsenko, and Neil Houlsby

Matthias Minderer, Alexey A. Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 24

work page 2023
[70]

Ray: A distributed framework for emerging {AI} applications

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018. 8

work page 2018
[71]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022. 18

work page internal anchor Pith review arXiv 2022
[72]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024. 10, 11, 12, 14, 18

work page 2024
[73]

Osmo platform, 2025

NVIDIA. Osmo platform, 2025. URLhttps://developer.nvidia.com/osmo. Accessed: 2025-03-12. 8

work page 2025
[74]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

work page 2024
[75]

Open X-Embodiment: Robotic learning datasets and RT-X models

Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models. International Conference on Robotics and Automation, 2024. 1, 9

work page 2024
[76]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 17

work page 2024
[77]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 4 34 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

work page 2023
[78]

Motion tracks: A unified representation for human-robot transfer in few- shot imitation learning, 2025

Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, and Jeannette Bohg. Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning.arXiv preprint arXiv:2501.06994, 2025. 18

work page arXiv 2025
[79]

Videoworld: Exploring knowledge learning from unlabeled videos.arXiv preprint arXiv:2501.09781, 2025

Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiaojie Jin. Videoworld: Exploring knowledge learning from unlabeled videos.arXiv preprint arXiv:2501.09781, 2025. 6

work page arXiv 2025
[80]

Sener, D

F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities.CVPR 2022, 2022. 11

work page 2022
[81]

Andy Park, Shenli Yuan, Yuke Zhu, , and Luis Sentis

Mingyo Seo, H. Andy Park, Shenli Yuan, Yuke Zhu, , and Luis Sentis. Legato: Cross-embodiment imitation using a grasping tool.IEEE Robotics and Automation Letters (RA-L), 2025. 18

work page 2025

Showing first 80 references.