pith. machine review for the scientific record. sign in

arxiv: 2503.14734 · v2 · submitted 2025-03-18 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords humanoid robotfoundation modelvision-language-actionimitation learningbimanual manipulationdiffusion transformerrobot learninggeneralist autonomy
0
0 comments X

The pith

GR00T N1 is an open foundation model that lets humanoid robots outperform prior imitation learning methods on benchmarks and real bimanual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GR00T N1 as a generalist robot foundation model built to handle novel situations and new tasks with limited additional data. It uses a dual-system design: one part processes vision and language instructions to understand the scene, while the other part generates smooth actions in real time. The model is trained end-to-end on a combination of real robot trajectories, human videos, and synthetic data. In tests it beats standard imitation learning baselines across simulation benchmarks for several robot types. On a real Fourier GR-1 humanoid it completes language-guided two-handed manipulation tasks effectively while requiring relatively little task-specific data.

Core claim

GR00T N1, a Vision-Language-Action model with a dual-system architecture consisting of a vision-language module and a diffusion transformer module trained jointly end-to-end, outperforms state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments when trained on a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets, and achieves strong performance with high data efficiency when deployed for language-conditioned bimanual manipulation on the Fourier GR-1 humanoid robot.

What carries the argument

Dual-system Vision-Language-Action architecture in which a vision-language module interprets the environment and instructions and feeds into a diffusion transformer that generates fluid motor actions, with the modules tightly coupled and trained end-to-end on mixed real, video, and synthetic data.

If this is right

  • A single generalist model can be applied across multiple robot embodiments without separate retraining from scratch.
  • Language-conditioned bimanual tasks become feasible on physical humanoids with reduced task-specific data collection.
  • Training on mixed human videos and synthetic data can substitute for large volumes of robot-only trajectories.
  • Real-time action generation through diffusion allows fluid motion while the vision-language component handles reasoning.
  • Open release of the model enables broader testing and adaptation by other researchers working on humanoid platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the data mixture generalizes as claimed, similar foundation models could be built for non-humanoid robots by swapping embodiment-specific action heads.
  • The high data efficiency observed in real deployments suggests that scaling the synthetic and video portions could further reduce the cost of real-robot data collection.
  • Success on bimanual manipulation opens the possibility of extending the same architecture to longer-horizon household tasks that require sequencing multiple skills.
  • The open nature of the model allows direct comparison of its internal representations against those learned by task-specific methods to identify where generalization occurs.

Load-bearing premise

The heterogeneous mixture of real-robot trajectories, human videos, and synthetic datasets supplies enough coverage and quality for the model to generalize to new situations and different robot hardware in real-world use.

What would settle it

Deploy the trained model on a different humanoid robot or in a substantially new task environment and measure whether it continues to outperform imitation learning baselines or requires extensive new data collection to match prior performance levels.

read the original abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GR00T N1, an open Vision-Language-Action (VLA) foundation model for generalist humanoid robots featuring a dual-system architecture: a vision-language module (System 2) that interprets environments and language instructions, coupled with a diffusion transformer module (System 1) that generates real-time motor actions. Both modules are jointly trained end-to-end on a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. The central claims are that GR00T N1 outperforms state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments and achieves strong performance with high data efficiency when deployed on the Fourier GR-1 humanoid for language-conditioned bimanual manipulation tasks.

Significance. If the empirical results hold with detailed supporting evidence, this would be a notable contribution to open foundation models in robotics. The dual-system VLA design, scale of heterogeneous training data, cross-embodiment simulation results, and real-world humanoid deployment represent concrete progress toward generalist robot autonomy. The decision to release the model openly is a particular strength that supports reproducibility and follow-on work.

major comments (3)
  1. [§4] §4 (Experiments and Results): The claim of outperformance over SOTA imitation learning baselines on multi-embodiment simulation benchmarks is central but presented without the specific quantitative metrics, baseline details, number of runs, or statistical significance tests needed to verify the magnitude and reliability of the improvement. This directly affects assessment of the headline result.
  2. [§3.2] §3.2 (Training Data): The heterogeneous mixture of real-robot trajectories, human videos, and synthetic data is treated as sufficient for cross-embodiment generalization and real-world transfer by construction, yet no per-source ablations, coverage metrics, or distribution-shift experiments are reported. This assumption is load-bearing for both the simulation outperformance and the GR-1 deployment claims; without it the joint VLA + diffusion training may not reliably close the domain gaps.
  3. [§5] §5 (Real-world Deployment): The GR-1 bimanual manipulation results are summarized as 'strong performance' and 'high data efficiency,' but lack concrete success rates, trial counts, task variations, failure modes, or comparisons to alternative methods. These details are required to substantiate the practical impact of the model.
minor comments (2)
  1. [§2] The dual-system terminology (System 1 / System 2) is introduced without explicit mapping to the cognitive-science analogy or to the precise information flow between the VLM and diffusion transformer; a short clarifying paragraph would improve readability.
  2. Figure 1 (architecture diagram) would benefit from explicit arrows or annotations showing the end-to-end joint training objective and the real-time inference path.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the empirical support for our claims. We address each major comment below and commit to revisions that add the requested details without altering the core contributions.

read point-by-point responses
  1. Referee: §4 (Experiments and Results): The claim of outperformance over SOTA imitation learning baselines on multi-embodiment simulation benchmarks is central but presented without the specific quantitative metrics, baseline details, number of runs, or statistical significance tests needed to verify the magnitude and reliability of the improvement. This directly affects assessment of the headline result.

    Authors: We agree that the simulation results require more granular reporting to allow independent verification. In the revised manuscript we will expand §4 with exact success rates and performance deltas for GR00T N1 versus each baseline, full baseline implementation details (including training hyperparameters and data used), the number of evaluation runs per task (minimum 5 seeds), and statistical tests (e.g., 95% confidence intervals and paired t-tests). Updated tables and text will present these metrics clearly. revision: yes

  2. Referee: §3.2 (Training Data): The heterogeneous mixture of real-robot trajectories, human videos, and synthetic data is treated as sufficient for cross-embodiment generalization and real-world transfer by construction, yet no per-source ablations, coverage metrics, or distribution-shift experiments are reported. This assumption is load-bearing for both the simulation outperformance and the GR-1 deployment claims; without it the joint VLA + diffusion training may not reliably close the domain gaps.

    Authors: We recognize that per-source ablations would strengthen the data-mixture claims. The revised §3.2 will include dataset coverage statistics (sizes, diversity metrics) and a discussion of observed distribution shifts across sources. Full per-source ablations were not completed due to compute limits in this initial study; we will add partial ablations on key subsets and note their implications for generalization. The cross-embodiment simulation results provide supporting evidence for the joint-training approach. revision: partial

  3. Referee: §5 (Real-world Deployment): The GR-1 bimanual manipulation results are summarized as 'strong performance' and 'high data efficiency,' but lack concrete success rates, trial counts, task variations, failure modes, or comparisons to alternative methods. These details are required to substantiate the practical impact of the model.

    Authors: We agree that concrete metrics are essential. The revised §5 will report specific success rates, total trial counts, task variations tested, observed failure modes, and direct comparisons to alternative methods (where available). These additions will quantify the claimed performance and data efficiency on the Fourier GR-1 platform. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and evaluation results

full rationale

The paper presents GR00T N1 as a trained VLA model using a heterogeneous data mixture, with performance claims based on direct evaluation against imitation learning baselines in simulation and real-robot deployments. No derivation chain, mathematical predictions, fitted parameters renamed as outputs, or self-citation load-bearing steps are present in the abstract or described methodology. All results are experimental outcomes from end-to-end training and testing, with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim depends on the validity of the training data being diverse and the architecture's ability to integrate the two systems without additional unstated assumptions.

free parameters (2)
  • Diffusion transformer parameters
    The weights and hyperparameters of the diffusion model are learned from data.
  • Vision-language model parameters
    Parameters of the vision-language module trained jointly.
axioms (2)
  • domain assumption Joint end-to-end training of the vision-language and action modules leads to better performance than separate training.
    Invoked in the model design description.
  • domain assumption The data mixture from real robots, humans, and synthetic sources is representative enough for generalization.
    Central to the training strategy.

pith-pipeline@v0.9.0 · 5698 in / 1476 out tokens · 159046 ms · 2026-05-10T19:02:15.801386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 8.0

    SafeManip is a new benchmark that applies LTLf monitors to assess temporal safety properties across eight categories in robotic manipulation, demonstrating that task success frequently fails to ensure safe execution i...

  2. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  3. FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

    cs.CV 2026-03 unverdicted novelty 8.0

    FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...

  4. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  5. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  6. Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

    cs.RO 2026-05 unverdicted novelty 7.0

    Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.

  7. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  8. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  9. RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

    cs.RO 2026-05 unverdicted novelty 7.0

    RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

  10. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  11. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  12. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...

  13. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

  14. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

  15. Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

    cs.RO 2026-05 unverdicted novelty 7.0

    Octopus Protocol enables one-shot hardware onboarding for AI agents by running a five-stage LLM-driven pipeline that probes devices, infers capabilities, generates an MCP server, and deploys it for closed-loop control.

  16. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  17. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  18. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  19. Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

    cs.RO 2026-05 unverdicted novelty 7.0

    Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

  20. Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection

    cs.RO 2026-05 unverdicted novelty 7.0

    Phone2Act is a smartphone-based teleoperation system that collects synchronized multi-camera robot manipulation data in LeRobot format without custom hardware, validated by fine-tuning GR00T-N1.5 to 90% success on a r...

  21. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  22. DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

    cs.RO 2026-04 unverdicted novelty 7.0

    Discrete diffusion policies support native asynchronous execution via unmasking for real-time chunking, delivering higher success rates and 0.7x inference cost versus flow-matching RTC on dynamic robotics benchmarks a...

  23. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  24. Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

  25. CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.

  26. EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

    cs.CV 2026-04 unverdicted novelty 7.0

    EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

  27. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  28. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  29. Using large language models for embodied planning introduces systematic safety risks

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.

  30. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  31. [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

    cs.AI 2026-04 unverdicted novelty 7.0

    ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.

  32. Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models

    cs.RO 2026-04 unverdicted novelty 7.0

    Flow Motion Policy uses flow matching to model distributions over feasible manipulator paths, enabling best-of-N sampling with post-generation collision filtering to improve success and efficiency over prior neural an...

  33. DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

    cs.RO 2026-03 conditional novelty 7.0

    DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

  34. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  35. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  36. AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

  37. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  38. Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

    cs.RO 2026-05 unverdicted novelty 6.0

    HandITL blends human intent with policy execution to eliminate gesture jumps in dexterous VLA interventions, cutting jitter by 99.8%, grasp failures by 87.5%, and yielding 19% better refined policies.

  39. WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

    cs.LG 2026-05 unverdicted novelty 6.0

    Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.

  40. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  41. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  42. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

  43. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  44. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  45. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  46. From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bim...

  47. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  48. PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

  49. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  50. Adaptive Action Chunking via Multi-Chunk Q Value Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

  51. Lakestream: A Consistent and Brokerless Data Plane for Large Foundation Model Training

    cs.DC 2026-05 unverdicted novelty 6.0

    Lakestream provides a consistent brokerless object-store-native data plane for large foundation model training using transactional global batches and decentralized adaptive commit.

  52. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  53. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  54. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  55. EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

    cs.CV 2026-05 unverdicted novelty 6.0

    EggHand unifies VLA action decoding with viewpoint-aware video-text encoding to forecast egocentric hand poses, achieving SOTA accuracy on EgoExo4D while remaining robust to ego-motion and controllable via language prompts.

  56. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  57. AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.

  58. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  59. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  60. Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 129 Pith papers · 17 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 6, 18

  2. [2]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors et al. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669, 2025. 9, 17

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińko...

  4. [4]

    Aloha 2: An enhanced low-cost hardware for bimanual teleoperation.arXiv preprint arXiv:2405.02292, 2024

    Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, et al. Aloha 2: An enhanced low-cost hardware for bimanual teleoperation.arXiv preprint arXiv:2405.02292, 2024. 17

  5. [5]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737,

  6. [6]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022. 8, 17

  7. [7]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 18

  8. [8]

    Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv e-prints, pages arXiv–2405, 2024

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation.arXiv e-prints, pages arXiv–2405, 2024. 18

  9. [9]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 9

  10. [10]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 3, 5, 17

  11. [11]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 17

  12. [13]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real- world control at scale.arXiv preprint arXiv:2212.06817, 2022. 17

  13. [14]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuh...

  14. [15]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning, pages 287–318. PMLR, 2023. 17

  15. [16]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023. 17

  16. [17]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. 2024.URL https://openai. com/research/video-generation-models-as-world-simulators, 3:1, 2024. 6

  17. [18]

    Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/ huggingface/lerobot, 2024

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch.https://github.com/ huggingface/lerobot, 2024. 21

  18. [19]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 17

  19. [20]

    Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,

    Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation.arXiv preprint arXiv:2302.06671, 2023. 18

  20. [21]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024. 14

  21. [22]

    Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024. 18

  22. [23]

    Imitating task and motion planning with visuomotor transformers

    Murtaza Dalal, Ajay Mandlekar, Caelan Reed Garrett, Ankur Handa, Ruslan Salakhutdinov, and Dieter Fox. Imitating task and motion planning with visuomotor transformers. InConference on Robot Learning, pages 2565–2593. PMLR, 2023. 18 30 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

  23. [24]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. InProceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 11, 18

  24. [25]

    Telemoma: A modular and versatile teleoperation system for mobile manipulation

    Shivin Dass, Wensi Ai, Yuqian Jiang, Samik Singh, Jiaheng Hu, Ruohan Zhang, Peter Stone, Ben Abbatematteo, and Roberto Martín-Martín. Telemoma: A modular and versatile teleoperation system for mobile manipulation. In2nd Workshop on Mobile Manipulation and Embodied Intelligence at ICRA 2024, 2024. 17

  25. [26]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 17

  26. [27]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets. InProceedings of Robotics: Science and Systems, New York City, NY, USA, June

  27. [28]

    doi: 10.15607/RSS.2022.XVIII.063. 17

  28. [29]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning, 2023. 11

  29. [30]

    Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild

    Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 15031–15038. IEEE, 2024. 18

  30. [31]

    Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 17

  31. [32]

    Skillmimicgen: Automated demonstration genera- tion for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

    Caelan Garrett, Ajay Mandlekar, Bowen Wen, and Dieter Fox. Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024. 18

  32. [33]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842–585...

  33. [34]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022. 11, 18

  34. [35]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    KristenGrauman, AndrewWestbury, LorenzoTorresani, KrisKitani, JitendraMalik, TriantafyllosAfouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–194...

  35. [36]

    Maniskill2: A unified benchmark for generalizable manipulation skills

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. In The Eleventh International Conference on Learning Representations, 2023. 18

  36. [37]

    Scaling up and distilling down: Language-guided robot skill acquisition

    Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. InConference on Robot Learning, pages 3766–3777. PMLR, 2023. 18 31 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

  37. [38]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 24

  38. [39]

    Adaflow: Imitation learning with variance-adaptive flow-based policies, 2024

    Xixi Hu, Bo Liu, Xingchao Liu, and Qiang Liu. Adaflow: Imitation learning with variance-adaptive flow-based policies.arXiv preprint arXiv:2402.04292, 2024. 17

  39. [40]

    An embodied generalist agent in 3d world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InProceedings of the International Conference on Machine Learning (ICML), 2024. 17

  40. [41]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning. 17

  41. [42]

    Grounded decoding: Guiding text generation with grounded models for embodied agents.Advances in Neural Information Processing Systems, 36:59636–59661, 2023

    Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, et al. Grounded decoding: Guiding text generation with grounded models for embodied agents.Advances in Neural Information Processing Systems, 36:59636–59661, 2023. 17

  42. [43]

    Open teach: A versatile teleoperation system for robotic manipulation

    Aadhithya Iyer, Zhuoran Peng, Yinlong Dai, Irmak Guzey, Siddhant Haldar, Soumith Chintala, and Lerrel Pinto. Open teach: A versatile teleoperation system for robotic manipulation. InCoRL 2024 Workshop on Mastering Robot Manipulation in a World of Abundant Data. 17

  43. [44]

    Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020. 18

  44. [45]

    Vima: robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: robot manipulation with multimodal prompts. In Proceedings of the 40th International Conference on Machine Learning, pages 14975–15022, 2023. 5

  45. [46]

    Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

    Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

  46. [47]

    Daniel Kahneman.Thinking, fast and slow. 2011. 2

  47. [48]

    Language-driven representation learning for robotics

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023. 18

  48. [49]

    EgoMimic: Scaling imitation learning via egocentric video, 2024

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning via egocentric video, 2024. 18

  49. [50]

    AlexanderKhazatsky, KarlPertsch, SurajNair, AshwinBalakrishna, SudeepDasari, SiddharthKaramcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon...

  50. [51]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 17

  51. [52]

    Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 17

  52. [53]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 3, 4, 17

  53. [54]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023. 17

  54. [55]

    Text2motion: from natural language instructions to feasible plans.Autonomous Robots, 2023

    Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: from natural language instructions to feasible plans.Autonomous Robots, 2023. 17

  55. [56]

    Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024

    Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, et al. Stiv: Scalable text and image conditioned video generation.arXiv preprint arXiv:2412.07730, 2024. 6

  56. [57]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations. 3, 17

  57. [58]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 17

  58. [59]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022. 11

  59. [60]

    Interactive language: Talking to robots in real time, 2022

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time, 2022. 9

  60. [61]

    Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 17

  61. [62]

    Cacti: A framework for scal- able multi-task multi-scene visual imitation learning,

    Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning.arXiv preprint arXiv:2212.05711, 2022. 18

  62. [63]

    RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation. InConference on Robot Learning, 2018. 17 33 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

  63. [64]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulationdatasetthroughhuman reasoningand dexterity

    Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulationdatasetthroughhuman reasoningand dexterity. In 2019IEEE/RSJInternationalConference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE,...

  64. [65]

    Human-in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020

    Ajay Mandlekar, Danfei Xu, Roberto Martín-Martín, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Human- in-the-loop imitation learning using remote teleoperation.arXiv preprint arXiv:2012.06733, 2020. 17

  65. [66]

    What matters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021. 14

  66. [67]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning, 2023. 6, 12, 18

  67. [68]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 18

  68. [69]

    Gritsenko, and Neil Houlsby

    Matthias Minderer, Alexey A. Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 24

  69. [70]

    Ray: A distributed framework for emerging {AI} applications

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging {AI} applications. In13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018. 8

  70. [71]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022. 18

  71. [72]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024. 10, 11, 12, 14, 18

  72. [73]

    Osmo platform, 2025

    NVIDIA. Osmo platform, 2025. URLhttps://developer.nvidia.com/osmo. Accessed: 2025-03-12. 8

  73. [74]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and...

  74. [75]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models. International Conference on Robotics and Automation, 2024. 1, 9

  75. [76]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), 2024. 17

  76. [77]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 4 34 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

  77. [78]

    Motion tracks: A unified representation for human-robot transfer in few- shot imitation learning, 2025

    Juntao Ren, Priya Sundaresan, Dorsa Sadigh, Sanjiban Choudhury, and Jeannette Bohg. Motion tracks: A unified representation for human-robot transfer in few-shot imitation learning.arXiv preprint arXiv:2501.06994, 2025. 18

  78. [79]

    Videoworld: Exploring knowledge learning from unlabeled videos.arXiv preprint arXiv:2501.09781, 2025

    Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, and Xiaojie Jin. Videoworld: Exploring knowledge learning from unlabeled videos.arXiv preprint arXiv:2501.09781, 2025. 6

  79. [80]

    Sener, D

    F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities.CVPR 2022, 2022. 11

  80. [81]

    Andy Park, Shenli Yuan, Yuke Zhu, , and Luis Sentis

    Mingyo Seo, H. Andy Park, Shenli Yuan, Yuke Zhu, , and Luis Sentis. Legato: Cross-embodiment imitation using a grasping tool.IEEE Robotics and Automation Letters (RA-L), 2025. 18

Showing first 80 references.