pith. machine review for the scientific record. sign in

arxiv: 2405.12213 · v2 · submitted 2024-05-20 · 💻 cs.RO · cs.LG

Recognition: no theorem link

Octo: An Open-Source Generalist Robot Policy

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:21 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords generalist robot policypretrained policiesfinetuningtransformer modelrobot manipulationpolicy adaptationopen source modeldiverse datasets
0
0 comments X

The pith

A large policy pretrained on hundreds of thousands of robot trajectories can be finetuned in hours to work with new sensors and actions on different robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that pretraining a large policy on diverse robot manipulation data creates a reusable base model for many tasks and platforms. This base can then be adapted quickly to new robots that have different sensors and ways of moving, using only modest amounts of new data and standard hardware. A reader would care because this shifts robot learning away from training every application from scratch toward reusing and customizing a shared starting point. If correct, the result would make capable robot behaviors easier to develop and deploy across varied setups without large new data collections or long training runs.

Core claim

The paper establishes that a transformer-based policy trained on a large collection of robot manipulation trajectories can accept instructions through language or goal images and can be finetuned to new observation spaces and action spaces across nine different robotic platforms, all within a few hours on standard consumer graphics processing units, providing a versatile initialization for generalist policies.

What carries the argument

A transformer architecture that processes sequences of observations and instructions to predict actions, serving as the reusable initialization for adaptation to new robot configurations.

Load-bearing premise

That pretraining on varied robot data creates features general enough for effective transfer to new hardware through limited finetuning without needing lots of extra data or careful tuning.

What would settle it

An experiment on a new robot setup where finetuning the pretrained policy yields performance no better than training an identical model from random initialization on the same limited data would disprove the central result.

read the original abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Octo, a large transformer-based policy pretrained on 800k trajectories from the Open X-Embodiment dataset. It claims that this model serves as a versatile initialization that can be effectively finetuned (via language or goal-image instructions) to new robotic platforms with different observation and action spaces, with experiments across 9 platforms and ablations on architecture and data choices to support future generalist robot model development.

Significance. If the central claim holds, Octo would provide a valuable open-source starting point for robot learning that reduces per-task data collection needs. The release of the model, code, and detailed ablations on design decisions (architecture, training data) are concrete strengths that can accelerate community progress on generalist policies.

major comments (2)
  1. [Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.
  2. [Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.
minor comments (1)
  1. [Abstract] The abstract and introduction use 'a few hours on standard consumer GPUs' without specifying exact GPU type, batch size, or number of steps; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of Octo as an open-source generalist robot policy. We address the major comments below, focusing on strengthening the experimental validation.

read point-by-point responses
  1. Referee: [Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.

    Authors: We concur that a from-scratch baseline with the same architecture, data volume, and compute would help isolate the contribution of pretraining. However, the scale of the model and the number of platforms make running these additional experiments infeasible with our current computational budget. Our ablations on architecture variants and subsets of the pretraining data provide indirect evidence for the benefits of large-scale pretraining. In the revised manuscript, we will include a dedicated limitations paragraph discussing this and report the exact finetuning data amounts and hyperparameter optimization details for each of the 9 platforms. revision: partial

  2. Referee: [Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.

    Authors: We agree that to make the claims more robust, additional controls are valuable. We will update the manuscript to explicitly report the data volume used for finetuning on each platform and describe the extent of hyperparameter tuning performed. While we cannot add full from-scratch runs, we believe the combination of cross-platform results and ablations supports the utility of the pretrained model as a starting point. We will revise the text to emphasize these points more clearly. revision: partial

standing simulated objections not resolved
  • Full from-scratch baselines across all nine platforms under matched compute constraints

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of model inputs

full rationale

The paper's central claim is that a transformer policy pretrained on 800k Open X-Embodiment trajectories serves as a versatile initialization for finetuning to new observation and action spaces. This is supported by new experimental runs across 9 robotic platforms with held-out data, not by any derivation that reduces to the pretraining data or fitted parameters by construction. No equations, uniqueness theorems, or self-citations are used to derive performance claims; the work contains no mathematical derivation chain and reports direct empirical outcomes from finetuning protocols. The results are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer assumptions and large-scale supervised pretraining; many architectural and optimization choices are free parameters tuned to the robot data.

free parameters (2)
  • transformer architecture hyperparameters
    Number of layers, attention heads, embedding dimension, and context length chosen to fit the 800k-trajectory dataset.
  • pretraining and finetuning optimization settings
    Learning rate schedule, batch size, and number of finetuning steps selected for the reported results.
axioms (1)
  • domain assumption A transformer can learn general robot control features from heterogeneous trajectory data collected across many embodiments.
    Invoked by the decision to pretrain a single model on the full Open X-Embodiment collection.

pith-pipeline@v0.9.0 · 5596 in / 1352 out tokens · 37072 ms · 2026-05-11T00:21:26.603830+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  2. Test-time Sparsity for Extreme Fast Action Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.

  3. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  4. Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 7.0

    MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.

  5. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  6. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

  7. Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

    cs.LG 2026-05 unverdicted novelty 7.0

    TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and r...

  8. ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

  9. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  10. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 7.0

    LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

  11. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  12. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.

  13. Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    cs.AI 2026-05 unverdicted novelty 7.0

    A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

  14. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  15. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  16. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  17. STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

    cs.RO 2026-04 unverdicted novelty 7.0

    STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.

  18. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  19. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  20. BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

    cs.RO 2026-04 conditional novelty 7.0

    BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.

  21. Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking

    cs.RO 2026-04 unverdicted novelty 7.0

    A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.

  22. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  23. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  24. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  25. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  26. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  27. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  28. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  29. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  30. PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

  31. Unified Noise Steering for Efficient Human-Guided VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

  32. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...

  33. Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

    cs.RO 2026-05 unverdicted novelty 6.0

    A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.

  34. Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

    cs.LG 2026-05 unverdicted novelty 6.0

    Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.

  35. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  36. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  37. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 6.0

    LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.

  38. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  39. Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

    cs.RO 2026-05 unverdicted novelty 6.0

    VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.

  40. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  41. When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.

  42. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  43. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

  44. An Efficient Metric for Data Quality Measurement in Imitation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.

  45. Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

  46. VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

    cs.RO 2026-05 unverdicted novelty 6.0

    VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

  47. Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  48. $M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

    cs.RO 2026-04 unverdicted novelty 6.0

    M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

  49. AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.

  50. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  51. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  52. CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

    cs.RO 2026-04 unverdicted novelty 6.0

    CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

  53. Exploring High-Order Self-Similarity for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  54. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

  55. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  56. HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

    cs.LG 2026-04 unverdicted novelty 6.0

    HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.

  57. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  58. OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

  59. A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

    cs.RO 2026-04 unverdicted novelty 6.0

    A two-level hierarchical vector quantization tokenizer that clusters actions spatially and temporally achieves new state-of-the-art results in in-context imitation learning for robotics.

  60. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 86 Pith papers · 9 internal anchors

  1. [1]

    Introducing scale’s automotive foundation model, 2023

    Scale AI. Introducing scale’s automotive foundation model, 2023. URL https://scale.com/blog/afm1

  2. [2]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, 2017

  3. [3]

    Human-to-robot imitation in the wild

    Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022

  4. [4]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023

  5. [5]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023

  6. [6]

    Roboa- gent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 , 2023

  7. [7]

    Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

  8. [8]

    End to End Learning for Self-Driving Cars

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016

  9. [9]

    Robocat: A self-improving foundation agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 1(8), 2023

    Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipu- lation. arXiv preprint arXiv:2306.11706 , 2023

  10. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022

  11. [11]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning , pages 287–318. PMLR, 2023

  12. [12]

    Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,

    Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data- driven robotics with reward sketching and batch rein- forcement learning. arXiv preprint arXiv:19...

  13. [13]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

  14. [14]

    Berkeley UR5 demonstration dataset

    Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home

  15. [15]

    Vision-language models provide promptable rep- resentations for reinforcement learning

    William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-language models provide promptable rep- resentations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024

  16. [16]

    Genaug: Retargeting behaviors to unseen situations via gener- ative augmentation,

    Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

  17. [17]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  18. [18]

    From play to policy: Conditional behavior generation from uncurated robot data

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In The Eleventh International Conference on Learning Representations, 2022

  19. [19]

    Robonet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning , pages 885–897. PMLR, 2020

  20. [20]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/ clvrai/clvr_jaco_play_dataset

  21. [21]

    Causal confusion in imitation learning

    Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. NeurIPS, 2019

  22. [22]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  23. [23]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  24. [24]

    Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets, 2023

    Maximilian Du, Suraj Nair, Dorsa Sadigh, and Chelsea Finn. Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets. ArXiv, abs/2304.08742,

  25. [25]

    URL https://api.semanticscholar.org/CorpusID: 258186973

  26. [26]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. arXiv preprint arXiv:2109.13396, 2021

  27. [27]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023

  28. [28]

    Deep visual foresight for planning robot motion

    Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2786–2793. IEEE, 2017

  29. [29]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  30. [30]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition , pages 3354–3361. IEEE, 2012

  31. [31]

    Robot learning in homes: Improving generalization and reducing dataset bias

    Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018

  32. [32]

    Scaling up and distilling down: Language-guided robot skill acquisition

    Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023

  33. [33]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016

  34. [34]

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems , 2023

  35. [35]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

  36. [36]

    Gaia-1: A generative world model for autonomous driving, 2023

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023

  37. [37]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023

  38. [38]

    Audio visual language maps for robot naviga- tion

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot naviga- tion. In Proceedings of the International Symposium on Experimental Robotics (ISER) , Chiang Mai, Thailand, 2023

  39. [39]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 , 2023

  40. [40]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR, 2022

  41. [41]

    VIMA: Robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on M...

  42. [42]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018

  43. [43]

    Scaling up multi- task robotic reinforcement learning

    Dmitry Kalashnikov, Jake Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Scaling up multi- task robotic reinforcement learning. In 5th Annual Conference on Robot Learning , 2021

  44. [44]

    Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters , 7(4):11807–11814, 2022

  45. [45]

    Berg, Wan-Yen Lo, et al

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything, April 2023

  46. [46]

    Robohive: A unified framework for robot learning

    Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Abhishek Gupta, and Ar- avind Rajeswaran. Robohive: A unified framework for robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/ forum?id=0H5fRQcpQ7

  47. [47]

    Language models as zero-shot trajectory generators

    Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. arXiv preprint arXiv:2310.11604 , 2023

  48. [48]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334– 1373, 2016

  49. [49]

    Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection

    Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018

  50. [50]

    Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier

    Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier. Polymetis. https: //facebookresearch.github.io/fairo/polymetis/, 2021

  51. [51]

    Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Robotics: Science and Systems (RSS) , 2023

  52. [52]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018

  53. [53]

    Multi-stage cable routing through hierarchical imitation learning, 2024

    Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. arXiv preprint arXiv:2307.08927 , 2023

  54. [54]

    Fmb: a functional manipulation benchmark for generalizable robotic learning

    Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024

  55. [55]

    Language conditioned imitation learning over unstructured data

    Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. In RSS, 2021

  56. [56]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023

  57. [57]

    Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

    Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

  58. [58]

    Where are we in the search for an artificial visual cortex for embodied intelligence?

    Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023

  59. [59]

    RoboTurk: A crowdsourcing platform for robotic skill learning through imitation.CoRR, abs/1811.02790, 2018

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. CoRR, abs/1811.02790, 2018. URL http://arxiv.org/abs/1811. 02790

  60. [60]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–

  61. [61]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023

  62. [62]

    What matters in language conditioned robotic imitation learning over unstructured data

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

  63. [63]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

  64. [64]

    Grounding language with visual affordances over un- structured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023

  65. [65]

    Structured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. CoRL, 2023

  66. [66]

    Learning and retrieval from prior data for skill-based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022

  67. [67]

    Im- proved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021

  68. [68]

    Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang...

  69. [69]

    GPT-4 Technical Report, March 2023

    OpenAI. GPT-4 Technical Report, March 2023

  70. [70]

    The surprising effectiveness of representation learning for visual imita- tion, 2021

    Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imita- tion, 2021

  71. [71]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018

  72. [72]

    Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours

    Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA) , pages 3406–3413. IEEE, 2016

  73. [73]

    Shared Control Templates for Assistive Robotics

    Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Joern V ogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Automation (ICRA) , page 7, Paris, France, 2020

  74. [74]

    Robot learning with sensorimotor pre-training

    Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. Conference on Robot Learning, 2023

  75. [75]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research , 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html

  76. [76]

    A generalist agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research , 2022

  77. [77]

    High-Resolution Image Synthesis with Latent Diffusion Models, April 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022

  78. [78]

    Latent plans for task agnostic offline reinforcement learning

    Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

  79. [79]

    Multi-resolution sensing for real-time control with vision- language models

    Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision- language models. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id= WuBv9-IGDUA

  80. [80]

    On bringing robots home, 2023

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023

Showing first 80 references.