super hub Mixed citations

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Octo Model Team, Oier Mees · 2024 · cs.RO · arXiv 2405.12213

Mixed citation behavior. Most common role is background (68%).

222 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 222 citing papers more from Dibya Ghosh arXiv PDF

abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 14 dataset 2 method 1

citation-polarity summary

background 36 baseline 15 use dataset 2

claims ledger

abstract Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-so

authors

Dibya Ghosh Homer Walke Karl Pertsch Kevin Black Octo Model Team Oier Mees

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

cs.RO · 2026-06-09 · unverdicted · novelty 8.0

TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

cs.RO · 2026-06-09 · unverdicted · novelty 7.0

UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Targeting World Models to Compromise Robot Learning Pipelines

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.

Same Weights, Different Robot: A Deployment Safety View of VLA Policies

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

cs.RO · 2026-05-20 · unverdicted · novelty 7.0

A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

cs.RO · 2026-05-19 · unverdicted · novelty 7.0

MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

citing papers explorer

Showing 22 of 222 citing papers.

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis cs.RO · 2026-06-07 · unverdicted · none · ref 10 · internal anchor
Introduces SO-101 benchmark for VLA and imitation learning policies on four tasks, showing pretrained VLAs outperform baseline but with high task dependence and execution instability as main failure mode.
Unified Video-Action Joint Denoising for Dexterous Action and Data Generation cs.CV · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
Donk is a unified video-action denoising model that generates dexterous hand trajectories and videos under language, image, and state conditioning while also serving as a text-conditioned data engine.
WALL-WM: Carving World Action Modeling at the Event Joints cs.RO · 2026-06-01 · unverdicted · none · ref 70 · internal anchor
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation cs.RO · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
VR-DAgger: Immersive VR for Dexterous Data Collection and Uncertainty-Guided On-Policy Correction cs.RO · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
VR-DAgger is a VR-centered human-in-the-loop framework that applies MC dropout uncertainty to select and correct failure segments in diffusion policy rollouts, yielding up to 23 percentage point gains over behavioral cloning and 40% lower per-sample collection time on three dexterous tasks.
Can Predicted Dynamics Exist in the Physical World? cs.RO · 2026-05-23 · unverdicted · none · ref 9 · internal anchor
Physical admissibility is defined as a prediction-control interface using kinematic, dynamic, and composed-horizon conditions to reject invalid dynamics proposals, with AUC 0.957 on LeRobot PushT and 87-89% prevention of invalid actions in interventions.
$\pi_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control cs.RO · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
Equilibrium Matching decoder substitution in π₀ improves RoboTwin success from 40.4% to 50.2% across 19 tasks and reaches 87.0% on LIBERO-10.
Towards Robotic Dexterous Hand Intelligence: A Survey cs.RO · 2026-05-13 · unverdicted · none · ref 110 · internal anchor
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models cs.RO · 2026-05-13 · unverdicted · none · ref 6 · 2 links · internal anchor
AttenA+ reweights action training objectives in VLA and WAM models via inverse velocity attention to prioritize kinematically critical segments, yielding small benchmark gains.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms cs.RO · 2026-04-26 · accept · none · ref 64 · internal anchor
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies cs.CY · 2026-04-24 · unverdicted · none · ref 29 · internal anchor
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 227 · internal anchor
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
Vision-Language-Action Models: Experimental Insights from a Real-World UR5 Platform cs.RO · 2026-06-29 · unverdicted · none · ref 8 · internal anchor
Real-robot trials with OpenVLA on a UR5e arm show consistent offline-to-closed-loop gaps driven by action semantics, coordinate conventions, temporal alignment, image preprocessing, and dataset quality rather than model capacity.
Building a Scalable, Reproducible, Evaluatable, and Closed-Loop Simulation Environment Foundation for Embodied Intelligence cs.RO · 2026-06-26 · unverdicted · none · ref 4 · 2 links · internal anchor
Presents a four-layer cloud-native framework for scalable, reproducible simulation-based training and evaluation in embodied AI.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates cs.LG · 2026-05-10 · unreviewed · ref 9 · internal anchor
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics cs.RO · 2026-04-22 · unreviewed · ref 16 · internal anchor
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics cs.RO · 2026-04-16 · unreviewed · ref 2 · 2 links · internal anchor
Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents cs.LG · 2026-04-15 · unreviewed · ref 29 · internal anchor
Hierarchical Planning with Latent World Models cs.LG · 2026-04-03 · unreviewed · ref 46 · internal anchor
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unreviewed · ref 50 · internal anchor
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation cs.RO · 2025-12-29 · unreviewed · ref 27 · internal anchor
LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization cs.CV · 2025-10-04 · unreviewed · ref 22 · internal anchor

Octo: An Open-Source Generalist Robot Policy

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer