super hub Mixed citations

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Octo Model Team, Oier Mees · 2024 · cs.RO · arXiv 2405.12213

Mixed citation behavior. Most common role is background (68%).

244 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 244 citing papers more from Dibya Ghosh arXiv PDF

abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 14 dataset 2 method 1

citation-polarity summary

background 36 baseline 15 use dataset 2

claims ledger

abstract Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-so

authors

Dibya Ghosh Homer Walke Karl Pertsch Kevin Black Octo Model Team Oier Mees

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

cs.RO · 2026-06-09 · unverdicted · novelty 8.0

TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

cs.RO · 2026-06-17 · unverdicted · novelty 7.0

A single Encoder-Router network uses semigroup superposition of frame, modulation, and coefficient parameters to produce a scene-specific Riemannian metric field that supports zero-shot geodesic planning after training on one two-obstacle scene.

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

cs.RO · 2026-06-09 · unverdicted · novelty 7.0

UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Targeting World Models to Compromise Robot Learning Pipelines

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.

Same Weights, Different Robot: A Deployment Safety View of VLA Policies

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

citing papers explorer

Showing 50 of 244 citing papers.

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation cs.RO · 2026-06-30 · unverdicted · none · ref 17 · internal anchor
HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies cs.RO · 2026-06-09 · unverdicted · none · ref 15 · internal anchor
TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 18 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
Adapting Generalist Robot Policies with Semantic Reinforcement Learning cs.RO · 2026-06-30 · unverdicted · none · ref 7 · internal anchor
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 37 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation cs.RO · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models cs.RO · 2026-06-25 · unverdicted · none · ref 3 · internal anchor
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models cs.CV · 2026-06-17 · unverdicted · none · ref 11 · internal anchor
Mix-QVLA is a task-evidence-aware mixed-precision PTQ framework for VLA models that preserves task-relevant evidence via evidence-mass and attribution-distribution metrics to guide bit allocation under memory and BitOps constraints.
Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation cs.RO · 2026-06-17 · unverdicted · none · ref 13 · internal anchor
A single Encoder-Router network uses semigroup superposition of frame, modulation, and coefficient parameters to produce a scene-specific Riemannian metric field that supports zero-shot geodesic planning after training on one two-obstacle scene.
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies cs.RO · 2026-06-16 · unverdicted · none · ref 15 · internal anchor
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO · 2026-06-11 · unverdicted · none · ref 17 · internal anchor
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination cs.CV · 2026-06-09 · unverdicted · none · ref 28 · internal anchor
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data cs.RO · 2026-06-09 · unverdicted · none · ref 13 · internal anchor
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 18 · internal anchor
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies cs.RO · 2026-06-08 · unverdicted · none · ref 20 · internal anchor
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
Targeting World Models to Compromise Robot Learning Pipelines cs.RO · 2026-06-08 · unverdicted · none · ref 6 · internal anchor
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection cs.RO · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies cs.RO · 2026-06-06 · unverdicted · none · ref 3 · internal anchor
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 48 · internal anchor
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies cs.RO · 2026-06-02 · unverdicted · none · ref 6 · internal anchor
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
Same Weights, Different Robot: A Deployment Safety View of VLA Policies cs.CR · 2026-06-02 · unverdicted · none · ref 18 · internal anchor
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies cs.LG · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models cs.AI · 2026-05-28 · unverdicted · none · ref 38 · internal anchor
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 69 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning cs.LG · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 26 · internal anchor
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation cs.RO · 2026-05-20 · unverdicted · none · ref 41 · internal anchor
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation cs.RO · 2026-05-19 · unverdicted · none · ref 19 · internal anchor
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 26 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation cs.RO · 2026-05-15 · unverdicted · none · ref 32 · internal anchor
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
DSSP: Diffusion State Space Policy with Full-History Encoding cs.RO · 2026-05-14 · conditional · none · ref 50 · internal anchor
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
Test-time Sparsity for Extreme Fast Action Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models cs.AI · 2026-05-12 · unverdicted · none · ref 47 · internal anchor
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models cs.RO · 2026-05-12 · unverdicted · none · ref 6 · 2 links · internal anchor
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 41 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding cs.AI · 2026-05-08 · unverdicted · none · ref 13 · 2 links · internal anchor
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 57 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation cs.AI · 2026-05-01 · unverdicted · none · ref 18 · internal anchor
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment cs.RO · 2026-04-27 · unverdicted · none · ref 22 · internal anchor
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 33 · internal anchor
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 3 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations cs.RO · 2026-04-11 · unverdicted · none · ref 13 · internal anchor
STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 46 · internal anchor
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 54 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination cs.RO · 2026-04-07 · conditional · none · ref 53 · internal anchor
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking cs.RO · 2026-04-03 · unverdicted · none · ref 22 · internal anchor
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 36 · internal anchor
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 52 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

Octo: An Open-Source Generalist Robot Policy

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer