super hub Mixed citations

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Octo Model Team, Oier Mees · 2024 · cs.RO · arXiv 2405.12213

Mixed citation behavior. Most common role is background (68%).

215 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 215 citing papers more from Dibya Ghosh arXiv PDF

abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 14 dataset 2 method 1

citation-polarity summary

background 36 baseline 15 use dataset 2

claims ledger

abstract Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-so

authors

Dibya Ghosh Homer Walke Karl Pertsch Kevin Black Octo Model Team Oier Mees

co-cited works

representative citing papers

HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 8.0

HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.

Targeting World Models to Compromise Robot Learning Pipelines

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.

Same Weights, Different Robot: A Deployment Safety View of VLA Policies

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

cs.RO · 2026-05-20 · unverdicted · novelty 7.0

A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

cs.RO · 2026-05-19 · unverdicted · novelty 7.0

MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

Test-time Sparsity for Extreme Fast Action Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.

citing papers explorer

Showing 50 of 215 citing papers.

Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 41 · internal anchor
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots cs.RO · 2026-06-26 · unverdicted · none · ref 50 · internal anchor
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 27 · internal anchor
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model cs.RO · 2026-06-06 · unverdicted · none · ref 29 · internal anchor
MotionVLA converts short past video windows into compact trajectory-field tokens to supply motion-consistent evidence for vision-language-action robot policies, improving long-horizon manipulation.
SynthICL: Scalable In-context Imitation Learning with Synthetic Data cs.RO · 2026-06-06 · unverdicted · none · ref 7 · internal anchor
SynthICL trains flow-matching transformer policies for in-context imitation learning entirely from synthetic RGB data and reports 79% average success on 16 unseen real manipulation tasks with one test-time demonstration.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models cs.CV · 2026-06-05 · unverdicted · none · ref 15 · 2 links · internal anchor
LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies cs.RO · 2026-06-04 · unverdicted · none · ref 5 · internal anchor
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.
Flow-based Policy Adaptation without Policy Updates cs.RO · 2026-06-04 · unverdicted · none · ref 7 · internal anchor
GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding cs.RO · 2026-06-04 · unverdicted · none · ref 61 · internal anchor
AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.
FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization cs.RO · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.
OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform cs.RO · 2026-06-02 · conditional · none · ref 26 · internal anchor
OpenEAI-Platform delivers an open-source low-cost robotic arm and VLA model that outperforms commercial arms and matches large pretrained baselines on four real-world manipulation tasks using limited open data.
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs cs.RO · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
RoboDream: Compositional World Models for Scalable Robot Data Synthesis cs.RO · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
RoboDream is a compositional world model that decouples robot trajectory execution from environment synthesis to generate scalable, diverse manipulation data, shown in abstract to improve downstream policies.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models cs.RO · 2026-06-01 · unverdicted · none · ref 30 · internal anchor
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
PHASOR: Phase-Anchored Universal Action Representations for Humanoid Embodiments cs.RO · 2026-06-01 · unverdicted · none · ref 8 · internal anchor
PHASOR factorizes motion into an FFT-based phase manifold and pose branch with semantic distillation to produce a cross-embodiment, human-anchored action embedding space for humanoid robots.
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models cs.RO · 2026-05-30 · unverdicted · none · ref 4 · internal anchor
SKIP achieves 4.16x faster dense video rollouts for robot world models by synthesizing only multimodal-identified keyframes and interpolating the rest, preserving policy training effectiveness with minimal success rate drops.
Continuous Reasoning for Vision-Language-Action cs.RO · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model cs.RO · 2026-05-29 · unverdicted · none · ref 13 · internal anchor
HARP aligns human-robot visual and latent action representations via paired bridges and unpaired dynamics supervision to boost VLA policy performance on manipulation tasks.
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving cs.RO · 2026-05-28 · unverdicted · none · ref 37 · internal anchor
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies cs.CV · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models cs.RO · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies cs.RO · 2026-05-26 · unverdicted · none · ref 27 · internal anchor
FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance cs.RO · 2026-05-22 · unverdicted · none · ref 30 · internal anchor
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models cs.RO · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.
Action with Visual Primitives cs.RO · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations cs.RO · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
RoHIL adapts human-in-the-loop RL policies to new illumination conditions offline by combining world-model image relighting, illumination-retention replay, and anchored Bellman regularisation, improving shifted-light performance while preserving source performance on four real-robot tasks.
Mechanisms of Misgeneralization in Physical Sequence Modeling cs.LG · 2026-05-19 · unverdicted · none · ref 146 · internal anchor
Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 21 · internal anchor
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making cs.LG · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
UAM: A Dual-Stream Perspective on Forgetting in VLA Training cs.CV · 2026-05-15 · unverdicted · none · ref 40 · internal anchor
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training cs.RO · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 30 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models cs.RO · 2026-05-13 · unverdicted · none · ref 22 · internal anchor
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 74 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 46 · internal anchor
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs cs.RO · 2026-05-11 · unverdicted · none · ref 25 · 2 links · internal anchor
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases cs.LG · 2026-05-10 · unverdicted · none · ref 28 · internal anchor
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation cs.RO · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation cs.CV · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning cs.RO · 2026-05-06 · unverdicted · none · ref 34 · 2 links · internal anchor
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 68 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving cs.RO · 2026-05-06 · unverdicted · none · ref 54 · 2 links · internal anchor
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
An Efficient Metric for Data Quality Measurement in Imitation Learning cs.RO · 2026-05-02 · unverdicted · none · ref 3 · internal anchor
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation cs.RO · 2026-05-02 · unverdicted · none · ref 20 · internal anchor
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model cs.RO · 2026-05-02 · unverdicted · none · ref 21 · internal anchor
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 3 · 2 links · internal anchor
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills cs.RO · 2026-04-27 · unverdicted · none · ref 19 · internal anchor
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

Octo: An Open-Source Generalist Robot Policy

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer