HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
super hub Mixed citations
Octo: An Open-Source Generalist Robot Policy
Mixed citation behavior. Most common role is background (68%).
abstract
Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-so
authors
co-cited works
representative citing papers
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
citing papers explorer
-
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
-
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
-
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
-
MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model
MotionVLA converts short past video windows into compact trajectory-field tokens to supply motion-consistent evidence for vision-language-action robot policies, improving long-horizon manipulation.
-
SynthICL: Scalable In-context Imitation Learning with Synthetic Data
SynthICL trains flow-matching transformer policies for in-context imitation learning entirely from synthetic RGB data and reports 79% average success on 16 unseen real manipulation tasks with one test-time demonstration.
-
LARA: Latent Action Representation Alignment for Vision-Language-Action Models
LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
-
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.
-
Flow-based Policy Adaptation without Policy Updates
GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
-
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding
AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.
-
FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization
FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.
-
OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform
OpenEAI-Platform delivers an open-source low-cost robotic arm and VLA model that outperforms commercial arms and matches large pretrained baselines on four real-world manipulation tasks using limited open data.
-
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
-
RoboDream: Compositional World Models for Scalable Robot Data Synthesis
RoboDream is a compositional world model that decouples robot trajectory execution from environment synthesis to generate scalable, diverse manipulation data, shown in abstract to improve downstream policies.
-
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
-
PHASOR: Phase-Anchored Universal Action Representations for Humanoid Embodiments
PHASOR factorizes motion into an FFT-based phase manifold and pose branch with semantic distillation to produce a cross-embodiment, human-anchored action embedding space for humanoid robots.
-
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
SKIP achieves 4.16x faster dense video rollouts for robot world models by synthesizing only multimodal-identified keyframes and interpolating the rest, preserving policy training effectiveness with minimal success rate drops.
-
Continuous Reasoning for Vision-Language-Action
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
-
HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model
HARP aligns human-robot visual and latent action representations via paired bridges and unpaired dynamics supervision to boost VLA policy performance on manipulation tasks.
-
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
-
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.
-
VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models
VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.
-
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.
-
Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance
Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.
-
Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models
Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.
-
Action with Visual Primitives
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
-
RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations
RoHIL adapts human-in-the-loop RL policies to new illumination conditions offline by combining world-model image relighting, illumination-retention replay, and anchored Bellman regularisation, improving shifted-light performance while preserving source performance on four real-robot tasks.
-
Mechanisms of Misgeneralization in Physical Sequence Modeling
Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核
-
LACE: Latent Visual Representation for Cross-Embodiment Learning
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
-
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
-
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-values from a BC policy and applies Q-gating to enable efficient offline-to-online RL, outperforming baselines on D4RL/robomimic tasks and achieving up to 100% success on real-robot manipulation in 1-2 hours.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
An Efficient Metric for Data Quality Measurement in Imitation Learning
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.