HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
super hub Canonical reference
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Canonical reference. 72% of citing Pith papers cite this work as background.
abstract
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose
authors
co-cited works
representative citing papers
TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
DEHP adds an online-RL horizon predictor to frozen chunk policies, yielding higher success on precise and long-horizon robot manipulation by adapting chunk length to task stage.
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
citing papers explorer
-
HABIT: Human-Aware Behavior and Interaction Training Dataset for Robot Manipulation
HABIT is a large-scale robot demonstration dataset for human-present environments that elicits spatiotemporal synchronization, yielding, and gesture grounding behaviors absent from robot-only training data.
-
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies
TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
-
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
-
TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.
-
Membership Inference Attacks on Vision-Language-Action Models
Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
-
Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms
OPT-AIL provides the first provably efficient adversarial imitation learning algorithms under general function approximation, achieving polynomial expert sample and interaction complexity.
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation
OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.
-
Adapting Generalist Robot Policies with Semantic Reinforcement Learning
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
-
Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory
Labimus is the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories, exposing a gap between task completion and required experimental precision.
-
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
-
SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics
SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint
-
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
-
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
-
Dynamic Execution Horizon Prediction for Chunk-based Robot Policies
DEHP adds an online-RL horizon predictor to frozen chunk policies, yielding higher success on precise and long-horizon robot manipulation by adapting chunk length to task stage.
-
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination
Introduces LIBERO-Occ benchmark showing VLA performance drop under occlusion and Viewpoint Imagination method that generates complementary views to improve robustness without extra hardware.
-
UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data
UMI-Bench 1.0 is presented as the first open benchmark dedicated to reproducible real-world evaluation of Universal Manipulation Interface policies.
-
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models
PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmark from 69.6% to 74.1%.
-
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
ReCoVLA improves VLA policy reliability by using a VLM as a semantic reward selector to train residual recovery policies in simulation, raising average success from 36.7% to 66.7% in sim and achieving 61.7% in zero-shot sim-to-real physical tests.
-
Targeting World Models to Compromise Robot Learning Pipelines
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
-
Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
-
PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
-
Reinforcement Learning for Flow-Matching Policies with Density Transport
RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.
-
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
-
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
-
Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
-
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
The paper introduces a Trajectory Waypoint paradigm with a TSDF-guided diffusion policy and trajectory-enhanced navigator that achieves better performance on VLN-CE benchmarks by ensuring waypoint reachability and planning-execution consistency.
-
ActionMap: Robot Policy Learning via Voxel Action Heatmap
ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.
-
PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation
PiL-World introduces a chunk-wise world model for closed-loop VLA policy evaluation that reduces the gap between simulated and real success rates from 63.2% to 12.0% on three dual-arm manipulation tasks by conditioning on action-derived visual control and latent histories while training on both succ
-
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
-
Benchmarking Visual State Tracking in Multimodal Video Understanding
VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
-
TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models
TTT-VLA performs test-time training for VLA models by optimizing only a latent prompt on new interaction data via a proxy self-supervised signal, yielding higher task success rates on SimplerEnv in single- and multi-embodiment settings.
-
PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
-
Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following
Benchmarking ACT, Diffusion Policy, SmolVLA, and π0 on suture following yields 50-75% success under ideal conditions and 92% stitch completion with π0 in a surgeon-robot trial.
-
How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures
VLA architectures exhibit architecture-specific failure signatures at the motor-command level, with direction reversal as a universal predictor and velocity monitoring ineffective for continuous models.
-
Can VLA Models Learn from Real-World Data Continually without Forgetting?
VLA models exhibit catastrophic forgetting on a new real-world dataset of four sequential manipulation tasks, with experience replay implementation factors evaluated for mitigation.
-
Point Tracking Improves World Action Models
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
-
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
-
Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network
A single diffusion policy network with per-factor null-token dropout enables additive score composition for robot control under conditional independence, with a trajectory-tube certificate, shown to generalize on drone racing tasks.
-
Understanding Multimodal Failure in Action-Chunking Behavioral Cloning
The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.
-
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
-
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
CrossVLA develops a surrogate log-probability estimator for DPO on flow-matching VLAs, shows DoRA outperforming LoRA by +10.4 pp mean on LIBERO, and identifies inference bottlenecks with limited caching gains.
-
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Dexora: Open-source VLA for High-DoF Bimanual Dexterity
Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.
-
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
-
Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies
Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.
-
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
-
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.