WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.
super hub Canonical reference
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Canonical reference. 72% of citing Pith papers cite this work as background.
abstract
Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleop
authors
co-cited works
representative citing papers
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
Geometric diversity of demonstration trajectories exhibits an inverted-U effect on imitation learning success, with the peak shifting lower as mastery increases via more data, easier tasks, or stronger priors.
FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.
PAINT reframes asynchronous flow-based action chunking as an initial noise selection problem solved via backward Euler inversion and a repainting rule.
Sleeping Robots uses frozen critics and actor snapshots as compact memories to define surrogate objectives combined via Nash bargaining for offline consolidation of shared robot policies in sequential skill learning.
Transformer warm-starts cut SCP iterations by up to 28% and runtime by 23% for space manipulator terminal guidance while preserving cost distributions and improving feasibility projection robustness.
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.
DEHP adds an online-RL horizon predictor to frozen chunk policies, yielding higher success on precise and long-horizon robot manipulation by adapting chunk length to task stage.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.
PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.
VLA architectures exhibit architecture-specific failure signatures at the motor-command level, with direction reversal as a universal predictor and velocity monitoring ineffective for continuous models.
Soft RTC uses partially denoised states for overlap tokens and token-wise blending to reduce action delta and jerk by ~9% versus hard RTC while matching solve rates on Kinetix levels.
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
citing papers explorer
-
AutoSpeed: Annotation-Free Stage-Adaptive Motion Speed Learning for Robot Manipulation
AutoSpeed optimizes visuomotor policies over candidate trajectories at varying speeds using a composite cost of prediction error versus horizon length, with DCT-based modulation, yielding shorter execution times and higher success rates while producing speeds that align with task stages.
-
UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
-
TactX: Learning Shared Tactile Representations Across Diverse Sensors
TactX learns a shared latent representation across three tactile sensor modalities via joint training on paired contacts, enabling zero-shot policy transfer and higher success on pick-and-place, insertion, wiping, and reorientation tasks.
-
ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control
ReactiveBFM introduces a real-time closed-loop planning-control system for humanoids using curriculum-based error recovery and asynchronous replanning, achieving 93.1% success under severe perturbations in sim-to-sim tests.
-
Chronos: A Physics-Informed Full-History Framework for Non-Markovian Long-Horizon Manipulation
Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.
-
SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance
SA-VLA adds state conditioning to VQ-based action tokenization in VLA policies, expanding each discrete token's effective support to state-dependent actions and raising average success rates from 0.29 to 0.56 on 12 sim tasks and 0.15 to 0.33 on 3 real tasks.
-
Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies
CI-MSE improves Spearman's rank correlation between offline validation error and real rollout performance from -0.61 (raw MSE) to -0.87 across policy checkpoints in simulation and real-world robot manipulation experiments.
-
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
-
Hierarchical Policy Learning via Spectral Decomposition
Causal Spectral Policy decomposes actions spectrally into coarse motion from obs/language and conditional fine corrections, outperforming baselines on precision manipulation tasks.
-
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
-
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
-
PAMAE: Phase-Aware-MoE Action Experts Towards Reliable Flow-Matching Vision-Language-Action Policies
PAMAE adds a phase-aware router and expert mixture to flow-matching VLA models, yielding up to 9.2% higher task success on multi-stage manipulation simulations via two-stage training.
-
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
-
Tactile-WAM: Touch-Aware World Action Model with Tactile Asymmetric Attention
Tactile-WAM with TAAM improves mean success rate by 38.9% overall and 86% on contact-rich tasks on ManiFeel by using VideoClean mask and touch-aware bias to prevent tactile pollution in world action models.
-
Foresight: Failure Detection for Long-Horizon Robotic Manipulation with Action-Conditioned World Model Latents
Foresight detects failures in long-horizon robotic manipulation using latents from action-conditioned world models trained only on task-level labels and calibrated via functional conformal prediction.
-
Verifiable Foundation Models for Robot Safety
FEARL decomposes robot policies into an expressive Controller and a small verifiable Safety module to enable formal verification of safety constraints while retaining foundation-model task performance.
-
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation
ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.
-
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation
RARM is a lightweight visual comparator trained once on general videos that supplies dense progress rewards to RL by matching rollout clips to a reference demonstration and gating rewards on match confidence.
-
Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models
GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.
-
Remember what you did?: Learning Behavioral Memories for Partially Observable Object Manipulation
CAMP learns a compressed behavioral memory from action history to enable success in long-horizon partially observable object manipulation without extra supervision, showing gains over baselines in real-robot and simulation tests.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
VLA models exhibit layer-wise redundancy allowing up to 50% depth compression via training-free CKA-based removal, yielding faster fine-tuning and inference with no performance loss on robot tasks.
-
Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation
A double-soft-belt finger module adds translation, pitch, and roll to parallel grippers for improved in-hand manipulation at low cost.
-
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
-
Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory
Tri-Info uses three information theory signals on action diversity, temporal consistency, and state coupling to predict VLA model failures with cross-domain generalization to 83% real-world accuracy.
-
One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies
A framework augments single fisheye demonstrations into multiple novel-view trajectories with obstacles via fisheye-adapted Gaussian Splatting and trajectory optimization, raising policy success rates in original and modified scenes.
-
Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos
A Hybrid Disentangled VQ-VAE with physical masks creates a cross-embodiment action codebook from human videos, allowing VLA pre-training that adapts to new embodiments with only 50 trajectories.
-
Contrastive Action-Image Pre-training for Visuomotor Control
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
-
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
-
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning
Real2Sim tactile calibration, layout-aware encoder pretraining, and diffusion policy aggregation from object-specific RL experts enable 27% real-world success in blind grasping on a LEAP Hand for 10 seen and 10 unseen objects.
-
Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization
HOWTransfer recovers 3D hand motion from video, localizes contact intervals via hand-object cues, generates multi-modal grasp hypotheses, and edits trajectories to produce diverse robot-executable motions achieving 86% success.
-
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
-
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
-
RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation
RGB-S projects tactile contacts onto images as force-modulated Gaussian saliency maps via kinematics and zero-initialized conditioning, raising real-world occluded dexterous manipulation success by 26.7 percentage points over implicit baselines.
-
SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation
SIMPLE is a new large-scale simulation benchmark for humanoid loco-manipulation that integrates accurate dynamics and photorealistic rendering and demonstrates policy transfer from simulation to physical robots.
-
SynthICL: Scalable In-context Imitation Learning with Synthetic Data
SynthICL trains flow-matching transformer policies for in-context imitation learning entirely from synthetic RGB data and reports 79% average success on 16 unseen real manipulation tasks with one test-time demonstration.
-
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
-
Flow-based Policy Adaptation without Policy Updates
GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
-
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
-
CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.
-
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
-
OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform
OpenEAI-Platform delivers an open-source low-cost robotic arm and VLA model that outperforms commercial arms and matches large pretrained baselines on four real-world manipulation tasks using limited open data.
-
Policy-based Foveated Imaging and Perception
A task-aware policy learned via reinforcement learning allocates high-resolution pixels on dual-stream sensors in real time, outperforming fixed or non-predictive baselines under tight pixel budgets in both simulation and 200 MP hardware tests.
-
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation
AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot success rates where baselines score near zero.
-
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
-
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking
PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.
-
Continuous Reasoning for Vision-Language-Action
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
-
Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
-
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
-
Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation
Mag-VLA uses a LoRA-adapted Qwen2.5-VL-7B with a phase classifier and ACT decoder on a new teleoperated dataset to reach 90% approach and 50-80% transport success in bimanual magnetic microrobot tasks.