super hub Canonical reference

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Chelsea Finn, Sergey Levine, Tony Z. Zhao, Vikash Kumar · 2023 · cs.RO · arXiv 2304.13705

Canonical reference. 72% of citing Pith papers cite this work as background.

264 Pith papers citing it

Background 72% of classified citations

open full Pith review browse 264 citing papers more from Chelsea Finn arXiv PDF

abstract

Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleoperation interface. Imitation learning, however, presents its own challenges, particularly in high-precision domains: errors in the policy can compound over time, and human demonstrations can be non-stationary. To address these challenges, we develop a simple yet novel algorithm, Action Chunking with Transformers (ACT), which learns a generative model over action sequences. ACT allows the robot to learn 6 difficult tasks in the real world, such as opening a translucent condiment cup and slotting a battery with 80-90% success, with only 10 minutes worth of demonstrations. Project website: https://tonyzhaozh.github.io/aloha/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 dataset 1 other 1

citation-polarity summary

background 36 use method 7 baseline 4 unclear 2 use dataset 1

claims ledger

abstract Fine manipulation tasks, such as threading cable ties or slotting a battery, are notoriously difficult for robots because they require precision, careful coordination of contact forces, and closed-loop visual feedback. Performing these tasks typically requires high-end robots, accurate sensors, or careful calibration, which can be expensive and difficult to set up. Can learning enable low-cost and imprecise hardware to perform these fine manipulation tasks? We present a low-cost system that performs end-to-end imitation learning directly from real demonstrations, collected with a custom teleop

authors

Chelsea Finn Sergey Levine Tony Z. Zhao Vikash Kumar

co-cited works

representative citing papers

WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

cs.RO · 2026-06-26 · unverdicted · novelty 7.0

WARP trains a reward model on time-warped successful demonstrations to produce frame-level progress estimates that upweight high-advantage chunks during behavior cloning, maintaining high success rates on suboptimal datasets where vanilla BC fails.

ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

cs.RO · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

Geometric Entropy: When Trajectory Diversity Helps and Hurts in Imitation Learning

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

Geometric diversity of demonstration trajectories exhibits an inverted-U effect on imitation learning success, with the peak shifting lower as mastery increases via more data, easier tasks, or stronger priors.

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.

Start Right, Arrive Right: Asynchronous Execution via Initial Noise Selection

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

PAINT reframes asynchronous flow-based action chunking as an initial noise selection problem solved via backward Euler inversion and a repainting rule.

When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

Sleeping Robots uses frozen critics and actor snapshots as compact memories to define surrogate objectives combined via Nash bargaining for offline consolidation of shared robot policies in sequential skill learning.

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

cs.RO · 2026-06-15 · unverdicted · novelty 7.0

Transformer warm-starts cut SCP iterations by up to 28% and runtime by 23% for space manipulator terminal guidance while preserving cost distributions and improving feasibility projection robustness.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

Ambient Diffusion Policy enables better imitation learning from suboptimal robot data by leveraging spectral properties to restrict data usage to specific diffusion times.

Dynamic Execution Horizon Prediction for Chunk-based Robot Policies

cs.RO · 2026-06-09 · unverdicted · novelty 7.0

DEHP adds an online-RL horizon predictor to frozen chunk policies, yielding higher success on precise and long-horizon robot manipulation by adapting chunk length to task stage.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

Affordance2Action introduces A2A-Bench, a manipulation-oriented benchmark for scene-level task-conditioned affordance grounding covering single- and multi-region correspondences, plus an annotation pipeline, and reports gaps in existing segmentation and VLM baselines.

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.

Same Weights, Different Robot: A Deployment Safety View of VLA Policies

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

The paper identifies a deployment safety gap in VLA policies where identical checkpoints can be executable-inequivalent due to action metadata mismatches, supported by a derived closed-form transform and empirical drift measurements on LIBERO benchmarks.

PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology

cs.RO · 2026-05-28 · unverdicted · novelty 7.0

PhAIL provides an open benchmark and distributional evaluation method for real-robot VLA policies using time-to-success CDF, HRT scoring, and KS significance tests.

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

cs.RO · 2026-05-27 · unverdicted · novelty 7.0

VLA architectures exhibit architecture-specific failure signatures at the motor-command level, with direction reversal as a universal predictor and velocity monitoring ineffective for continuous models.

Action-Prior Denoising for Smooth Real-Time Chunking

cs.RO · 2026-05-25 · unverdicted · novelty 7.0

Soft RTC uses partially denoised states for overlap tokens and token-wise blending to reduce action delta and jerk by ~9% versus hard RTC while matching solve rates on Kinetix levels.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

citing papers explorer

Showing 50 of 264 citing papers.

AutoSpeed: Annotation-Free Stage-Adaptive Motion Speed Learning for Robot Manipulation cs.RO · 2026-07-01 · unverdicted · none · ref 40 · internal anchor
AutoSpeed optimizes visuomotor policies over candidate trajectories at varying speeds using a composite cost of prediction error versus horizon length, with DCT-based modulation, yielding shorter execution times and higher success rates while producing speeds that align with task stages.
UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models cs.RO · 2026-06-30 · unverdicted · none · ref 18 · internal anchor
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
TactX: Learning Shared Tactile Representations Across Diverse Sensors cs.RO · 2026-06-30 · unverdicted · none · ref 21 · internal anchor
TactX learns a shared latent representation across three tactile sensor modalities via joint training on paired contacts, enabling zero-shot policy transfer and higher success on pick-and-place, insertion, wiping, and reorientation tasks.
ReactiveBFM: Reactive Closed-Loop Motion Planning Towards Universal Humanoid Whole-Body Control cs.RO · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
ReactiveBFM introduces a real-time closed-loop planning-control system for humanoids using curriculum-based error recovery and asynchronous replanning, achieving 93.1% success under severe perturbations in sim-to-sim tests.
Chronos: A Physics-Informed Full-History Framework for Non-Markovian Long-Horizon Manipulation cs.RO · 2026-06-29 · unverdicted · none · ref 30 · internal anchor
Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.
SA-VLA: State-aware tokenizer for improving Vision-Language-Action Models' performance cs.RO · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
SA-VLA adds state conditioning to VQ-based action tokenization in VLA policies, expanding each discrete token's effective support to state-dependent actions and raising average success rates from 0.29 to 0.56 on 12 sim tasks and 0.15 to 0.33 on 3 real tasks.
Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies cs.RO · 2026-06-29 · unverdicted · none · ref 18 · internal anchor
CI-MSE improves Spearman's rank correlation between offline validation error and real rollout performance from -0.61 (raw MSE) to -0.87 across policy checkpoints in simulation and real-world robot manipulation experiments.
Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models cs.RO · 2026-06-29 · unverdicted · none · ref 48 · internal anchor
T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.
Hierarchical Policy Learning via Spectral Decomposition cs.RO · 2026-06-28 · unverdicted · none · ref 3 · internal anchor
Causal Spectral Policy decomposes actions spectrally into coarse motion from obs/language and conditional fine corrections, outperforming baselines on precision manipulation tasks.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 48 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 48 · internal anchor
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
PAMAE: Phase-Aware-MoE Action Experts Towards Reliable Flow-Matching Vision-Language-Action Policies cs.RO · 2026-06-25 · unverdicted · none · ref 22 · internal anchor
PAMAE adds a phase-aware router and expert mixture to flow-matching VLA models, yielding up to 9.2% higher task success on multi-stage manipulation simulations via two-stage training.
Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision cs.RO · 2026-06-25 · unverdicted · none · ref 42 · internal anchor
StaKe adds lightweight auxiliary heads for manipulation stage identification and next-gripper-transition keyframe prediction to VLA fine-tuning, reporting relative success rate gains of 14% in bimanual simulation and 56% on single-arm real-robot tasks.
Tactile-WAM: Touch-Aware World Action Model with Tactile Asymmetric Attention cs.RO · 2026-06-25 · unverdicted · none · ref 27 · internal anchor
Tactile-WAM with TAAM improves mean success rate by 38.9% overall and 86% on contact-rich tasks on ManiFeel by using VideoClean mask and touch-aware bias to prevent tactile pollution in world action models.
Foresight: Failure Detection for Long-Horizon Robotic Manipulation with Action-Conditioned World Model Latents cs.RO · 2026-06-22 · unverdicted · none · ref 22 · internal anchor
Foresight detects failures in long-horizon robotic manipulation using latents from action-conditioned world models trained only on task-level labels and calibrated via functional conformal prediction.
Verifiable Foundation Models for Robot Safety cs.RO · 2026-06-22 · unverdicted · none · ref 18 · internal anchor
FEARL decomposes robot policies into an expressive Controller and a small verifiable Safety module to enable formal verification of safety constraints while retaining foundation-model task performance.
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 1 · internal anchor
ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.
RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation cs.RO · 2026-06-20 · unverdicted · none · ref 51 · internal anchor
RARM is a lightweight visual comparator trained once on general videos that supplies dense progress rewards to RL by matching rollout clips to a reference demonstration and gating rewards on match confidence.
Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models cs.RO · 2026-06-19 · unverdicted · none · ref 2 · internal anchor
GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.
Remember what you did?: Learning Behavioral Memories for Partially Observable Object Manipulation cs.RO · 2026-06-19 · unverdicted · none · ref 22 · internal anchor
CAMP learns a compressed behavioral memory from action history to enable success in long-horizon partially observable object manipulation without extra supervision, showing gains over baselines in real-robot and simulation tests.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 158 · internal anchor
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think cs.RO · 2026-06-18 · unverdicted · none · ref 25 · internal anchor
VLA models exhibit layer-wise redundancy allowing up to 50% depth compression via training-free CKA-based removal, yielding faster fine-tuning and inference with no performance loss on robot tasks.
Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation cs.RO · 2026-06-18 · unverdicted · none · ref 28 · internal anchor
A double-soft-belt finger module adds translation, pitch, and roll to parallel grippers for improved in-hand manipulation at low cost.
EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies cs.CV · 2026-06-18 · unverdicted · none · ref 5 · 2 links · internal anchor
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory cs.RO · 2026-06-18 · unverdicted · none · ref 46 · internal anchor
Tri-Info uses three information theory signals on action diversity, temporal consistency, and state coupling to predict VLA model failures with cross-domain generalization to 83% real-world accuracy.
One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies cs.RO · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
A framework augments single fisheye demonstrations into multiple novel-view trajectories with obstacles via fisheye-adapted Gaussian Splatting and trajectory optimization, raising policy success rates in original and modified scenes.
Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos cs.CV · 2026-06-17 · unverdicted · none · ref 23 · 2 links · internal anchor
A Hybrid Disentangled VQ-VAE with physical masks creates a cross-embodiment action codebook from human videos, allowing VLA pre-training that adapts to new embodiments with only 50 trajectories.
Contrastive Action-Image Pre-training for Visuomotor Control cs.RO · 2026-06-15 · unverdicted · none · ref 59 · internal anchor
CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics cs.RO · 2026-06-11 · unverdicted · none · ref 19 · internal anchor
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning cs.RO · 2026-06-10 · unverdicted · none · ref 46 · internal anchor
Real2Sim tactile calibration, layout-aware encoder pretraining, and diffusion policy aggregation from object-specific RL experts enable 27% real-world success in blind grasping on a LEAP Hand for 10 seen and 10 unseen objects.
Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization cs.RO · 2026-06-09 · unverdicted · none · ref 57 · internal anchor
HOWTransfer recovers 3D hand motion from video, localizes contact intervals via hand-object cues, generates multi-modal grasp hypotheses, and edits trajectories to produce diverse robot-executable motions achieving 86% success.
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 36 · internal anchor
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 24 · internal anchor
MotionWAM conditions a policy on intermediate features from a video world model to predict unified whole-body motion tokens, enabling real-time humanoid loco-manipulation that outperforms VLA baselines by over 30% on nine Unitree G1 tasks.
RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation cs.RO · 2026-06-07 · unverdicted · none · ref 13 · internal anchor
RGB-S projects tactile contacts onto images as force-modulated Gaussian saliency maps via kinematics and zero-initialized conditioning, raising real-world occluded dexterous manipulation success by 26.7 percentage points over implicit baselines.
SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation cs.RO · 2026-06-06 · unverdicted · none · ref 44 · internal anchor
SIMPLE is a new large-scale simulation benchmark for humanoid loco-manipulation that integrates accurate dynamics and photorealistic rendering and demonstrates policy transfer from simulation to physical robots.
SynthICL: Scalable In-context Imitation Learning with Synthetic Data cs.RO · 2026-06-06 · unverdicted · none · ref 14 · internal anchor
SynthICL trains flow-matching transformer policies for in-context imitation learning entirely from synthetic RGB data and reports 79% average success on 16 unseen real manipulation tasks with one test-time demonstration.
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models cs.RO · 2026-06-06 · conditional · none · ref 40 · internal anchor
vla.cpp is a unified C++ runtime that serves multiple VLA architectures with flow-matching and diffusion patterns, matching SOTA performance on LIBERO while running on low-memory embedded hardware.
Flow-based Policy Adaptation without Policy Updates cs.RO · 2026-06-04 · unverdicted · none · ref 9 · internal anchor
GLOVES learns flow models from limited expert demonstrations to selectively correct actions from non-expert policies or operators toward expert distributions using reverse-flow OOD detection as an intervention gate.
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss cs.LG · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
CLAW: Learning Continuous Latent Action World Models via Adversarial Latent Regularization cs.RO · 2026-06-02 · unverdicted · none · ref 69 · internal anchor
CLAW is an end-to-end self-supervised method that learns semantically meaningful continuous latent actions and predictive world models from action-free videos to support imitation learning and goal-directed planning.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 62 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform cs.RO · 2026-06-02 · conditional · none · ref 19 · internal anchor
OpenEAI-Platform delivers an open-source low-cost robotic arm and VLA model that outperforms commercial arms and matches large pretrained baselines on four real-world manipulation tasks using limited open data.
Policy-based Foveated Imaging and Perception cs.CV · 2026-06-01 · unverdicted · none · ref 137 · internal anchor
A task-aware policy learned via reinforcement learning allocates high-resolution pixels on dual-stream sensors in real time, outperforming fixed or non-predictive baselines under tight pixel budgets in both simulation and 200 MP hardware tests.
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation cs.RO · 2026-06-01 · unverdicted · none · ref 7 · internal anchor
AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot success rates where baselines score near zero.
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs cs.RO · 2026-05-31 · unverdicted · none · ref 66 · internal anchor
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking cs.RO · 2026-05-30 · unverdicted · none · ref 29 · internal anchor
PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.
Continuous Reasoning for Vision-Language-Action cs.RO · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity cs.RO · 2026-05-28 · unverdicted · none · ref 43 · internal anchor
Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving cs.RO · 2026-05-28 · unverdicted · none · ref 54 · internal anchor
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation cs.RO · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
Mag-VLA uses a LoRA-adapted Qwen2.5-VL-7B with a phase classifier and ACT decoder on a new teleoperated dataset to reach 90% approach and 50-80% transport success in bimanual magnetic microrobot tasks.

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer