hub

MolmoAct2: Action Reasoning Models for Real-world Deployment

· 2026 · cs.RO · arXiv 2605.02881

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open full Pith review browse 13 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

cs.RO · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

LA4VLA creates a 33K language-action dataset from existing demos and shows that pretraining on language-action pairs before or alongside vision-language-action training boosts success rates in sim and real robot tasks.

VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.

Guava: An Effective and Universal Harness for Embodied Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified design principles.

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

Qwen-RobotManip applies unified alignment across representation, motion, and behavior to enable large-scale training on heterogeneous manipulation data, yielding emergent generalization on out-of-distribution robotic benchmarks.

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

A search-and-distill framework with conformalized improvement head produces a language feedback policy that boosts frozen VLA performance by 24.7% in simulation and 65% on hardware while guaranteeing harmlessness on perturbations.

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

cs.RO · 2026-06-11 · unverdicted · novelty 5.0

SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

cs.RO · 2026-06-05 · unverdicted · novelty 5.0

Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.

citing papers explorer

Showing 13 of 13 citing papers after filters.

NAC: Neural Action Codec for Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 21 · internal anchor
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining cs.CV · 2026-06-18 · unverdicted · none · ref 11 · internal anchor
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation cs.RO · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
Sequential Planning via Anchored Robotic Keypoints cs.RO · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision cs.RO · 2026-06-29 · unverdicted · none · ref 23 · 2 links · internal anchor
ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.
LA4VLA: Learning to Act without Seeing via Language-Action Pretraining cs.RO · 2026-06-25 · unverdicted · none · ref 13 · internal anchor
LA4VLA creates a 33K language-action dataset from existing demos and shows that pretraining on language-action pairs before or alongside vision-language-action training boosts success rates in sim and real robot tasks.
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models cs.LG · 2026-06-19 · unverdicted · none · ref 5 · internal anchor
VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction cs.CV · 2026-06-17 · unverdicted · none · ref 23 · internal anchor
Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
Guava: An Effective and Universal Harness for Embodied Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
Guava harness enables 4B open-source models to achieve performance comparable to frontier models on embodied manipulation tasks by distilling capabilities from under 2K simulation trajectories using three identified design principles.
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models cs.RO · 2026-06-16 · unverdicted · none · ref 13 · internal anchor
Qwen-RobotManip applies unified alignment across representation, motion, and behavior to enable large-scale training on heterogeneous manipulation data, yielding emergent generalization on out-of-distribution robotic benchmarks.
Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering cs.RO · 2026-06-10 · unverdicted · none · ref 8 · internal anchor
A search-and-distill framework with conformalized improvement head produces a language feedback policy that boosts frozen VLA performance by 24.7% in simulation and 65% on hardware while guaranteeing harmlessness on perturbations.
SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale cs.RO · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
SPARC generates reliable spatial annotations for robot demonstrations by leveraging spatio-temporal task structure, outperforming detection baselines on localization accuracy while retaining more samples and enabling competitive model performance without manual annotations.
Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models cs.RO · 2026-06-05 · unverdicted · none · ref 22 · internal anchor
Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.

MolmoAct2: Action Reasoning Models for Real-world Deployment

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer