hub Canonical reference

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li · 2025 · cs.RO · arXiv 2508.07917

Canonical reference. 82% of citing Pith papers cite this work as background.

71 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 dataset 2 baseline 1

citation-polarity summary

background 14 use dataset 2 baseline 1

representative citing papers

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

cs.RO · 2026-06-08 · unverdicted · novelty 7.0

B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

cs.RO · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.

Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

cs.AI · 2026-05-01 · unverdicted · novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

cs.RO · 2026-03-27 · conditional · novelty 7.0

DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

cs.RO · 2026-02-18 · unverdicted · novelty 7.0

PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.

VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models

cs.LG · 2026-06-19 · unverdicted · novelty 6.0

VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

A search-and-distill framework with conformalized improvement head produces a language feedback policy that boosts frozen VLA performance by 24.7% in simulation and 65% on hardware while guaranteeing harmlessness on perturbations.

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0 · 2 refs

LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

AxisGuide augments RGB images with rendered robot base-frame axis cues to improve generalization of visuomotor manipulation policies under distribution shifts.

citing papers explorer

Showing 50 of 71 citing papers.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 73 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
NAC: Neural Action Codec for Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 10 · internal anchor
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 12 · internal anchor
MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.
Trajectory-Level Redirection Attacks on Vision-Language-Action Models cs.RO · 2026-06-11 · unverdicted · none · ref 12 · internal anchor
A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.
Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection cs.RO · 2026-06-08 · unverdicted · none · ref 32 · internal anchor
B2FF pre-generates a milestone bank of familiar future states from the clean initial observation and uses a recoverability-aware selector to guide VLA policies back from deviations, raising average success rate from 56.3% to 74.0% on failure-injected LIBERO.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation cs.RO · 2026-06-05 · unverdicted · none · ref 35 · internal anchor
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
Colosseum V2: Benchmarking Generalization for Vision Language Action Models cs.RO · 2026-05-26 · unverdicted · none · ref 41 · internal anchor
Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation cs.RO · 2026-05-04 · unverdicted · none · ref 15 · 2 links · internal anchor
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-rich robotic scenarios.
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation cs.AI · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 51 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 22 · internal anchor
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 31 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
Token Warping Helps MLLMs Look from Nearby Viewpoints cs.CV · 2026-04-03 · unverdicted · none · ref 46 · internal anchor
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching cs.RO · 2026-03-27 · conditional · none · ref 11 · internal anchor
DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.
Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation cs.RO · 2026-02-18 · unverdicted · none · ref 29 · internal anchor
PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation cs.RO · 2026-06-26 · unverdicted · none · ref 15 · internal anchor
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization cs.RO · 2026-06-26 · unverdicted · none · ref 16 · internal anchor
Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models cs.LG · 2026-06-19 · unverdicted · none · ref 4 · internal anchor
VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.
Vesta: A Generalist Embodied Reasoning Model cs.RO · 2026-06-18 · unverdicted · none · ref 58 · internal anchor
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing? cs.CV · 2026-06-17 · unverdicted · none · ref 54 · internal anchor
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV · 2026-06-11 · unverdicted · none · ref 24 · internal anchor
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering cs.RO · 2026-06-10 · unverdicted · none · ref 7 · internal anchor
A search-and-distill framework with conformalized improvement head produces a language feedback policy that boosts frozen VLA performance by 24.7% in simulation and 65% on hardware while guaranteeing harmlessness on perturbations.
LARA: Latent Action Representation Alignment for Vision-Language-Action Models cs.CV · 2026-06-05 · unverdicted · none · ref 55 · 2 links · internal anchor
LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.
AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation cs.RO · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
AxisGuide augments RGB images with rendered robot base-frame axis cues to improve generalization of visuomotor manipulation policies under distribution shifts.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
Cosmos 3: Omnimodal World Models for Physical AI cs.CV · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs cs.RO · 2026-06-01 · unverdicted · none · ref 28 · internal anchor
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
Continuous Reasoning for Vision-Language-Action cs.RO · 2026-05-29 · unverdicted · none · ref 20 · internal anchor
Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving cs.RO · 2026-05-28 · unverdicted · none · ref 23 · internal anchor
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving cs.CR · 2026-05-27 · unverdicted · none · ref 24 · internal anchor
ReasonBreak demonstrates up to 89% attack success on reasoning and 72% on trajectories in NVIDIA Alpamayo VLA models via black-box textual perturbations, introducing a reasoning-aware evaluation framework and benchmark for autonomous driving.
TacO: Benchmarking Tactile Sensors for Object Manipulation cs.RO · 2026-05-21 · unverdicted · none · ref 18 · internal anchor
The paper provides a task-driven benchmark comparing visual, acoustic, magnetic, and resistive tactile sensors on three manipulation tasks and concludes that sensor utility depends on modality, material friction, and task specifics.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 20 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 23 · 2 links · internal anchor
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills cs.RO · 2026-04-27 · unverdicted · none · ref 27 · internal anchor
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models cs.RO · 2026-04-14 · unverdicted · none · ref 28 · internal anchor
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 17 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 92 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies cs.CV · 2026-02-23 · unverdicted · none · ref 19 · internal anchor
Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.
VLANeXt: Recipes for Building Strong VLA Models cs.CV · 2026-02-20 · conditional · none · ref 21 · internal anchor
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 59 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control cs.RO · 2026-02-13 · unverdicted · none · ref 29 · internal anchor
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 23 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning cs.RO · 2026-02-11 · unverdicted · none · ref 32 · internal anchor
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 77 · internal anchor
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CV · 2026-02-04 · unverdicted · none · ref 17 · internal anchor
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
Mull-Tokens: Modality-Agnostic Latent Thinking cs.CV · 2025-12-11 · unverdicted · none · ref 29 · internal anchor
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding cs.RO · 2025-11-21 · unverdicted · none · ref 20 · internal anchor
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models cs.CV · 2025-11-20 · unverdicted · none · ref 29 · internal anchor
BOP-ASK supplies 150k images and 33M QA pairs across six tasks to improve VLMs on precise 3D object interaction reasoning and spatial planning.

MolmoAct: Action Reasoning Models that can Reason in Space

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer