hub Canonical reference

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin · 2025 · cs.RO · arXiv 2510.13778

Canonical reference. 79% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 2 dataset 1 method 1

citation-polarity summary

background 11 baseline 2 use method 1

representative citing papers

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.

Towards Generalizable Robotic Manipulation in Dynamic Environments

cs.CV · 2026-03-16 · unverdicted · novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-05-13 · conditional · novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

cs.RO · 2026-04-09 · unverdicted · novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.

A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

cs.RO · 2026-04-07 · unverdicted · novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models

cs.RO · 2026-03-26 · unverdicted · novelty 6.0

SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

cs.CV · 2026-02-11 · unverdicted · novelty 6.0

ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.

A Pragmatic VLA Foundation Model

cs.RO · 2026-01-26 · unverdicted · novelty 6.0

LingBot-VLA is a VLA foundation model trained on massive real robot data that shows superior generalization across tasks and platforms with fast training throughput.

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

cs.RO · 2025-11-18 · unverdicted · novelty 6.0

AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

cs.RO · 2026-05-20 · unverdicted · novelty 5.0

PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

cs.RO · 2026-05-20 · unverdicted · novelty 5.0

A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

cs.RO · 2026-05-19 · unverdicted · novelty 5.0

RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.

X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.

citing papers explorer

Showing 28 of 28 citing papers.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 12 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unverdicted · none · ref 53 · internal anchor
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning cs.RO · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency in robotic manipulation.
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unverdicted · none · ref 12 · internal anchor
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training cs.RO · 2026-05-13 · unverdicted · none · ref 2 · internal anchor
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 8 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 7 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 45 · internal anchor
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation cs.RO · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds cs.RO · 2026-04-09 · unverdicted · none · ref 14 · internal anchor
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 5 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models cs.RO · 2026-03-26 · unverdicted · none · ref 40 · internal anchor
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 15 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
A Pragmatic VLA Foundation Model cs.RO · 2026-01-26 · unverdicted · none · ref 9 · internal anchor
LingBot-VLA is a VLA foundation model trained on massive real robot data that shows superior generalization across tasks and platforms with fast training throughput.
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models cs.RO · 2025-11-18 · unverdicted · none · ref 10 · internal anchor
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction cs.RO · 2026-05-20 · unverdicted · none · ref 13 · internal anchor
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum cs.RO · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 14 · internal anchor
RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction cs.RO · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 80 · internal anchor
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning cs.LG · 2026-05-11 · unverdicted · none · ref 9 · 2 links · internal anchor
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 11 · 2 links · internal anchor
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 173 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation cs.CV · 2026-04-19 · unreviewed · ref 120 · internal anchor

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer