arxiv: 2510.13778 · v1 · submitted 2025-10-15 · 💻 cs.RO · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Xinyi Chen , Yilun Chen , Yanwei Fu , Ning Gao , Jiaya Jia , Weiyang Jin , Hao Li , Yao Mu

show 21 more authors

Jiangmiao Pang Yu Qiao Yang Tian Bin Wang Bolun Wang Fangjing Wang Hanqing Wang Tai Wang Ziqin Wang Xueyuan Wei Chao Wu Shuai Yang Jinhui Ye Junqiu Yu Jia Zeng Jingjing Zhang Jinyu Zhang Shi Zhang Feng Zheng Bowen Zhou Yangkun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords vision-language-actionspatial groundinggeneralist robot policyinstruction followingpick and placetwo-stage trainingrobot simulation

0 comments

The pith

Spatially guided pre-training on millions of examples teaches robots where to act before how, yielding gains up to 17 percent on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternVLA-M1 as a framework that splits robot control into first determining where to act via spatial grounding and then deciding how to act via prompted action generation. It pre-trains on over 2.3 million embodiment-agnostic spatial reasoning examples to align instructions with visual positions such as boxes, points, and traces. The second stage then plugs these positions into action training for specific robot bodies and tasks. This recipe produces measured lifts over the same model without spatial prompts, including 14.6 percent on one Google Robot simulator, 17 percent on WidowX, and 4.3 percent on LIBERO Franka, plus further gains from added simulation data in both simulated and real pick-and-place settings. The authors treat spatial guidance as the linking mechanism that makes instruction-following robots more scalable.

Core claim

InternVLA-M1 shows that a two-stage spatially guided vision-language-action pipeline, with spatial grounding pre-training on 2.3 million examples followed by spatially prompted action post-training, improves both spatial reasoning accuracy and embodiment-specific action success across simulation suites and real clustered manipulation tasks.

What carries the argument

The two-stage pipeline of spatial grounding pre-training that produces visual position prompts followed by spatially guided action post-training that consumes those prompts to generate robot actions.

If this is right

Outperforms the no-spatial-guidance baseline by 14.6 percent on SimplerEnv Google Robot, 17 percent on WidowX, and 4.3 percent on LIBERO Franka.
Delivers an average 6.2 percent lift across 200 simulated tasks after adding 244K pick-and-place episodes.
Raises real-world clustered pick-and-place success by 7.3 percent and by 20.6 percent on unseen objects when synthetic data is added.
Improves performance by more than 10 percent in long-horizon, reasoning-heavy scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the spatial prompts generalize across robot bodies, the same pre-training stage could shorten adaptation time when new hardware is introduced.
The explicit separation of spatial localization from motor generation may apply to other embodied agents that must act on visual instructions.
Large-scale spatial reasoning datasets collected independently of any robot body could become a reusable first step for training generalist controllers.

Load-bearing premise

That spatial grounding learned on embodiment-agnostic data transfers effectively when inserted as prompts into embodiment-specific action training.

What would settle it

An ablation that replaces the learned spatial prompts with random or absent positions during the second-stage action training and measures whether the reported success-rate gains on SimplerEnv, WidowX, and LIBERO disappear.

read the original abstract

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternVLA-M1's two-stage spatial pre-training delivers reported benchmark gains on robot tasks, but the controls do not yet isolate whether the spatial grounding itself drives the improvements.

read the letter

The main thing to know is that this paper gives a concrete two-stage recipe: first pre-train on 2.3M spatial reasoning examples to predict boxes, points, and traces from instructions, then use those predictions as plug-and-play prompts while training the action head. They report clear lifts over their no-spatial variant: +14.6% on SimplerEnv Google Robot, +17% on WidowX, +4.3% on LIBERO Franka, plus 7.3% in real clustered pick-and-place and over 10% on long-horizon cases. They also released code and a 244K-episode simulation dataset for pick-and-place, which is practical for others to build on.

Referee Report

3 major / 2 minor

Summary. The paper introduces InternVLA-M1, a unified vision-language-action framework that employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M embodiment-agnostic samples to align instructions with visual positions, and (ii) spatially guided action post-training that uses plug-and-play spatial prompts (box, point, trace) to generate embodiment-specific actions. It reports consistent gains over a no-spatial-guidance variant (+14.6% on SimplerEnv Google Robot, +17% on WidowX, +4.3% on LIBERO Franka), plus further improvements from a new 244K pick-and-place simulation dataset (6.2% average across 200 tasks), real-world clustered pick-and-place (+7.3%, +20.6% with synthetic co-training on unseen objects), and long-horizon scenarios (>10% over prior work).

Significance. If the reported gains are causally attributable to the learned spatial grounding and transfer effectively, the work supplies a concrete, scalable training recipe that decouples spatial reasoning from embodiment-specific control, potentially advancing generalist robot policies. The public release of code and models is a clear strength that enables direct reproduction and extension.

major comments (3)

[Abstract / §4] Abstract and §4 (results): the central claim that spatial guidance produces the reported deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) rests on comparison to a “variant without spatial guidance,” yet the manuscript provides no information on whether this baseline matches total training compute, data volume, prompt format, or number of epochs. Without these controls the causal contribution of stage-1 spatial grounding cannot be isolated.
[Abstract / Methods] Abstract and methods: concrete percentage improvements are stated without accompanying details on baseline implementations, statistical tests (e.g., standard error or significance), data splits, or ablation controls. Full verification of whether the numbers support the spatially-guided-training thesis therefore requires the complete experimental section.
[Abstract] Abstract: the transfer assumption—that embodiment-agnostic box/point/trace predictions from the 2.3M-sample stage-1 pre-training align with the spatial requirements of successful actions on the target robots—is stated but not supported by any quantitative alignment analysis or failure-case study.

minor comments (2)

[Abstract] The abstract states “over 2.3M spatial reasoning data”; an exact count and brief breakdown of dataset sources would improve precision.
Ensure all result tables and figures include explicit captions, axis labels, and legends so that spatial-prediction and action-success metrics are immediately interpretable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications on the experimental controls already present in the full manuscript while committing to explicit revisions that strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract / §4] Abstract and §4 (results): the central claim that spatial guidance produces the reported deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) rests on comparison to a “variant without spatial guidance,” yet the manuscript provides no information on whether this baseline matches total training compute, data volume, prompt format, or number of epochs. Without these controls the causal contribution of stage-1 spatial grounding cannot be isolated.

Authors: We appreciate the referee's emphasis on isolating the causal effect. The no-spatial-guidance variant was trained with identical total compute, data volume (2.3M pre-training samples plus the 244K post-training episodes), number of epochs, and prompt formatting as the full InternVLA-M1 model; the only difference is the omission of spatial prompts during stage-2 action post-training. These matched controls are described in §4.1 and the supplementary material. We will add an explicit statement of these controls to the abstract and include a summary table in the revised §4. revision: yes
Referee: [Abstract / Methods] Abstract and methods: concrete percentage improvements are stated without accompanying details on baseline implementations, statistical tests (e.g., standard error or significance), data splits, or ablation controls. Full verification of whether the numbers support the spatially-guided-training thesis therefore requires the complete experimental section.

Authors: The complete experimental section (§4) already specifies baseline implementations (RT-2, Octo, and internal ablations), training/evaluation data splits, and ablation studies on the spatial components. We will expand the abstract with cross-references to these sections and add standard error bars together with statistical significance tests (paired t-tests) to the results tables in the revision. revision: yes
Referee: [Abstract] Abstract: the transfer assumption—that embodiment-agnostic box/point/trace predictions from the 2.3M-sample stage-1 pre-training align with the spatial requirements of successful actions on the target robots—is stated but not supported by any quantitative alignment analysis or failure-case study.

Authors: The consistent cross-embodiment gains and the stage-1 spatial prediction accuracies reported in §3.2 provide indirect quantitative support for the transfer. We agree that a direct alignment analysis would further strengthen the claim. We will add a new subsection in the revised §4 that reports correlation metrics between stage-1 grounding accuracy and downstream success rates together with representative failure cases. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper's central claims consist of measured performance deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) obtained by comparing the full two-stage model against an ablated variant on standard external robot benchmarks and real-world tests. These outcomes are not obtained by fitting parameters inside the model equations and then relabeling the fit as a prediction, nor do any derivations reduce to self-definitions or self-citation chains. The spatial-grounding stage uses embodiment-agnostic data whose outputs are plugged into the action stage, but the reported gains are falsifiable against held-out robot tasks and objects rather than being tautological with the training inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the transferability of spatial grounding across stages and the assumption that large-scale pre-training on spatial data improves downstream action performance; no new physical entities are introduced.

axioms (2)

standard math Standard transformer-based vision-language model training procedures and optimization assumptions hold.
The framework builds directly on existing VLA architectures.
domain assumption Spatial positions predicted in the pre-training stage can be used as effective prompts for action generation in the post-training stage.
This is the core mechanism claimed to produce the reported gains.

pith-pipeline@v0.9.0 · 5721 in / 1458 out tokens · 88071 ms · 2026-05-14T20:02:54.268680+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency ...
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 6.0

SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
cs.RO 2026-05 unverdicted novelty 5.0

X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 21 Pith papers · 21 internal anchors

[1]

URLhttps://www.figure.ai/news/helix. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, J. Lin, and ... Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. S. Bai, K. Chen...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.\𝑝𝑖 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Q. Bu, Y. Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric laten...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Gr-3 technical report.arXiv preprint arXiv:2507.15493,

C. Cheang, S. Chen, Z. Cui, Y. Hu, L. Huang, T. Kong, H. Li, Y. Li, Y. Liu, X. Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493,

work page arXiv
[9]

Clark, S

20 J. Clark, S. Mirchandani, D. Sadigh, and S. Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729,

work page arXiv
[10]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, J. Lu, T. Anderson, E. Bransom, K. Ehsani, H. Ngo, Y. Chen, A. Patel, M. Yatskar, C. Callison-Burch, A. Head, R. Hendrix, F. Bastani, E. VanderBilt, N. Lambert, Y. Chou, A. Chheda, J. Sparks, S. Skjonsberg, M. Schmitz, A. Sarnat, B. Bischoff, P. W...

work page internal anchor Pith review arXiv
[12]

Driess, J

D. Driess, J. T. Springenberg, B. Ichter, L. Yu, A. Li-Bell, K. Pertsch, A. Z. Ren, H. Walke, Q. Vuong, L. X. Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705,

work page arXiv
[13]

J. Gu, S. Kirmani, P. Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023a. J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y. Tang, S. Tao, X. Wei, Y. Yao, et al. Maniskill2: A unified benchmark for generali...

work page arXiv
[14]

Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024b. P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.𝑝𝑖0.5: a vision-language-action model with open-worl...

work page arXiv
[15]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything.arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review arXiv
[20]

LLaVA-OneVision: Easy Visual Task Transfer

B.Li,Y.Zhang,D.Guo,R.Zhang,F.Li,H.Zhang,K.Zhang,P.Zhang,Y.Li,Z.Liu,etal. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. 22 B.Li,Y.Zhang,D.Guo,R.Zhang,F.Li,H.Zhang,K.Zhang,P.Zhang,Y.Li,Z.Liu,etal. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024b. H. Li, S. Yang, Y. Chen, Y. Tian, X. Yang,...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2502.05485 (2025)

Y.Li, Y.Deng, J.Zhang, J.Jang, M.Memmel, R.Yu, C.R.Garrett, F.Ramos, D.Fox, A.Li, etal. Hamster: Hierarchical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025c. H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang. Pointnetgpd: Detecting grasp configurations from point sets. In2019 International Co...

work page arXiv
[22]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

Y.Liao, P.Zhou, S.Huang, D.Yang, S.Chen, Y.Jiang, Y.Hu, J.Cai, S.Liu, J.Luo, etal. Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

work page arXiv
[23]

F. Lin, R. Nai, Y. Hu, J. You, J. Zhao, and Y. Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917,

work page arXiv
[24]

F. Liu, K. Fang, P. Abbeel, and S. Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

work page 2024
[25]

Y. Lu, Y. Fan, B. Deng, F. Liu, Y. Li, and S. Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023a. Y. Lu, Y. Fan, B. Deng, F. Liu, Y. Li, and S. Wang. Vl-grasp: a 6-dof interactive grasp polic...

work page arXiv
[26]

URLhttps://arxiv.org/abs/2509.06951. V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470,

work page arXiv
[27]

S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601,

work page arXiv
[28]

Nasiriany, S

S. Nasiriany, S. Kirmani, T. Ding, L. Smith, Y. Zhu, D. Driess, D. Sadigh, and T. Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

work page arXiv
[29]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Z. Qi, W. Zhang, Y. Ding, R. Dong, X. Yu, J. Li, L. Xu, B. Li, X. He, G. Fan, J. Zhang, J. He, J. Gu, X. Jin, K.Ma,Z.Zhang,H.Wang,andL.Yi. Sofar: Language-groundedorientationbridgesspatialreasoning and object manipulation.CoRR, abs/2502.13143,

work page arXiv
[32]

URL https://doi.org/10.48550/arXiv.2502.13143

doi: 10.48550/ARXIV.2502.13143. URL https://doi.org/10.48550/arXiv.2502.13143. D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploringspatialrepresentationsforvisual-language-actionmodel.arXivpreprintarXiv:2501.15830,

work page doi:10.48550/arxiv.2502.13143
[33]

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135,

work page arXiv
[34]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

work page arXiv
[35]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

H. Song, D. Qu, Y. Yao, Q. Chen, Q. Lv, Y. Tang, M. Shi, G. Ren, M. Yao, B. Zhao, et al. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432,

work page arXiv
[37]

B. R. Team, M. Cao, H. Tan, Y. Ji, M. Lin, Z. Li, Z. Cao, P. Wang, E. Zhou, Y. Han, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

work page arXiv
[38]

Y. Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109,

work page arXiv
[39]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Y.Wang, X.Li, W.Wang, J.Zhang, Y.Li,Y.Chen, X.Wang,andZ.Zhang. Unifiedvision-language-action model.arXiv preprint arXiv:2506.19850,

work page arXiv
[40]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y. Zhao, Z. Xu, G. Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation.arXiv preprint arXiv:2412.13877,

work page arXiv
[41]

R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, Y. Kuang, M. Cao, F. Zheng, and X. Liang. A0: An affordance-aware hierarchical model for general robotic manipulation, 2025a. URLhttps://arxiv.org/abs/2504.12636. R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, et al. A0: An affordan...

work page arXiv
[42]

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robo- point: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721,

work page arXiv
[44]

Robotic Control via Embodied Chain-of-Thought Reasoning

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

work page internal anchor Pith review arXiv
[45]

Y. S. Y. Q. M. Zhang, X. L. J. Y. X. Zheng, K. L. X. S. Y. Wu, R. J. C. Fu, and P. Chen. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 18,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025a. Z. Zhou, Y. Zhu, J. Wen, C. Shen, and Y. Xu. Vision-language-action model with open-world embodied reasoning from pretrained knowl...

work page arXiv
[47]

URLhttps://arxiv.org/abs/2504.10479. 26 A. Author contributions All contributors are listed inalphabeticalorder by their last names. A.1. Core Contributors Yilun Chen, Ning Gao, Jiangmiao Pang, Bolun Wang, Fangjing Wang, Jinhui Ye, Junqiu Yu, Jinyu Zhang, Yangkun Zhu A.2. Contributors Xinyi Chen, Yanwei Fu, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Yu Qiao,...

work page internal anchor Pith review Pith/arXiv arXiv