Recognition: 2 theorem links
· Lean TheoremInternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3
The pith
Spatially guided pre-training on millions of examples teaches robots where to act before how, yielding gains up to 17 percent on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternVLA-M1 shows that a two-stage spatially guided vision-language-action pipeline, with spatial grounding pre-training on 2.3 million examples followed by spatially prompted action post-training, improves both spatial reasoning accuracy and embodiment-specific action success across simulation suites and real clustered manipulation tasks.
What carries the argument
The two-stage pipeline of spatial grounding pre-training that produces visual position prompts followed by spatially guided action post-training that consumes those prompts to generate robot actions.
If this is right
- Outperforms the no-spatial-guidance baseline by 14.6 percent on SimplerEnv Google Robot, 17 percent on WidowX, and 4.3 percent on LIBERO Franka.
- Delivers an average 6.2 percent lift across 200 simulated tasks after adding 244K pick-and-place episodes.
- Raises real-world clustered pick-and-place success by 7.3 percent and by 20.6 percent on unseen objects when synthetic data is added.
- Improves performance by more than 10 percent in long-horizon, reasoning-heavy scenarios.
Where Pith is reading between the lines
- If the spatial prompts generalize across robot bodies, the same pre-training stage could shorten adaptation time when new hardware is introduced.
- The explicit separation of spatial localization from motor generation may apply to other embodied agents that must act on visual instructions.
- Large-scale spatial reasoning datasets collected independently of any robot body could become a reusable first step for training generalist controllers.
Load-bearing premise
That spatial grounding learned on embodiment-agnostic data transfers effectively when inserted as prompts into embodiment-specific action training.
What would settle it
An ablation that replaces the learned spatial prompts with random or absent positions during the second-stage action training and measures whether the reported success-rate gains on SimplerEnv, WidowX, and LIBERO disappear.
read the original abstract
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InternVLA-M1, a unified vision-language-action framework that employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M embodiment-agnostic samples to align instructions with visual positions, and (ii) spatially guided action post-training that uses plug-and-play spatial prompts (box, point, trace) to generate embodiment-specific actions. It reports consistent gains over a no-spatial-guidance variant (+14.6% on SimplerEnv Google Robot, +17% on WidowX, +4.3% on LIBERO Franka), plus further improvements from a new 244K pick-and-place simulation dataset (6.2% average across 200 tasks), real-world clustered pick-and-place (+7.3%, +20.6% with synthetic co-training on unseen objects), and long-horizon scenarios (>10% over prior work).
Significance. If the reported gains are causally attributable to the learned spatial grounding and transfer effectively, the work supplies a concrete, scalable training recipe that decouples spatial reasoning from embodiment-specific control, potentially advancing generalist robot policies. The public release of code and models is a clear strength that enables direct reproduction and extension.
major comments (3)
- [Abstract / §4] Abstract and §4 (results): the central claim that spatial guidance produces the reported deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) rests on comparison to a “variant without spatial guidance,” yet the manuscript provides no information on whether this baseline matches total training compute, data volume, prompt format, or number of epochs. Without these controls the causal contribution of stage-1 spatial grounding cannot be isolated.
- [Abstract / Methods] Abstract and methods: concrete percentage improvements are stated without accompanying details on baseline implementations, statistical tests (e.g., standard error or significance), data splits, or ablation controls. Full verification of whether the numbers support the spatially-guided-training thesis therefore requires the complete experimental section.
- [Abstract] Abstract: the transfer assumption—that embodiment-agnostic box/point/trace predictions from the 2.3M-sample stage-1 pre-training align with the spatial requirements of successful actions on the target robots—is stated but not supported by any quantitative alignment analysis or failure-case study.
minor comments (2)
- [Abstract] The abstract states “over 2.3M spatial reasoning data”; an exact count and brief breakdown of dataset sources would improve precision.
- Ensure all result tables and figures include explicit captions, axis labels, and legends so that spatial-prediction and action-success metrics are immediately interpretable.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications on the experimental controls already present in the full manuscript while committing to explicit revisions that strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract / §4] Abstract and §4 (results): the central claim that spatial guidance produces the reported deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) rests on comparison to a “variant without spatial guidance,” yet the manuscript provides no information on whether this baseline matches total training compute, data volume, prompt format, or number of epochs. Without these controls the causal contribution of stage-1 spatial grounding cannot be isolated.
Authors: We appreciate the referee's emphasis on isolating the causal effect. The no-spatial-guidance variant was trained with identical total compute, data volume (2.3M pre-training samples plus the 244K post-training episodes), number of epochs, and prompt formatting as the full InternVLA-M1 model; the only difference is the omission of spatial prompts during stage-2 action post-training. These matched controls are described in §4.1 and the supplementary material. We will add an explicit statement of these controls to the abstract and include a summary table in the revised §4. revision: yes
-
Referee: [Abstract / Methods] Abstract and methods: concrete percentage improvements are stated without accompanying details on baseline implementations, statistical tests (e.g., standard error or significance), data splits, or ablation controls. Full verification of whether the numbers support the spatially-guided-training thesis therefore requires the complete experimental section.
Authors: The complete experimental section (§4) already specifies baseline implementations (RT-2, Octo, and internal ablations), training/evaluation data splits, and ablation studies on the spatial components. We will expand the abstract with cross-references to these sections and add standard error bars together with statistical significance tests (paired t-tests) to the results tables in the revision. revision: yes
-
Referee: [Abstract] Abstract: the transfer assumption—that embodiment-agnostic box/point/trace predictions from the 2.3M-sample stage-1 pre-training align with the spatial requirements of successful actions on the target robots—is stated but not supported by any quantitative alignment analysis or failure-case study.
Authors: The consistent cross-embodiment gains and the stage-1 spatial prediction accuracies reported in §3.2 provide indirect quantitative support for the transfer. We agree that a direct alignment analysis would further strengthen the claim. We will add a new subsection in the revised §4 that reports correlation metrics between stage-1 grounding accuracy and downstream success rates together with representative failure cases. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper's central claims consist of measured performance deltas (+14.6% SimplerEnv, +17% WidowX, +4.3% LIBERO) obtained by comparing the full two-stage model against an ablated variant on standard external robot benchmarks and real-world tests. These outcomes are not obtained by fitting parameters inside the model equations and then relabeling the fit as a prediction, nor do any derivations reduce to self-definitions or self-citation chains. The spatial-grounding stage uses embodiment-agnostic data whose outputs are plugged into the action stage, but the reported gains are falsifiable against held-out robot tasks and objects rather than being tautological with the training inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard transformer-based vision-language model training procedures and optimization assumptions hold.
- domain assumption Spatial positions predicted in the pre-training stage can be used as effective prompts for action generation in the post-training stage.
Forward citations
Cited by 22 Pith papers
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
HiPolicy: Hierarchical Multi-Frequency Action Chunking for Policy Learning
HiPolicy is a new hierarchical multi-frequency action chunking method for imitation learning that jointly generates coarse and fine action sequences with entropy-guided execution to improve performance and efficiency ...
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
-
X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction
X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.
-
Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LI...
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
URLhttps://www.figure.ai/news/helix. S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, J. Lin, and ... Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. S. Bai, K. Chen...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.\𝑝𝑖 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025a. Q. Bu, Y. Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Univla: Learning to act anywhere with task-centric laten...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gr-3 technical report.arXiv preprint arXiv:2507.15493,
C. Cheang, S. Chen, Z. Cui, Y. Hu, L. Huang, T. Kong, H. Li, Y. Li, Y. Liu, X. Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493,
- [9]
-
[10]
O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, J. Lu, T. Anderson, E. Bransom, K. Ehsani, H. Ngo, Y. Chen, A. Patel, M. Yatskar, C. Callison-Burch, A. Head, R. Hendrix, F. Bastani, E. VanderBilt, N. Lambert, Y. Chou, A. Chheda, J. Sparks, S. Skjonsberg, M. Schmitz, A. Sarnat, B. Bischoff, P. W...
work page internal anchor Pith review arXiv
- [12]
-
[13]
J. Gu, S. Kirmani, P. Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023a. J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y. Tang, S. Tao, X. Wei, Y. Yao, et al. Maniskill2: A unified benchmark for generali...
-
[14]
W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024b. P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.𝑝𝑖0.5: a vision-language-action model with open-worl...
-
[15]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick. Segment anything.arXiv:2304.02643,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,
work page internal anchor Pith review arXiv
-
[20]
LLaVA-OneVision: Easy Visual Task Transfer
B.Li,Y.Zhang,D.Guo,R.Zhang,F.Li,H.Zhang,K.Zhang,P.Zhang,Y.Li,Z.Liu,etal. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. 22 B.Li,Y.Zhang,D.Guo,R.Zhang,F.Li,H.Zhang,K.Zhang,P.Zhang,Y.Li,Z.Liu,etal. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024b. H. Li, S. Yang, Y. Chen, Y. Tian, X. Yang,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
arXiv preprint arXiv:2502.05485 (2025)
Y.Li, Y.Deng, J.Zhang, J.Jang, M.Memmel, R.Yu, C.R.Garrett, F.Ramos, D.Fox, A.Li, etal. Hamster: Hierarchical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025c. H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang. Pointnetgpd: Detecting grasp configurations from point sets. In2019 International Co...
-
[22]
Y.Liao, P.Zhou, S.Huang, D.Yang, S.Chen, Y.Jiang, Y.Hu, J.Cai, S.Liu, J.Luo, etal. Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,
- [23]
-
[24]
F. Liu, K. Fang, P. Abbeel, and S. Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,
work page 2024
-
[25]
Y. Lu, Y. Fan, B. Deng, F. Liu, Y. Li, and S. Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023a. Y. Lu, Y. Fan, B. Deng, F. Liu, Y. Li, and S. Wang. Vl-grasp: a 6-dof interactive grasp polic...
- [26]
- [27]
-
[28]
S. Nasiriany, S. Kirmani, T. Ding, L. Smith, Y. Zhu, D. Driess, D. Sadigh, and T. Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,
-
[29]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
-
[32]
URL https://doi.org/10.48550/arXiv.2502.13143
doi: 10.48550/ARXIV.2502.13143. URL https://doi.org/10.48550/arXiv.2502.13143. D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploringspatialrepresentationsforvisual-language-actionmodel.arXivpreprintarXiv:2501.15830,
- [33]
- [34]
-
[35]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,
work page internal anchor Pith review Pith/arXiv arXiv
- [36]
- [37]
- [38]
-
[39]
Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025
Y.Wang, X.Li, W.Wang, J.Zhang, Y.Li,Y.Chen, X.Wang,andZ.Zhang. Unifiedvision-language-action model.arXiv preprint arXiv:2506.19850,
- [40]
-
[41]
R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, Y. Kuang, M. Cao, F. Zheng, and X. Liang. A0: An affordance-aware hierarchical model for general robotic manipulation, 2025a. URLhttps://arxiv.org/abs/2504.12636. R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang, et al. A0: An affordan...
-
[42]
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
- [43]
-
[44]
Robotic Control via Embodied Chain-of-Thought Reasoning
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,
work page internal anchor Pith review arXiv
-
[45]
Y. S. Y. Q. M. Zhang, X. L. J. Y. X. Zheng, K. L. X. S. Y. Wu, R. J. C. Fu, and P. Chen. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 18,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025a. Z. Zhou, Y. Zhu, J. Wen, C. Shen, and Y. Xu. Vision-language-action model with open-world embodied reasoning from pretrained knowl...
-
[47]
URLhttps://arxiv.org/abs/2504.10479. 26 A. Author contributions All contributors are listed inalphabeticalorder by their last names. A.1. Core Contributors Yilun Chen, Ning Gao, Jiangmiao Pang, Bolun Wang, Fangjing Wang, Jinhui Ye, Junqiu Yu, Jinyu Zhang, Yangkun Zhu A.2. Contributors Xinyi Chen, Yanwei Fu, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Yu Qiao,...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.