arxiv: 2510.03827 · v1 · submitted 2025-10-04 · 💻 cs.CV · cs.RO

Recognition: 3 theorem links

· Lean Theorem

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Xueyang Zhou , Yangming Xu , Guiyao Tie , Yongchao Chen , Guowen Zhang , Duanfeng Chu , Pan Zhou , Lichao Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:15 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords Vision-Language-Action modelsLIBERO benchmarkgeneralizationmemorizationrobustness evaluationtask perturbationsroboticsimitation learning

0 comments

The pith

Vision-Language-Action models achieve over 90 percent on standard benchmarks yet drop to zero percent when objects, instructions or environments are perturbed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LIBERO-PRO, an extended version of the LIBERO benchmark that adds systematic perturbations in four dimensions: the objects being manipulated, the initial states of the scene, the wording of the task instructions, and the overall environment layout. It reports that models scoring above 90 percent on the original LIBERO evaluation fall to 0.0 percent success under these changes. The drop occurs because the models continue to output the same memorized action sequences even when the target object is replaced by an irrelevant item or the instruction is corrupted. A sympathetic reader would care because the result shows that current evaluation practices can certify models as capable when they have only learned to replay training data.

Core claim

Although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. This discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception.

What carries the argument

LIBERO-PRO, the extended benchmark that applies controlled perturbations across four dimensions (manipulated objects, initial states, task instructions, and environments) to distinguish memorization from comprehension.

If this is right

Models keep executing grasping actions even after the target object has been replaced with an irrelevant item.
Model outputs remain unchanged when given corrupted or messy task instructions.
Standard LIBERO-style evaluations produce inflated accuracy numbers that do not reflect real task understanding.
Future development should prioritize generalization checks instead of single-environment memorization tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hidden memorization problems may exist in other robotics and imitation-learning benchmarks that use fixed layouts and instructions.
Adding controlled variations during training could reduce the gap between standard and perturbed performance.
Real-world robot deployment often encounters unexpected object changes or phrasing differences, so the same failure mode could appear outside the lab.

Load-bearing premise

The specific changes made to objects, starting conditions, instructions, and environments constitute fair tests of generalization rather than introducing unrelated difficulties no model could handle.

What would settle it

A model that continues to achieve high success rates on the perturbed LIBERO-PRO tasks while still succeeding on the original benchmark would show that the observed collapse is not caused by reliance on memorization.

read the original abstract

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LIBERO-PRO, an extension of the LIBERO benchmark for Vision-Language-Action (VLA) models. It applies systematic perturbations across four dimensions—manipulated objects, initial states, task instructions, and environments—and reports that models achieving over 90% success on standard LIBERO evaluations drop to 0.0% under the perturbed conditions. The authors interpret this performance collapse as evidence that current models rely on rote memorization of training sequences and layouts rather than genuine task understanding or environmental perception, with examples including persistent grasping actions on replaced objects and unchanged outputs under corrupted instructions. The work provides code for the new benchmark and calls for abandoning current evaluation practices in favor of more robust assessments.

Significance. If the perturbed tasks can be shown to remain solvable by agents with genuine perception and comprehension, this benchmark would be a valuable contribution to VLA evaluation by exposing overestimation in existing protocols and providing a concrete tool for testing generalization. The release of code supports reproducibility. However, the significance is currently limited by the lack of verification that the perturbations preserve task solvability, which directly affects whether the results support the memorization interpretation over alternative explanations such as introduced perceptual or grounding difficulties.

major comments (2)

[Abstract and experimental results] Abstract and experimental results section: The claim that 0.0% success under the four perturbation types demonstrates reliance on memorization (rather than understanding) is load-bearing and requires that the perturbed tasks remain solvable by any policy that correctly perceives the scene and parses the goal. No oracle, human baseline, or other evidence is provided confirming solvability after object replacement with irrelevant items, initial-state changes, instruction corruption, or environment swaps. This leaves open the possibility that the perturbations introduce unrelated difficulties (e.g., broken grasp affordances or language grounding failures) that would affect even non-memorizing agents.
[Perturbation methodology] Perturbation methodology: The manuscript does not provide sufficient detail on how the specific perturbations are generated and applied (e.g., criteria for selecting 'irrelevant items' or constructing 'messy tokens'). Without these specifications, it is difficult to assess whether the tests are fair probes of generalization or inadvertently create unsolvable variants, which is central to interpreting the 0.0% results.

minor comments (1)

[Abstract] Abstract: The statement 'over 90% accuracy' would benefit from specifying the exact models evaluated and their individual scores to allow readers to assess the baseline performance more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and committing to revisions that strengthen the interpretability of LIBERO-PRO without altering the core findings.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: The claim that 0.0% success under the four perturbation types demonstrates reliance on memorization (rather than understanding) is load-bearing and requires that the perturbed tasks remain solvable by any policy that correctly perceives the scene and parses the goal. No oracle, human baseline, or other evidence is provided confirming solvability after object replacement with irrelevant items, initial-state changes, instruction corruption, or environment swaps. This leaves open the possibility that the perturbations introduce unrelated difficulties (e.g., broken grasp affordances or language grounding failures) that would affect even non-memorizing agents.

Authors: We agree that explicit verification of solvability would make the memorization interpretation more robust and rule out alternative explanations such as introduced perceptual difficulties. The observed failure modes (e.g., persistent grasping at original locations despite object replacement, or invariant outputs under instruction corruption) are consistent with rote memorization of training sequences rather than scene understanding. Nevertheless, to directly address the concern, we will add a human baseline evaluation in the revised manuscript: participants will be shown the perturbed scenes and instructions and asked to complete the tasks, confirming that the variants remain solvable when genuine perception and comprehension are applied. We will also include qualitative examples demonstrating that core affordances (graspable objects, reachable states) are preserved. This addition will be incorporated into the experimental results section. revision: yes
Referee: [Perturbation methodology] Perturbation methodology: The manuscript does not provide sufficient detail on how the specific perturbations are generated and applied (e.g., criteria for selecting 'irrelevant items' or constructing 'messy tokens'). Without these specifications, it is difficult to assess whether the tests are fair probes of generalization or inadvertently create unsolvable variants, which is central to interpreting the 0.0% results.

Authors: We acknowledge that additional methodological detail would improve transparency and allow readers to better evaluate the fairness of the perturbations. The current manuscript describes the four dimensions at a high level, but we will expand the methodology section in the revision to specify the generation process. For object replacement, 'irrelevant items' are drawn from the LIBERO object vocabulary excluding task-relevant objects, with selection prioritizing similar physical properties (size, shape, mass) to preserve graspability. For instruction perturbations, 'messy tokens' are generated via controlled random token substitution or insertion from a fixed vocabulary while preserving overall sentence length and basic syntax. We will also add pseudocode and point to specific functions in the released repository that implement these steps. These clarifications will help confirm that the perturbations test generalization rather than creating fundamentally unsolvable tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with independent test conditions

full rationale

The manuscript is an empirical evaluation paper that defines four perturbation dimensions (object replacement, initial-state change, instruction corruption, environment swap) and reports measured success rates on existing VLA models, dropping from >90% on standard LIBERO to 0.0% on the perturbed sets. No equations, fitted parameters, or first-principles derivations are present; the performance numbers are direct experimental outputs rather than predictions derived from prior fits or self-referential definitions. The interpretation that the drop indicates memorization is an external claim about the results, not a step that reduces to the inputs by construction. The work is therefore self-contained against the external LIBERO benchmark and external model checkpoints, with no load-bearing self-citation chains or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that the introduced perturbations validly probe understanding rather than creating unsolvable variants.

axioms (1)

domain assumption Perturbations to objects, initial states, task instructions, and environments serve as valid probes of genuine task understanding and environmental perception.
This assumption is required to interpret the observed performance collapse as evidence of memorization rather than task alteration.

pith-pipeline@v0.9.0 · 5539 in / 1154 out tokens · 55040 ms · 2026-05-17T06:15:03.919808+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.JcostCore Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens
Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high scores primarily reflect rote memorization of training data rather than genuine task understanding or execution ability
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 8.0

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
cs.RO 2026-05 unverdicted novelty 6.0

TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 6.0

RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
cs.RO 2026-04 unverdicted novelty 6.0

DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
cs.RO 2026-03 unverdicted novelty 6.0

ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
cs.RO 2026-02 unverdicted novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
cs.LG 2026-05 unverdicted novelty 5.0

Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation a...
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
cs.AI 2026-03 accept novelty 5.0

vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
cs.RO 2026-04 accept novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
cs.RO 2026-04 unverdicted novelty 4.0

OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 17 Pith papers · 13 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K Black, N Brown, D Driess, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Q Bu, Y Yang, J Cai, et al. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

WorldVLA: Towards Autoregressive Action World Model

J Cen, C Yu, H Yuan, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

S Deng, M Yan, S Wei, et al

URLhttps://arxiv.org/ abs/2506.08440. S Deng, M Yan, S Wei, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233,

work page arXiv
[5]

Irving Fang, Juexiao Zhang, Shengbang Tong, and Chen Feng

URL https://arxiv.org/ abs/2507.10672. Irving Fang, Juexiao Zhang, Shengbang Tong, and Chen Feng. From intention to execution: Probing the generalization boundaries of vision-language-action models,

work page arXiv
[6]

org/abs/2506.09930

URL https://arxiv. org/abs/2506.09930. Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. Safe: Multitask failure detection for vision-language-action models,

work page arXiv
[7]

Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, and Paul Pu Liang

URL https://arxiv.org/abs/2506.09937. Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, and Paul Pu Liang. Benchmarking vision, language, & action models on robotic learning tasks,

work page arXiv
[8]

C P Huang, Y H Wu, M H Chen, et al

URL https://arxiv.org/abs/2411.05821. C P Huang, Y H Wu, M H Chen, et al. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page arXiv
[9]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

URL https://arxiv.org/ abs/2504.16054. Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019– 3026,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

OpenVLA: An Open-Source Vision-Language-Action Model

M J Kim, K Pertsch, S Karamcheti, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

MolmoAct: Action Reasoning Models that can Reason in Space

J Lee, J Duan, H Fang, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a

H Li, S Yang, Y Chen, et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation.arXiv preprint arXiv:2506.19816, 2025a. 10 Haoran Li, Yuhui Chen, Wenbo Cui, Weiheng Liu, Kai Liu, Mingcai Zhou, Zhengtao Zhang, and Dongbin Zhao. Survey of vision-language-action models for embodied manipulation, 2025b. URL https://ar...

work page arXiv
[13]

VLAs are Confined yet Capable of Generalizing to Novel Instructions

URL https: //arxiv.org/abs/2505.03500. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024b. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Pet...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Trivla: A triple-system-based unified vision-language-action model for general robot control.arXiv e-prints, page arXiv:2507.01424,

Z Liu, Y Gu, S Zheng, et al. Trivla: A triple-system-based unified vision-language-action model for general robot control.arXiv e-prints, page arXiv:2507.01424,

work page arXiv
[15]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2505.18719. Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language- action models for embodied ai,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

A Survey on Vision-Language-Action Models for Embodied AI

URLhttps://arxiv.org/abs/2405.14093. Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

D Qu, H Song, Q Chen, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,

Ranjan Sapkota, Yang Cao, Konstantinos I Roumeliotis, and Manoj Karkee. Vision-language-action models: Concepts, progress, applications and challenges.arXiv preprint arXiv:2505.04769,

work page arXiv
[20]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M Shukor, D Aubakirova, F Capuano, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Geovla: Empowering 3d representations in vision-language-action models

L Sun, B Xie, Y Liu, et al. Geovla: Empowering 3d representations in vision-language-action models. arXiv preprint arXiv:2508.09071,

work page arXiv
[22]

Octo: An Open-Source Generalist Robot Policy

O M Team, D Ghosh, H Walke, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al

URLhttps://arxiv.org/abs/2507.17049. Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR,

work page arXiv
[24]

Exploring the adversarial vulnerabilities of vision- language-action models in robotics, 2025a

Taowen Wang, Cheng Han, James Chenhao Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversarial vulnerabilities of vision- language-action models in robotics, 2025a. URLhttps://arxiv.org/abs/2411.13587. 11 Y Wang, X Li, W Wang, et al. Unified vision-language-action model.arXiv preprint arXiv:25...

work page arXiv
[25]

URLhttps://arxiv.org/abs/2505.15660. 12

work page arXiv