RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

Jingzhou Luo; Liang Lin; Xinshuai Song; Yang Liu; Yifan Wen; Yongjie Bai

arxiv: 2605.19678 · v1 · pith:NUYFJ3J4new · submitted 2026-05-19 · 💻 cs.RO

RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models

Jingzhou Luo , Yifan Wen , Yongjie Bai , Xinshuai Song , Yang Liu , Liang Lin This is my paper

Pith reviewed 2026-05-20 04:44 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsrobustnessconsistency constraintsembodied manipulationinstruction semanticsobservation perturbationtrajectory evolutiongeneralization

0 comments

The pith

Enforcing consistency under instruction rewrites, trajectory steps, and observation disturbances lets vision-language-action models generalize better to task and visual shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language-action models often depend on shallow patterns in training data and therefore break when instructions are rephrased, when the robot advances through a task, or when camera images and joint readings change slightly. To fix this, RoVLA adds three consistency terms to the training loss so that the same action is predicted under each of these transformations. Instructional consistency keeps outputs stable for semantically equivalent language commands. Evolutionary consistency keeps action intent coherent as the robot moves forward in time. Observational consistency keeps predictions unchanged after small visual or proprioceptive disturbances. If these terms succeed, the model should rely on stable couplings between semantics, states, and actions rather than on training-set accidents, producing stronger results on both simulated benchmarks and physical robots.

Core claim

RoVLA incorporates three complementary consistency constraints into end-to-end vision-language-action policy training. Instructional Consistency requires the model to output identical actions for semantically equivalent instruction rewrites. Evolutionary Consistency requires coherent action predictions across successive steps of a trajectory. Observational Consistency requires unchanged predictions before and after targeted visual and proprioceptive perturbations. By minimizing violations of these invariances, the training process reduces dependence on superficial correlations present in the data distribution and yields policies that remain effective under task and observation shifts.

What carries the argument

Multi-consistency constraints (Instructional Consistency, Evolutionary Consistency, and Observational Consistency) that penalize changes in action predictions under semantically equivalent, temporally progressive, and sensor-perturbed inputs.

If this is right

Policies trained with the three consistency terms outperform standard baselines on LIBERO-Plus and RoboTwin 2.0 benchmarks.
The same policies maintain higher success rates when task descriptions or visual conditions differ from training.
Real-world manipulation experiments show improved reliability under the same shifts.
The model relies less on spurious correlations and more on stable semantic-state-action relationships.
No additional large-scale pretraining or post-hoc adaptation is required to obtain the robustness gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency approach could be transferred to other embodied sequence tasks such as navigation or multi-step assembly.
Combining the constraints with existing large-scale vision-language pretraining might produce even stronger zero-shot behavior.
Explicit invariance modeling offers a data-efficient route to robustness that does not require ever-larger training corpora.
One could measure whether the constraints also reduce sensitivity to changes in robot morphology or gripper type.

Load-bearing premise

The chosen transformations are assumed to represent the distribution shifts that matter in real deployment without creating new failure modes or over-constraining the policy.

What would settle it

A controlled test in which a RoVLA-trained model is evaluated on paraphrased instructions and perturbed observations that were never used as consistency examples during training; if performance drops to the level of ordinary baselines, the claimed robustness benefit does not hold.

Figures

Figures reproduced from arXiv: 2605.19678 by Jingzhou Luo, Liang Lin, Xinshuai Song, Yang Liu, Yifan Wen, Yongjie Bai.

**Figure 2.** Figure 2: Overview of RoVLA. (a) RoVLA adopts a dual-system backbone with high-level semantic extraction and low-level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the real-world evaluation tasks. We [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples on LIBERO-Plus. Under vari [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples on RoboTwin 2.0. Representative rollout snapshots, mainly under the Randomized environment, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples on the real-world evaluation tasks visualized from the wrist-mounted camera view. Represen [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have shown strong performance on embodied manipulation, yet they remain brittle under visual observation changes, paraphrased language instructions, and compounded perturbations. This limitation suggests that existing methods still rely heavily on shallow correlations in the training distribution, rather than learning stable couplings among task semantics, environment states, and action generation. Although recent efforts improve robustness through larger-scale training, post-training adaptation, or enhanced predictive modeling, they rarely enforce invariance-oriented consistency within the end-to-end policy itself. To address this issue, we propose RoVLA, a robust vision-language-action framework with multi-consistency constraints. RoVLA enforces consistency under three complementary transformations: instruction semantics, trajectory evolution, and observation perturbation. Specifically, Instructional Consistency (IC) promotes stable grounding under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) preserves coherent action intent throughout the generation process, and Observational Consistency (OC) improves robustness to visual and proprioceptive perturbations by enforcing consistent predictions before and after targeted disturbances. By explicitly modeling these invariances during training, RoVLA reduces reliance on superficial correlations and improves robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks show that RoVLA consistently outperforms strong baseline methods and exhibits superior robustness under diverse task and observation shifts. These results demonstrate the effectiveness of multi-consistency learning for robust embodied control. Codes will be available at https://github.com/HCPLab-SYSU/RoVLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces RoVLA, a vision-language-action model that applies multi-consistency constraints during training: Instructional Consistency (IC) under semantically equivalent instruction rewrites, Evolutionary Consistency (EC) across trajectory steps to preserve action intent, and Observational Consistency (OC) under targeted visual and proprioceptive perturbations. The central claim is that explicitly enforcing these invariances reduces reliance on superficial correlations in the training distribution, yielding improved robustness and generalization. Experiments on LIBERO-Plus, RoboTwin 2.0, and real-world manipulation tasks are reported to show consistent outperformance over strong baselines under task and observation shifts.

Significance. If the experimental results hold after isolating the contribution of the consistency terms, the work could meaningfully advance robust embodied control by providing an end-to-end mechanism for learning stable couplings among semantics, states, and actions. The complementary nature of the three consistency types and the planned code release are positive features that support reproducibility and further investigation.

major comments (1)

[Experiments] Experiments section: the manuscript must include a control experiment training a baseline on the identical set of augmented data (semantic rewrites, trajectory steps, and disturbances) but using only standard supervised loss without the IC/EC/OC consistency terms. Without this ablation, it remains unclear whether the reported robustness gains on LIBERO-Plus and RoboTwin 2.0 shifts arise from the proposed multi-consistency mechanism or simply from the stronger supervision signal provided by the transformed pairs, directly addressing the concern that consistency losses may be redundant with the augmentations themselves.

minor comments (2)

[Abstract] Abstract: quantitative metrics, baseline names, ablation summaries, and statistical tests are absent, making it difficult to assess the magnitude and reliability of the claimed outperformance.
[Method] The description of the three transformations should clarify whether they are applied only at training time or also at test time, and how the consistency losses are balanced with the primary task loss (e.g., via coefficients or scheduling).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our manuscript. We have carefully considered the major comment on the Experiments section and agree that the requested control experiment will strengthen the paper by better isolating the contribution of the multi-consistency constraints.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript must include a control experiment training a baseline on the identical set of augmented data (semantic rewrites, trajectory steps, and disturbances) but using only standard supervised loss without the IC/EC/OC consistency terms. Without this ablation, it remains unclear whether the reported robustness gains on LIBERO-Plus and RoboTwin 2.0 shifts arise from the proposed multi-consistency mechanism or simply from the stronger supervision signal provided by the transformed pairs, directly addressing the concern that consistency losses may be redundant with the augmentations themselves.

Authors: We agree that this control experiment is essential to rule out the possibility that robustness gains arise merely from the augmented data rather than the consistency losses themselves. In the revised manuscript, we will add results from training the baseline model on the identical augmented dataset (semantic rewrites, trajectory steps, and disturbances) but using only the standard supervised loss without the IC, EC, or OC terms. These results will be reported on LIBERO-Plus and RoboTwin 2.0 under the same task and observation shifts, with direct comparisons to the full RoVLA model to demonstrate the specific benefit of the multi-consistency mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: consistency losses and evaluation metrics remain independent

full rationale

The paper defines IC, EC, and OC as auxiliary consistency losses applied to transformed inputs (semantically equivalent instructions, trajectory steps, and perturbed observations) during training. These losses are not mathematically equivalent to the reported success metrics, which are measured on held-out tasks and distribution shifts in LIBERO-Plus, RoboTwin 2.0, and real-world settings. No equations reduce the robustness claims to the training objectives by construction, no self-citations serve as load-bearing uniqueness theorems, and no fitted parameters are relabeled as predictions. The derivation from multi-consistency training to empirical gains is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract supplies no explicit loss equations or hyperparameter values; the ledger is therefore populated from the high-level description of the three consistency mechanisms.

free parameters (1)

Consistency loss coefficients
Weights balancing instructional, evolutionary, and observational losses against the primary imitation or reinforcement objective; these must be chosen or tuned.

axioms (1)

domain assumption Enforcing prediction invariance under the three defined transformations improves robustness to real-world distribution shifts.
This premise underpins the entire training procedure and is not derived from first principles within the paper.

pith-pipeline@v0.9.0 · 5816 in / 1224 out tokens · 38824 ms · 2026-05-20T04:44:32.280489+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 24 internal anchors

[1]

Robot manip- ulation based on embodied visual perception: A survey,

S. Wang, M. N. Nikolić, T. L. Lam, Q. Gao, R. Ding, and T. Zhang, “Robot manip- ulation based on embodied visual perception: A survey, ”CAAI Transactions on Intelligence Technology, vol. 10, no. 4, pp. 945–958, 2025

work page 2025
[2]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale, ” in International Conference on Learning Representations, 2021

work page 2021
[3]

Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space, ”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[4]

Dspnet: Dual-vision scene perception for robust 3d question answering,

J. Luo, Y. Liu, W. Chen, Z. Li, Y. Wang, G. Li, and L. Lin, “Dspnet: Dual-vision scene perception for robust 3d question answering, ” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14169–14178, 2025

work page 2025
[5]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Yao, “A survey on large language models for automated planning, ”arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025
[6]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., “Qwen3 technical report, ”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, ”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers, ” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023

work page 2023
[9]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling, ”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion, ” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

work page 2025
[11]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Y. Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y. Wang, S. Guo, T. Guan, K. N. Lui, et al., “A survey on vision-language-action models: An action tokeniza- tion perspective, ”arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., “Openvla: An open-source vision-language- action model, ”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “𝜋0: A vision-language-action flow model for general robot control, ”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

X. Chen, Y. Chen, Y. Fu, N. Gao, J. Jia, W. Jin, H. Li, Y. Mu, J. Pang, Y. Qiao, et al., “Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy, ”arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al., “Gr00t n1: An open foundation model for generalist humanoid robots, ”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

𝜋0.5: a vision-language-action model with open-world generalization,

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y. Galliker,et al., “𝜋0.5: a vision-language-action model with open-world generalization, ” in9th Annual Conference on Robot Learning, 2025

work page 2025
[17]

GR00T N1.6: An Improved Open Foundation Model for Generalist Humanoid Robots,

NVIDIA GEAR Team, “GR00T N1.6: An Improved Open Foundation Model for Generalist Humanoid Robots, ” 2025. Technical report

work page 2025
[18]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale, ”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control, ” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

work page 2023
[20]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., “Libero-plus: In-depth robustness analysis of vision-language-action models, ” arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y. Liu, Z. Li, Q. Liang, X. Lin, Y. Ge, Z. Gu, et al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, ”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Exploring the adversarial vulnerabilities of vision-language-action models in robotics,

T. Wang, C. Han, J. Liang, W. Yang, D. Liu, L. X. Zhang, Q. Wang, J. Luo, and R. Tang, “Exploring the adversarial vulnerabilities of vision-language-action models in robotics, ” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6948–6958, 2025

work page 2025
[23]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

S. Yang, H. Li, Y. Chen, B. Wang, Y. Tian, T. Wang, H. Wang, F. Zhao, Y. Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from under- standing to manipulation, ”arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025
[24]

Interactive Post-Training for Vision-Language-Action Models

S. Tan, K. Dou, Y. Zhao, and P. Krähenbühl, “Interactive post-training for vision- language-action models, ”arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review arXiv 2025
[25]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, Y. Gao, Z. Chen, J. Yu, X. Wang, et al., “Worldvla: Towards autoregressive action world model, ”arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, J. Zhang, H. Xu, Z. Zhang, D. Wang, et al., “Univla: Unified vision-language-action model, ”arXiv preprint arXiv:2506.19850, 2025

work page arXiv 2025
[27]

Aligning cyber space with physical world: A comprehensive survey on embodied ai,

Y. Liu, W. Chen, Y. Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai, ” IEEE/ASME Transactions on Mechatronics, 2025

work page 2025
[28]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0, ” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903, IEEE, 2024

work page 2024
[29]

Octo: An Open-Source Generalist Robot Policy

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al., “Octo: An open-source generalist robot policy, ”arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems, ”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Con- nors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al., “𝜋0.6∗: a vla that learns from experience, ”arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success, ”arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models, ”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Vlatest: Testing and evaluating vision-language-action models for robotic manipulation,

Z. Wang, Z. Zhou, J. Song, Y. Huang, Z. Shu, and L. Ma, “Vlatest: Testing and evaluating vision-language-action models for robotic manipulation, ”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1615–1638, 2025

work page 2025
[35]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

J. Cen, S. Huang, Y. Yuan, K. Li, H. Yuan, C. Yu, Y. Jiang, J. Guo, X. Li, H. Luo, et al., “Rynnvla-002: A unified vision-language-action and world model, ”arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025
[36]

Motus: A Unified Latent Action World Model

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, et al., “Motus: A unified latent action world model, ” arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

G. Lu, W. Guo, C. Zhang, Y. Zhou, H. Jiang, Z. Gao, Y. Tang, and Z. Wang, “Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, ”arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

H. Li, Y. Zuo, J. Yu, Y. Zhang, Z. Yang, K. Zhang, X. Zhu, Y. Zhang, T. Chen, G. Cui, et al., “Simplevla-rl: Scaling vla training via reinforcement learning, ” arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Robustvla: Robustness- aware reinforcement post-training for vision-language-action models,

H. Zhang, S. Zhang, J. Jin, Q. Zeng, R. Li, and D. Wang, “Robustvla: Robustness- aware reinforcement post-training for vision-language-action models, ”arXiv preprint arXiv:2511.01331, 2025

work page arXiv 2025
[40]

Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results,

A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results, ” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[41]

Virtual adversarial training: a regularization method for supervised and semi-supervised learning,

T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning, ” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979– 1993, 2018

work page 1979
[42]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence,

K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence, ”Advances in neural information processing systems, vol. 33, pp. 596–608, 2020

work page 2020
[43]

Image augmentation is all you need: Reg- ularizing deep reinforcement learning from pixels,

D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Reg- ularizing deep reinforcement learning from pixels, ” inInternational conference on learning representations, 2020

work page 2020
[44]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks, ”arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples, ”CoRR, vol. abs/1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[46]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning, ”Advances in Neural Information Processing Systems, vol. 36, pp. 44776–44791, 2023

work page 2023
[47]

Decoupled Weight Decay Regularization

I. Loshchilov, “Decoupled weight decay regularization, ” arXiv preprint arXiv:1711.05101, 2017. A More Implementation details of Instructional Consistency We provide more implementation details for the instruction rewrit- ing process employed by "Instruction Consistency" (IC). IC does not introduce an additional explicit loss. Instead, it expands each sin...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Robot manip- ulation based on embodied visual perception: A survey,

S. Wang, M. N. Nikolić, T. L. Lam, Q. Gao, R. Ding, and T. Zhang, “Robot manip- ulation based on embodied visual perception: A survey, ”CAAI Transactions on Intelligence Technology, vol. 10, no. 4, pp. 945–958, 2025

work page 2025

[2] [2]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale, ” in International Conference on Learning Representations, 2021

work page 2021

[3] [3]

Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space, ”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[4] [4]

Dspnet: Dual-vision scene perception for robust 3d question answering,

J. Luo, Y. Liu, W. Chen, Z. Li, Y. Wang, G. Li, and L. Lin, “Dspnet: Dual-vision scene perception for robust 3d question answering, ” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 14169–14178, 2025

work page 2025

[5] [5]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Yao, “A survey on large language models for automated planning, ”arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025

[6] [6]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al., “Qwen3 technical report, ”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, ”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers, ” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205, 2023

work page 2023

[9] [9]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling, ”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion, ” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

work page 2025

[11] [11]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Y. Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y. Wang, S. Guo, T. Guan, K. N. Lui, et al., “A survey on vision-language-action models: An action tokeniza- tion perspective, ”arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., “Openvla: An open-source vision-language- action model, ”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “𝜋0: A vision-language-action flow model for general robot control, ”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

X. Chen, Y. Chen, Y. Fu, N. Gao, J. Jia, W. Jin, H. Li, Y. Mu, J. Pang, Y. Qiao, et al., “Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy, ”arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al., “Gr00t n1: An open foundation model for generalist humanoid robots, ”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

𝜋0.5: a vision-language-action model with open-world generalization,

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y. Galliker,et al., “𝜋0.5: a vision-language-action model with open-world generalization, ” in9th Annual Conference on Robot Learning, 2025

work page 2025

[17] [17]

GR00T N1.6: An Improved Open Foundation Model for Generalist Humanoid Robots,

NVIDIA GEAR Team, “GR00T N1.6: An Improved Open Foundation Model for Generalist Humanoid Robots, ” 2025. Technical report

work page 2025

[18] [18]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale, ”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control, ” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

work page 2023

[20] [20]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., “Libero-plus: In-depth robustness analysis of vision-language-action models, ” arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y. Liu, Z. Li, Q. Liang, X. Lin, Y. Ge, Z. Gu, et al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, ”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Exploring the adversarial vulnerabilities of vision-language-action models in robotics,

T. Wang, C. Han, J. Liang, W. Yang, D. Liu, L. X. Zhang, Q. Wang, J. Luo, and R. Tang, “Exploring the adversarial vulnerabilities of vision-language-action models in robotics, ” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6948–6958, 2025

work page 2025

[23] [23]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

S. Yang, H. Li, Y. Chen, B. Wang, Y. Tian, T. Wang, H. Wang, F. Zhao, Y. Liao, and J. Pang, “Instructvla: Vision-language-action instruction tuning from under- standing to manipulation, ”arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025

[24] [24]

Interactive Post-Training for Vision-Language-Action Models

S. Tan, K. Dou, Y. Zhao, and P. Krähenbühl, “Interactive post-training for vision- language-action models, ”arXiv preprint arXiv:2505.17016, 2025

work page internal anchor Pith review arXiv 2025

[25] [25]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, Y. Gao, Z. Chen, J. Yu, X. Wang, et al., “Worldvla: Towards autoregressive action world model, ”arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, J. Zhang, H. Xu, Z. Zhang, D. Wang, et al., “Univla: Unified vision-language-action model, ”arXiv preprint arXiv:2506.19850, 2025

work page arXiv 2025

[27] [27]

Aligning cyber space with physical world: A comprehensive survey on embodied ai,

Y. Liu, W. Chen, Y. Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: A comprehensive survey on embodied ai, ” IEEE/ASME Transactions on Mechatronics, 2025

work page 2025

[28] [28]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0, ” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903, IEEE, 2024

work page 2024

[29] [29]

Octo: An Open-Source Generalist Robot Policy

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al., “Octo: An open-source generalist robot policy, ”arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems, ”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Con- nors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al., “𝜋0.6∗: a vla that learns from experience, ”arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success, ”arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models, ”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Vlatest: Testing and evaluating vision-language-action models for robotic manipulation,

Z. Wang, Z. Zhou, J. Song, Y. Huang, Z. Shu, and L. Ma, “Vlatest: Testing and evaluating vision-language-action models for robotic manipulation, ”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1615–1638, 2025

work page 2025

[35] [35]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025a

J. Cen, S. Huang, Y. Yuan, K. Li, H. Yuan, C. Yu, Y. Jiang, J. Guo, X. Li, H. Luo, et al., “Rynnvla-002: A unified vision-language-action and world model, ”arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025

[36] [36]

Motus: A Unified Latent Action World Model

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, et al., “Motus: A unified latent action world model, ” arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

G. Lu, W. Guo, C. Zhang, Y. Zhou, H. Jiang, Z. Gao, Y. Tang, and Z. Wang, “Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, ”arXiv preprint arXiv:2505.18719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

H. Li, Y. Zuo, J. Yu, Y. Zhang, Z. Yang, K. Zhang, X. Zhu, Y. Zhang, T. Chen, G. Cui, et al., “Simplevla-rl: Scaling vla training via reinforcement learning, ” arXiv preprint arXiv:2509.09674, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Robustvla: Robustness- aware reinforcement post-training for vision-language-action models,

H. Zhang, S. Zhang, J. Jin, Q. Zeng, R. Li, and D. Wang, “Robustvla: Robustness- aware reinforcement post-training for vision-language-action models, ”arXiv preprint arXiv:2511.01331, 2025

work page arXiv 2025

[40] [40]

Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results,

A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight- averaged consistency targets improve semi-supervised deep learning results, ” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[41] [41]

Virtual adversarial training: a regularization method for supervised and semi-supervised learning,

T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning, ” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1979– 1993, 2018

work page 1979

[42] [42]

Fixmatch: Simplifying semi-supervised learning with consistency and confidence,

K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence, ”Advances in neural information processing systems, vol. 33, pp. 596–608, 2020

work page 2020

[43] [43]

Image augmentation is all you need: Reg- ularizing deep reinforcement learning from pixels,

D. Yarats, I. Kostrikov, and R. Fergus, “Image augmentation is all you need: Reg- ularizing deep reinforcement learning from pixels, ” inInternational conference on learning representations, 2020

work page 2020

[44] [44]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks, ”arXiv preprint arXiv:1706.06083, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples, ”CoRR, vol. abs/1412.6572, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[46] [46]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning, ”Advances in Neural Information Processing Systems, vol. 36, pp. 44776–44791, 2023

work page 2023

[47] [47]

Decoupled Weight Decay Regularization

I. Loshchilov, “Decoupled weight decay regularization, ” arXiv preprint arXiv:1711.05101, 2017. A More Implementation details of Instructional Consistency We provide more implementation details for the instruction rewrit- ing process employed by "Instruction Consistency" (IC). IC does not introduce an additional explicit loss. Instead, it expands each sin...

work page internal anchor Pith review Pith/arXiv arXiv 2017