Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Haoran Sun; Jiachi Ji; Junwu Xiong; Luqiao Wang; Shengzhe Ji; Wei Lu; Yongjian Guo; Zhen Sun; Zhijun Meng

arxiv: 2605.22446 · v1 · pith:QONCUQTKnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.RO

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Zhen Sun , Yongjian Guo , Haoran Sun , Luqiao Wang , Wei Lu , Jiachi Ji , Shengzhe Ji , Junwu Xiong

show 1 more author

Zhijun Meng

This is my paper

Pith reviewed 2026-05-22 06:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords runtime verificationvision-language-action modelsworld modelsaction safetypreemptive filteringLIBERO benchmarkembodied AIresampling scheduler

0 comments

The pith

Pre-VLA adds preemptive checks to filter bad actions and raise VLA success rates from 31 to 38 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Pre-VLA, a system that verifies action chunks from vision-language-action models before they are carried out or fed into world models. It uses a multimodal backbone and dual-branch head to score safety and advantage, then resamples poor candidates within a time budget. The approach tackles uncertainty in learned policies that otherwise causes robot failures or inefficient simulations. Readers should care because it offers a practical way to make large embodied models more dependable without retraining the whole system. Experiments indicate this raises average success on standard benchmarks while keeping verification fast.

Core claim

Pre-VLA is a unified runtime verification architecture that performs preemptive action validity assessment using an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict safety confidence and critic-derived advantage scores. It is trained with a multi-task objective that combines Focal classification, advantage regression, and soft-threshold calibration. At deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under limited computation budget, leading to higher closed-loop success and less error buildup in rollouts.

What carries the argument

Lightweight dual-branch head that outputs safety confidence and advantage scores for action chunks, paired with a dual-mode resampling scheduler.

If this is right

Increases average closed-loop success rate from 30.79% to 37.62% over baseline on LIBERO.
Decreases the number of steps required to complete tasks.
Keeps average verification time at 183.9 milliseconds per action chunk.
Reduces error accumulation when generating world-model rollouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This verification approach could help stabilize longer planning horizons by catching mistakes before they compound.
It might generalize to other robot learning setups where action uncertainty is a problem.
Real-world tests could check if the added latency still allows responsive control in dynamic environments.

Load-bearing premise

The dual-branch head produces safety and advantage predictions that work well on unseen actions without causing too many unnecessary resamples or stalls.

What would settle it

If adding Pre-VLA to a VLA model on new tasks fails to improve success rates or causes frequent execution halts due to false alarms, the method's reliability would be questioned.

Figures

Figures reproduced from arXiv: 2605.22446 by Haoran Sun, Jiachi Ji, Junwu Xiong, Luqiao Wang, Shengzhe Ji, Wei Lu, Yongjian Guo, Zhen Sun, Zhijun Meng.

**Figure 2.** Figure 2: Overview of the ARGUS runtime safety verification framework. The VLA generates candidate action chunks from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The core idea is to reuse the multimodal perceptual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of Pre-VLA with dual-modal data used during training. into the backbone for encoding. During the training of PreVLA, all parameters of the backbone are frozen to preserve its original generative capability. We then extract the final-layer hidden states Ht of the backbone as high-dimensional feature representations for subsequent verification. 2) Modality-Aware Feature Pooling: Since t… view at source ↗

**Figure 4.** Figure 4: Closed-loop execution comparison with and without [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Closed-loop performance comparison across four LIBERO suites. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: World Model rollout comparison with and without [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre-VLA adds a lightweight runtime filter for VLA action chunks with reported LIBERO gains, but the safety and advantage scores lack direct validation on generalization.

read the letter

Pre-VLA is a system for checking the quality of action chunks in vision-language-action models before they are used in physical execution or world model rollouts. The reported result is an improvement in average closed-loop success rate on the LIBERO benchmark from 30.79 percent to 37.62 percent compared to the RynnVLA-002 baseline, along with faster task completion and reduced error in simulations. The paper does a few things well. It integrates modality-aware pooling into the backbone for handling different input types efficiently. The dual-branch head predicts both safety confidence and critic-derived advantage, which allows for a more nuanced filtering than a single score. Training with focal loss addresses the imbalance between good and bad actions, while the soft-threshold calibration helps with boundary decisions. The dual-mode resampling scheduler operates under a limited budget, which is a realistic constraint for deployment. The timing result of 183.9 milliseconds per action chunk shows the approach stays lightweight. The soft spots are mainly in the supporting evidence. The abstract does not include error bars on the success rates or any ablation studies that isolate the contribution of the verification head versus the scheduler. There are also no reported metrics on the head's performance itself, such as how well the safety scores correlate with actual outcomes or the rate of false negatives that might halt execution unnecessarily. This leaves open the question of whether the scores generalize reliably to new action sequences in the different LIBERO suites, as noted in the stress-test. Overall, this paper is for practitioners in robotics and AI who are working on making learned controllers more dependable for real-world use. It offers a concrete method that can be added on top of existing VLA models. I would recommend sending it for peer review. The benchmark results provide a starting point for discussion, and referees can request the additional diagnostics needed to confirm the claims about the verification component.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Pre-VLA, a unified runtime verification architecture for vision-language-action (VLA) models and generative world models. It features an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head that predicts safety confidence and critic-derived advantage scores for candidate action chunks. The model is trained using a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. A dual-mode preemptive resampling scheduler filters low-quality actions under a limited computation budget. On the LIBERO benchmark, Pre-VLA improves the average closed-loop success rate across four suites from 30.79% to 37.62% compared to RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

Significance. If the performance improvements are robustly attributable to the preemptive verification mechanism, this work could advance the reliability of embodied AI systems by addressing uncertainty in action generation and preventing misleading world-model rollouts. The approach offers a practical solution for runtime safety in long-horizon tasks, potentially reducing physical failures and computational waste. The reported verification time suggests feasibility for real-time deployment.

major comments (3)

[Abstract] Abstract: The reported improvement in closed-loop success rate from 30.79% to 37.62% provides no error bars, no statistical significance tests, and no ablation isolating the dual-branch head from the resampling scheduler. This directly undermines attribution of the gains to reliable safety confidence and advantage predictions on out-of-distribution chunks.
[Training objective] Training description: No details are given on how critic advantage labels were obtained for the regression branch. This is load-bearing for the central claim, as label quality determines whether the dual-branch head can produce generalizable scores without excessive false negatives that stall execution.
[Experiments] Evaluation: No predictor-level metrics (AUC, ECE, false-negative rate on held-out chunks) are supplied for the lightweight dual-branch head. Without these, the assumption that the head generalizes to unseen action chunks under the four LIBERO suites cannot be verified and remains the weakest link in supporting the 6.83 percentage-point gain.

minor comments (2)

[Abstract] The abstract refers to 'four suites' of LIBERO without naming them; explicit identification would aid reproducibility.
[Method] The soft-threshold calibration parameter is mentioned as a free parameter but its precise integration into the multi-task loss is not illustrated, which could be clarified with a short equation or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed each major comment carefully and provide point-by-point responses below, along with commitments to revisions that will strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: The reported improvement in closed-loop success rate from 30.79% to 37.62% provides no error bars, no statistical significance tests, and no ablation isolating the dual-branch head from the resampling scheduler. This directly undermines attribution of the gains to reliable safety confidence and advantage predictions on out-of-distribution chunks.

Authors: We agree that the current reporting in the abstract lacks error bars, statistical tests, and a dedicated ablation to isolate the dual-branch head from the resampling scheduler. In the revised manuscript we will add error bars computed over multiple random seeds, report the results of statistical significance tests, and include an ablation study that separates the contributions of the dual-branch head and the preemptive resampling scheduler. These additions will better support attribution of the observed gains to the safety confidence and advantage predictions. revision: yes
Referee: [Training objective] Training description: No details are given on how critic advantage labels were obtained for the regression branch. This is load-bearing for the central claim, as label quality determines whether the dual-branch head can produce generalizable scores without excessive false negatives that stall execution.

Authors: We acknowledge that the manuscript does not currently provide sufficient detail on the generation of critic advantage labels for the regression branch. We will expand the training objective section in the revision to fully describe the label acquisition process, including the critic model employed, the computation of advantage scores, and any preprocessing steps used to mitigate label noise or imbalance. revision: yes
Referee: [Experiments] Evaluation: No predictor-level metrics (AUC, ECE, false-negative rate on held-out chunks) are supplied for the lightweight dual-branch head. Without these, the assumption that the head generalizes to unseen action chunks under the four LIBERO suites cannot be verified and remains the weakest link in supporting the 6.83 percentage-point gain.

Authors: We concur that predictor-level metrics are necessary to substantiate the generalization of the dual-branch head. In the revised experiments section we will report AUC, expected calibration error (ECE), and false-negative rates evaluated on held-out action chunks drawn from the LIBERO suites. These metrics will directly address the verification of the head's performance on out-of-distribution chunks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmark

full rationale

The paper's central claims consist of measured closed-loop success rates on the external LIBERO benchmark (improving from 30.79% to 37.62% over the named baseline RynnVLA-002) together with runtime metrics such as 183.9 ms verification time. These quantities are obtained by direct evaluation on held-out suites rather than by any internal equation that reduces the reported success rate to a fitted parameter or self-referential definition. The training procedure (Focal loss + advantage regression + soft-threshold calibration on a dual-branch head) is described as a standard multi-task objective; no derivation step equates the final performance numbers to the training inputs by construction. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the architecture. The derivation chain therefore remains self-contained against an independent external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised-learning assumptions plus the untested premise that the critic-derived advantage labels are sufficiently accurate to guide resampling.

free parameters (1)

soft-threshold calibration parameter
Introduced to stabilize boundary decisions under class imbalance; its value is chosen during training.

axioms (1)

domain assumption The multimodal backbone extracts features that are linearly separable enough for the dual-branch head to produce useful safety and advantage predictions.
Invoked when the paper states that the backbone plus lightweight head suffices for preemptive assessment.

pith-pipeline@v0.9.0 · 5790 in / 1267 out tokens · 35227 ms · 2026-05-22T06:57:57.576043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores... multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate... 183.9 ms average forward verification time

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

[1]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning. PMLR, 2025, pp. 2679–2713

work page 2025
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

W. Huang, H. Sun, Y . Guo, Y . Ma, H. Li, J. Long, Z. Mo, Z. Guan, Y . Guo, S. Diet al., “Noisegate: Learning per-latent timestep sched- ules as information gating in world action models,”arXiv preprint arXiv:2605.07794, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Luiet al., “A survey on vision-language-action models: An action tokenization perspective,”arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

D. Zhang, J. Sun, C. Hu, X. Wu, Z. Yuan, R. Zhou, F. Shen, and Q. Zhou, “Pure vision language action (vla) models: A comprehensive survey,” arXiv preprint arXiv:2509.19012, 2025

work page arXiv 2025
[8]

Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,

C. Zhou, H. Sun, H. Yang, J. Long, J. Xiong, L. Wang, M. Luo, Q. Yang, S. Di, S. Wanget al., “Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,” arXiv preprint arXiv:2603.11101, 2026

work page arXiv 2026
[9]

Wovr: World models as reliable simulators for post-training vla policies with rl,

Z. Jiang, S. Zhou, Y . Jiang, Z. Huang, M. Wei, Y . Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhanget al., “Wovr: World models as reliable simulators for post-training vla policies with rl,”arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026
[10]

Multi-agent embodied ai: Advances and future directions,

Z. Feng, R. Xue, L. Yuan, Y . Yu, N. Ding, M. Liu, B. Gao, J. Sun, X. Zheng, and G. Wang, “Multi-agent embodied ai: Advances and future directions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05108

work page arXiv 2025
[11]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wanget al., “Worldvla: Towards autoregressive action world model,”arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2509.19080

work page arXiv 2026
[13]

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

J. Gao, Y . Guo, Z. Guan, W. Huang, W. Ma, X. Xiao, J. Xiong, and S. Wen, “Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training,”arXiv preprint arXiv:2605.07288, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

Z. Guan, H. Sun, Y . Guo, S. Di, X. Bai, J. Long, T. Zhao, M. Luo, C. Zhou, Y . Guoet al., “Rl-vla3: Reinforcement learning vla accelerating via full asynchronism,”arXiv preprint arXiv:2602.05765, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Runtime verification and field-based testing for ros-based robotic systems,

R. Caldas, J. A. Pi ˜nera Garc´ıa, M. Schiopu, P. Pelliccione, G. Rodrigues, and T. Berger, “Runtime verification and field-based testing for ros-based robotic systems,”IEEE Transactions on Software Engineering, vol. 50, no. 10, pp. 2544–2567, 2024

work page 2024
[16]

Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,

X. Guan, Y . Liu, X. Lu, B. Cao, B. He, X. Han, L. Sun, J. Lou, B. Yu, Y . Luet al., “Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,”arXiv preprint arXiv:2411.11504, 2024

work page arXiv 2024
[17]

Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,

J. S. Betzer, J. Boudjadar, M. Frasheri, and P. Talasila, “Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,”arXiv preprint arXiv:2412.09913, 2024

work page arXiv 2024
[18]

Robosafe: Safeguarding embodied agents via executable safety logic,

L. Wang, Z. Ying, X. Yang, Q. Zou, Z. Yin, T. Li, J. Yang, Y . Yang, A. Liu, and X. Liu, “Robosafe: Safeguarding embodied agents via executable safety logic,”arXiv preprint arXiv:2512.21220, 2025

work page arXiv 2025
[19]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. Whiteet al., “Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios,”arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025
[20]

Deep learning traversability estimator for mobile robots in unstructured environments,

M. Visca, S. Kuutti, R. Powell, Y . Gao, and S. Fallah, “Deep learning traversability estimator for mobile robots in unstructured environments,” inAnnual Conference Towards Autonomous Robotic Systems. Springer, 2021, pp. 203–213

work page 2021
[21]

A survey on class imbalance learning algorithms in complex scenarios,

L. Zhao, F. Han, Q. Ling, H. Han, Z. Yao, W. Liu, and Z. Zhou, “A survey on class imbalance learning algorithms in complex scenarios,” IEEE Access, 2025

work page 2025
[22]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017
[23]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

work page 2023
[24]

Rynnvla-002: A unified vision-language-action and world model,

J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luoet al., “Rynnvla-002: A unified vision-language-action and world model,”arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025
[25]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “pi0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Y . Guo, L. X. Shi, J. Chen, and C. Finn, “Ctrl-world: A control- lable generative world model for robot manipulation,”arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,

G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liuet al., “Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,”arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026
[29]

Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

Z. Fangqi, Y . Zhengyang, H. Zicong, S. Quanxin, M. Xiao, and G. Song, “Wmpo: World model-based policy optimization for vision- language-action models,”arXiv preprint arXiv:2511.09515, 2025. [Online]. Available: https://arxiv.org/abs/2511.09515

work page arXiv 2025
[30]

World-vla-loop: Closed-loop learning of video world model and vla policy,

X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou, “World-vla-loop: Closed-loop learning of video world model and vla policy,”arXiv preprint arXiv:2602.06508, 2026

work page arXiv 2026
[31]

Control barrier functions: Theory and applications,

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in2019 18th European control conference (ECC). Ieee, 2019, pp. 3420–3431

work page 2019
[32]

Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,

Y . Luo and T. Ma, “Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,”Advances in Neural Information Processing Systems, vol. 34, pp. 25 621–25 632, 2021

work page 2021
[33]

Safe Exploration in Continuous Action Spaces

G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y . Tassa, “Safe exploration in continuous action spaces,”arXiv preprint arXiv:1801.08757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Ai agents under threat: A survey of key security challenges and future pathways,

Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

work page 2025
[35]

Conservative safety critics for exploration,

H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg, “Conservative safety critics for exploration,”arXiv preprint arXiv:2010.14497, 2020

work page arXiv 2010
[36]

Diffusion forcing: Next-token prediction meets full- sequence diffusion,

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full- sequence diffusion,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 24 081–24 125, 2024

work page 2024
[37]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,

C. Yu, Y . Wang, Z. Guo, H. Lin, S. Xu, H. Zang, Q. Zhang, Y . Wu, C. Zhu, J. Hu, Z. Huang, M. Wei, Y . Xie, K. Yang, B. Dai, Z. Xu, J. Du, X. Wang, X. Fu, L. Shi, Z. Liu, K. Chen, W. Liu, G. Liu, B. Li, J. Yang, Z. Yang, G. Dai, and Y . Wang, “Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,” 2025. ...

work page arXiv 2025

[1] [1]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning. PMLR, 2025, pp. 2679–2713

work page 2025

[2] [2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

W. Huang, H. Sun, Y . Guo, Y . Ma, H. Li, J. Long, Z. Mo, Z. Guan, Y . Guo, S. Diet al., “Noisegate: Learning per-latent timestep sched- ules as information gating in world action models,”arXiv preprint arXiv:2605.07794, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Luiet al., “A survey on vision-language-action models: An action tokenization perspective,”arXiv preprint arXiv:2507.01925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

D. Zhang, J. Sun, C. Hu, X. Wu, Z. Yuan, R. Zhou, F. Shen, and Q. Zhou, “Pure vision language action (vla) models: A comprehensive survey,” arXiv preprint arXiv:2509.19012, 2025

work page arXiv 2025

[8] [8]

Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,

C. Zhou, H. Sun, H. Yang, J. Long, J. Xiong, L. Wang, M. Luo, Q. Yang, S. Di, S. Wanget al., “Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,” arXiv preprint arXiv:2603.11101, 2026

work page arXiv 2026

[9] [9]

Wovr: World models as reliable simulators for post-training vla policies with rl,

Z. Jiang, S. Zhou, Y . Jiang, Z. Huang, M. Wei, Y . Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhanget al., “Wovr: World models as reliable simulators for post-training vla policies with rl,”arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026

[10] [10]

Multi-agent embodied ai: Advances and future directions,

Z. Feng, R. Xue, L. Yuan, Y . Yu, N. Ding, M. Liu, B. Gao, J. Sun, X. Zheng, and G. Wang, “Multi-agent embodied ai: Advances and future directions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05108

work page arXiv 2025

[11] [11]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wanget al., “Worldvla: Towards autoregressive action world model,”arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2509.19080

work page arXiv 2026

[13] [13]

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

J. Gao, Y . Guo, Z. Guan, W. Huang, W. Ma, X. Xiao, J. Xiong, and S. Wen, “Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training,”arXiv preprint arXiv:2605.07288, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

Z. Guan, H. Sun, Y . Guo, S. Di, X. Bai, J. Long, T. Zhao, M. Luo, C. Zhou, Y . Guoet al., “Rl-vla3: Reinforcement learning vla accelerating via full asynchronism,”arXiv preprint arXiv:2602.05765, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Runtime verification and field-based testing for ros-based robotic systems,

R. Caldas, J. A. Pi ˜nera Garc´ıa, M. Schiopu, P. Pelliccione, G. Rodrigues, and T. Berger, “Runtime verification and field-based testing for ros-based robotic systems,”IEEE Transactions on Software Engineering, vol. 50, no. 10, pp. 2544–2567, 2024

work page 2024

[16] [16]

Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,

X. Guan, Y . Liu, X. Lu, B. Cao, B. He, X. Han, L. Sun, J. Lou, B. Yu, Y . Luet al., “Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,”arXiv preprint arXiv:2411.11504, 2024

work page arXiv 2024

[17] [17]

Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,

J. S. Betzer, J. Boudjadar, M. Frasheri, and P. Talasila, “Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,”arXiv preprint arXiv:2412.09913, 2024

work page arXiv 2024

[18] [18]

Robosafe: Safeguarding embodied agents via executable safety logic,

L. Wang, Z. Ying, X. Yang, Q. Zou, Z. Yin, T. Li, J. Yang, Y . Yang, A. Liu, and X. Liu, “Robosafe: Safeguarding embodied agents via executable safety logic,”arXiv preprint arXiv:2512.21220, 2025

work page arXiv 2025

[19] [19]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. Whiteet al., “Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios,”arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025

[20] [20]

Deep learning traversability estimator for mobile robots in unstructured environments,

M. Visca, S. Kuutti, R. Powell, Y . Gao, and S. Fallah, “Deep learning traversability estimator for mobile robots in unstructured environments,” inAnnual Conference Towards Autonomous Robotic Systems. Springer, 2021, pp. 203–213

work page 2021

[21] [21]

A survey on class imbalance learning algorithms in complex scenarios,

L. Zhao, F. Han, Q. Ling, H. Han, Z. Yao, W. Liu, and Z. Zhou, “A survey on class imbalance learning algorithms in complex scenarios,” IEEE Access, 2025

work page 2025

[22] [22]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017

[23] [23]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

work page 2023

[24] [24]

Rynnvla-002: A unified vision-language-action and world model,

J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luoet al., “Rynnvla-002: A unified vision-language-action and world model,”arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025

[25] [25]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “pi0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Y . Guo, L. X. Shi, J. Chen, and C. Finn, “Ctrl-world: A control- lable generative world model for robot manipulation,”arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,

G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liuet al., “Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,”arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026

[29] [29]

Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

Z. Fangqi, Y . Zhengyang, H. Zicong, S. Quanxin, M. Xiao, and G. Song, “Wmpo: World model-based policy optimization for vision- language-action models,”arXiv preprint arXiv:2511.09515, 2025. [Online]. Available: https://arxiv.org/abs/2511.09515

work page arXiv 2025

[30] [30]

World-vla-loop: Closed-loop learning of video world model and vla policy,

X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou, “World-vla-loop: Closed-loop learning of video world model and vla policy,”arXiv preprint arXiv:2602.06508, 2026

work page arXiv 2026

[31] [31]

Control barrier functions: Theory and applications,

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in2019 18th European control conference (ECC). Ieee, 2019, pp. 3420–3431

work page 2019

[32] [32]

Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,

Y . Luo and T. Ma, “Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,”Advances in Neural Information Processing Systems, vol. 34, pp. 25 621–25 632, 2021

work page 2021

[33] [33]

Safe Exploration in Continuous Action Spaces

G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y . Tassa, “Safe exploration in continuous action spaces,”arXiv preprint arXiv:1801.08757, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Ai agents under threat: A survey of key security challenges and future pathways,

Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

work page 2025

[35] [35]

Conservative safety critics for exploration,

H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg, “Conservative safety critics for exploration,”arXiv preprint arXiv:2010.14497, 2020

work page arXiv 2010

[36] [36]

Diffusion forcing: Next-token prediction meets full- sequence diffusion,

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full- sequence diffusion,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 24 081–24 125, 2024

work page 2024

[37] [37]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,

C. Yu, Y . Wang, Z. Guo, H. Lin, S. Xu, H. Zang, Q. Zhang, Y . Wu, C. Zhu, J. Hu, Z. Huang, M. Wei, Y . Xie, K. Yang, B. Dai, Z. Xu, J. Du, X. Wang, X. Fu, L. Shi, Z. Liu, K. Chen, W. Liu, G. Liu, B. Li, J. Yang, Z. Yang, G. Dai, and Y . Wang, “Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,” 2025. ...

work page arXiv 2025