pith. sign in

arxiv: 2605.22446 · v1 · pith:QONCUQTKnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.RO

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pith reviewed 2026-05-22 06:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords runtime verificationvision-language-action modelsworld modelsaction safetypreemptive filteringLIBERO benchmarkembodied AIresampling scheduler
0
0 comments X

The pith

Pre-VLA adds preemptive checks to filter bad actions and raise VLA success rates from 31 to 38 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Pre-VLA, a system that verifies action chunks from vision-language-action models before they are carried out or fed into world models. It uses a multimodal backbone and dual-branch head to score safety and advantage, then resamples poor candidates within a time budget. The approach tackles uncertainty in learned policies that otherwise causes robot failures or inefficient simulations. Readers should care because it offers a practical way to make large embodied models more dependable without retraining the whole system. Experiments indicate this raises average success on standard benchmarks while keeping verification fast.

Core claim

Pre-VLA is a unified runtime verification architecture that performs preemptive action validity assessment using an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict safety confidence and critic-derived advantage scores. It is trained with a multi-task objective that combines Focal classification, advantage regression, and soft-threshold calibration. At deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under limited computation budget, leading to higher closed-loop success and less error buildup in rollouts.

What carries the argument

Lightweight dual-branch head that outputs safety confidence and advantage scores for action chunks, paired with a dual-mode resampling scheduler.

If this is right

  • Increases average closed-loop success rate from 30.79% to 37.62% over baseline on LIBERO.
  • Decreases the number of steps required to complete tasks.
  • Keeps average verification time at 183.9 milliseconds per action chunk.
  • Reduces error accumulation when generating world-model rollouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This verification approach could help stabilize longer planning horizons by catching mistakes before they compound.
  • It might generalize to other robot learning setups where action uncertainty is a problem.
  • Real-world tests could check if the added latency still allows responsive control in dynamic environments.

Load-bearing premise

The dual-branch head produces safety and advantage predictions that work well on unseen actions without causing too many unnecessary resamples or stalls.

What would settle it

If adding Pre-VLA to a VLA model on new tasks fails to improve success rates or causes frequent execution halts due to false alarms, the method's reliability would be questioned.

Figures

Figures reproduced from arXiv: 2605.22446 by Haoran Sun, Jiachi Ji, Junwu Xiong, Luqiao Wang, Shengzhe Ji, Wei Lu, Yongjian Guo, Zhen Sun, Zhijun Meng.

Figure 1
Figure 1. Figure 1: Overview of the dual-mode runtime verification [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ARGUS runtime safety verification framework. The VLA generates candidate action chunks from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The core idea is to reuse the multimodal perceptual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of Pre-VLA with dual-modal data used during training. into the backbone for encoding. During the training of Pre￾VLA, all parameters of the backbone are frozen to preserve its original generative capability. We then extract the final-layer hidden states Ht of the backbone as high-dimensional feature representations for subsequent verification. 2) Modality-Aware Feature Pooling: Since t… view at source ↗
Figure 4
Figure 4. Figure 4: Closed-loop execution comparison with and without [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Closed-loop performance comparison across four LIBERO suites. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: World Model rollout comparison with and without [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Pre-VLA, a unified runtime verification architecture for vision-language-action (VLA) models and generative world models. It features an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head that predicts safety confidence and critic-derived advantage scores for candidate action chunks. The model is trained using a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. A dual-mode preemptive resampling scheduler filters low-quality actions under a limited computation budget. On the LIBERO benchmark, Pre-VLA improves the average closed-loop success rate across four suites from 30.79% to 37.62% compared to RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

Significance. If the performance improvements are robustly attributable to the preemptive verification mechanism, this work could advance the reliability of embodied AI systems by addressing uncertainty in action generation and preventing misleading world-model rollouts. The approach offers a practical solution for runtime safety in long-horizon tasks, potentially reducing physical failures and computational waste. The reported verification time suggests feasibility for real-time deployment.

major comments (3)
  1. [Abstract] Abstract: The reported improvement in closed-loop success rate from 30.79% to 37.62% provides no error bars, no statistical significance tests, and no ablation isolating the dual-branch head from the resampling scheduler. This directly undermines attribution of the gains to reliable safety confidence and advantage predictions on out-of-distribution chunks.
  2. [Training objective] Training description: No details are given on how critic advantage labels were obtained for the regression branch. This is load-bearing for the central claim, as label quality determines whether the dual-branch head can produce generalizable scores without excessive false negatives that stall execution.
  3. [Experiments] Evaluation: No predictor-level metrics (AUC, ECE, false-negative rate on held-out chunks) are supplied for the lightweight dual-branch head. Without these, the assumption that the head generalizes to unseen action chunks under the four LIBERO suites cannot be verified and remains the weakest link in supporting the 6.83 percentage-point gain.
minor comments (2)
  1. [Abstract] The abstract refers to 'four suites' of LIBERO without naming them; explicit identification would aid reproducibility.
  2. [Method] The soft-threshold calibration parameter is mentioned as a free parameter but its precise integration into the multi-task loss is not illustrated, which could be clarified with a short equation or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed each major comment carefully and provide point-by-point responses below, along with commitments to revisions that will strengthen the presentation of our results and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported improvement in closed-loop success rate from 30.79% to 37.62% provides no error bars, no statistical significance tests, and no ablation isolating the dual-branch head from the resampling scheduler. This directly undermines attribution of the gains to reliable safety confidence and advantage predictions on out-of-distribution chunks.

    Authors: We agree that the current reporting in the abstract lacks error bars, statistical tests, and a dedicated ablation to isolate the dual-branch head from the resampling scheduler. In the revised manuscript we will add error bars computed over multiple random seeds, report the results of statistical significance tests, and include an ablation study that separates the contributions of the dual-branch head and the preemptive resampling scheduler. These additions will better support attribution of the observed gains to the safety confidence and advantage predictions. revision: yes

  2. Referee: [Training objective] Training description: No details are given on how critic advantage labels were obtained for the regression branch. This is load-bearing for the central claim, as label quality determines whether the dual-branch head can produce generalizable scores without excessive false negatives that stall execution.

    Authors: We acknowledge that the manuscript does not currently provide sufficient detail on the generation of critic advantage labels for the regression branch. We will expand the training objective section in the revision to fully describe the label acquisition process, including the critic model employed, the computation of advantage scores, and any preprocessing steps used to mitigate label noise or imbalance. revision: yes

  3. Referee: [Experiments] Evaluation: No predictor-level metrics (AUC, ECE, false-negative rate on held-out chunks) are supplied for the lightweight dual-branch head. Without these, the assumption that the head generalizes to unseen action chunks under the four LIBERO suites cannot be verified and remains the weakest link in supporting the 6.83 percentage-point gain.

    Authors: We concur that predictor-level metrics are necessary to substantiate the generalization of the dual-branch head. In the revised experiments section we will report AUC, expected calibration error (ECE), and false-negative rates evaluated on held-out action chunks drawn from the LIBERO suites. These metrics will directly address the verification of the head's performance on out-of-distribution chunks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmark

full rationale

The paper's central claims consist of measured closed-loop success rates on the external LIBERO benchmark (improving from 30.79% to 37.62% over the named baseline RynnVLA-002) together with runtime metrics such as 183.9 ms verification time. These quantities are obtained by direct evaluation on held-out suites rather than by any internal equation that reduces the reported success rate to a fitted parameter or self-referential definition. The training procedure (Focal loss + advantage regression + soft-threshold calibration on a dual-branch head) is described as a standard multi-task objective; no derivation step equates the final performance numbers to the training inputs by construction. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the architecture. The derivation chain therefore remains self-contained against an independent external benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised-learning assumptions plus the untested premise that the critic-derived advantage labels are sufficiently accurate to guide resampling.

free parameters (1)
  • soft-threshold calibration parameter
    Introduced to stabilize boundary decisions under class imbalance; its value is chosen during training.
axioms (1)
  • domain assumption The multimodal backbone extracts features that are linearly separable enough for the dual-branch head to produce useful safety and advantage predictions.
    Invoked when the paper states that the backbone plus lightweight head suffices for preemptive assessment.

pith-pipeline@v0.9.0 · 5790 in / 1267 out tokens · 35227 ms · 2026-05-22T06:57:57.576043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 14 internal anchors

  1. [1]

    Openvla: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuonget al., “Openvla: An open-source vision-language-action model,” inConference on Robot Learning. PMLR, 2025, pp. 2679–2713

  2. [2]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

  3. [3]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  4. [4]

    NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    W. Huang, H. Sun, Y . Guo, Y . Ma, H. Li, J. Long, Z. Mo, Z. Guan, Y . Guo, S. Diet al., “Noisegate: Learning per-latent timestep sched- ules as information gating in world action models,”arXiv preprint arXiv:2605.07794, 2026

  5. [5]

    A Survey on Vision-Language-Action Models for Embodied AI

    Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

  6. [6]

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Luiet al., “A survey on vision-language-action models: An action tokenization perspective,”arXiv preprint arXiv:2507.01925, 2025

  7. [7]

    Pure vision language action (vla) models: A comprehensive survey.arXiv preprint arXiv:2509.19012,

    D. Zhang, J. Sun, C. Hu, X. Wu, Z. Yuan, R. Zhou, F. Shen, and Q. Zhou, “Pure vision language action (vla) models: A comprehensive survey,” arXiv preprint arXiv:2509.19012, 2025

  8. [8]

    Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,

    C. Zhou, H. Sun, H. Yang, J. Long, J. Xiong, L. Wang, M. Luo, Q. Yang, S. Di, S. Wanget al., “Thousand-gpu large-scale training and opti- mization recipe for ai-native cloud embodied intelligence infrastructure,” arXiv preprint arXiv:2603.11101, 2026

  9. [9]

    Wovr: World models as reliable simulators for post-training vla policies with rl,

    Z. Jiang, S. Zhou, Y . Jiang, Z. Huang, M. Wei, Y . Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhanget al., “Wovr: World models as reliable simulators for post-training vla policies with rl,”arXiv preprint arXiv:2602.13977, 2026

  10. [10]

    Multi-agent embodied ai: Advances and future directions,

    Z. Feng, R. Xue, L. Yuan, Y . Yu, N. Ding, M. Liu, B. Gao, J. Sun, X. Zheng, and G. Wang, “Multi-agent embodied ai: Advances and future directions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05108

  11. [11]

    WorldVLA: Towards Autoregressive Action World Model

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wanget al., “Worldvla: Towards autoregressive action world model,”arXiv preprint arXiv:2506.21539, 2025

  12. [12]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

    Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2509.19080

  13. [13]

    Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

    J. Gao, Y . Guo, Z. Guan, W. Huang, W. Ma, X. Xiao, J. Xiong, and S. Wen, “Sword: Style-robust world models as simulators via dynamic latent bootstrapping for vla policy post-training,”arXiv preprint arXiv:2605.07288, 2026

  14. [14]

    RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

    Z. Guan, H. Sun, Y . Guo, S. Di, X. Bai, J. Long, T. Zhao, M. Luo, C. Zhou, Y . Guoet al., “Rl-vla3: Reinforcement learning vla accelerating via full asynchronism,”arXiv preprint arXiv:2602.05765, 2026

  15. [15]

    Runtime verification and field-based testing for ros-based robotic systems,

    R. Caldas, J. A. Pi ˜nera Garc´ıa, M. Schiopu, P. Pelliccione, G. Rodrigues, and T. Berger, “Runtime verification and field-based testing for ros-based robotic systems,”IEEE Transactions on Software Engineering, vol. 50, no. 10, pp. 2544–2567, 2024

  16. [16]

    Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,

    X. Guan, Y . Liu, X. Lu, B. Cao, B. He, X. Han, L. Sun, J. Lou, B. Yu, Y . Luet al., “Search, verify and feedback: Towards next generation post- training paradigm of foundation models via verifier engineering,”arXiv preprint arXiv:2411.11504, 2024

  17. [17]

    Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,

    J. S. Betzer, J. Boudjadar, M. Frasheri, and P. Talasila, “Digital twin enabled runtime verification for autonomous mobile robots under un- certainty,”arXiv preprint arXiv:2412.09913, 2024

  18. [18]

    Robosafe: Safeguarding embodied agents via executable safety logic,

    L. Wang, Z. Ying, X. Yang, Q. Zou, Z. Yin, T. Li, J. Yang, Y . Yang, A. Liu, and X. Liu, “Robosafe: Safeguarding embodied agents via executable safety logic,”arXiv preprint arXiv:2512.21220, 2025

  19. [19]

    Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

    R. Xu, H. Lin, W. Jeon, H. Feng, Y . Zou, L. Sun, J. Gorman, E. Tolstaya, S. Tang, B. Whiteet al., “Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios,”arXiv preprint arXiv:2510.26125, 2025

  20. [20]

    Deep learning traversability estimator for mobile robots in unstructured environments,

    M. Visca, S. Kuutti, R. Powell, Y . Gao, and S. Fallah, “Deep learning traversability estimator for mobile robots in unstructured environments,” inAnnual Conference Towards Autonomous Robotic Systems. Springer, 2021, pp. 203–213

  21. [21]

    A survey on class imbalance learning algorithms in complex scenarios,

    L. Zhao, F. Han, Q. Ling, H. Han, Z. Yao, W. Liu, and Z. Zhou, “A survey on class imbalance learning algorithms in complex scenarios,” IEEE Access, 2025

  22. [22]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  23. [23]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

  24. [24]

    Rynnvla-002: A unified vision-language-action and world model,

    J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luoet al., “Rynnvla-002: A unified vision-language-action and world model,”arXiv preprint arXiv:2511.17502, 2025

  25. [25]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  26. [26]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “pi0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

  27. [27]

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Y . Guo, L. X. Shi, J. Chen, and C. Finn, “Ctrl-world: A control- lable generative world model for robot manipulation,”arXiv preprint arXiv:2510.10125, 2025

  28. [28]

    Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,

    G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liuet al., “Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning,”arXiv preprint arXiv:2602.12099, 2026

  29. [29]

    Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

    Z. Fangqi, Y . Zhengyang, H. Zicong, S. Quanxin, M. Xiao, and G. Song, “Wmpo: World model-based policy optimization for vision- language-action models,”arXiv preprint arXiv:2511.09515, 2025. [Online]. Available: https://arxiv.org/abs/2511.09515

  30. [30]

    World-vla-loop: Closed-loop learning of video world model and vla policy,

    X. Liu, Z. Bai, H. Ci, K. Y . Ma, and M. Z. Shou, “World-vla-loop: Closed-loop learning of video world model and vla policy,”arXiv preprint arXiv:2602.06508, 2026

  31. [31]

    Control barrier functions: Theory and applications,

    A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in2019 18th European control conference (ECC). Ieee, 2019, pp. 3420–3431

  32. [32]

    Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,

    Y . Luo and T. Ma, “Learning barrier certificates: Towards safe reinforce- ment learning with zero training-time violations,”Advances in Neural Information Processing Systems, vol. 34, pp. 25 621–25 632, 2021

  33. [33]

    Safe Exploration in Continuous Action Spaces

    G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y . Tassa, “Safe exploration in continuous action spaces,”arXiv preprint arXiv:1801.08757, 2018

  34. [34]

    Ai agents under threat: A survey of key security challenges and future pathways,

    Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

  35. [35]

    Conservative safety critics for exploration,

    H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg, “Conservative safety critics for exploration,”arXiv preprint arXiv:2010.14497, 2020

  36. [36]

    Diffusion forcing: Next-token prediction meets full- sequence diffusion,

    B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full- sequence diffusion,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 24 081–24 125, 2024

  37. [37]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  38. [38]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

  39. [39]

    Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,

    C. Yu, Y . Wang, Z. Guo, H. Lin, S. Xu, H. Zang, Q. Zhang, Y . Wu, C. Zhu, J. Hu, Z. Huang, M. Wei, Y . Xie, K. Yang, B. Dai, Z. Xu, J. Du, X. Wang, X. Fu, L. Shi, Z. Liu, K. Chen, W. Liu, G. Liu, B. Li, J. Yang, Z. Yang, G. Dai, and Y . Wang, “Rlinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,” 2025. ...