pith. sign in

arxiv: 2606.11266 · v1 · pith:I5VMF5U6new · submitted 2026-06-09 · 💻 cs.LG

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

Pith reviewed 2026-06-27 14:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords constrained reinforcement learningvision-language modelsanticipatory safetyLagrangian methodsSafety-GymnasiumMetaDrivecost signals
0
0 comments X

The pith

A frozen vision-language model supplies anticipatory cost signals that let constrained RL avoid collisions before they start in high-speed tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard constrained RL receives cost signals only after collisions begin, which arrives too late when motion is fast and irreversible. The paper shows that a frozen VLM can be inserted directly into the Lagrangian multiplier update to generate per-step anticipatory costs. Four components achieve this: independent reward and cost paths through CLIP, an augmented VLM-Lagrange update, a Bayes-optimal confidence gate on the CLIP margin, and the resulting VLMPPOLag algorithm. On the principal Safety-Gymnasium FormulaOne L2 benchmark the combined method is the only configuration that keeps both substantive return and cost within budget across most seeds, while five constraint-aware baselines each fail at least one of those criteria. The same mechanism lowers catastrophe rates on a held-out MetaDrive task and shows consistent directional transfer elsewhere.

Core claim

The paper establishes that a frozen VLM can be integrated as an anticipatory cost term inside the CMDP Lagrangian update through the VLM-Lagrange mechanism together with confidence gating, producing the VLMPPOLag+Conf algorithm that simultaneously retains substantive return (Jr≈40) and holds cost within budget on a majority of seeds in Safety-Gymnasium FormulaOne L2, where five constraint-aware baselines each fail at least one requirement.

What carries the argument

VLM-Lagrange, the augmented multiplier update that adds a per-step VLM cost as an anticipatory term, together with Decoupled Dual-Path CLIP and Confidence Gating derived from a logistic noise model on the CLIP margin.

If this is right

  • Only the VLM-augmented method satisfies both return and cost criteria simultaneously on the main racing benchmark.
  • The approach reduces catastrophe rate from 41% to 26% on held-out MetaDrive Medium.
  • Directionally consistent safety gains appear when transferred to Bullet Safety-Gym.
  • Observed failures on certain MetaDrive difficulties trace to Lagrangian regulation pathology rather than the VLM cost signal itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Anticipatory VLM signals could be paired with other sensor modalities to improve cost prediction without retraining the language model.
  • The confidence-gating construction offers a template for incorporating any noisy external predictor into constrained optimization loops.
  • Because the VLM remains frozen, the method may transfer to physical robots whose visual input differs from simulation without additional VLM fine-tuning.

Load-bearing premise

The frozen VLM produces per-step cost signals that are sufficiently accurate and low-noise to serve as a useful anticipatory term inside the Lagrangian multiplier update without requiring fine-tuning or introducing systematic bias that the confidence gate cannot correct.

What would settle it

Re-running VLMPPOLag+Conf on Safety-Gymnasium FormulaOne L2 under the stated protocol and observing that it fails to retain Jr≈40 while holding cost within budget on a majority of the five seeds would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.11266 by Cody Fleming, Samuel Tetteh.

Figure 1
Figure 1. Figure 1: Anticipatory safety from a frozen VLM. (a) Per-step CLIP danger signal cvlm (green) rises several timesteps before the environment cost (red) on a single FormulaOne L2 rollout. The Lagrange multiplier λ is updated between epochs using the epoch mean cvlm via Eq. (3); the per-step trace illustrates the signal that this mean accumulates. (b) Held-out MetaDrive Medium: the mechanism cuts catastrophe rate from… view at source ↗
Figure 2
Figure 2. Figure 2: Lagrange multiplier dynamics. Multiplier λ tracking on FormulaOne-L2. 0 10 20 30 40 50 Epoch 0.00 0.25 0.50 0.75 1.00 1.25 L a g r a n g e m ultiplie r FormulaOne-L2 PPOLag-Decoupled VLMPPOLag VLMPPOLag+Conf Interpretation of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves on FormulaOne-L0/L1/L2: episode return JR (top row) and episode cost JC (bottom row); faded per-seed traces with bold seed-mean overlays. 0 20 40 60 E pis o d e R e t u r n J R FormulaOne-L0 0 20 40 60 FormulaOne-L1 0 20 40 60 FormulaOne-L2 0 10 20 30 40 50 Epoch 0 5 10 15 20 25 E pis o d e C o s t J C 0 10 20 30 40 50 Epoch 0 50 100 150 200 250 300 0 10 20 30 40 50 Epoch 0 50 100 150 200 2… view at source ↗
Figure 4
Figure 4. Figure 4: First-person camera frames from the three FormulaOne difficulty levels in Safety [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MetaDrive held-out catastrophe rates before (left, default [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt-template sensitivity (held-out catastrophe rate, v1 vs. v2 vs. v3) on Bullet and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual rendering of Table [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-seed FormulaOne learning curves for key methods across L0/L1/L2 (3 seeds each). [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Phase B 5-seed per-seed learning curves for VLMPPOLag+Conf on Formu￾laOne L0/L1/L2 (top: JR; bottom: JC ). Bold blue: mean across the 5 Phase B seeds {42, 123, 456, 789, 1024} with a ±1 std band; thin blue: individual seeds. Dashed coral: ungated VLMPPOLag baseline mean (3 seeds, for reference). Red dashed line: cost budget d=25; soft red wash marks the infeasible region. The two post-submission seeds (789… view at source ↗
Figure 10
Figure 10. Figure 10: Aggregated FormulaOne learning curves (return top, cost bottom; L0/L1/L2 columns). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pairwise Welch’s t-test on FormulaOne L2 for episodic return JR. Left: p-values (green = significant at α=0.05). Right: t-statistics (positive = row method > column method). D.3 One-sided permutation tests on FormulaOne L2 Welch’s t-test relies on a Gaussian-tail approximation that has low power at small seed counts and is sensitive to the variance estimate. We therefore additionally report exact one-side… view at source ↗
Figure 12
Figure 12. Figure 12: Pairwise Welch’s t-test on FormulaOne L2 for episodic cost JC . Negative t-statistics indicate the row method is safer than the column method [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Cost–return Pareto frontiers across FormulaOne L0/L1/L2. The ideal region is the bottom [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Where the VLM signal enters the safe-RL loop. Four injection points evaluated in Section D.7; in each panel the coloured arrow marks the injection edge and the box outlined in the same colour marks the target component of the safe-RL update. (1) Decoupled adds an auxiliary VLM advantage directly to the policy gradient. (2) Decoupled + Confidence (our default) is identical but multiplies the contribution b… view at source ↗
Figure 15
Figure 15. Figure 15: Held-out catastrophe rate by VLM injection mode on FormulaOne L1 (left) and L2 (right). [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Sensitivity of VLMPPOLag+Conf on FormulaOne L2 to [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Mechanism of the confidence gate (§3). Top: the gating function κ(m) = |2σ(s(m − c)) − 1| at the two operating points used in this paper: prior-symmetric (s, c)=(100, 0) used in the main table, and empirically calibrated (ˆs, cˆ)≈(320, 0.020) recovered from the F1-L2 random-policy buffer via Eq. (5). The two configurations place the steep part of σ at different locations on the CLIP margin axis. Bottom: a… view at source ↗
Figure 18
Figure 18. Figure 18: Empirical validation of the confidence gate over 2400 held-out frames (200 frames [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: shows that the calibrated κ achieves AUC 0.82 on L1 and 0.78 on L2. The prior-symmetric configuration performs near chance on both levels (AUC 0.13 on L1, 0.34 on L2) because the gate is saturated (κ≈1 uniformly), making all frames indistinguishable by threshold. This contrast is the clearest empirical case for calibration: the gate is a meaningful danger predictor at the calibrated setting and a degenera… view at source ↗
Figure 20
Figure 20. Figure 20: Bullet SafetyCarReach-v0 learning curves at 1M (top) and 2M (bottom) training steps. Each panel shows episode return (left) and episode cost (right). Shaded: ±1 std across 3 seeds. return collapse). The remaining three seeds converge near the empirical attractor λ≈0.5–0.7. With the multiplier learning rate η1=0.035 and a 50-epoch training budget (§4), λ has insufficient time to recover from an unlucky ear… view at source ↗
Figure 21
Figure 21. Figure 21: MetaDrive learning curves (Easy left, Medium center, Hard right). Easy: 3 seeds; Medium [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: MetaDrive held-out evaluation summary after the seed-leak fix. [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Per-seed return distributions on MetaDrive (Easy / Medium / Hard) for PPOLag baseline [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Unified held-out evaluation across all three generalisation environments (Bullet, MetaDrive [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Four key moments from a VLMPPOLag+Conf evaluation episode on FormulaOne L2 [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: CLIP ViT-B/32 attention rollout on three real FormulaOne frames (one per difficulty level, [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
read the original abstract

The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{\text{lim}}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VLM-Safe-RL, a framework that augments constrained RL (specifically PPO-Lagrangian) with a frozen vision-language model to provide anticipatory per-step cost signals inside the CMDP Lagrangian multiplier update. Key components are Decoupled Dual-Path CLIP (separate reward/cost paths), VLM-Lagrange (augmented multiplier update), a Bayes-optimal Confidence Gate derived from a logistic noise model on the CLIP margin, and the composed VLMPPOLag+Conf algorithm. On the principal benchmark (Safety-Gymnasium FormulaOne L2, n=5 seeds, 10^6 steps, d_lim=25), VLMPPOLag+Conf is reported as the only method that simultaneously achieves substantive return (Jr≈40) and stays within cost budget on a majority of seeds, while five baselines fail at least one criterion; directional transfer is shown on MetaDrive Medium and Bullet Safety-Gym with honest reporting of failures on other settings.

Significance. If the VLM-derived anticipatory costs prove reliable, the approach offers a practical route to proactive safety in high-speed continuous-control tasks where reactive simulator costs arrive too late. Strengths include the use of an external frozen VLM (no fine-tuning required), explicit factorization respecting CMDP structure, multiple baseline comparisons with 5 seeds, bootstrap CIs on transfer results, and transparent reporting of negative results (e.g., MetaDrive Easy/Hard, Qwen2-VL). These elements make the empirical claims more falsifiable than typical safe-RL papers that rely solely on end-to-end success.

major comments (2)
  1. [Evaluation / principal benchmark results] The headline claim on FormulaOne L2 (that VLMPPOLag+Conf is the sole method satisfying both return and cost-budget criteria) is load-bearing on the assumption that the frozen VLM produces per-step costs that are genuinely anticipatory and low-bias. No separate calibration or validation is provided that measures VLM margin against ground-truth future collision events (e.g., collision within the next 5–10 simulator steps) on the same trajectories, independent of the full RL training loop. Without this, end-to-end improvement cannot isolate the contribution of the anticipatory term from other algorithmic changes (Lagrange augmentation, gating).
  2. [Method / Confidence Gating] The logistic noise model used to derive the Bayes-optimal weight for the confidence gate (contribution iii) is presented as producing the gating weight, yet the manuscript does not report an ablation that removes the VLM cost entirely while retaining the gate structure, nor does it show that the gate actually corrects systematic bias in the VLM signal rather than acting as a generic regularizer.
minor comments (2)
  1. [Method] Notation for the VLM cost term and its integration into the multiplier update should be made fully explicit with an equation number in the main text rather than only in the appendix.
  2. [Evaluation] The bootstrap CI reported for MetaDrive Medium transfer is useful; the same style of uncertainty quantification should be added for the FormulaOne L2 seed-level results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting both the potential of the approach and areas for strengthening the empirical claims. We respond to each major comment below and commit to revisions that address the concerns raised.

read point-by-point responses
  1. Referee: [Evaluation / principal benchmark results] The headline claim on FormulaOne L2 (that VLMPPOLag+Conf is the sole method satisfying both return and cost-budget criteria) is load-bearing on the assumption that the frozen VLM produces per-step costs that are genuinely anticipatory and low-bias. No separate calibration or validation is provided that measures VLM margin against ground-truth future collision events (e.g., collision within the next 5–10 simulator steps) on the same trajectories, independent of the full RL training loop. Without this, end-to-end improvement cannot isolate the contribution of the anticipatory term from other algorithmic changes (Lagrange augmentation, gating).

    Authors: We agree that an independent calibration of the VLM signal would better isolate its anticipatory contribution. In the revised manuscript, we will add a new experiment section that evaluates the VLM margin on held-out trajectories collected from a converged policy. We will report the correlation between the VLM cost signal and actual future collisions (within 5-10 steps) using metrics such as precision-recall or AUC, separate from the RL training. This addresses the concern about isolating the VLM's role. revision: yes

  2. Referee: [Method / Confidence Gating] The logistic noise model used to derive the Bayes-optimal weight for the confidence gate (contribution iii) is presented as producing the gating weight, yet the manuscript does not report an ablation that removes the VLM cost entirely while retaining the gate structure, nor does it show that the gate actually corrects systematic bias in the VLM signal rather than acting as a generic regularizer.

    Authors: We will incorporate the suggested ablation in the revised version. Specifically, we will compare the full model against a variant where the VLM cost is set to zero (or replaced with a non-informative signal) but the gating weight is still computed (perhaps from a fixed distribution or zero margin). This will demonstrate whether the gate provides benefit beyond a generic regularizer. Additionally, we will include analysis showing the gate's effect on reducing variance or bias in the cost estimates. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results use external frozen VLM with direct benchmark comparisons

full rationale

The paper's core contribution is an empirical framework (VLM-Safe-RL) that augments CMDP Lagrangian updates with per-step signals from a frozen external VLM. Principal claims rest on reported performance metrics (return, cost within budget) across Safety-Gymnasium FormulaOne L2, MetaDrive, and Bullet Safety-Gym, with explicit seed counts, bootstrap CIs, and comparisons to five independent baselines. No equations, derivations, or fitted parameters are presented that reduce the claimed improvements to quantities defined by construction from the inputs. The VLM is treated as an off-the-shelf black-box source; no self-citation chain or ansatz is invoked to justify uniqueness or force the outcome. The method is self-contained against external benchmarks and does not rename known results or smuggle assumptions via prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides no visibility into fitted parameters or additional modeling choices; the ledger therefore records only the core domain assumptions required for the central claim.

axioms (2)
  • domain assumption A frozen VLM can extract per-step visual features that correlate with future collision risk without task-specific fine-tuning
    Required for the Decoupled Dual-Path CLIP and VLM-Lagrange components to function as anticipatory signals
  • ad hoc to paper The logistic noise model on the CLIP margin produces a Bayes-optimal weight for gating the VLM cost
    Introduced as part of Confidence Gating; no external validation cited in abstract

pith-pipeline@v0.9.1-grok · 5927 in / 1283 out tokens · 21151 ms · 2026-06-27T14:04:16.350514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 7 linked inside Pith

  1. [1]

    Altman.Constrained Markov Decision Processes

    E. Altman.Constrained Markov Decision Processes. CRC Press, 1999

  2. [2]

    Stooke, J

    A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. InInternational Conference on Machine Learning (ICML), 2020

  3. [3]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), 2021

  4. [4]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023

  5. [5]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023

  6. [6]

    Zhang, Y

    B. Zhang, Y . Zhang, J. Ji, Y . Lei, J. Dai, Y . Chen, and Y . Yang. SafeVLA: Towards safety alignment of vision-language-action models via constrained learning.arXiv preprint arXiv:2503.03480, 2025

  7. [7]

    L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huang, Y . Zhu, and A. Anandkumar. MineDojo: Building open-ended embodied agents with internet-scale knowledge. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2022

  8. [8]

    M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh. Reward design with language models. In International Conference on Learning Representations (ICLR), 2023

  9. [9]

    Rocamonde, V

    J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner. Vision-language models are zero-shot reward models for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

  10. [10]

    Huang, Z

    Z. Huang, Z. Sheng, Y . Qu, J. You, and S. Chen. VLM-RL: A unified vision language models and reinforcement learning framework for safe autonomous driving.arXiv preprint arXiv:2412.15544, 2024

  11. [11]

    J. Ji, J. Zhou, B. Zhang, J. Dai, X. Pan, R. Sun, W. Huang, Y . Geng, M. Liu, and Y . Yang. OmniSafe: An infrastructure for accelerating safe reinforcement learning research.arXiv preprint arXiv:2305.09304, 2023

  12. [12]

    J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y . Geng, Y . Zhong, J. Dai, and Y . Yang. Safety Gymnasium: A unified safe reinforcement learning benchmark. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2023

  13. [13]

    Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(3):3461–3475, 2023

  14. [14]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, A. Herzog, D. Ho, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning (CoRL), 2022

  15. [15]

    Huang, C

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. InConference on Robot Learning (CoRL), 2023. 9

  16. [16]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  17. [17]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-VL: En- hancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  18. [18]

    A. Y . Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), 1999

  19. [19]

    T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu. Text2Reward: Reward shaping with language models for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024

  20. [20]

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models. In International Conference on Learning Representations (ICLR), 2024

  21. [21]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Conference on Machine Learning (ICML), 2018

  22. [22]

    Amodei, C

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016

  23. [23]

    Shalev-Shwartz, S

    S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving.arXiv preprint arXiv:1610.03295, 2016

  24. [24]

    García and F

    J. García and F. Fernández. A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015

  25. [25]

    Brunke, M

    L. Brunke, M. Greeff, A. W. Hall, Z. Yuan, S. Zhou, J. Panerati, and A. P. Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022

  26. [26]

    Achiam, D

    J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. InInternational Conference on Machine Learning (ICML), 2017

  27. [27]

    Zhang, Q

    Y . Zhang, Q. Vuong, and K. W. Ross. FOCOPS: First order constrained optimization in policy space. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  28. [28]

    T.-Y . Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. InInternational Conference on Learning Representations (ICLR), 2020

  29. [29]

    L. Yang, J. Ji, J. Dai, L. Zhang, B. Zhou, P. Li, Y . Yang, and G. Pan. Constrained update projection approach to safe policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  30. [30]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033. IEEE, 2012

  31. [31]

    T. Xu, Y . Liang, and G. Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. InInternational Conference on Machine Learning (ICML), 2021

  32. [32]

    Sootla, A

    A. Sootla, A. I. Cowen-Rivers, T. Jafferjee, Z. Wang, D. H. Mguni, J. Wang, and H. Ammar. Sauté RL: Almost surely safe reinforcement learning using state augmentation. InInternational Conference on Machine Learning (ICML), 2022. 10

  33. [33]

    Henderson, R

    P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters.AAAI Conference on Artificial Intelligence, 2018

  34. [34]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, and M. G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  35. [35]

    Chen and R

    J. Chen and R. Chandra. Dynamic control barrier function regulation with vision-language models for safe, adaptive, and realtime visual navigation.arXiv preprint arXiv:2603.21142, 2026

  36. [36]

    Jeong, Y

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer. WinCLIP: Zero-/few-shot anomaly classification and segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  37. [37]

    K. P. Murphy.Machine Learning: A Probabilistic Perspective. MIT Press, 2012

  38. [38]

    Santos, Z

    L. Santos, Z. Li, L. Peters, S. Bansal, and A. Bajcsy. Updating robot safety representations online from natural language feedback.arXiv preprint arXiv:2409.14580, 2024

  39. [39]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  40. [40]

    J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, volume 10, pages 61–74. MIT Press, 1999

  41. [41]

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 1321–1330, 2017

  42. [42]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  43. [43]

    Efron and R

    B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy.Statistical Science, 1:54–75, 1986

  44. [44]

    Gronauer

    S. Gronauer. Bullet-safety-gym: A framework for constrained reinforcement learning. Technical report, Technical University of Munich, 2022. URL https://github.com/ SvenGronauer/Bullet-Safety-Gym

  45. [45]

    Burda, H

    Y . Burda, H. Edwards, A. Storkey, and O. Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

  46. [46]

    Coumans and Y

    E. Coumans and Y . Bai. PyBullet: A python module for physics simulation for games, robotics and machine learning. InGitHub repository, 2016.http://pybullet.org

  47. [47]

    Zhang, L

    L. Zhang, L. Shen, L. Yang, S. Chen, B. Yuan, X. Wang, and D. Tao. Penalized proximal policy optimization for safe reinforcement learning.arXiv preprint arXiv:2205.11814, 2022

  48. [48]

    Abnar and W

    S. Abnar and W. Zuidema. Quantifying attention flow in transformers. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

  49. [49]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. LLaV A-NeXT: Improved reasoning, OCR, and world knowledge.arXiv preprint arXiv:2408.03326, 2024

  50. [50]

    V . S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2009

  51. [51]

    Steps/sec

    H. J. Kushner and G. G. Yin.Stochastic Approximation and Recursive Algorithms and Applica- tions. Springer, 2nd edition, 2003. 11 A Appendix A.1 Hyperparameters Table 5 provides a complete specification of the hyperparameters shared across all environments and methods. All algorithms use identical optimisation, PPO/CPO, and Lagrange settings; only the VLM...

  52. [52]

    the racecar is centred on the track and driving safely

    “the racecar is centred on the track and driving safely”

  53. [53]

    the racecar is following the racing line smoothly

    “the racecar is following the racing line smoothly”

  54. [54]

    the racecar is making steady forward progress

    “the racecar is making steady forward progress”

  55. [55]

    the car is driving efficiently without collisions

    “the car is driving efficiently without collisions” 14 Negative prompts (L=4, cost shaping):

  56. [56]

    the racecar is about to crash into the barrier

    “the racecar is about to crash into the barrier”

  57. [57]

    the racecar is off the track and unsafe

    “the racecar is off the track and unsafe”

  58. [58]

    the car is colliding with an obstacle

    “the car is colliding with an obstacle”

  59. [59]

    the car is driving in the wrong direction

    “the car is driving in the wrong direction” C.2 Design principles We followed four principles in writing the prompts: • Action-oriented language.Prompts describebehaviours(“driving safely”, “about to crash”) rather than static states. This aligns the VLM signal with the CMDP objective of shaping action selection rather than scoring frames. • Semantic dive...

  60. [60]

    Decoupled (critic only).The VLM score cvlm is consumed only by an auxiliary cost critic VC; the policy update receives the standard environment cost

  61. [61]

    (4)), down-weighting frames the VLM is uncertain about

    Decoupled + Confidence.As Decoupled, but the auxiliary contribution is gated by the per-frame confidenceκ(s)(Eq. (4)), down-weighting frames the VLM is uncertain about

  62. [62]

    cvlm is mixed into theenvironment costthat the critic regresses on, ctot(s)=cenv(s)+α cvlm(s); the policy then trades VLM and env signals through a single Lagrangian

    Coupled (reward shaping). cvlm is mixed into theenvironment costthat the critic regresses on, ctot(s)=cenv(s)+α cvlm(s); the policy then trades VLM and env signals through a single Lagrangian

  63. [63]

    VLM-as-reward

    VLG (VLM-Lagrangian Gate). cvlm enters via the Lagrange-multiplier update, λ←λ+ η g(cvlm, κ) (VC −d) , matching the Rocamonde-style “VLM-as-reward” usage from [ 9] but adapted to the constraint side. Each mode is paired with all three base safe-RL algorithms (PPO, PPO-Lag, CPO) on FormulaOne L1 and L2, evaluated on the held-out seeds 10000–10019 (20 deter...

  64. [64]

    κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

    Decoupled ∇θJ − λvlm ⋅ ∇θ[ svlm(s) log π(a|s) ] PPO-Lag F1-L1: 15% Environment Camera frame VLM score svlm, conf. κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

  65. [65]

    κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

    Decoupled + Confidence ∇θJ − λvlm ⋅ κ(s) ⋅ ∇θ[ svlm(s) log π(a|s) ] PPO-Lag F1-L1: 7% ★ best Environment Camera frame VLM score svlm, conf. κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

  66. [66]

    κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

    Coupled ctot(s) = cenv(s) + α ⋅ svlm(s) CPO F1-L1: 15% Environment Camera frame VLM score svlm, conf. κ Policy π Cost critic VC Lagrangian λ Policy update ∇θJ

  67. [67]

    VLG” (VLM-Lagrangian Gate) is distinct from “CLG

    VLG (VLM-Lagrangian Gate) λ ← λ + η ⋅ g(svlm, κ) ⋅ (VC − d) PPO F1-L1: 47% CPO L1: 47% ✗ Where the VLM signal enters the safe-RL loop (coloured arrow = injection edge; matching box = injection target) Figure 14:Where the VLM signal enters the safe-RL loop.Four injection points evaluated in Section D.7; in each panel thecoloured arrowmarks the injection ed...