Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

Jan-Peter Calliess; Rohan Tangri

arxiv: 2601.22993 · v3 · submitted 2026-01-30 · 💻 cs.LG · stat.ML

Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

Rohan Tangri , Jan-Peter Calliess This is my paper

Pith reviewed 2026-05-16 09:35 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords constrained policy optimizationvalue-at-riskCantelli inequalitysafe reinforcement learningtrust region methodschance constraintsmoment-based bounds

0 comments

The pith

VaR-CPO approximates Value-at-Risk constraints via Cantelli's inequality to guarantee zero violations during training in feasible reinforcement learning environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VaR-CPO as a method for solving reinforcement learning problems where policies must satisfy Value-at-Risk limits on cumulative costs. It replaces the non-differentiable VaR with a bound derived from the mean and variance of cost returns using Cantelli's inequality, then embeds this surrogate inside a trust-region update similar to Constrained Policy Optimization. The approach yields both practical safety during learning and theoretical worst-case guarantees on how much the policy can improve and how much the constraint can be violated per step. A reader would care because real-world RL applications often require strict limits on rare but expensive failures, and existing methods routinely breach those limits during exploration. If the method works as stated, it supplies a concrete route to training policies that remain safe from the first update onward.

Core claim

VaR-CPO is a sample-efficient algorithm that optimizes policies subject to Value-at-Risk constraints on cumulative costs by replacing the non-differentiable VaR with a Cantelli inequality bound on the first two moments, and extends the constrained policy optimization framework to derive worst-case guarantees on improvement and violation.

What carries the argument

The Cantelli-bounded Value-at-Risk surrogate inside the trust-region policy optimization loop, which converts the original probabilistic cost constraint into a differentiable moment-based penalty while preserving safety properties.

If this is right

Zero constraint violations occur throughout training whenever the environment is feasible.
Worst-case analytic bounds hold on both per-step policy improvement and per-step constraint violation.
The method remains sample-efficient while remaining more conservative than standard baselines on safety metrics.
The same trust-region machinery used in CPO now applies directly to moment-based VaR constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar Cantelli-style bounds could be substituted for other chance constraints that lack closed-form gradients.
The approach may scale to continuous control tasks once reliable online estimates of cost mean and variance become available.
Running the same algorithm on environments with narrower feasibility margins would quantify how much extra conservatism the bound introduces.

Load-bearing premise

Cantelli's inequality supplies a sufficiently tight and conservative upper bound on the true Value-at-Risk so that enforcing the surrogate still prevents actual violations in practice.

What would settle it

A controlled experiment in which a policy trained to satisfy the Cantelli surrogate still produces cost returns that exceed the prescribed VaR threshold with probability greater than the allowed level.

read the original abstract

We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide worst-case bounds for both policy improvement and constraint violation during the training process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VaR-CPO swaps in a Cantelli moment bound for the VaR constraint inside CPO and reports zero violations in the tested cases, but the surrogate is loose enough that the safety claim stays empirical.

read the letter

The paper introduces VaR-CPO, which replaces the non-differentiable VaR constraint with a Cantelli upper bound on the tail probability using only the first two moments of the cost return, then carries the usual CPO trust-region analysis over to this surrogate. The experiments show it maintains zero constraint violations during training where standard baselines do not, and the method stays sample-efficient within the same trust-region style updates as CPO. That combination is new and the implementation looks clean for anyone already working in constrained RL. The trust-region bounds on policy improvement and surrogate violation are derived in the expected way and match the style of the original CPO paper. The main limitation is that Cantelli's inequality is distribution-free and can be arbitrarily loose for many cost distributions, so enforcing the surrogate does not automatically enforce the original probabilistic VaR constraint. Sample estimates of the mean and variance add further unaccounted error, and the paper's worst-case bounds apply only to the approximated quantity. The zero-violation result is therefore an empirical observation rather than a transferred guarantee. This is useful for RL practitioners who need a conservative, off-the-shelf way to add probabilistic cost constraints without changing the overall CPO pipeline. A reader who knows CPO will see the incremental step immediately and can judge whether the extra conservatism is acceptable for their domain. I would send it to peer review; the core construction is straightforward and the experiments are on point, even if the gap between surrogate and true VaR needs tighter analysis in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VaR-CPO, a sample-efficient algorithm for Value-at-Risk (VaR) constrained reinforcement learning. It approximates the non-differentiable VaR constraint via Cantelli's inequality applied to the first two moments of the cost return, extends the trust-region framework of Constrained Policy Optimization (CPO) to derive worst-case bounds on policy improvement and constraint violation, and reports empirical results showing zero constraint violations during training in feasible environments where baseline methods incur violations.

Significance. If the Cantelli surrogate is shown to be sufficiently conservative such that enforcing it transfers to the original probabilistic VaR constraint and if the worst-case bounds are rigorously derived for the true quantity, the work would provide a meaningful contribution to safe RL by enabling conservative policy updates with explicit safety guarantees during training.

major comments (2)

[Abstract and theoretical analysis] Abstract and theoretical section: the worst-case bounds on constraint violation are stated to apply during training, yet they are derived only for the Cantelli surrogate (P(cost >= mu + t) <= sigma^2 / (sigma^2 + t^2)); because this upper bound can be arbitrarily loose for many distributions and because mu and sigma are estimated from finite samples, it is unclear whether the bounds guarantee satisfaction of the original VaR constraint, which underpins the central zero-violation claim.
[Empirical evaluation] Empirical evaluation: the claim of zero constraint violations is load-bearing for the paper's contribution, but the provided description lacks visible details on moment estimation error, number of independent runs, or statistical verification that the surrogate enforcement indeed produced no true VaR violations across the tested environments.

minor comments (1)

[Algorithm description] Clarify in the algorithm description how the first two moments are estimated from trajectories and whether any bias-correction is applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point-by-point below, clarifying the scope of our theoretical guarantees and committing to expanded empirical details in the revision.

read point-by-point responses

Referee: [Abstract and theoretical analysis] Abstract and theoretical section: the worst-case bounds on constraint violation are stated to apply during training, yet they are derived only for the Cantelli surrogate (P(cost >= mu + t) <= sigma^2 / (sigma^2 + t^2)); because this upper bound can be arbitrarily loose for many distributions and because mu and sigma are estimated from finite samples, it is unclear whether the bounds guarantee satisfaction of the original VaR constraint, which underpins the central zero-violation claim.

Authors: We appreciate the referee's careful reading. Cantelli's inequality supplies a distribution-free upper bound, so constraining the surrogate probability to at most δ guarantees that the true tail probability is at most δ whenever the moments μ and σ² are known exactly. Our worst-case bounds on constraint violation are therefore derived for the surrogate and inherit this conservative relationship. We acknowledge that the bound can be loose for some distributions and that finite-sample moment estimates render the guarantee approximate rather than strict for the original VaR. The zero-violation claim in the paper is therefore primarily empirical. In the revised manuscript we will explicitly state that the theoretical guarantees apply to the surrogate, discuss the implications of moment estimation error, and qualify the zero-violation statement accordingly. revision: partial
Referee: [Empirical evaluation] Empirical evaluation: the claim of zero constraint violations is load-bearing for the paper's contribution, but the provided description lacks visible details on moment estimation error, number of independent runs, or statistical verification that the surrogate enforcement indeed produced no true VaR violations across the tested environments.

Authors: We agree that additional experimental details are warranted. In the revised version we will report: (i) that the first two moments are estimated empirically from the batch of sampled trajectories at each policy update, (ii) that all results are averaged over 10 independent random seeds per environment, and (iii) a summary table confirming that no true VaR violations occurred in any run when the surrogate constraint was enforced. We will also include a brief analysis of observed moment estimation variability across seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external inequality and extends prior framework

full rationale

The paper derives its VaR surrogate by applying Cantelli's inequality (a standard, distribution-free result) to the first two moments of the cost return; this is not self-definitional or fitted by construction. The trust-region updates extend the CPO framework from external prior literature rather than self-citation chains. No equations reduce the target constraint to the surrogate by renaming or tautology, and empirical zero-violation results are presented as observations under the surrogate rather than forced predictions. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of Cantelli's inequality as a conservative surrogate for VaR and on the assumption that the first two moments of the cost return are sufficient statistics for the safety constraint. No free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Cantelli's inequality supplies a valid upper bound on the probability that a random variable exceeds a threshold given only its mean and variance.
Invoked to replace the non-differentiable VaR constraint with a moment-based surrogate.

pith-pipeline@v0.9.0 · 5428 in / 1308 out tokens · 32579 ms · 2026-05-16T09:35:17.857407+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To overcome the inherent non-differentiability of the VaR constraint, we employ Cantelli's inequality to obtain a tractable approximation based on the first two moments of the cost return... σ²(π)/(σ²(π) + [ρ−μ(π)]²) ≤ ε
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide a comprehensive worst-case analysis of the constraint violation, extending the guarantees of the original trust-region based CPO algorithm to the VaR setting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.