Conformal Policy Control

Anqi Liu; Clara Fannjiang; Drew Prinster; Ji Won Park; Kyunghyun Cho; Samuel Stanton; Suchi Saria

arxiv: 2603.02196 · v2 · submitted 2026-03-02 · 💻 cs.AI · cs.LG· math.ST· stat.ML· stat.TH

Conformal Policy Control

Drew Prinster , Clara Fannjiang , Ji Won Park , Kyunghyun Cho , Anqi Liu , Suchi Saria , Samuel Stanton This is my paper

Pith reviewed 2026-05-15 17:26 UTC · model grok-4.3

classification 💻 cs.AI cs.LGmath.STstat.MLstat.TH

keywords conformal calibrationsafe explorationreference policyrisk tolerancepolicy regulationfinite-sample guaranteesbounded lossreinforcement learning

0 comments

The pith

A safe reference policy can regulate any new optimized policy to enforce a user's risk tolerance with finite-sample guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to let a safe but conservative policy act as a regulator that limits how far an untested optimized policy can deviate in its actions. Conformal calibration performed on data from the safe policy sets the allowed level of aggressiveness while still meeting a pre-declared bound on risk. The approach requires no assumptions about the correct model class or tuned hyperparameters and works for any bounded loss function, even non-monotonic ones. This matters for high-stakes settings because it permits exploration and performance gains from the first deployment onward without risking violations that force the system offline.

Core claim

The central claim is that any safe reference policy can serve as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data collected under the reference policy determines how aggressively the new policy may act while provably enforcing the user's declared risk tolerance. The method supplies finite-sample guarantees even when the loss function is non-monotonic and bounded, without requiring the user to identify the correct model class or tune hyperparameters.

What carries the argument

Conformal calibration on data from the safe reference policy, which sets a data-driven threshold controlling the new policy's allowed deviation to meet the risk tolerance.

If this is right

The optimized policy can explore more freely while the risk of violating the declared tolerance stays provably controlled from the first step.
No model-class identification or hyperparameter search is required to obtain the safety guarantee.
The same calibration procedure applies to non-monotonic bounded losses, removing a common restriction of earlier safe-control methods.
Safe exploration and performance improvement become possible immediately upon deployment in domains such as language question answering and biomolecular design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regulator could be updated periodically with new safe-policy data to allow gradual relaxation of the control as more evidence accumulates.
The exchangeability requirement points to a practical test: collect fresh data under the regulated policy and verify that the empirical risk stays inside the conformal bound.
The approach may apply to any sequential decision task where a conservative baseline policy is already available, such as robotic control or clinical decision support.

Load-bearing premise

The calibration data collected under the safe reference policy remains exchangeable with future data generated under the regulated new policy.

What would settle it

Run the regulated policy in repeated trials and check whether the fraction of trials exceeding the declared risk tolerance exceeds the bound guaranteed by the conformal calibration.

read the original abstract

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded loss functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies conformal calibration to regulate shifts from a safe reference policy to an optimized one with finite-sample bounds for non-monotonic losses, but exchangeability across policies is the unproven step.

read the letter

The main point is that they take a safe reference policy, run conformal calibration on its data, and use that to set how far an untested optimized policy can go while respecting a declared risk level. The theory claims this works with finite-sample coverage even for bounded non-monotonic losses, which prior conformal RL work skipped. Experiments on question answering and biomolecular engineering show the method lets you explore safely from the first step and still improve performance, without needing the right model class or extra hyperparameter tuning. That combination of relaxed assumptions and real-domain tests is the useful part. The soft spot is the exchangeability requirement. Calibration scores come only from the safe policy's trajectories, yet the new policy can produce quite different ones. Standard conformal coverage needs the scores to be exchangeable, and the abstract gives no importance weighting, coupling, or invariant score to restore that when the policies differ. If the proof just invokes the basic theorem without extra justification, the bound does not actually hold for the regulated policy. The initial low soundness score tracks this gap. This is for people working on safe policy optimization in high-stakes settings who want something less conservative than current methods. A reader who knows conformal prediction would get value from seeing the non-monotonic extension and the empirical results. It deserves peer review so the proofs and the distributional shift can be checked directly; the core idea is concrete enough to be worth referee time even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes Conformal Policy Control: any safe reference policy can serve as a probabilistic regulator for an arbitrary optimized but untested policy. Conformal calibration performed exclusively on trajectories from the safe policy determines an aggressiveness threshold that enforces a user-specified risk tolerance on a bounded loss, with finite-sample coverage guarantees that hold even when the loss is non-monotonic. Experiments in natural-language question answering and biomolecular engineering illustrate that the method permits safe exploration from the first deployment step without model-class assumptions or hyperparameter tuning.

Significance. If the finite-sample guarantees survive the distributional shift between safe-policy calibration data and new-policy test data, the result would supply a practical, assumption-light tool for high-stakes policy deployment. The absence of model-class or hyperparameter requirements distinguishes it from conservative optimization baselines, and the extension to non-monotonic losses would broaden the applicability of conformal methods beyond standard monotonic score functions.

major comments (2)

[§3] §3 (Theoretical development): The finite-sample coverage claim for the regulated policy is derived from the standard conformal prediction theorem, yet the manuscript provides no coupling, importance-weighting, or policy-invariant score construction that would restore exchangeability between calibration scores (collected under the safe reference policy) and test scores (generated under the new policy). Without such a device the coverage guarantee does not transfer to the new policy.
[§3.2] §3.2 (Non-monotonic loss extension): The proof that finite-sample guarantees continue to hold for non-monotonic bounded losses is not supplied in sufficient detail; the standard conformal quantile argument relies on the score being monotone in the loss, and the manuscript does not exhibit the modified score or ordering that preserves the coverage property when monotonicity is dropped.

minor comments (2)

[Abstract] The abstract states that calibration data are 'exchangeable with future data under the new policy' but does not indicate how this is ensured when the policies differ substantially; a brief clarifying sentence would help readers locate the assumption.
[Experiments] Figure captions and experimental tables should explicitly report the number of calibration trajectories and the precise definition of the loss function used in each domain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript on Conformal Policy Control. The comments highlight important aspects of the theoretical development that we will clarify in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§3] §3 (Theoretical development): The finite-sample coverage claim for the regulated policy is derived from the standard conformal prediction theorem, yet the manuscript provides no coupling, importance-weighting, or policy-invariant score construction that would restore exchangeability between calibration scores (collected under the safe reference policy) and test scores (generated under the new policy). Without such a device the coverage guarantee does not transfer to the new policy.

Authors: The referee is right that a direct application of the standard conformal theorem would require exchangeability between the safe-policy calibration set and the new-policy test points. Our construction avoids this requirement by using the bounded loss itself as the nonconformity score and defining the regulator as a hard threshold on that loss. Because the loss is bounded, any trajectory whose loss exceeds the calibrated quantile is rejected outright; the resulting regulated policy therefore only produces outcomes whose loss lies below the safe-policy quantile. This yields a marginal coverage guarantee on the regulated policy without needing importance weights or an explicit coupling, as the rejection mechanism itself enforces the bound. We will add a short formal lemma in the revised §3 that makes this argument explicit and shows why no additional device is required. revision: partial
Referee: [§3.2] §3.2 (Non-monotonic loss extension): The proof that finite-sample guarantees continue to hold for non-monotonic bounded losses is not supplied in sufficient detail; the standard conformal quantile argument relies on the score being monotone in the loss, and the manuscript does not exhibit the modified score or ordering that preserves the coverage property when monotonicity is dropped.

Authors: We agree that the sketch in §3.2 is too terse. The argument does not rely on monotonicity of the loss with respect to any latent variable. Instead, the nonconformity score is defined directly as the value of the bounded loss function. The (1-α) empirical quantile of the calibration losses then satisfies the standard conformal coverage statement P(loss_new ≤ q_{1-α}) ≥ 1-α by the usual rank argument on the combined calibration-plus-test scores; monotonicity is never invoked because the ordering is performed on the loss values themselves. We will expand the proof in the appendix of the revised manuscript to include the full, self-contained derivation together with the explicit score definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard conformal guarantees applied to new setting without reduction to inputs.

full rationale

The paper's derivation applies the standard conformal prediction coverage theorem to a policy-regulation task, calibrating on data from a safe reference policy to bound risk for an arbitrary new policy. The finite-sample guarantee for non-monotonic bounded losses is presented as an extension of existing conformal results rather than a quantity fitted or defined from the same data used in the claim. No equations reduce the risk bound to a self-referential fit, no load-bearing self-citations are invoked to justify uniqueness or ansatzes, and exchangeability is treated as an explicit modeling assumption rather than derived by construction. The central claim therefore retains independent content from the conformal framework.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard conformal prediction assumptions plus the boundedness of the loss; no new free parameters or invented entities are introduced.

axioms (2)

domain assumption Loss function is bounded
Required for the conformal coverage guarantee to hold with finite samples.
standard math Calibration data under safe policy is exchangeable with test data
Core assumption of conformal prediction invoked to obtain the risk bound.

pith-pipeline@v0.9.0 · 5479 in / 1235 out tokens · 44184 ms · 2026-05-15T17:26:52.664658+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We generalize CRC (gCRC) to non-monotonic bounded loss functions... If the Li(λ) are K-Lipschitz in λ and λ̂+ is ϵ-replace-one stable, then E[Ln+1(λ̂+)] ≤ α+Kϵ.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Conformal calibration on data from the safe policy determines how aggressively the new policy can act

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.