Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

David Watson; Edoardo Pona; Mehran Hosseini; Milad Kazemi; Nicola Paoletti; Osvaldo Simeone; Yali Du

arxiv: 2604.14251 · v1 · submitted 2026-04-15 · 💻 cs.LG

Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

Edoardo Pona , Milad Kazemi , Mehran Hosseini , Yali Du , David Watson , Osvaldo Simeone , Nicola Paoletti This is my paper

Pith reviewed 2026-05-10 13:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords model cascadessafety monitoringdelegation valuecalibrationmultiple hypothesis testingLLM safetybudget guaranteesrisk control

0 comments

The pith

A delegation-value probe calibrated via multiple hypothesis testing enables model cascades that guarantee computation budgets while improving safety detection over uncertainty-based escalation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Calibrate-Then-Delegate to monitor LLM safety at scale by using a cheap probe to screen inputs and escalating difficult cases to an expensive expert model only when worthwhile. It replaces uncertainty-based delegation with a delegation value probe that directly predicts the benefit of escalation from the same internal representations. The method calibrates a threshold on this signal using held-out data and multiple hypothesis testing to deliver finite-sample probabilistic guarantees on the fraction of inputs sent to the expert. On four safety datasets, the resulting system meets budget limits at every level while delivering higher accuracy than uncertainty baselines and avoiding unnecessary escalations without needing group labels.

Core claim

CTD builds on a novel delegation value probe that directly predicts the benefit of escalation to the expert model and calibrates a threshold on the DV signal via multiple hypothesis testing to yield finite-sample guarantees on the delegation rate under the data distribution encountered at deployment.

What carries the argument

The delegation value probe, a lightweight model operating on the same internal representations as the safety probe, which directly predicts escalation benefit and supports threshold calibration for budget control.

If this is right

Safety monitoring accuracy improves at every fixed computation budget compared with uncertainty-based delegation.
Over-delegation is reduced because escalation occurs only when the probe predicts a net benefit.
Budget allocation automatically adapts to input difficulty in a streaming, per-instance fashion.
Probabilistic guarantees on delegation rate hold without requiring group labels or distributional assumptions on subgroups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same DV-probe-plus-calibration pattern could be applied to other cost-sensitive decision cascades such as medical triage or autonomous system monitoring.
If internal representations remain informative across model updates, the approach reduces the frequency of full expert calls in long-running production systems.
The multiple hypothesis testing step could be reused for other finite-sample risk controls in sequential ML pipelines where exact budget adherence matters.
Evaluating CTD across a wider range of base models and safety tasks would test whether the internal-representation assumption generalizes beyond the four datasets studied.

Load-bearing premise

The delegation value probe can reliably predict the actual benefit of escalation to the expert model on unseen instances, and the multiple hypothesis testing procedure yields valid finite-sample guarantees on the delegation rate.

What would settle it

Run the calibrated CTD on a fresh held-out sample drawn from the same distribution as the calibration set and check whether the observed delegation fraction exceeds the claimed probabilistic bound or whether safety metrics fail to exceed those of an uncertainty-based cascade at the same average budget.

Figures

Figures reproduced from arXiv: 2604.14251 by David Watson, Edoardo Pona, Mehran Hosseini, Milad Kazemi, Nicola Paoletti, Osvaldo Simeone, Yali Du.

**Figure 2.** Figure 2: Example failure modes of uncertainty-based delegation (strong expert, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Cascade performance vs. delegation budget with a strong expert (Gemma-3-27B-IT, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Same as Figure [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mean v(x, y) of the top-k examples ranked by each signal, as a function of selection fraction k/N. (Left): strong expert. (Right): weak expert. For both experts, the DV probe (orange) outperforms the uncertainty signal (blue) and the random (i.e., no-ranking) baseline (dashed grey). The ground-truth ranking (green) provides an upper bound. 5.4 Delegation Value Probe Performance Our trained DV probe d(x) pr… view at source ↗

**Figure 6.** Figure 6: Difference in DV top-k performance (continuous − binary) vs. delegation budget for three batch sizes. Top: strong expert (Gemma-27B). Bottom: weak expert (Llama-1B). Positive values indicate that the continuous formulation outperforms the binary one. The advantage grows with the budget, consistent with the finer-grained ranking provided by the continuous target [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Strong expert (Gemma-3-27B-IT): cascade performance at [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Weak expert (Llama-3.2-1B-Instruct): cascade performance at [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Group-level mean d(x) vs ground-truth mean v(x, y) for each of the four evaluation datasets and both experts. The DV probe recovers the group-level ordering of delegation benefit without group supervision. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Composition of the top-k delegated set by dataset group as a function of budget α, for DV top-k (left column) and Unc. top-k (right column), with strong (top row) and weak (bottom row) experts. Dashed white lines mark cumulative base rates; a method with no group preference would track these lines. The DV signal over-represents high-v groups at low budgets and gradually includes lower-v groups as α grows.… view at source ↗

**Figure 11.** Figure 11: Distribution of realised delegation rates across 500 random calibration splits [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Cascade performance with AUROC error as the Pareto-testing performance risk [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Cascade performance with AUROC error as the Pareto-testing performance risk [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CTD swaps uncertainty for a benefit-predicting DV probe and adds MHT calibration to deliver finite-sample budget control in LLM safety cascades, but the gains rest on unproven generalization.

read the letter

CTD replaces the standard uncertainty threshold in model cascades with a delegation value probe that directly estimates when escalation to the expert will correct an error. It then calibrates the DV threshold on held-out data using multiple hypothesis testing to produce finite-sample guarantees on the delegation rate under a budget constraint. This is the main new element: moving from a proxy signal to an explicit benefit model plus a non-parametric calibration step for the budget side. The approach is evaluated on four safety datasets and reports consistent outperformance over uncertainty-based delegation at every budget level, plus reduced over-delegation and adaptation to input difficulty without group labels. Those results give the method a practical flavor that prior cascade papers sometimes lack. The soft spots sit in the two linked assumptions the stress-test already flagged. The DV probe is trained on the same internal representations as the safety model, so it is not obvious that its benefit predictions will hold on truly new instances or that P(expert fixes error | DV) stays well-calibrated outside the training distribution. The MHT calibration supplies guarantees only on the held-out set; any shift at deployment can break the finite-sample control. The abstract and method sketch supply no ablations on probe generalization, no sensitivity analysis on the calibration set size, and no direct checks against distribution shift, which leaves the central claims under-supported. This paper is aimed at engineers and researchers who build production safety monitors for large models and need explicit cost control. Readers working on cascades or budgeted inference will find the calibration technique worth examining. It shows clear engagement with the practical constraints of the problem and supplies enough experimental grounding to justify referee time, even if the generalization questions will require revisions. I would send it for peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Calibrate-Then-Delegate (CTD), a model-cascade framework for LLM safety monitoring. A lightweight delegation-value (DV) probe operating on the same internal representations as the safety model predicts the benefit of escalation to an expert; a threshold on the DV score is then calibrated on held-out data via multiple hypothesis testing to deliver finite-sample guarantees on the delegation (budget) rate. Experiments on four safety datasets report that CTD outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and allocates budget adaptively to input difficulty without requiring group labels.

Significance. If the DV probe generalizes and the multiple-hypothesis-testing calibration yields valid finite-sample control under deployment conditions, the work supplies a principled, instance-level mechanism that simultaneously improves safety performance and supplies explicit probabilistic budget guarantees—features that would be valuable for scalable, cost-controlled LLM safety pipelines.

major comments (3)

[Abstract, §3] Abstract and §3 (Method): The central claim of 'finite-sample guarantees on the delegation rate' is asserted via multiple hypothesis testing on held-out data, yet no theorem statement, proof sketch, or explicit list of assumptions (i.i.d. sampling, no distribution shift between calibration and deployment) appears. Without this, it is impossible to verify whether the reported guarantees follow from the calibration procedure or survive the distribution shift that the skeptic correctly flags as a weakest assumption.
[§3.2, §4] §4 (Experiments) and §3.2 (DV probe): The performance superiority at every budget level rests on the DV probe accurately predicting actual expert correction benefit on unseen instances. The manuscript supplies no calibration plots, correlation metrics, or ablation showing P(expert fixes error | DV score) on held-out or shifted data; the reported outperformance therefore cannot be separated from possible overfitting of the probe to the calibration distribution.
[§4] §4 (Experiments): While consistent outperformance is claimed across four datasets, the text provides no statistical significance tests, standard-error bars across random seeds, or sensitivity analysis to the single free parameter (DV threshold). This leaves the robustness of the 'avoids harmful over-delegation' and 'adapts budget allocation' claims unverified.

minor comments (2)

[Abstract] The abstract states results on 'four safety datasets' without naming them; the datasets should be identified in the abstract or first paragraph of the introduction.
[§3] Notation for the DV probe output, calibrated threshold, and delegation indicator is introduced piecemeal; a compact symbol table would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of guarantees, probe validation, and experimental robustness.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Method): The central claim of 'finite-sample guarantees on the delegation rate' is asserted via multiple hypothesis testing on held-out data, yet no theorem statement, proof sketch, or explicit list of assumptions (i.i.d. sampling, no distribution shift between calibration and deployment) appears. Without this, it is impossible to verify whether the reported guarantees follow from the calibration procedure or survive the distribution shift that the skeptic correctly flags as a weakest assumption.

Authors: We agree that a formal theorem would improve verifiability. In the revised manuscript we will insert a theorem statement in §3 that precisely characterizes the finite-sample control on the delegation rate achieved by the multiple-hypothesis-testing threshold calibration. A proof sketch will be added to the appendix. The assumptions will be stated explicitly: (i) calibration samples are i.i.d. draws from the data-generating distribution, and (ii) deployment inputs are drawn from the same distribution (no shift). We will also note, as the skeptic already flags in the paper, that the no-shift assumption is the weakest link and that the guarantees are conditional on it; we will discuss sensitivity to mild shifts in the limitations section. revision: yes
Referee: [§3.2, §4] §4 (Experiments) and §3.2 (DV probe): The performance superiority at every budget level rests on the DV probe accurately predicting actual expert correction benefit on unseen instances. The manuscript supplies no calibration plots, correlation metrics, or ablation showing P(expert fixes error | DV score) on held-out or shifted data; the reported outperformance therefore cannot be separated from possible overfitting of the probe to the calibration distribution.

Authors: We will add, in the revised §4 and appendix, calibration plots of DV score versus observed expert benefit together with quantitative correlation metrics (Pearson and Spearman) computed on the held-out calibration split of each dataset. We will also include an ablation that reports the empirical probability that the expert corrects an error conditional on binned DV scores. While the four datasets already provide some distributional diversity, we acknowledge that explicit shifted-data experiments are absent; we will add a limitations paragraph discussing the risk of overfitting to the calibration distribution and the consequent need for periodic re-calibration in deployment. revision: yes
Referee: [§4] §4 (Experiments): While consistent outperformance is claimed across four datasets, the text provides no statistical significance tests, standard-error bars across random seeds, or sensitivity analysis to the single free parameter (DV threshold). This leaves the robustness of the 'avoids harmful over-delegation' and 'adapts budget allocation' claims unverified.

Authors: We will revise the experimental section to report standard-error bars computed over five independent random seeds for all curves. We will add paired statistical significance tests (t-tests with Bonferroni correction) comparing CTD against each baseline at every budget level. Finally, we will include a sensitivity analysis in the appendix that varies the DV threshold around the calibrated value and shows the resulting changes in delegation rate, safety improvement, and over-delegation metrics. These additions will directly substantiate the robustness claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper trains a delegation value (DV) probe on internal representations to predict escalation benefit (a distinct target from the safety probe's output), then applies multiple hypothesis testing on held-out data to calibrate a threshold and obtain finite-sample guarantees on delegation rate. This uses external calibration data and standard statistical procedures rather than reducing any claim to a fitted parameter or self-defined quantity by construction. No load-bearing self-citations, ansatzes, or renamings of known results are present in the described chain; the guarantees remain statistically grounded outside the model's internal fits.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on the assumption that a lightweight probe can forecast escalation value and that statistical calibration on held-out data transfers to deployment; these are not derived from first principles but introduced to enable the guarantees.

free parameters (1)

DV threshold
Chosen via multiple hypothesis testing on held-out data to enforce a target delegation rate.

axioms (1)

domain assumption The DV probe produces a signal whose ordering reflects true escalation benefit on new inputs
Invoked to justify instance-level delegation decisions and the calibration procedure.

invented entities (1)

Delegation Value (DV) probe no independent evidence
purpose: Lightweight model that predicts the benefit of escalating to the expert safety model
Operates on the same internal representations as the safety probe but is trained to output a delegation benefit signal.

pith-pipeline@v0.9.0 · 5504 in / 1383 out tokens · 49102 ms · 2026-05-10T13:03:54.012592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2023