Evaluating Surrogates in Individualized Treatment Rules

Hao Mei; Xiaojie Mao; Yue Liu; Zeyu Xu

arxiv: 2512.00405 · v2 · submitted 2025-11-29 · 📊 stat.ME

Evaluating Surrogates in Individualized Treatment Rules

Zeyu Xu , Xiaojie Mao , Hao Mei , Yue Liu This is my paper

Pith reviewed 2026-05-17 03:44 UTC · model grok-4.3

classification 📊 stat.ME

keywords surrogate endpointsindividualized treatment rulescausal inferencedecision makingAIPW estimationbudget constraintsperformance evaluation

0 comments

The pith

Surrogates strongly linked to primary outcomes can still produce poor individualized treatment rules when resources are budgeted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When primary outcomes are costly to measure, researchers often train individualized treatment rules using surrogate endpoints instead. The paper shows that high correlation or compliance with classic surrogate criteria does not guarantee that the resulting rule will perform well on the true outcome, especially when treatment is limited by a budget. To address this gap, the authors define three decision-focused metrics: surrogate regret, which measures the extra loss from following the surrogate rule rather than the true optimal rule; surrogate gain, which measures benefit over a no-treatment baseline; and surrogate efficiency, which measures gain over random assignment. They extend all three metrics to budget-constrained settings and supply augmented inverse-probability-weighted estimators whose large-sample behavior is established under standard causal conditions.

Core claim

A surrogate that satisfies existing association or causal criteria may nevertheless induce an individualized treatment rule whose value on the primary outcome falls short of the outcome-optimal rule, particularly under resource limits. The paper therefore introduces surrogate regret, surrogate gain, and surrogate efficiency as direct measures of a surrogate's decision-making value, provides AIPW estimators for them, and proves consistency and asymptotic normality of those estimators.

What carries the argument

Three ITR-oriented performance measures—surrogate regret (expected loss gap between surrogate-optimal and outcome-optimal rules), surrogate gain (benefit over no treatment), and surrogate efficiency (gain over random assignment)—together with their budget-constrained extensions and the corresponding AIPW estimators.

If this is right

Surrogate regret directly quantifies how much primary-outcome performance is sacrificed by using the surrogate-optimal rule.
Surrogate gain and efficiency together show whether the surrogate rule improves on both the no-treatment baseline and random assignment.
The same three measures remain well-defined and estimable when a fixed treatment budget must be respected.
AIPW estimators recover the population values of the measures at root-n rate under correct nuisance-model specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Surrogate validation in practice should shift from pure predictive accuracy toward explicit decision-value metrics like regret.
The framework could be used to rank candidate surrogates before a large primary-outcome trial is launched.
Extending the measures to dynamic or multi-stage treatment settings would allow evaluation of time-varying surrogates.

Load-bearing premise

The AIPW estimators are consistent and asymptotically normal only when the propensity-score and outcome-regression models are correctly specified and the usual causal assumptions hold.

What would settle it

In a controlled simulation where the true outcome-optimal rule is known, if the estimated surrogate regret remains positive yet the surrogate rule actually produces lower primary-outcome loss than the outcome-optimal rule, the regret measure would be falsified.

read the original abstract

In many decision-making problems, the primary outcome is expensive, time-consuming, or difficult to observe, so individualized treatment rules (ITRs) may be instead learned from surrogate endpoints. However, a surrogate that is highly associated with the primary outcome, or even satisfies existing surrogate criteria, may not necessarily induce a treatment rule that performs well on the primary outcome, especially under treatment resource budget constraints. In this paper, we develop a principled framework for evaluating the decision-making value of surrogate endpoints. We introduce three ITR-oriented performance measures: surrogate regret, which assesses the expected loss from using the surrogate-optimal ITR instead of outcome-optimal ITR; surrogate gain, which quantifies the benefit of surrogate-optimal ITRs relative to the no-treatment baseline; and surrogate efficiency, which evaluates improvement over random treatment assignment. We also extend them to budget-constrained settings. We propose augmented inverse probability weighted (AIPW) estimators for these measures and establish their large-sample properties. We demonstrate the proposed approach on both simulations and an application to the Criteo dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces decision-focused measures to check whether a surrogate actually yields good treatment rules on the primary outcome, especially with limited treatment budgets.

read the letter

The main thing to know is that a surrogate can look strong by association or by standard criteria yet still produce a treatment rule that underperforms on the real outcome once you impose a budget on how many people can be treated. The paper targets exactly that gap with three new measures built around the value of the induced rule rather than prediction error alone. Surrogate regret compares the expected outcome under the surrogate-optimal rule to the outcome-optimal rule. Surrogate gain and efficiency put the same rule against no-treatment and random baselines. They also give budget-constrained versions that replace the unconstrained argmax with a thresholded selection. AIPW estimators are proposed for all of them, with the usual large-sample normality claims under standard causal assumptions and sufficient nuisance rates. That framing is new relative to the surrogate literature the abstract cites, which has stayed closer to effect transport or Prentice-type conditions. The shift to policy value makes sense for settings like the Criteo ad data where treatment slots are scarce. The estimation strategy follows directly from semiparametric efficiency theory once the propensity and conditional expectation models are plugged in, so the asymptotic part is not surprising but is applied cleanly here. One soft spot is that everything still hinges on the nuisance functions being estimated at the right rate and on the maintained positivity and consistency assumptions holding in the data. The abstract does not show how sensitive the Criteo numbers are to model choice or to mild violations, and without the full simulation details it is hard to judge whether the reported gains are robust or partly driven by how the nuisances were tuned. The budget extension looks pathwise differentiable under the stated conditions, but edge behavior near the threshold could use more attention in finite samples. This is for statisticians and causal ML people who build or evaluate ITRs from proxies and need tools that speak directly to decision quality rather than correlation. A reader already working on surrogate validation or resource-constrained personalization will get concrete functionals and estimators they can implement. It is worth sending for peer review because the measures are well-motivated, the estimators are standard but newly targeted, and the empirical illustration on real data gives something to check. Minor revisions on robustness checks would strengthen it, but the core contribution stands on its own.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a framework for evaluating surrogate endpoints specifically for their value in learning individualized treatment rules (ITRs) on a primary outcome. It introduces surrogate regret (expected loss from using the surrogate-optimal ITR versus the outcome-optimal ITR), surrogate gain (benefit relative to no-treatment), and surrogate efficiency (improvement over random assignment), with extensions to budget-constrained ITRs. AIPW estimators are proposed for these functionals, large-sample properties are derived, and the approach is illustrated in simulations and on the Criteo dataset.

Significance. If the results hold, the work is significant for shifting surrogate evaluation from association or traditional criteria to direct decision-making performance on the primary outcome. Defining the measures as functionals of the induced rule's value avoids circularity and enables falsifiable assessment, especially under resource constraints where association alone is insufficient. The AIPW estimators and asymptotic results follow from standard semiparametric efficiency theory, and the budget-constrained extension preserves pathwise differentiability under maintained assumptions. The simulation and Criteo application provide concrete demonstrations.

major comments (1)

§4.2: The extension to budget-constrained ITRs replaces the unconstrained argmax with a thresholded selection rule; while the value function is claimed to remain pathwise differentiable, the manuscript should explicitly derive the influence function to confirm that the non-smooth thresholding does not alter the n^{-1/2} rate under the stated positivity and consistency assumptions.

minor comments (2)

The simulation section should report the specific nuisance estimation methods (e.g., random forests or neural nets) and their hyperparameter choices, as these affect the finite-sample performance of the AIPW estimators.
Notation for the feasible set under the budget constraint (e.g., the definition of the threshold) could be introduced earlier to improve readability when comparing constrained and unconstrained versions of the performance measures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive comment on the budget-constrained extension. We address the point below and have revised the manuscript to incorporate an explicit derivation as suggested.

read point-by-point responses

Referee: [—] §4.2: The extension to budget-constrained ITRs replaces the unconstrained argmax with a thresholded selection rule; while the value function is claimed to remain pathwise differentiable, the manuscript should explicitly derive the influence function to confirm that the non-smooth thresholding does not alter the n^{-1/2} rate under the stated positivity and consistency assumptions.

Authors: We thank the referee for this observation. We agree that an explicit derivation of the influence function strengthens the theoretical justification. In the revised manuscript, Section 4.2 has been expanded to include a full derivation of the influence function for the budget-constrained value functional. The derivation proceeds by expressing the value as a composition of the expectation operator with the thresholded selection rule and applying the chain rule for pathwise differentiability. Under the maintained positivity assumption (ensuring the threshold lies in the interior of the support with positive probability) and consistency of the nuisance estimators, the set of non-differentiability induced by the thresholding operation has Lebesgue measure zero and does not alter the n^{-1/2} rate or asymptotic normality. The resulting influence function is provided explicitly and shown to be square-integrable, confirming that the AIPW estimator retains its semiparametric efficiency properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines surrogate regret, gain, and efficiency directly as functionals of the value function of the surrogate-induced ITR evaluated on the primary outcome. AIPW estimators target these identified quantities under standard causal assumptions (consistency, positivity, no unmeasured confounding) with nuisance models estimated at sufficient rates. No derivation step reduces by construction to a fitted parameter or self-citation chain; the budget-constrained extensions preserve pathwise differentiability without tautological redefinition. The central claim is supported by explicit contrast between association-based surrogates and ITR value on the primary outcome, independent of the estimation procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard identification assumptions for causal effects and correct specification of working models for AIPW; no new entities are postulated and no free parameters are introduced in the abstract description.

axioms (2)

domain assumption Consistency, positivity, and no unmeasured confounding for identification of treatment effects
Required for the expectations in surrogate regret, gain, and efficiency to be identified from observed data.
domain assumption Correct specification of propensity score and outcome regression models for AIPW
Needed for the estimators to achieve the claimed large-sample properties.

pith-pipeline@v0.9.0 · 5479 in / 1305 out tokens · 43610 ms · 2026-05-17T03:44:31.597354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

URLhttp: //www.jstor.org/stable/4623303

ISSN 13697412, 14679868. URLhttp: //www.jstor.org/stable/4623303. Rui Chen, Jared D Huling, Guanhua Chen, and Menggang Yu. Robust sample weighting to facilitate individualized treatment rule learning for a target population.Biometrika, 111(1):309–329,

work page arXiv
[2]

Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

Riccardo D’Adamo. Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

work page arXiv
[3]

doi: 10.1111/j.1467-9868.2009. 00729.x. Nathan Kallus. What’s the harm? sharp bounds on the fraction negatively affected by treatment. In36th Conference on Neural Information Processing Systems,

work page doi:10.1111/j.1467-9868.2009 2009
[4]

Kennedy, Sivaraman Balakrishnan, and Max G’Sell

Edward H. Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal effects.The Annals of Statistics, 48(4):2008–2030,

work page 2008
[5]

Kitagawa and A

doi: 10.3982/ECTA13288. Steffen L Lauritzen, Odd O Aalen, Donald B Rubin, and Elja Arjas. Discussion on causality [with reply]. Scandinavian Journal of Statistics, 31(2):189–201,

work page doi:10.3982/ecta13288
[6]

Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate- assisted bounds on causal effects with instrumental variables.arXiv preprint arXiv:2301.12106,

work page arXiv
[7]

Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

Peng Wu, Ziyu Shen, Feng Xie, Zhongyao Wang, Chunchen Liu, and Yan Zeng. Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

work page arXiv
[8]

Published Online: 3 Aug

doi: 10.1287/mnsc.2023.4881. Published Online: 3 Aug

work page doi:10.1287/mnsc.2023.4881 2023
[9]

Evaluating Surrogates in Individual Treatment Regimes

24 Supplementary Material for “Evaluating Surrogates in Individual Treatment Regimes” S1. The optimal transformation framework in ITR The goal of the optimal transformation framework by Wang et al. (2020) is to find an optimal function ofS,g(·), such thatg(S) can be used to approximate the primary outcome and subsequently to quantify the treatment effect ...

work page 2020
[10]

1−ˆe +τ Y (ˆπS,λ −π S,λ) ≲∥ˆe−e∥ 2 · ∥ˆµ1 −µ 1∥2 +∥ˆe−e∥ 2 · ∥ˆµ0 −µ 0∥2 +E P |τY (ˆπS,λ −π S,λ)| =o P(n−1/2) where we used Lemma S3, Lemma S4, Lemma S5, Lemma S6, Lemma S8 and the boundedness ofτ Y . S3. Auxiliary Lemmas In this appendix, we present several auxiliary results used in the proofs of other results. Lemma S4(Lemma 2 in Kennedy et al. (2020))....

work page 2020

[1] [1]

URLhttp: //www.jstor.org/stable/4623303

ISSN 13697412, 14679868. URLhttp: //www.jstor.org/stable/4623303. Rui Chen, Jared D Huling, Guanhua Chen, and Menggang Yu. Robust sample weighting to facilitate individualized treatment rule learning for a target population.Biometrika, 111(1):309–329,

work page arXiv

[2] [2]

Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

Riccardo D’Adamo. Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

work page arXiv

[3] [3]

doi: 10.1111/j.1467-9868.2009. 00729.x. Nathan Kallus. What’s the harm? sharp bounds on the fraction negatively affected by treatment. In36th Conference on Neural Information Processing Systems,

work page doi:10.1111/j.1467-9868.2009 2009

[4] [4]

Kennedy, Sivaraman Balakrishnan, and Max G’Sell

Edward H. Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal effects.The Annals of Statistics, 48(4):2008–2030,

work page 2008

[5] [5]

Kitagawa and A

doi: 10.3982/ECTA13288. Steffen L Lauritzen, Odd O Aalen, Donald B Rubin, and Elja Arjas. Discussion on causality [with reply]. Scandinavian Journal of Statistics, 31(2):189–201,

work page doi:10.3982/ecta13288

[6] [6]

Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate- assisted bounds on causal effects with instrumental variables.arXiv preprint arXiv:2301.12106,

work page arXiv

[7] [7]

Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

Peng Wu, Ziyu Shen, Feng Xie, Zhongyao Wang, Chunchen Liu, and Yan Zeng. Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

work page arXiv

[8] [8]

Published Online: 3 Aug

doi: 10.1287/mnsc.2023.4881. Published Online: 3 Aug

work page doi:10.1287/mnsc.2023.4881 2023

[9] [9]

Evaluating Surrogates in Individual Treatment Regimes

24 Supplementary Material for “Evaluating Surrogates in Individual Treatment Regimes” S1. The optimal transformation framework in ITR The goal of the optimal transformation framework by Wang et al. (2020) is to find an optimal function ofS,g(·), such thatg(S) can be used to approximate the primary outcome and subsequently to quantify the treatment effect ...

work page 2020

[10] [10]

1−ˆe +τ Y (ˆπS,λ −π S,λ) ≲∥ˆe−e∥ 2 · ∥ˆµ1 −µ 1∥2 +∥ˆe−e∥ 2 · ∥ˆµ0 −µ 0∥2 +E P |τY (ˆπS,λ −π S,λ)| =o P(n−1/2) where we used Lemma S3, Lemma S4, Lemma S5, Lemma S6, Lemma S8 and the boundedness ofτ Y . S3. Auxiliary Lemmas In this appendix, we present several auxiliary results used in the proofs of other results. Lemma S4(Lemma 2 in Kennedy et al. (2020))....

work page 2020