pith. sign in

arxiv: 2512.00405 · v2 · submitted 2025-11-29 · 📊 stat.ME

Evaluating Surrogates in Individualized Treatment Rules

Pith reviewed 2026-05-17 03:44 UTC · model grok-4.3

classification 📊 stat.ME
keywords surrogate endpointsindividualized treatment rulescausal inferencedecision makingAIPW estimationbudget constraintsperformance evaluation
0
0 comments X

The pith

Surrogates strongly linked to primary outcomes can still produce poor individualized treatment rules when resources are budgeted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When primary outcomes are costly to measure, researchers often train individualized treatment rules using surrogate endpoints instead. The paper shows that high correlation or compliance with classic surrogate criteria does not guarantee that the resulting rule will perform well on the true outcome, especially when treatment is limited by a budget. To address this gap, the authors define three decision-focused metrics: surrogate regret, which measures the extra loss from following the surrogate rule rather than the true optimal rule; surrogate gain, which measures benefit over a no-treatment baseline; and surrogate efficiency, which measures gain over random assignment. They extend all three metrics to budget-constrained settings and supply augmented inverse-probability-weighted estimators whose large-sample behavior is established under standard causal conditions.

Core claim

A surrogate that satisfies existing association or causal criteria may nevertheless induce an individualized treatment rule whose value on the primary outcome falls short of the outcome-optimal rule, particularly under resource limits. The paper therefore introduces surrogate regret, surrogate gain, and surrogate efficiency as direct measures of a surrogate's decision-making value, provides AIPW estimators for them, and proves consistency and asymptotic normality of those estimators.

What carries the argument

Three ITR-oriented performance measures—surrogate regret (expected loss gap between surrogate-optimal and outcome-optimal rules), surrogate gain (benefit over no treatment), and surrogate efficiency (gain over random assignment)—together with their budget-constrained extensions and the corresponding AIPW estimators.

If this is right

  • Surrogate regret directly quantifies how much primary-outcome performance is sacrificed by using the surrogate-optimal rule.
  • Surrogate gain and efficiency together show whether the surrogate rule improves on both the no-treatment baseline and random assignment.
  • The same three measures remain well-defined and estimable when a fixed treatment budget must be respected.
  • AIPW estimators recover the population values of the measures at root-n rate under correct nuisance-model specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Surrogate validation in practice should shift from pure predictive accuracy toward explicit decision-value metrics like regret.
  • The framework could be used to rank candidate surrogates before a large primary-outcome trial is launched.
  • Extending the measures to dynamic or multi-stage treatment settings would allow evaluation of time-varying surrogates.

Load-bearing premise

The AIPW estimators are consistent and asymptotically normal only when the propensity-score and outcome-regression models are correctly specified and the usual causal assumptions hold.

What would settle it

In a controlled simulation where the true outcome-optimal rule is known, if the estimated surrogate regret remains positive yet the surrogate rule actually produces lower primary-outcome loss than the outcome-optimal rule, the regret measure would be falsified.

read the original abstract

In many decision-making problems, the primary outcome is expensive, time-consuming, or difficult to observe, so individualized treatment rules (ITRs) may be instead learned from surrogate endpoints. However, a surrogate that is highly associated with the primary outcome, or even satisfies existing surrogate criteria, may not necessarily induce a treatment rule that performs well on the primary outcome, especially under treatment resource budget constraints. In this paper, we develop a principled framework for evaluating the decision-making value of surrogate endpoints. We introduce three ITR-oriented performance measures: surrogate regret, which assesses the expected loss from using the surrogate-optimal ITR instead of outcome-optimal ITR; surrogate gain, which quantifies the benefit of surrogate-optimal ITRs relative to the no-treatment baseline; and surrogate efficiency, which evaluates improvement over random treatment assignment. We also extend them to budget-constrained settings. We propose augmented inverse probability weighted (AIPW) estimators for these measures and establish their large-sample properties. We demonstrate the proposed approach on both simulations and an application to the Criteo dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a framework for evaluating surrogate endpoints specifically for their value in learning individualized treatment rules (ITRs) on a primary outcome. It introduces surrogate regret (expected loss from using the surrogate-optimal ITR versus the outcome-optimal ITR), surrogate gain (benefit relative to no-treatment), and surrogate efficiency (improvement over random assignment), with extensions to budget-constrained ITRs. AIPW estimators are proposed for these functionals, large-sample properties are derived, and the approach is illustrated in simulations and on the Criteo dataset.

Significance. If the results hold, the work is significant for shifting surrogate evaluation from association or traditional criteria to direct decision-making performance on the primary outcome. Defining the measures as functionals of the induced rule's value avoids circularity and enables falsifiable assessment, especially under resource constraints where association alone is insufficient. The AIPW estimators and asymptotic results follow from standard semiparametric efficiency theory, and the budget-constrained extension preserves pathwise differentiability under maintained assumptions. The simulation and Criteo application provide concrete demonstrations.

major comments (1)
  1. §4.2: The extension to budget-constrained ITRs replaces the unconstrained argmax with a thresholded selection rule; while the value function is claimed to remain pathwise differentiable, the manuscript should explicitly derive the influence function to confirm that the non-smooth thresholding does not alter the n^{-1/2} rate under the stated positivity and consistency assumptions.
minor comments (2)
  1. The simulation section should report the specific nuisance estimation methods (e.g., random forests or neural nets) and their hyperparameter choices, as these affect the finite-sample performance of the AIPW estimators.
  2. Notation for the feasible set under the budget constraint (e.g., the definition of the threshold) could be introduced earlier to improve readability when comparing constrained and unconstrained versions of the performance measures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive comment on the budget-constrained extension. We address the point below and have revised the manuscript to incorporate an explicit derivation as suggested.

read point-by-point responses
  1. Referee: [—] §4.2: The extension to budget-constrained ITRs replaces the unconstrained argmax with a thresholded selection rule; while the value function is claimed to remain pathwise differentiable, the manuscript should explicitly derive the influence function to confirm that the non-smooth thresholding does not alter the n^{-1/2} rate under the stated positivity and consistency assumptions.

    Authors: We thank the referee for this observation. We agree that an explicit derivation of the influence function strengthens the theoretical justification. In the revised manuscript, Section 4.2 has been expanded to include a full derivation of the influence function for the budget-constrained value functional. The derivation proceeds by expressing the value as a composition of the expectation operator with the thresholded selection rule and applying the chain rule for pathwise differentiability. Under the maintained positivity assumption (ensuring the threshold lies in the interior of the support with positive probability) and consistency of the nuisance estimators, the set of non-differentiability induced by the thresholding operation has Lebesgue measure zero and does not alter the n^{-1/2} rate or asymptotic normality. The resulting influence function is provided explicitly and shown to be square-integrable, confirming that the AIPW estimator retains its semiparametric efficiency properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines surrogate regret, gain, and efficiency directly as functionals of the value function of the surrogate-induced ITR evaluated on the primary outcome. AIPW estimators target these identified quantities under standard causal assumptions (consistency, positivity, no unmeasured confounding) with nuisance models estimated at sufficient rates. No derivation step reduces by construction to a fitted parameter or self-citation chain; the budget-constrained extensions preserve pathwise differentiability without tautological redefinition. The central claim is supported by explicit contrast between association-based surrogates and ITR value on the primary outcome, independent of the estimation procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard identification assumptions for causal effects and correct specification of working models for AIPW; no new entities are postulated and no free parameters are introduced in the abstract description.

axioms (2)
  • domain assumption Consistency, positivity, and no unmeasured confounding for identification of treatment effects
    Required for the expectations in surrogate regret, gain, and efficiency to be identified from observed data.
  • domain assumption Correct specification of propensity score and outcome regression models for AIPW
    Needed for the estimators to achieve the claimed large-sample properties.

pith-pipeline@v0.9.0 · 5479 in / 1305 out tokens · 43610 ms · 2026-05-17T03:44:31.597354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    URLhttp: //www.jstor.org/stable/4623303

    ISSN 13697412, 14679868. URLhttp: //www.jstor.org/stable/4623303. Rui Chen, Jared D Huling, Guanhua Chen, and Menggang Yu. Robust sample weighting to facilitate individualized treatment rule learning for a target population.Biometrika, 111(1):309–329,

  2. [2]

    Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

    Riccardo D’Adamo. Orthogonal policy learning under ambiguity.arXiv preprint arXiv:2111.10904,

  3. [3]

    doi: 10.1111/j.1467-9868.2009. 00729.x. Nathan Kallus. What’s the harm? sharp bounds on the fraction negatively affected by treatment. In36th Conference on Neural Information Processing Systems,

  4. [4]

    Kennedy, Sivaraman Balakrishnan, and Max G’Sell

    Edward H. Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal effects.The Annals of Statistics, 48(4):2008–2030,

  5. [5]

    Kitagawa and A

    doi: 10.3982/ECTA13288. Steffen L Lauritzen, Odd O Aalen, Donald B Rubin, and Elja Arjas. Discussion on causality [with reply]. Scandinavian Journal of Statistics, 31(2):189–201,

  6. [6]

    Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H

    Alexander W. Levis, Matteo Bonvini, Zhenghao Zeng, Luke Keele, and Edward H. Kennedy. Covariate- assisted bounds on causal effects with instrumental variables.arXiv preprint arXiv:2301.12106,

  7. [7]

    Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

    Peng Wu, Ziyu Shen, Feng Xie, Zhongyao Wang, Chunchen Liu, and Yan Zeng. Policy learning for balancing short-term and long-term rewards.arXiv preprint arXiv:2405.03329,

  8. [8]

    Published Online: 3 Aug

    doi: 10.1287/mnsc.2023.4881. Published Online: 3 Aug

  9. [9]

    Evaluating Surrogates in Individual Treatment Regimes

    24 Supplementary Material for “Evaluating Surrogates in Individual Treatment Regimes” S1. The optimal transformation framework in ITR The goal of the optimal transformation framework by Wang et al. (2020) is to find an optimal function ofS,g(·), such thatg(S) can be used to approximate the primary outcome and subsequently to quantify the treatment effect ...

  10. [10]

    1−ˆe +τ Y (ˆπS,λ −π S,λ) ≲∥ˆe−e∥ 2 · ∥ˆµ1 −µ 1∥2 +∥ˆe−e∥ 2 · ∥ˆµ0 −µ 0∥2 +E P |τY (ˆπS,λ −π S,λ)| =o P(n−1/2) where we used Lemma S3, Lemma S4, Lemma S5, Lemma S6, Lemma S8 and the boundedness ofτ Y . S3. Auxiliary Lemmas In this appendix, we present several auxiliary results used in the proofs of other results. Lemma S4(Lemma 2 in Kennedy et al. (2020))....