arxiv: 2605.12051 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Learning plug-in surrogate endpoints for randomized experiments

Alessandro-Umberto Margueritte , Ahmet Zahid Balc{\i}o\u{g}lu , Jesse Krijthe , Dave Zachariah , Fredrik D. Johansson

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords surrogate endpointsrandomized experimentsplug-in surrogateseffect predictivenesscausal inferencetreatment effect estimationmachine learning for experiments

0 comments

The pith

Plug-in composite surrogates learned by directly modeling the surrogate effect predict primary treatment effects more accurately than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops practical methods to learn surrogate endpoints that can replace long-term primary outcomes in randomized experiments. These surrogates are defined as functions of post-treatment variables that plug directly into the analysis in place of the real outcome. The authors focus on maximizing how well the treatment effect measured on the surrogate predicts the effect on the primary outcome, creating criteria that can be optimized from data. This addresses the fact that many earlier formal definitions of good surrogates involve quantities that cannot be identified or estimated in practice. If the approach works, researchers could design shorter or less costly trials whose results reliably indicate what a full study on the primary outcome would have found.

Core claim

We propose two methods for learning plug-in composite surrogates that maximize effect predictiveness, characterize conditions under which such surrogates can yield unbiased estimates of the primary effect, and show through synthetic experiments with known effects and a real-world experiment that directly modeling the surrogate effect produces endpoints whose estimated treatment effects are more predictive of the primary effect than those from established methods.

What carries the argument

Plug-in composite surrogates, which are functions of post-treatment variables that substitute directly for the primary outcome when estimating treatment effects, optimized to maximize the predictiveness of the surrogate effect for the primary effect.

If this is right

Plug-in surrogates can be found that produce unbiased estimates of the primary treatment effect in representative scenarios.
Direct modeling of the surrogate effect outperforms established surrogate-learning approaches on both synthetic data with known ground truth and real experimental data.
The learned surrogates allow substitution into standard randomized-experiment analyses without additional adjustments.
The framework applies to settings where observing the primary outcome on the full cohort is prohibitively expensive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same learning procedure could be tested on observational data under appropriate identification assumptions for the surrogate effect.
If the predictiveness criterion generalizes, trial designers could use it to select which short-term variables to measure when planning studies.
The characterization of unbiasedness might guide which post-treatment variables to collect to improve surrogate quality.

Load-bearing premise

That the effect predictiveness of a plug-in surrogate can be learned and optimized from data in a way that generalizes beyond the observed distribution without relying on unidentifiable causal quantities.

What would settle it

In a new randomized experiment with both surrogate and primary outcomes observed, the treatment effect estimated from the learned surrogate deviates from the primary effect by more than the deviation seen with baseline surrogates.

Figures

Figures reproduced from arXiv: 2605.12051 by Ahmet Zahid Balc{\i}o\u{g}lu, Alessandro-Umberto Margueritte, Dave Zachariah, Fredrik D. Johansson, Jesse Krijthe.

**Figure 3.** Figure 3: Comparison of R2 computed via potential outcomes and regression across reported results. We omit results from scenarios b) and c) as the CATE is constant and R2 is not meaningful. squared error within the leaf. For a leaf L, the optimal leaf prediction is the weighted mean µb(L) = P i∈L wi hˆ i P i∈L wi , and the corresponding leaf loss is Err(L) = X i∈L wi [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

read the original abstract

Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives two practical algorithms for learning plug-in surrogates that maximize effect predictiveness and beat baselines on synthetic plus one real dataset, but leaves generalization to new experiments untested.

read the letter

The core contribution is two concrete methods for learning plug-in composite surrogates—functions of post-treatment variables—that are optimized to make the surrogate-based treatment effect estimate as predictive as possible of the primary effect. One method models the surrogate effect directly and comes out ahead in their comparisons. This framing lets them work with an identifiable, optimizable criterion instead of the unidentifiable causal conditions common in earlier surrogate literature. They also sketch when such plug-ins can produce unbiased estimates in representative cases, which is useful grounding. The synthetic experiments with known ground-truth effects and the single real-world dataset give some empirical backing for the performance claim over established methods. That is the part worth paying attention to if you work on experiment design where long-term outcomes are expensive. The main limitation is robustness to shifts. The surrogate is fitted to the observed joint distribution in the training data, so any change in how the primary outcome relates to the post-treatment variables or in the distribution of those variables can break the predictiveness guarantee. The paper does not report cross-experiment validation or sensitivity checks under distribution shift, which leaves open how well the learned mapping travels to a new randomized trial. Readers working on causal ML or surrogate endpoints in medicine and policy will find the methods and comparisons directly usable. The work is coherent on its own terms and engages the literature without obvious internal contradictions, so it deserves a serious referee who can press on the generalization question and ask for more extensive validation. I would send it to review rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The paper claims that plug-in composite surrogates (functions of post-treatment variables) can be learned from data to maximize effect predictiveness, i.e., how well the surrogate-based treatment effect predicts the primary-effect estimate. It proposes two such methods, one based on directly modeling the surrogate effect, characterizes conditions for unbiased effect estimates in representative scenarios, and reports that this method outperforms established approaches on synthetic data with known ground-truth effects as well as one real-world experiment.

Significance. If the central claim holds, the work supplies a practical, identifiable, and optimizable alternative to unidentifiable causal criteria for surrogate selection. The use of synthetic data with known effects for validation plus a real dataset, together with the unbiasedness characterization, are concrete strengths that support applicability in settings where long-term primary outcomes are expensive to measure.

major comments (1)

Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concern on experimental validation below.

read point-by-point responses

Referee: Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.

Authors: We agree that robustness to distribution shifts is important for claiming reliable use in new experiments. Our current experiments follow the standard setup in surrogate learning where the surrogate is trained and evaluated on data from the same distribution, which ensures the effect predictiveness metric is identifiable. The paper's theoretical characterization of unbiasedness conditions explicitly delineates when the plug-in surrogate yields valid estimates, providing guidance for deployment in similar settings. To strengthen the empirical support, we will add cross-experiment validation on additional real-world datasets and sensitivity analyses to effect-size and conditional-distribution shifts in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation on held-out synthetic and real data is independent of fitting procedure

full rationale

The paper defines plug-in surrogates as functions of post-treatment variables and proposes two optimization procedures that directly maximize an effect-predictiveness criterion estimated from data in which both surrogate and primary outcomes are jointly observed. Performance is then evaluated on synthetic data generated with known ground-truth effects and on a separate real-world experiment; neither the reported superiority nor the unbiasedness characterizations reduce to the fitted values by construction. No self-citation is invoked as a uniqueness theorem, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into specific parameters or assumptions; the work rests on standard causal assumptions for randomized experiments and the premise that effect predictiveness is a learnable and useful criterion.

axioms (1)

domain assumption Plug-in surrogates are functions of post-treatment variables that can be directly substituted for the primary outcome in analysis of randomized experiments.
Explicitly stated as the object of study in the abstract.

pith-pipeline@v0.9.0 · 5496 in / 1220 out tokens · 111585 ms · 2026-05-13T06:10:30.595896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Gaël Aglin, Siegfried Nijssen, and Pierre Schaus. Pydl8. 5: a library for learning optimal decision trees. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 5222–5224, 2021

work page 2021
[2]

The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Review of Economic Studies, page rdaf087, 2025

work page 2025
[3]

Stuart G. Baker. Five criteria for using a surrogate endpoint to predict treatment effect based on data from multiple previous trials.Statistics in medicine, 2018. doi: 10.1002/sim.7561

work page doi:10.1002/sim.7561 2018
[4]

A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003

Stuart G Baker and Barnett S Kramer. A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003

work page 2003
[5]

Springer, 2005

Tomasz Burzykowski, Marc Buyse, and Geert Molenberghs.The evaluation of surrogate endpoints, volume 427. Springer, 2005

work page 2005
[6]

Apolipoproteins as markers and managers of coronary risk.Journal of the Association of Physicians, 99(5):277–287, 2006

DC Chan and GF Watts. Apolipoproteins as markers and managers of coronary risk.Journal of the Association of Physicians, 99(5):277–287, 2006

work page 2006
[7]

Criteria for surrogate end points.Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):919–932, 2007

Hua Chen, Zhi Geng, and Jinzhu Jia. Criteria for surrogate end points.Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):919–932, 2007

work page 2007
[8]

A framework for the definition and interpretation of the use of surrogate endpoints in interventional trials.EClinicalMedicine, 65, 2023

Oriana Ciani, Anthony M Manyara, Philippa Davies, Derek Stewart, Christopher J Weir, Amber E Young, Jane Blazeby, Nancy J Butcher, Sylwia Bujkiewicz, An-Wen Chan, et al. A framework for the definition and interpretation of the use of surrogate endpoints in interventional trials.EClinicalMedicine, 65, 2023

work page 2023
[9]

Nicola Coley, Marieke P Hoevenaar-Blom, Jan-Willem van Dalen, Eric P Moll van Charante, Miia Kivipelto, Hilkka Soininen, Sandrine Andrieu, Edo Richard, and PRODEMOS consortium, the preDIV A study group, the MAPT/DSA group, and the HATICE consortium. Dementia risk scores as surrogate outcomes for lifestyle-based multidomain prevention trials—rationale, pre...

work page 2020
[10]

Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015

Michael R Elliott, Anna SC Conlon, Yun Li, Nico Kaciroti, and Jeremy MG Taylor. Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015

work page 2015
[11]

Effect of a mediterranean-style diet on endothelial dysfunction and markers of vascular inflammation in the metabolic syndrome: a randomized trial.Jama, 292(12):1440–1446, 2004

Katherine Esposito, Raffaele Marfella, Miryam Ciotola, Carmen Di Palo, Francesco Giugliano, Giovanni Giugliano, Massimo D’Armiento, Francesco D’Andrea, and Dario Giugliano. Effect of a mediterranean-style diet on endothelial dysfunction and markers of vascular inflammation in the metabolic syndrome: a randomized trial.Jama, 292(12):1440–1446, 2004

work page 2004
[12]

Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996

Thomas R Fleming. Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996

work page 1996
[13]

Surrogate end points in clinical trials: are we being misled?Annals of internal medicine, 125(7):605–613, 1996

Thomas R Fleming and David L DeMets. Surrogate end points in clinical trials: are we being misled?Annals of internal medicine, 125(7):605–613, 1996

work page 1996
[14]

Principal stratification in causal inference

Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002

work page 2002
[15]

Statistical validation of intermediate endpoints for chronic diseases.Statistics in medicine, 11(2):167–178, 1992

Laurence S Freedman, Barry I Graubard, and Arthur Schatzkin. Statistical validation of intermediate endpoints for chronic diseases.Statistics in medicine, 11(2):167–178, 1992. 12

work page 1992
[16]

Springer, 2015

Lawrence M Friedman, Curt D Furberg, David L DeMets, David M Reboussin, and Christo- pher B Granger.Fundamentals of clinical trials. Springer, 2015

work page 2015
[17]

Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006

Peter B Gilbert and Michael Hudgens. Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006

work page 2006
[18]

Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

work page 2011
[19]

Enhancing the outcomes of low-birth-weight, premature infants: A multisite, random- ized trial.JAMA, 263(22):3035–3042, 06 1990

"IHDP". Enhancing the outcomes of low-birth-weight, premature infants: A multisite, random- ized trial.JAMA, 263(22):3035–3042, 06 1990. ISSN 0098-7484. doi: 10.1001/jama.1990. 03440220059030

work page doi:10.1001/jama.1990 1990
[20]

Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009

Marshall M Joffe and Tom Greene. Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009

work page 2009
[21]

The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984

Ancel Keys, Alessandro Menotti, Christ Aravanis, Henry Blackburn, Bozidar S Djordevi ˇc, Ratko Buzina, AS Dontas, Flaminio Fidanza, Martti J Karvonen, Noboru Kimura, et al. The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984

work page 1984
[22]

Review of validity research on the stanford- binet intelligence scale.Psychological assessment, 4(1):102, 1992

Jeff Laurent, Mark Swerdlik, and Mary Ryburn. Review of validity research on the stanford- binet intelligence scale.Psychological assessment, 4(1):102, 1992

work page 1992
[23]

Graphical models for surrogates.Bull

Steffen L Lauritzen. Graphical models for surrogates.Bull. Int. Statist. Inst, 60:144–147, 2003

work page 2003
[24]

Anne Martin, Jeanne Brooks-Gunn, Pamela Klebanov, Stephen L Buka, and Marie C Mc- Cormick. Long-term maternal effects of early childhood intervention: Findings from the infant health and development program (ihdp).Journal of Applied Developmental Psychology, 29(2): 101–117, 2008

work page 2008
[25]

Benefits of the mediterranean diet: insights from the predimed study.Progress in cardiovascular diseases, 58(1):50–60, 2015

Miguel A Martínez-González, Jordi Salas-Salvadó, Ramón Estruch, Dolores Corella, Montse Fitó, Emilio Ros, Predimed Investigators, et al. Benefits of the mediterranean diet: insights from the predimed study.Progress in cardiovascular diseases, 58(1):50–60, 2015

work page 2015
[26]

PhD thesis, KU Leuven, 2020

Paul Meyvisch.Surrogate marker evaluation in clinical trials using methods of causal inference. PhD thesis, KU Leuven, 2020

work page 2020
[27]

The statistical evaluation of surrogate endpoints in clinical trials

Geert Molenberghs, Ariel Alonso Abad, and Wim Van der Elst. The statistical evaluation of surrogate endpoints in clinical trials. InBiostatistics in Biopharmaceutical Research and Development: Clinical Trial Analysis, Volume 2, pages 243–286. Springer, 2024

work page 2024
[28]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

work page 2021
[29]

Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

work page 2019
[30]

Cambridge university press, 2009

Judea Pearl.Causality. Cambridge university press, 2009

work page 2009
[31]

Transportability of causal and statistical relations: A formal approach

Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 247–254, 2011

work page 2011
[32]

Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

work page 2011
[33]

Surrogate endpoints in clinical trials: definition and operational criteria

Ross L Prentice. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431–440, 1989. 13

work page 1989
[34]

Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992

James M Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992

work page 1992
[35]

Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66(5):688, 1974

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66(5):688, 1974

work page 1974
[36]

Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999

Robert Temple. Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999

work page 1999
[37]

Choosing a proxy metric from past experiments

Nilesh Tripuraneni, Lee Richardson, Alexander D’Amour, Jacopo Soriano, and Steve Yadlowsky. Choosing a proxy metric from past experiments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5803–5812, 2024

work page 2024
[38]

Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013

Tyler J VanderWeele. Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013

work page 2013
[39]

Surrogate for long-term user experience in recommender systems

Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, Ed H Chi, and Minmin Chen. Surrogate for long-term user experience in recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 4100–4109, 2022

work page 2022
[40]

Elizabeth A Yetley, David L DeMets, and William R Harlan Jr. Surrogate disease markers as substitutes for chronic disease outcomes in studies of diet and chronic disease relations.The American journal of clinical nutrition, 106(5):1175–1189, 2017

work page 2017
[41]

surrogate index

Vickie Zhang, Michael Zhao, Anh Le, Nathan Kallus, et al. Evaluating the surrogate index as a decision-making tool using 200 a/b tests at netflix.arXiv preprint arXiv:2311.11922, 2023. 14 A Notation Table 3: A summary of used notations throughout the paper. Random variables XPre-treatment variables TTreatment variableT∈ {0,1} S,S(t)Surrogate variables and...

work page arXiv 2023