Recognition: no theorem link
Learning plug-in surrogate endpoints for randomized experiments
Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3
The pith
Plug-in composite surrogates learned by directly modeling the surrogate effect predict primary treatment effects more accurately than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose two methods for learning plug-in composite surrogates that maximize effect predictiveness, characterize conditions under which such surrogates can yield unbiased estimates of the primary effect, and show through synthetic experiments with known effects and a real-world experiment that directly modeling the surrogate effect produces endpoints whose estimated treatment effects are more predictive of the primary effect than those from established methods.
What carries the argument
Plug-in composite surrogates, which are functions of post-treatment variables that substitute directly for the primary outcome when estimating treatment effects, optimized to maximize the predictiveness of the surrogate effect for the primary effect.
If this is right
- Plug-in surrogates can be found that produce unbiased estimates of the primary treatment effect in representative scenarios.
- Direct modeling of the surrogate effect outperforms established surrogate-learning approaches on both synthetic data with known ground truth and real experimental data.
- The learned surrogates allow substitution into standard randomized-experiment analyses without additional adjustments.
- The framework applies to settings where observing the primary outcome on the full cohort is prohibitively expensive.
Where Pith is reading between the lines
- The same learning procedure could be tested on observational data under appropriate identification assumptions for the surrogate effect.
- If the predictiveness criterion generalizes, trial designers could use it to select which short-term variables to measure when planning studies.
- The characterization of unbiasedness might guide which post-treatment variables to collect to improve surrogate quality.
Load-bearing premise
That the effect predictiveness of a plug-in surrogate can be learned and optimized from data in a way that generalizes beyond the observed distribution without relying on unidentifiable causal quantities.
What would settle it
In a new randomized experiment with both surrogate and primary outcomes observed, the treatment effect estimated from the learned surrogate deviates from the primary effect by more than the deviation seen with baseline surrogates.
Figures
read the original abstract
Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that plug-in composite surrogates (functions of post-treatment variables) can be learned from data to maximize effect predictiveness, i.e., how well the surrogate-based treatment effect predicts the primary-effect estimate. It proposes two such methods, one based on directly modeling the surrogate effect, characterizes conditions for unbiased effect estimates in representative scenarios, and reports that this method outperforms established approaches on synthetic data with known ground-truth effects as well as one real-world experiment.
Significance. If the central claim holds, the work supplies a practical, identifiable, and optimizable alternative to unidentifiable causal criteria for surrogate selection. The use of synthetic data with known effects for validation plus a real dataset, together with the unbiasedness characterization, are concrete strengths that support applicability in settings where long-term primary outcomes are expensive to measure.
major comments (1)
- Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address the major concern on experimental validation below.
read point-by-point responses
-
Referee: Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.
Authors: We agree that robustness to distribution shifts is important for claiming reliable use in new experiments. Our current experiments follow the standard setup in surrogate learning where the surrogate is trained and evaluated on data from the same distribution, which ensures the effect predictiveness metric is identifiable. The paper's theoretical characterization of unbiasedness conditions explicitly delineates when the plug-in surrogate yields valid estimates, providing guidance for deployment in similar settings. To strengthen the empirical support, we will add cross-experiment validation on additional real-world datasets and sensitivity analyses to effect-size and conditional-distribution shifts in the revision. revision: yes
Circularity Check
No circularity; empirical validation on held-out synthetic and real data is independent of fitting procedure
full rationale
The paper defines plug-in surrogates as functions of post-treatment variables and proposes two optimization procedures that directly maximize an effect-predictiveness criterion estimated from data in which both surrogate and primary outcomes are jointly observed. Performance is then evaluated on synthetic data generated with known ground-truth effects and on a separate real-world experiment; neither the reported superiority nor the unbiasedness characterizations reduce to the fitted values by construction. No self-citation is invoked as a uniqueness theorem, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Plug-in surrogates are functions of post-treatment variables that can be directly substituted for the primary outcome in analysis of randomized experiments.
Reference graph
Works this paper leans on
-
[1]
Gaël Aglin, Siegfried Nijssen, and Pierre Schaus. Pydl8. 5: a library for learning optimal decision trees. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 5222–5224, 2021
work page 2021
-
[2]
Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Review of Economic Studies, page rdaf087, 2025
work page 2025
-
[3]
Stuart G. Baker. Five criteria for using a surrogate endpoint to predict treatment effect based on data from multiple previous trials.Statistics in medicine, 2018. doi: 10.1002/sim.7561
-
[4]
A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003
Stuart G Baker and Barnett S Kramer. A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003
work page 2003
-
[5]
Tomasz Burzykowski, Marc Buyse, and Geert Molenberghs.The evaluation of surrogate endpoints, volume 427. Springer, 2005
work page 2005
-
[6]
DC Chan and GF Watts. Apolipoproteins as markers and managers of coronary risk.Journal of the Association of Physicians, 99(5):277–287, 2006
work page 2006
-
[7]
Hua Chen, Zhi Geng, and Jinzhu Jia. Criteria for surrogate end points.Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):919–932, 2007
work page 2007
-
[8]
Oriana Ciani, Anthony M Manyara, Philippa Davies, Derek Stewart, Christopher J Weir, Amber E Young, Jane Blazeby, Nancy J Butcher, Sylwia Bujkiewicz, An-Wen Chan, et al. A framework for the definition and interpretation of the use of surrogate endpoints in interventional trials.EClinicalMedicine, 65, 2023
work page 2023
-
[9]
Nicola Coley, Marieke P Hoevenaar-Blom, Jan-Willem van Dalen, Eric P Moll van Charante, Miia Kivipelto, Hilkka Soininen, Sandrine Andrieu, Edo Richard, and PRODEMOS consortium, the preDIV A study group, the MAPT/DSA group, and the HATICE consortium. Dementia risk scores as surrogate outcomes for lifestyle-based multidomain prevention trials—rationale, pre...
work page 2020
-
[10]
Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015
Michael R Elliott, Anna SC Conlon, Yun Li, Nico Kaciroti, and Jeremy MG Taylor. Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015
work page 2015
-
[11]
Katherine Esposito, Raffaele Marfella, Miryam Ciotola, Carmen Di Palo, Francesco Giugliano, Giovanni Giugliano, Massimo D’Armiento, Francesco D’Andrea, and Dario Giugliano. Effect of a mediterranean-style diet on endothelial dysfunction and markers of vascular inflammation in the metabolic syndrome: a randomized trial.Jama, 292(12):1440–1446, 2004
work page 2004
-
[12]
Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996
Thomas R Fleming. Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996
work page 1996
-
[13]
Thomas R Fleming and David L DeMets. Surrogate end points in clinical trials: are we being misled?Annals of internal medicine, 125(7):605–613, 1996
work page 1996
-
[14]
Principal stratification in causal inference
Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002
work page 2002
-
[15]
Laurence S Freedman, Barry I Graubard, and Arthur Schatzkin. Statistical validation of intermediate endpoints for chronic diseases.Statistics in medicine, 11(2):167–178, 1992. 12
work page 1992
-
[16]
Lawrence M Friedman, Curt D Furberg, David L DeMets, David M Reboussin, and Christo- pher B Granger.Fundamentals of clinical trials. Springer, 2015
work page 2015
-
[17]
Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006
Peter B Gilbert and Michael Hudgens. Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006
work page 2006
-
[18]
Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011
work page 2011
-
[19]
"IHDP". Enhancing the outcomes of low-birth-weight, premature infants: A multisite, random- ized trial.JAMA, 263(22):3035–3042, 06 1990. ISSN 0098-7484. doi: 10.1001/jama.1990. 03440220059030
-
[20]
Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009
Marshall M Joffe and Tom Greene. Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009
work page 2009
-
[21]
The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984
Ancel Keys, Alessandro Menotti, Christ Aravanis, Henry Blackburn, Bozidar S Djordevi ˇc, Ratko Buzina, AS Dontas, Flaminio Fidanza, Martti J Karvonen, Noboru Kimura, et al. The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984
work page 1984
-
[22]
Jeff Laurent, Mark Swerdlik, and Mary Ryburn. Review of validity research on the stanford- binet intelligence scale.Psychological assessment, 4(1):102, 1992
work page 1992
-
[23]
Graphical models for surrogates.Bull
Steffen L Lauritzen. Graphical models for surrogates.Bull. Int. Statist. Inst, 60:144–147, 2003
work page 2003
-
[24]
Anne Martin, Jeanne Brooks-Gunn, Pamela Klebanov, Stephen L Buka, and Marie C Mc- Cormick. Long-term maternal effects of early childhood intervention: Findings from the infant health and development program (ihdp).Journal of Applied Developmental Psychology, 29(2): 101–117, 2008
work page 2008
-
[25]
Miguel A Martínez-González, Jordi Salas-Salvadó, Ramón Estruch, Dolores Corella, Montse Fitó, Emilio Ros, Predimed Investigators, et al. Benefits of the mediterranean diet: insights from the predimed study.Progress in cardiovascular diseases, 58(1):50–60, 2015
work page 2015
-
[26]
Paul Meyvisch.Surrogate marker evaluation in clinical trials using methods of causal inference. PhD thesis, KU Leuven, 2020
work page 2020
-
[27]
The statistical evaluation of surrogate endpoints in clinical trials
Geert Molenberghs, Ariel Alonso Abad, and Wim Van der Elst. The statistical evaluation of surrogate endpoints in clinical trials. InBiostatistics in Biopharmaceutical Research and Development: Clinical Trial Analysis, Volume 2, pages 243–286. Springer, 2024
work page 2024
-
[28]
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021
work page 2021
-
[29]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019
work page 2019
-
[30]
Cambridge university press, 2009
Judea Pearl.Causality. Cambridge university press, 2009
work page 2009
-
[31]
Transportability of causal and statistical relations: A formal approach
Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 247–254, 2011
work page 2011
-
[32]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011
work page 2011
-
[33]
Surrogate endpoints in clinical trials: definition and operational criteria
Ross L Prentice. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431–440, 1989. 13
work page 1989
-
[34]
Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992
James M Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992
work page 1992
-
[35]
Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66(5):688, 1974
work page 1974
-
[36]
Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999
Robert Temple. Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999
work page 1999
-
[37]
Choosing a proxy metric from past experiments
Nilesh Tripuraneni, Lee Richardson, Alexander D’Amour, Jacopo Soriano, and Steve Yadlowsky. Choosing a proxy metric from past experiments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5803–5812, 2024
work page 2024
-
[38]
Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013
Tyler J VanderWeele. Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013
work page 2013
-
[39]
Surrogate for long-term user experience in recommender systems
Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, Ed H Chi, and Minmin Chen. Surrogate for long-term user experience in recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 4100–4109, 2022
work page 2022
-
[40]
Elizabeth A Yetley, David L DeMets, and William R Harlan Jr. Surrogate disease markers as substitutes for chronic disease outcomes in studies of diet and chronic disease relations.The American journal of clinical nutrition, 106(5):1175–1189, 2017
work page 2017
-
[41]
Vickie Zhang, Michael Zhao, Anh Le, Nathan Kallus, et al. Evaluating the surrogate index as a decision-making tool using 200 a/b tests at netflix.arXiv preprint arXiv:2311.11922, 2023. 14 A Notation Table 3: A summary of used notations throughout the paper. Random variables XPre-treatment variables TTreatment variableT∈ {0,1} S,S(t)Surrogate variables and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.