pith. machine review for the scientific record. sign in

arxiv: 2605.12051 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Learning plug-in surrogate endpoints for randomized experiments

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords surrogate endpointsrandomized experimentsplug-in surrogateseffect predictivenesscausal inferencetreatment effect estimationmachine learning for experiments
0
0 comments X

The pith

Plug-in composite surrogates learned by directly modeling the surrogate effect predict primary treatment effects more accurately than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops practical methods to learn surrogate endpoints that can replace long-term primary outcomes in randomized experiments. These surrogates are defined as functions of post-treatment variables that plug directly into the analysis in place of the real outcome. The authors focus on maximizing how well the treatment effect measured on the surrogate predicts the effect on the primary outcome, creating criteria that can be optimized from data. This addresses the fact that many earlier formal definitions of good surrogates involve quantities that cannot be identified or estimated in practice. If the approach works, researchers could design shorter or less costly trials whose results reliably indicate what a full study on the primary outcome would have found.

Core claim

We propose two methods for learning plug-in composite surrogates that maximize effect predictiveness, characterize conditions under which such surrogates can yield unbiased estimates of the primary effect, and show through synthetic experiments with known effects and a real-world experiment that directly modeling the surrogate effect produces endpoints whose estimated treatment effects are more predictive of the primary effect than those from established methods.

What carries the argument

Plug-in composite surrogates, which are functions of post-treatment variables that substitute directly for the primary outcome when estimating treatment effects, optimized to maximize the predictiveness of the surrogate effect for the primary effect.

If this is right

  • Plug-in surrogates can be found that produce unbiased estimates of the primary treatment effect in representative scenarios.
  • Direct modeling of the surrogate effect outperforms established surrogate-learning approaches on both synthetic data with known ground truth and real experimental data.
  • The learned surrogates allow substitution into standard randomized-experiment analyses without additional adjustments.
  • The framework applies to settings where observing the primary outcome on the full cohort is prohibitively expensive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same learning procedure could be tested on observational data under appropriate identification assumptions for the surrogate effect.
  • If the predictiveness criterion generalizes, trial designers could use it to select which short-term variables to measure when planning studies.
  • The characterization of unbiasedness might guide which post-treatment variables to collect to improve surrogate quality.

Load-bearing premise

That the effect predictiveness of a plug-in surrogate can be learned and optimized from data in a way that generalizes beyond the observed distribution without relying on unidentifiable causal quantities.

What would settle it

In a new randomized experiment with both surrogate and primary outcomes observed, the treatment effect estimated from the learned surrogate deviates from the primary effect by more than the deviation seen with baseline surrogates.

Figures

Figures reproduced from arXiv: 2605.12051 by Ahmet Zahid Balc{\i}o\u{g}lu, Alessandro-Umberto Margueritte, Dave Zachariah, Fredrik D. Johansson, Jesse Krijthe.

Figure 2
Figure 2. Figure 2: Six scenarios for surrogate learning [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of R2 computed via potential outcomes and regression across reported results. We omit results from scenarios b) and c) as the CATE is constant and R2 is not meaningful. squared error within the leaf. For a leaf L, the optimal leaf prediction is the weighted mean µb(L) = P i∈L wi hˆ i P i∈L wi , and the corresponding leaf loss is Err(L) = X i∈L wi [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
read the original abstract

Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that plug-in composite surrogates (functions of post-treatment variables) can be learned from data to maximize effect predictiveness, i.e., how well the surrogate-based treatment effect predicts the primary-effect estimate. It proposes two such methods, one based on directly modeling the surrogate effect, characterizes conditions for unbiased effect estimates in representative scenarios, and reports that this method outperforms established approaches on synthetic data with known ground-truth effects as well as one real-world experiment.

Significance. If the central claim holds, the work supplies a practical, identifiable, and optimizable alternative to unidentifiable causal criteria for surrogate selection. The use of synthetic data with known effects for validation plus a real dataset, together with the unbiasedness characterization, are concrete strengths that support applicability in settings where long-term primary outcomes are expensive to measure.

major comments (1)
  1. Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concern on experimental validation below.

read point-by-point responses
  1. Referee: Experimental validation (synthetic experiments and real-world dataset): the reported gains in predictiveness are shown only when training and evaluation distributions coincide. No cross-experiment validation, hold-out experiments with shifted post-treatment marginals, or sensitivity analysis to changes in effect size or conditional law of the primary given the surrogates is described. Because the surrogate is a fitted function of post-treatment variables, any such shift can break the learned predictiveness mapping; this directly limits the scope of the central claim that the method returns reliable plug-in endpoints for new experiments.

    Authors: We agree that robustness to distribution shifts is important for claiming reliable use in new experiments. Our current experiments follow the standard setup in surrogate learning where the surrogate is trained and evaluated on data from the same distribution, which ensures the effect predictiveness metric is identifiable. The paper's theoretical characterization of unbiasedness conditions explicitly delineates when the plug-in surrogate yields valid estimates, providing guidance for deployment in similar settings. To strengthen the empirical support, we will add cross-experiment validation on additional real-world datasets and sensitivity analyses to effect-size and conditional-distribution shifts in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation on held-out synthetic and real data is independent of fitting procedure

full rationale

The paper defines plug-in surrogates as functions of post-treatment variables and proposes two optimization procedures that directly maximize an effect-predictiveness criterion estimated from data in which both surrogate and primary outcomes are jointly observed. Performance is then evaluated on synthetic data generated with known ground-truth effects and on a separate real-world experiment; neither the reported superiority nor the unbiasedness characterizations reduce to the fitted values by construction. No self-citation is invoked as a uniqueness theorem, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into specific parameters or assumptions; the work rests on standard causal assumptions for randomized experiments and the premise that effect predictiveness is a learnable and useful criterion.

axioms (1)
  • domain assumption Plug-in surrogates are functions of post-treatment variables that can be directly substituted for the primary outcome in analysis of randomized experiments.
    Explicitly stated as the object of study in the abstract.

pith-pipeline@v0.9.0 · 5496 in / 1220 out tokens · 111585 ms · 2026-05-13T06:10:30.595896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Gaël Aglin, Siegfried Nijssen, and Pierre Schaus. Pydl8. 5: a library for learning optimal decision trees. InProceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 5222–5224, 2021

  2. [2]

    The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely

    Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Review of Economic Studies, page rdaf087, 2025

  3. [3]

    Stuart G. Baker. Five criteria for using a surrogate endpoint to predict treatment effect based on data from multiple previous trials.Statistics in medicine, 2018. doi: 10.1002/sim.7561

  4. [4]

    A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003

    Stuart G Baker and Barnett S Kramer. A perfect correlate does not a surrogate make.BMC medical research methodology, 3(1):16, 2003

  5. [5]

    Springer, 2005

    Tomasz Burzykowski, Marc Buyse, and Geert Molenberghs.The evaluation of surrogate endpoints, volume 427. Springer, 2005

  6. [6]

    Apolipoproteins as markers and managers of coronary risk.Journal of the Association of Physicians, 99(5):277–287, 2006

    DC Chan and GF Watts. Apolipoproteins as markers and managers of coronary risk.Journal of the Association of Physicians, 99(5):277–287, 2006

  7. [7]

    Criteria for surrogate end points.Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):919–932, 2007

    Hua Chen, Zhi Geng, and Jinzhu Jia. Criteria for surrogate end points.Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(5):919–932, 2007

  8. [8]

    A framework for the definition and interpretation of the use of surrogate endpoints in interventional trials.EClinicalMedicine, 65, 2023

    Oriana Ciani, Anthony M Manyara, Philippa Davies, Derek Stewart, Christopher J Weir, Amber E Young, Jane Blazeby, Nancy J Butcher, Sylwia Bujkiewicz, An-Wen Chan, et al. A framework for the definition and interpretation of the use of surrogate endpoints in interventional trials.EClinicalMedicine, 65, 2023

  9. [9]

    Nicola Coley, Marieke P Hoevenaar-Blom, Jan-Willem van Dalen, Eric P Moll van Charante, Miia Kivipelto, Hilkka Soininen, Sandrine Andrieu, Edo Richard, and PRODEMOS consortium, the preDIV A study group, the MAPT/DSA group, and the HATICE consortium. Dementia risk scores as surrogate outcomes for lifestyle-based multidomain prevention trials—rationale, pre...

  10. [10]

    Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015

    Michael R Elliott, Anna SC Conlon, Yun Li, Nico Kaciroti, and Jeremy MG Taylor. Surrogacy marker paradox measures in meta-analytic settings.Biostatistics, 16(2):400–412, 2015

  11. [11]

    Effect of a mediterranean-style diet on endothelial dysfunction and markers of vascular inflammation in the metabolic syndrome: a randomized trial.Jama, 292(12):1440–1446, 2004

    Katherine Esposito, Raffaele Marfella, Miryam Ciotola, Carmen Di Palo, Francesco Giugliano, Giovanni Giugliano, Massimo D’Armiento, Francesco D’Andrea, and Dario Giugliano. Effect of a mediterranean-style diet on endothelial dysfunction and markers of vascular inflammation in the metabolic syndrome: a randomized trial.Jama, 292(12):1440–1446, 2004

  12. [12]

    Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996

    Thomas R Fleming. Surrogate endpoints in clinical trials.Drug Information Journal, 30(2): 545–551, 1996

  13. [13]

    Surrogate end points in clinical trials: are we being misled?Annals of internal medicine, 125(7):605–613, 1996

    Thomas R Fleming and David L DeMets. Surrogate end points in clinical trials: are we being misled?Annals of internal medicine, 125(7):605–613, 1996

  14. [14]

    Principal stratification in causal inference

    Constantine E Frangakis and Donald B Rubin. Principal stratification in causal inference. Biometrics, 58(1):21–29, 2002

  15. [15]

    Statistical validation of intermediate endpoints for chronic diseases.Statistics in medicine, 11(2):167–178, 1992

    Laurence S Freedman, Barry I Graubard, and Arthur Schatzkin. Statistical validation of intermediate endpoints for chronic diseases.Statistics in medicine, 11(2):167–178, 1992. 12

  16. [16]

    Springer, 2015

    Lawrence M Friedman, Curt D Furberg, David L DeMets, David M Reboussin, and Christo- pher B Granger.Fundamentals of clinical trials. Springer, 2015

  17. [17]

    Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006

    Peter B Gilbert and Michael Hudgens. Evaluating causal effect predictiveness of candidate surrogate endpoints.Biometrics, 2006

  18. [18]

    Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

    Jennifer L Hill. Bayesian nonparametric modeling for causal inference.Journal of Computa- tional and Graphical Statistics, 20(1):217–240, 2011

  19. [19]

    Enhancing the outcomes of low-birth-weight, premature infants: A multisite, random- ized trial.JAMA, 263(22):3035–3042, 06 1990

    "IHDP". Enhancing the outcomes of low-birth-weight, premature infants: A multisite, random- ized trial.JAMA, 263(22):3035–3042, 06 1990. ISSN 0098-7484. doi: 10.1001/jama.1990. 03440220059030

  20. [20]

    Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009

    Marshall M Joffe and Tom Greene. Related causal frameworks for surrogate outcomes.Biomet- rics, 65(2):530–538, 2009

  21. [21]

    The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984

    Ancel Keys, Alessandro Menotti, Christ Aravanis, Henry Blackburn, Bozidar S Djordevi ˇc, Ratko Buzina, AS Dontas, Flaminio Fidanza, Martti J Karvonen, Noboru Kimura, et al. The seven countries study: 2,289 deaths in 15 years.Preventive medicine, 13(2):141–154, 1984

  22. [22]

    Review of validity research on the stanford- binet intelligence scale.Psychological assessment, 4(1):102, 1992

    Jeff Laurent, Mark Swerdlik, and Mary Ryburn. Review of validity research on the stanford- binet intelligence scale.Psychological assessment, 4(1):102, 1992

  23. [23]

    Graphical models for surrogates.Bull

    Steffen L Lauritzen. Graphical models for surrogates.Bull. Int. Statist. Inst, 60:144–147, 2003

  24. [24]

    Anne Martin, Jeanne Brooks-Gunn, Pamela Klebanov, Stephen L Buka, and Marie C Mc- Cormick. Long-term maternal effects of early childhood intervention: Findings from the infant health and development program (ihdp).Journal of Applied Developmental Psychology, 29(2): 101–117, 2008

  25. [25]

    Benefits of the mediterranean diet: insights from the predimed study.Progress in cardiovascular diseases, 58(1):50–60, 2015

    Miguel A Martínez-González, Jordi Salas-Salvadó, Ramón Estruch, Dolores Corella, Montse Fitó, Emilio Ros, Predimed Investigators, et al. Benefits of the mediterranean diet: insights from the predimed study.Progress in cardiovascular diseases, 58(1):50–60, 2015

  26. [26]

    PhD thesis, KU Leuven, 2020

    Paul Meyvisch.Surrogate marker evaluation in clinical trials using methods of causal inference. PhD thesis, KU Leuven, 2020

  27. [27]

    The statistical evaluation of surrogate endpoints in clinical trials

    Geert Molenberghs, Ariel Alonso Abad, and Wim Van der Elst. The statistical evaluation of surrogate endpoints in clinical trials. InBiostatistics in Biopharmaceutical Research and Development: Clinical Trial Analysis, Volume 2, pages 243–286. Springer, 2024

  28. [28]

    Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

  29. [29]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  30. [30]

    Cambridge university press, 2009

    Judea Pearl.Causality. Cambridge university press, 2009

  31. [31]

    Transportability of causal and statistical relations: A formal approach

    Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 247–254, 2011

  32. [32]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  33. [33]

    Surrogate endpoints in clinical trials: definition and operational criteria

    Ross L Prentice. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431–440, 1989. 13

  34. [34]

    Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992

    James M Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects.Epidemiology, 3(2):143–155, 1992

  35. [35]

    Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66(5):688, 1974

    Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of educational Psychology, 66(5):688, 1974

  36. [36]

    Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999

    Robert Temple. Are surrogate markers adequate to assess cardiovascular disease drugs?Jama, 282(8):790–795, 1999

  37. [37]

    Choosing a proxy metric from past experiments

    Nilesh Tripuraneni, Lee Richardson, Alexander D’Amour, Jacopo Soriano, and Steve Yadlowsky. Choosing a proxy metric from past experiments. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5803–5812, 2024

  38. [38]

    Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013

    Tyler J VanderWeele. Surrogate measures and consistent surrogates.Biometrics, 69(3):561–565, 2013

  39. [39]

    Surrogate for long-term user experience in recommender systems

    Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, Ed H Chi, and Minmin Chen. Surrogate for long-term user experience in recommender systems. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pages 4100–4109, 2022

  40. [40]

    Elizabeth A Yetley, David L DeMets, and William R Harlan Jr. Surrogate disease markers as substitutes for chronic disease outcomes in studies of diet and chronic disease relations.The American journal of clinical nutrition, 106(5):1175–1189, 2017

  41. [41]

    surrogate index

    Vickie Zhang, Michael Zhao, Anh Le, Nathan Kallus, et al. Evaluating the surrogate index as a decision-making tool using 200 a/b tests at netflix.arXiv preprint arXiv:2311.11922, 2023. 14 A Notation Table 3: A summary of used notations throughout the paper. Random variables XPre-treatment variables TTreatment variableT∈ {0,1} S,S(t)Surrogate variables and...