Blending Proxy Metrics with a North Star

Winston Chou

arxiv: 2606.21745 · v1 · pith:XMCEPMYLnew · submitted 2026-06-19 · 📊 stat.ME

Blending Proxy Metrics with a North Star

Winston Chou This is my paper

Pith reviewed 2026-06-26 13:14 UTC · model grok-4.3

classification 📊 stat.ME

keywords proxy metricsnorth star metricA/B testingonline experimentationblending weightsexperiment designstatistical power

0 comments

The pith

An optimal blending method combines proxy metrics with the north star metric using weights that depend on experiment power and proxy quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes an optimal blending approach for using proxy metrics alongside a north star metric in A/B testing. The method adjusts the weight given to each metric smoothly: more weight goes to the north star as the experiment's statistical power grows, and more weight goes to the proxy as its quality relative to the north star increases. A sympathetic reader would care because this resolves the common dilemma of whether to trust quick but imperfect proxies or the slower but more accurate north star. The framework also changes how experiments should be designed, with better proxies leading to smaller and more frequent tests. Historical experiments can supply the data needed to estimate the right weights and sizes for future tests.

Core claim

The paper claims that an optimal blending approach exists which smoothly guides decision-making towards the north star as the power of the experiment increases and away from the north star as the quality of the proxy metric improves. This decision-making framework carries direct implications for the design of individual experiments and of entire experimentation programs: experimenters equipped with better proxy metrics should run smaller and more experiments, while those with worse proxies should run larger and fewer ones. The optimal blending weights and experiment sizes can be estimated from past experiments, and the approach has been applied in practice to an experimentation program.

What carries the argument

Optimal blending weights that vary with experiment power and a quantifiable measure of proxy quality relative to the north star.

If this is right

With better proxy metrics, experimenters should run smaller and more experiments.
With higher-powered experiments, more weight should shift to the north star metric.
Worse proxy metrics imply running larger and fewer experiments.
Historical experiments can be used to estimate the optimal weights and sizes for future tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same blending logic could apply to any setting that trades off fast but noisy signals against slower but accurate outcomes.
Organizations could reduce overall experimentation costs by investing in higher-quality proxies that allow more tests per unit of time.
The framework could be tested by comparing blended versus single-metric decisions in controlled simulations where the true long-term effect is known in advance.

Load-bearing premise

A quantifiable and stable measure of proxy quality relative to the north star exists, and historical experiments provide unbiased estimates of the optimal blending weights.

What would settle it

Apply the estimated blending weights to a new set of experiments with known long-term north star outcomes and check whether the blended decisions match the north star outcomes more closely than decisions based on the proxy or north star alone.

Figures

Figures reproduced from arXiv: 2606.21745 by Winston Chou.

**Figure 1.** Figure 1: Cumulative Returns (left) and Optimal Blending Weights (right) by n. aligned with the objective—become measured with greater precision. As a result, for very large n, it is better to shift decision-making towards the north star. By construction, the blended metric uses the optimal weights at any given n, and so smoothly concentrates the weights on the north star as n grows. This behavior is shown in the se… view at source ↗

**Figure 2.** Figure 2: Empirical Power Heatmaps across n and ρ [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: False Positive Risk Heatmaps across n and ρ. Finally, I assess the robustness of the core insights to heavy-tailed treatment effect distributions by repeating the above simulations for a multivariate t-distribution with ν = 3 degrees of freedom. As [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Robustness to Heavy Tails: Returns, Power, and False Positive Risk, Comparing ν = ∞ (Gaussian) vs ν = 3 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Treatment effects on the proxy (clicks, x-axis) vs. the north star (plays, y-axis) across 15 experiments; error bars are 95% CIs. Two fits are shown: raw OLS (black, solid) and cross-fold OLS (red, dashed). Axes suppressed for confidentiality. I find that, at the median experiment size of n ≈ 5 million per arm, the optimal blend assigns 52% weight to clicks and 48% to plays. The substantial weight on click… view at source ↗

**Figure 6.** Figure 6: Approximately optimal blending weights vs. per-arm sample size n (log scale), using cross-fold parameter estimates from 15 experiments. The simplex constraint binds below n ≈ 2 million (weights load entirely on clicks); above it both weights are strictly positive and the weight on plays rises with n. Typical experiment size is approximately 5 million per arm (grey dotted vertical) [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 7.** Figure 7: Expected return per experiment vs. per-arm sample size n (log scale) for three decision rules: plays-only (blue, dotted), clicks-only (orange, solid), and the optimal blend (green, dash-dot). The blended rule dominates both single-metric rules and clicksonly earns more than plays-only across the entire range of empirically observed sample sizes [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Proxy metrics are widely used to improve the precision and velocity of online experimentation (aka A/B testing). Although proxies are often motivated by long-term outcomes that the experimenter does not observe, in many settings they are used alongside a contemporaneous but statistically insensitive north star. This can lead to a practical dilemma: when should experimenters trust the proxy metric, and when should they trust the north star? In this paper, I propose an optimal blending approach that smoothly guides decision-making towards the north star as the power of the experiment increases and away from the north star as the quality of the proxy metric improves. I study the implications of this decision-making framework for the design of experiments and of experimentation programs. Equipped with better (worse) proxy metrics, experimenters should run smaller and more (larger and fewer) experiments. I show how to leverage past experiments to estimate optimal blending weights and experiment sizes. Lastly, I describe the real-world application of the methodology to an experimentation program at Netflix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable rule for blending proxy and north-star decisions in A/B tests, with clear design implications, but the optimality claim hinges on clean historical estimates whose robustness is not shown in the abstract.

read the letter

The main takeaway is a blending rule that shifts weight toward the north star as experiment power grows and toward the proxy as its quality improves, plus guidance on how that changes experiment sizing and program structure. The Netflix application is the part that feels most grounded.

What stands out as new is the specific functional form for the weights and the derived advice that stronger proxies should produce smaller but more frequent experiments. The paper walks through the estimation step using past experiments, which is the practical piece most teams would actually use.

The soft spot is the dependence on historical data for the weights. If the proxy-north-star link drifts or if past experiments were selected on observed effects, the estimated blending parameters and recommended sizes will be mis-calibrated. The abstract states that historical experiments are used but does not show any check for stationarity or selection bias, so the optimality result rests on an assumption that may not hold in real programs.

This is written for practitioners who run large online experimentation systems. Someone managing an A/B platform at a tech company would find the sizing implications and the estimation recipe directly usable. A methods reader would want to see the derivation and any sensitivity checks before treating the optimality claim as settled.

It deserves peer review because it tackles a real operational problem with a concrete proposal and a deployed example. The estimation and robustness questions are the natural points for referees to press.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an optimal blending framework for proxy metrics and a contemporaneous but low-powered north star in online A/B testing. Blending weights are derived to shift decision-making toward the north star as experiment power grows and away from it as proxy quality improves; the weights and recommended experiment sizes are estimated from historical experiments. The paper examines design implications (smaller/more experiments with better proxies) and reports a real-world application at Netflix.

Significance. If the optimality derivation holds and the historical estimation is shown to be robust, the framework would supply a concrete, tunable rule for the common proxy-versus-north-star dilemma, directly affecting experiment sizing and program-level resource allocation in large-scale experimentation platforms.

major comments (2)

[Abstract / estimation procedure] The central optimality claim rests on the existence of a stable, unbiased estimate of proxy quality (correlation with the north star) obtained from past experiments. The abstract states that historical experiments are used to estimate the weights, but supplies no derivation showing that this estimation remains valid under non-stationarity or selection on observed proxy effects; if either violation occurs, the derived weights become mis-calibrated for future use.
[Implications for experiment design] The claim that experimenters should run smaller and more (larger and fewer) experiments when equipped with better (worse) proxies follows directly from the blending rule, yet the manuscript does not report a sensitivity analysis or simulation demonstrating that the recommended sizes remain approximately optimal when the proxy-north-star correlation is estimated with sampling error.

minor comments (2)

Notation for the blending weights and the proxy-quality parameter should be introduced with explicit definitions and distinguished from any data-dependent estimates.
The Netflix application section would benefit from a table contrasting the blended decisions against a pure-proxy and a pure-north-star baseline on the same set of experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the assumptions and robustness of our proposed framework. We address each major comment in turn.

read point-by-point responses

Referee: [Abstract / estimation procedure] The central optimality claim rests on the existence of a stable, unbiased estimate of proxy quality (correlation with the north star) obtained from past experiments. The abstract states that historical experiments are used to estimate the weights, but supplies no derivation showing that this estimation remains valid under non-stationarity or selection on observed proxy effects; if either violation occurs, the derived weights become mis-calibrated for future use.

Authors: The manuscript assumes that the proxy quality, measured by the correlation with the north star, can be reliably estimated from historical experiments and remains stable. We do not provide a formal proof of unbiasedness under non-stationarity or selection bias, as the focus is on the blending framework itself. However, we recognize this as a valid concern. In the revision, we will expand the estimation section to include a discussion of these assumptions, potential biases, and practical recommendations for mitigating them, such as using time-weighted historical data or monitoring for drift. We will also note this as a limitation. revision: yes
Referee: [Implications for experiment design] The claim that experimenters should run smaller and more (larger and fewer) experiments when equipped with better (worse) proxies follows directly from the blending rule, yet the manuscript does not report a sensitivity analysis or simulation demonstrating that the recommended sizes remain approximately optimal when the proxy-north-star correlation is estimated with sampling error.

Authors: We agree that a sensitivity analysis would be valuable to assess how sampling variability in the estimated correlation affects the recommended experiment sizes. We will add a simulation study in the revised manuscript that introduces noise to the correlation estimate based on the number of historical experiments and evaluates the resulting variation in optimal sizes and blending weights. This will demonstrate the conditions under which the design recommendations remain robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation self-contained with standard historical estimation

full rationale

The paper derives an optimal blending rule that trades proxy quality against experiment power and then separately shows how to estimate the resulting weights from historical experiments. This estimation is a conventional calibration step using external data and does not reduce the claimed optimality condition to a tautology or fitted input by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or description. The central proposal remains independent of the fitted values it later calibrates.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the ability to define proxy quality and to estimate blending parameters from historical data; these are treated as domain assumptions rather than derived quantities.

free parameters (1)

blending weights
Estimated from past experiments to determine optimal mixing between proxy and north star.

axioms (1)

domain assumption A stable and quantifiable relationship exists between proxy metric quality and the north star that permits optimal blending.
Invoked to justify the existence of an optimal decision rule.

pith-pipeline@v0.9.1-grok · 5686 in / 1147 out tokens · 26602 ms · 2026-06-26T13:14:07.894360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages

[1]

The Review of Economic Studies (2025)

Athey, S., Chetty, R., Imbens, G.W., Kang, H.: The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. The Review of Economic Studies (2025). https://doi.org/10.1093/restud/rdaf087, advance article

work page doi:10.1093/restud/rdaf087 2025
[2]

In: AEA Papers and Proceedings

Azevedo, E.M., Deng, A., Montiel Olea, J.L., Weyl, E.G.: Empirical bayes estimation of treatment effects with many a/b tests: An overview. In: AEA Papers and Proceedings. vol. 109, pp. 43–47. American Economic Associa- tion (2019)

2019
[3]

Journal of Political Economy128(12) (2020)

Azevedo, E.M., Deng, A., Olea, J.L.M., Rao, J., Weyl, E.G.: A/b testing with fat tails. Journal of Political Economy128(12) (2020). https://doi.org/10.1086/710607, https://www.journals.uchicago.edu/doi/abs/10.1086/710607

work page doi:10.1086/710607 2020
[4]

Journal of Economic Theory210, 105646 (2023)

Azevedo, E.M., Mao, D., Olea, J.L.M., Velez, A.: The a/b testing problem with gaussian priors. Journal of Economic Theory210, 105646 (2023)

2023
[5]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Bibaut, A., Chou, W., Ejdemyr, S., Kallus, N.: Learning the covariance of treatment effects across many weak experiments. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 153–162 (2024). https://doi.org/10.1145/3637528.3672034

work page doi:10.1145/3637528.3672034 2024
[6]

Biostatistics1(1), 49–67 (2000)

Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., Geys, H.: The validation of surrogate endpoints in meta-analyses of randomized experi- ments. Biostatistics1(1), 49–67 (2000)

2000
[7]

In: Proceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (2025)

Chou, W., Gray, C., Kallus, N., Bibaut, A., Ejdemyr, S.: Eval- uating decision rules across many weak experiments. In: Proceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (2025). https://doi.org/10.1145/3711896.3737217, https://doi.org/10.1145/3711896.3737217

work page doi:10.1145/3711896.3737217 2025
[8]

In: The World Wide Web Conference

Coey, D., Cunningham, T.: Improving treatment effect estimators through experiment splitting. In: The World Wide Web Conference. pp. 285–295 (2019)

2019
[9]

Cunningham, T., Kim, J.: Interpreting experiments with multiple outcomes (2022)

2022
[10]

In: Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining

Deng, A., Hagar, L., Stevens, N.T., Xifara, T., Gandhi, A.: Metric decom- position in a/b tests. In: Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining. pp. 4885–4895 (2024). https://doi.org/10.1145/3637528.3671556

work page doi:10.1145/3637528.3671556 2024
[11]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing

Deng, A., Shi, X.: Data-driven metric development for online controlled experiments: Seven lessons learned. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing. pp. 77–86 (2016)

2016
[12]

Perspectives on psychological science9(6), 641–651 (2014) 18 W

Gelman, A., Carlin, J.: Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on psychological science9(6), 641–651 (2014) 18 W. Chou

2014
[13]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Kohavi, R., Chen, N.: False positives in a/b tests. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 5240–5250 (2024). https://doi.org/10.1145/3637528.3671631

work page doi:10.1145/3637528.3671631 2024
[14]

The American Statistician78(2), 135–149 (2024)

Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., Stevens, N.T.: Statistical challenges in online controlled experiments: A review of a/b test- ing methodology. The American Statistician78(2), 135–149 (2024)

2024
[15]

In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining

Lee, M.R., Shen, M.: Winner’s curse: Bias estimation for total effects of features in online controlled experiments. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 491–499 (2018)

2018
[16]

Controlled clinical trials23(6), 607–625 (2002)

Molenberghs,G.,Buyse,M.,Geys,H.,Renard,D.,Burzykowski,T.,Alonso, A.:Statisticalchallengesintheevaluationofsurrogateendpointsinrandom- ized trials. Controlled clinical trials23(6), 607–625 (2002)

2002
[17]

Statistics in medicine8(4), 431–440 (1989)

Prentice, R.L.: Surrogate endpoints in clinical trials: definition and opera- tional criteria. Statistics in medicine8(4), 431–440 (1989)

1989
[18]

In: Proceed- ings of the 26th ACM Conference on Economics and Com- putation (2025)

Sudijono, T., Ejdemyr, S., Lal, A., Tingley, M.: Optimiz- ing returns from experimentation programs. In: Proceed- ings of the 26th ACM Conference on Economics and Com- putation (2025). https://doi.org/10.1145/3736252.3742638, https://dl.acm.org/doi/abs/10.1145/3736252.3742638

work page doi:10.1145/3736252.3742638 2025
[19]

In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Tang, D., Agarwal, A., O’Brien, D., Meyer, M.: Overlapping experiment infrastructure: More, better, faster experimentation. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 17–26 (2010)

2010
[20]

In: Proceedings of the 41st International Conference on Machine Learning

Tran, A., Bibaut, A., Kallus, N.: Inferring the long-term causal effects of long-term treatments from short-term experiments. In: Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 48565–48577. PMLR (2024), https://proceedings.mlr.press/v235/tran24b.html

2024
[21]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Tripuraneni, N., Richardson, L., D’Amour, A., Soriano, J., Yadlowsky, S.: Choosing a proxy metric from past experiments. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 5803–5812 (2024). https://doi.org/10.1145/3637528.3671543

work page doi:10.1145/3637528.3671543 2024
[22]

arXiv preprint arXiv:2311.11922 (2023)

Zhang, V., Zhao, M., Le, A., Kallus, N., et al.: Evaluating the surrogate index as a decision-making tool using 200 a/b tests at netflix. arXiv preprint arXiv:2311.11922 (2023)

arXiv 2023
[23]

Applied Stochastic Models in Business and Industry41(2), e70003 (2025)

Zito, A., Greaves, D., Soriano, J., Richardson, L.: Pareto optimal proxy metrics. Applied Stochastic Models in Business and Industry41(2), e70003 (2025)

2025

[1] [1]

The Review of Economic Studies (2025)

Athey, S., Chetty, R., Imbens, G.W., Kang, H.: The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. The Review of Economic Studies (2025). https://doi.org/10.1093/restud/rdaf087, advance article

work page doi:10.1093/restud/rdaf087 2025

[2] [2]

In: AEA Papers and Proceedings

Azevedo, E.M., Deng, A., Montiel Olea, J.L., Weyl, E.G.: Empirical bayes estimation of treatment effects with many a/b tests: An overview. In: AEA Papers and Proceedings. vol. 109, pp. 43–47. American Economic Associa- tion (2019)

2019

[3] [3]

Journal of Political Economy128(12) (2020)

Azevedo, E.M., Deng, A., Olea, J.L.M., Rao, J., Weyl, E.G.: A/b testing with fat tails. Journal of Political Economy128(12) (2020). https://doi.org/10.1086/710607, https://www.journals.uchicago.edu/doi/abs/10.1086/710607

work page doi:10.1086/710607 2020

[4] [4]

Journal of Economic Theory210, 105646 (2023)

Azevedo, E.M., Mao, D., Olea, J.L.M., Velez, A.: The a/b testing problem with gaussian priors. Journal of Economic Theory210, 105646 (2023)

2023

[5] [5]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Bibaut, A., Chou, W., Ejdemyr, S., Kallus, N.: Learning the covariance of treatment effects across many weak experiments. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 153–162 (2024). https://doi.org/10.1145/3637528.3672034

work page doi:10.1145/3637528.3672034 2024

[6] [6]

Biostatistics1(1), 49–67 (2000)

Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., Geys, H.: The validation of surrogate endpoints in meta-analyses of randomized experi- ments. Biostatistics1(1), 49–67 (2000)

2000

[7] [7]

In: Proceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (2025)

Chou, W., Gray, C., Kallus, N., Bibaut, A., Ejdemyr, S.: Eval- uating decision rules across many weak experiments. In: Proceed- ings of the 31st ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining (2025). https://doi.org/10.1145/3711896.3737217, https://doi.org/10.1145/3711896.3737217

work page doi:10.1145/3711896.3737217 2025

[8] [8]

In: The World Wide Web Conference

Coey, D., Cunningham, T.: Improving treatment effect estimators through experiment splitting. In: The World Wide Web Conference. pp. 285–295 (2019)

2019

[9] [9]

Cunningham, T., Kim, J.: Interpreting experiments with multiple outcomes (2022)

2022

[10] [10]

In: Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining

Deng, A., Hagar, L., Stevens, N.T., Xifara, T., Gandhi, A.: Metric decom- position in a/b tests. In: Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining. pp. 4885–4895 (2024). https://doi.org/10.1145/3637528.3671556

work page doi:10.1145/3637528.3671556 2024

[11] [11]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing

Deng, A., Shi, X.: Data-driven metric development for online controlled experiments: Seven lessons learned. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing. pp. 77–86 (2016)

2016

[12] [12]

Perspectives on psychological science9(6), 641–651 (2014) 18 W

Gelman, A., Carlin, J.: Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on psychological science9(6), 641–651 (2014) 18 W. Chou

2014

[13] [13]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Kohavi, R., Chen, N.: False positives in a/b tests. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 5240–5250 (2024). https://doi.org/10.1145/3637528.3671631

work page doi:10.1145/3637528.3671631 2024

[14] [14]

The American Statistician78(2), 135–149 (2024)

Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., Stevens, N.T.: Statistical challenges in online controlled experiments: A review of a/b test- ing methodology. The American Statistician78(2), 135–149 (2024)

2024

[15] [15]

In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining

Lee, M.R., Shen, M.: Winner’s curse: Bias estimation for total effects of features in online controlled experiments. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 491–499 (2018)

2018

[16] [16]

Controlled clinical trials23(6), 607–625 (2002)

Molenberghs,G.,Buyse,M.,Geys,H.,Renard,D.,Burzykowski,T.,Alonso, A.:Statisticalchallengesintheevaluationofsurrogateendpointsinrandom- ized trials. Controlled clinical trials23(6), 607–625 (2002)

2002

[17] [17]

Statistics in medicine8(4), 431–440 (1989)

Prentice, R.L.: Surrogate endpoints in clinical trials: definition and opera- tional criteria. Statistics in medicine8(4), 431–440 (1989)

1989

[18] [18]

In: Proceed- ings of the 26th ACM Conference on Economics and Com- putation (2025)

Sudijono, T., Ejdemyr, S., Lal, A., Tingley, M.: Optimiz- ing returns from experimentation programs. In: Proceed- ings of the 26th ACM Conference on Economics and Com- putation (2025). https://doi.org/10.1145/3736252.3742638, https://dl.acm.org/doi/abs/10.1145/3736252.3742638

work page doi:10.1145/3736252.3742638 2025

[19] [19]

In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Tang, D., Agarwal, A., O’Brien, D., Meyer, M.: Overlapping experiment infrastructure: More, better, faster experimentation. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 17–26 (2010)

2010

[20] [20]

In: Proceedings of the 41st International Conference on Machine Learning

Tran, A., Bibaut, A., Kallus, N.: Inferring the long-term causal effects of long-term treatments from short-term experiments. In: Proceedings of the 41st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 235, pp. 48565–48577. PMLR (2024), https://proceedings.mlr.press/v235/tran24b.html

2024

[21] [21]

In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Tripuraneni, N., Richardson, L., D’Amour, A., Soriano, J., Yadlowsky, S.: Choosing a proxy metric from past experiments. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 5803–5812 (2024). https://doi.org/10.1145/3637528.3671543

work page doi:10.1145/3637528.3671543 2024

[22] [22]

arXiv preprint arXiv:2311.11922 (2023)

Zhang, V., Zhao, M., Le, A., Kallus, N., et al.: Evaluating the surrogate index as a decision-making tool using 200 a/b tests at netflix. arXiv preprint arXiv:2311.11922 (2023)

arXiv 2023

[23] [23]

Applied Stochastic Models in Business and Industry41(2), e70003 (2025)

Zito, A., Greaves, D., Soriano, J., Richardson, L.: Pareto optimal proxy metrics. Applied Stochastic Models in Business and Industry41(2), e70003 (2025)

2025