PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments
Pith reviewed 2026-05-10 12:16 UTC · model grok-4.3
The pith
PROXIMA scores proxy metrics by checking if they produce correct launch decisions and flags failing user segments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PROXIMA is a lightweight diagnostic framework that scores proxy reliability through a composite of three dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. It directly audits whether a candidate proxy leads to correct launch decisions rather than attempting to predict long-term treatment effects, and it identifies the specific user segments where the proxy fails. Validation across 80 simulated A/B tests on the Criteo Uplift and KuaiRec datasets shows early engagement metrics achieving composite reliabilities of 0.80 and 0.62 respectively, with 98.4 percent average agreement to an oracle policy. Fragility analysis indicates 68 percent segment-he
What carries the argument
The composite reliability score built from normalised effect correlation, directional accuracy, and segment-level fragility rate.
If this is right
- Proxies can be ranked and selected according to their composite reliability score before use in production experiments.
- Segments flagged for high fragility can be isolated or monitored with additional metrics to avoid masked failures.
- The full composite distinguishes reliable from unreliable proxies more effectively than correlation alone.
- Early engagement metrics qualify as sufficiently reliable for launch decisions in the advertising and recommendation domains tested.
- Directional accuracy above 96 percent holds even when segment fragility differs sharply across domains.
Where Pith is reading between the lines
- Teams could run the fragility component on historical data to decide in advance which user groups need separate long-term tracking.
- The framework might reduce costly ship/no-ship errors by surfacing proxies that look good in aggregate but fail for large subgroups.
- Domains with high measured fragility may benefit from maintaining a small set of parallel proxies instead of relying on one.
- Sensitivity results suggest that dropping any one of the three components would weaken the ability to screen proxies.
Load-bearing premise
The 80 simulated A/B tests built from the Criteo and KuaiRec datasets accurately reflect the heterogeneity, treatment effects, and decision scenarios of real production experiments.
What would settle it
Running PROXIMA on a large collection of real production A/B tests that include known long-term outcomes and measuring whether the predicted decision agreement matches the observed outcomes.
Figures
read the original abstract
Online A/B testing at scale relies on proxy metrics -- short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson's Paradox, leading to costly ship/no-ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. Unlike surrogate-index approaches that predict long-term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets -- the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation) -- using 80 simulated A/B tests. Early engagement metrics achieve a composite reliability of 0.80 on Criteo and 0.62 on KuaiRec, yielding 98.4% average decision agreement with an oracle policy. Fragility analysis reveals that recommendation domains exhibit substantially higher segment-level heterogeneity (68% fragility) than advertising (13%), yet directional accuracy remains above 96% in both cases. A sensitivity analysis over the weight space confirms that no single component suffices and that the composite provides substantially better discrimination between reliable and unreliable proxies than correlation alone. Code and reproduction scripts are available at: https://github.com/Avinash-Amudala/PROXIMA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PROXIMA, a lightweight diagnostic framework for assessing proxy metric reliability in online controlled experiments. It scores proxies via a composite of three components—normalized effect correlation, directional accuracy, and segment-level fragility rate—explicitly auditing whether proxies lead to correct launch decisions rather than relying on aggregate correlation. Validation uses two public datasets (Criteo Uplift with 14M observations and KuaiRec with 7K users) and 80 simulated A/B tests, reporting composite reliabilities of 0.80 and 0.62 for early engagement metrics, 98.4% average decision agreement with an oracle policy, higher fragility in recommendation domains (68%) than advertising (13%), and a sensitivity analysis showing the composite outperforms correlation alone. Code is provided for reproducibility.
Significance. If the simulation-based validation holds, PROXIMA offers a practical, decision-focused alternative to surrogate-index methods for proxy selection in large-scale A/B testing, directly addressing segment-level heterogeneity and Simpson's paradox risks. The use of public datasets, explicit sensitivity checks over weights, and open code are strengths that support reproducibility and allow external scrutiny of the discrimination power of the three-component score.
major comments (1)
- [Validation on simulated A/B tests] The empirical claims rest entirely on 80 simulated A/B tests constructed on the Criteo and KuaiRec datasets (abstract and validation section). The manuscript provides no quantitative diagnostics—such as moment matching against real logged production experiments, checks for interference effects, or sensitivity to variations in segment-level treatment-effect heterogeneity—to establish that these simulations faithfully reproduce the decision thresholds, confounding structure, and heterogeneity patterns of production online controlled experiments. Without such checks, the reported composite reliabilities (0.80/0.62), 98.4% oracle agreement, and fragility rates (13%/68%) cannot be confidently transferred beyond the simulated setting.
minor comments (2)
- [Introduction and Methods] The exact definitions and normalization procedures for the three components (especially 'normalised effect correlation' and 'segment-level fragility rate') are described at a high level in the abstract and introduction; including the precise formulas and any hyperparameters in the main text or an appendix would improve clarity and reproducibility.
- [Sensitivity analysis] The sensitivity analysis over the weight space is mentioned but lacks detail on the range of weights explored and the exact discrimination metric used to compare the composite against correlation alone; a table or figure showing the full weight-sensitivity results would strengthen the claim that no single component suffices.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the validation of our simulation framework. We address the major comment below and will revise the manuscript to improve transparency and contextualize the results.
read point-by-point responses
-
Referee: [Validation on simulated A/B tests] The empirical claims rest entirely on 80 simulated A/B tests constructed on the Criteo and KuaiRec datasets (abstract and validation section). The manuscript provides no quantitative diagnostics—such as moment matching against real logged production experiments, checks for interference effects, or sensitivity to variations in segment-level treatment-effect heterogeneity—to establish that these simulations faithfully reproduce the decision thresholds, confounding structure, and heterogeneity patterns of production online controlled experiments. Without such checks, the reported composite reliabilities (0.80/0.62), 98.4% oracle agreement, and fragility rates (13%/68%) cannot be confidently transferred beyond the simulated setting.
Authors: We agree that the validation relies exclusively on 80 simulated A/B tests derived from the public Criteo Uplift and KuaiRec datasets, and that the manuscript does not include direct quantitative diagnostics such as moment matching to real production experiments or explicit interference checks. The simulations are constructed directly from the observed user-level data in these corpora to retain authentic segment sizes, outcome distributions, and treatment-effect heterogeneity, which enables reproducible study of proxy decision errors including Simpson's paradox. However, we do not have access to proprietary production A/B test logs that would be required for moment matching or interference analysis. In the revised manuscript we will: (1) expand the validation section with a precise description of the simulation procedure and the data moments it preserves; (2) add further sensitivity analyses that systematically vary the degree of segment-level treatment-effect heterogeneity; and (3) insert a dedicated Limitations section that explicitly discusses the simulated nature of the experiments, the lack of interference modeling, and the resulting limits on direct transfer to production settings. These changes will make the scope and assumptions of the reported metrics (0.80/0.62 reliability, 98.4% oracle agreement, 13%/68% fragility) clearer without overstating generalizability. revision: yes
- Moment matching against real logged production experiments and checks for interference effects, because such proprietary data are not available to us.
Circularity Check
No significant circularity; framework defined independently of validation data
full rationale
The paper defines PROXIMA as a composite of three explicitly stated dimensions (normalised effect correlation, directional accuracy, segment-level fragility rate) and applies it to externally simulated A/B tests on public datasets. Reported metrics (0.80/0.62 reliability, 98.4% oracle agreement) are computed outputs from those simulations rather than inputs; the sensitivity analysis over weights is an empirical check showing the composite outperforms singles, not a definitional reduction. No self-citations, fitted parameters renamed as predictions, or ansatzes appear in the abstract or described chain. The derivation remains self-contained against the external oracle benchmark.
Axiom & Free-Parameter Ledger
free parameters (1)
- component weights in composite reliability score
axioms (2)
- domain assumption An oracle policy based on true long-term outcomes provides the correct ground-truth launch decision for each simulated test.
- domain assumption The simulated A/B tests preserve the statistical properties of real experiments on the chosen datasets.
Reference graph
Works this paper leans on
-
[1]
Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353--7360
work page 2016
-
[2]
Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No.\ 26463
work page 2019
-
[3]
Blyth, C. R. (1972). On S impson's paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338):364--366
work page 1972
-
[4]
Criteo uplift prediction dataset
Criteo AI Lab (2021). Criteo uplift prediction dataset. https://ailab.criteo.com/criteo-uplift-prediction-dataset/
work page 2021
-
[5]
Deng, A., Xu, Y., Kohavi, R., and Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of WSDM, pages 123--132
work page 2013
-
[6]
Deng, A., Li, Y., and Guo, M. (2017). Statistical inference in two-stage online controlled experiments with treatment selection and validation. In Proceedings of WWW, pages 609--618
work page 2017
- [7]
-
[8]
Gao, C., Li, S., Lei, W., Chen, J., Li, B., Jiang, P., He, X., Mao, J., and Chua, T.-S. (2022). KuaiRec : A fully-observed dataset for recommender systems. In Proceedings of CIKM, pages 540--550
work page 2022
-
[9]
Goodhart, C. A. E. (1984). Problems of monetary management: the UK experience. In Monetary Theory and Practice, pages 91--121. Macmillan
work page 1984
-
[10]
Hagar, L., Du, C., and Deng, A. (2023a). Choosing a proxy metric from past experiments. In Proceedings of KDD, pages 4158--4168
- [11]
-
[12]
R., Ramdas, A., McAuliffe, J., and Sekhon, J
Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055--1080
work page 2021
-
[13]
Jennison, C. and Turnbull, B. W. (1999). Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC
work page 1999
-
[14]
Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). Peeking at A/B tests: Why it matters, and what to do about it. In Proceedings of KDD, pages 1517--1525
work page 2017
-
[15]
Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140--181
work page 2009
-
[16]
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of KDD, pages 1168--1176
work page 2013
-
[17]
K\" u nzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156--4165
work page 2019
-
[18]
Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., and Stevens, N. (2024). Statistical challenges in online controlled experiments: A review of A/B testing methodology. The American Statistician, 78(2):135--149
work page 2024
- [19]
-
[20]
Manzi, J. (2012). Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. Basic Books
work page 2012
-
[21]
sherlock : Causal machine learning for segment discovery
Netflix (2023). sherlock : Causal machine learning for segment discovery. https://netflix.github.io/sherlock/
work page 2023
-
[22]
Pearl, J. (2014). Understanding S impson's paradox. The American Statistician, 68(1):8--13
work page 2014
-
[23]
Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4):431--440
work page 1989
-
[24]
Simpson, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B, 13(2):238--241
work page 1951
-
[25]
Statsig (2024). Differential impact detection. https://docs.statsig.com/experiments-plus/differential-impact-detection
work page 2024
-
[26]
Teng, X. and Lin, Y.-R. (2026). De-paradox tree: Breaking down S impson's paradox via a kernel-based partition algorithm. arXiv:2603.02174
-
[27]
Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228--1242
work page 2018
- [28]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.