pith. sign in

arxiv: 2604.14352 · v1 · submitted 2026-04-15 · 📊 stat.ME · cs.LG· stat.AP

PROXIMA: A Reliability Scoring Framework for Proxy Metrics in Online Controlled Experiments

Pith reviewed 2026-05-10 12:16 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.AP
keywords proxy metricsonline A/B testingreliability scoringsegment heterogeneitydirectional accuracyfragility ratelaunch decisionsSimpson's paradox
0
0 comments X

The pith

PROXIMA scores proxy metrics by checking if they produce correct launch decisions and flags failing user segments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PROXIMA as a framework to score the reliability of short-term proxy metrics that stand in for long-term outcomes in online A/B tests. It tackles cases where overall correlation hides segment-specific failures that can cause wrong decisions about shipping changes. The score combines normalised effect correlation, directional accuracy, and segment-level fragility rate into one composite measure. Tests on 80 simulated experiments from the Criteo and KuaiRec datasets show early engagement proxies reaching 0.80 and 0.62 reliability while matching oracle decisions 98.4 percent of the time on average. The work also finds that recommendation domains have much higher segment fragility than advertising domains, yet directional accuracy stays high in both.

Core claim

PROXIMA is a lightweight diagnostic framework that scores proxy reliability through a composite of three dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. It directly audits whether a candidate proxy leads to correct launch decisions rather than attempting to predict long-term treatment effects, and it identifies the specific user segments where the proxy fails. Validation across 80 simulated A/B tests on the Criteo Uplift and KuaiRec datasets shows early engagement metrics achieving composite reliabilities of 0.80 and 0.62 respectively, with 98.4 percent average agreement to an oracle policy. Fragility analysis indicates 68 percent segment-he

What carries the argument

The composite reliability score built from normalised effect correlation, directional accuracy, and segment-level fragility rate.

If this is right

  • Proxies can be ranked and selected according to their composite reliability score before use in production experiments.
  • Segments flagged for high fragility can be isolated or monitored with additional metrics to avoid masked failures.
  • The full composite distinguishes reliable from unreliable proxies more effectively than correlation alone.
  • Early engagement metrics qualify as sufficiently reliable for launch decisions in the advertising and recommendation domains tested.
  • Directional accuracy above 96 percent holds even when segment fragility differs sharply across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could run the fragility component on historical data to decide in advance which user groups need separate long-term tracking.
  • The framework might reduce costly ship/no-ship errors by surfacing proxies that look good in aggregate but fail for large subgroups.
  • Domains with high measured fragility may benefit from maintaining a small set of parallel proxies instead of relying on one.
  • Sensitivity results suggest that dropping any one of the three components would weaken the ability to screen proxies.

Load-bearing premise

The 80 simulated A/B tests built from the Criteo and KuaiRec datasets accurately reflect the heterogeneity, treatment effects, and decision scenarios of real production experiments.

What would settle it

Running PROXIMA on a large collection of real production A/B tests that include known long-term outcomes and measuring whether the predicted decision agreement matches the observed outcomes.

Figures

Figures reproduced from arXiv: 2604.14352 by Avinash Amudala.

Figure 1
Figure 1. Figure 1: High-level architecture of PROXIMA. Historical experiment data is processed by the core engine (composite scoring, fragility detection, deci￾sion simulation) and surfaced via an API and dash￾board. 3.3 Composite Reliability Score Let τ proxy = (τ 1 proxy, . . . , τ E proxy) and τ long = (τ 1 long, . . . , τ E long) be the vectors of experiment-level effects. We define three component scores: Normalised Eff… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end PROXIMA workflows: data processing (a), composite scoring (b), decision sim￾ulation (c), and fragility analysis (d). 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Stacked decomposition of the composite reliability score. Each bar shows the contribution of the normalised correlation (wC · C, blue), di￾rectional accuracy (wDA · DA, green), and segment stability (wFR · (1 − FR), orange) components [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proxy metric reliability across datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Proxy versus long-term treatment effects [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Bootstrap distribution of composite reli [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Segment-level fragility profiles. Each bar shows the sign-flip rate for a specific re￾gion/device/tenure segment. Left: a moderate proxy with localised fragility. Right: a patholog￾ical proxy with widespread fragility. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Online A/B testing at scale relies on proxy metrics -- short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can mask directional failures akin to Simpson's Paradox, leading to costly ship/no-ship errors. We introduce PROXIMA (Proxy Metric Validation Framework for Online Experiments), a lightweight diagnostic framework that scores proxy reliability through a composite of three complementary dimensions: normalised effect correlation, directional accuracy, and segment-level fragility rate. Unlike surrogate-index approaches that predict long-term treatment effects, PROXIMA directly audits whether a candidate proxy leads to correct launch decisions and flags the user segments where it fails. We validate PROXIMA on two public datasets -- the Criteo Uplift corpus (14M observations, advertising) and KuaiRec (7K users, video recommendation) -- using 80 simulated A/B tests. Early engagement metrics achieve a composite reliability of 0.80 on Criteo and 0.62 on KuaiRec, yielding 98.4% average decision agreement with an oracle policy. Fragility analysis reveals that recommendation domains exhibit substantially higher segment-level heterogeneity (68% fragility) than advertising (13%), yet directional accuracy remains above 96% in both cases. A sensitivity analysis over the weight space confirms that no single component suffices and that the composite provides substantially better discrimination between reliable and unreliable proxies than correlation alone. Code and reproduction scripts are available at: https://github.com/Avinash-Amudala/PROXIMA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PROXIMA, a lightweight diagnostic framework for assessing proxy metric reliability in online controlled experiments. It scores proxies via a composite of three components—normalized effect correlation, directional accuracy, and segment-level fragility rate—explicitly auditing whether proxies lead to correct launch decisions rather than relying on aggregate correlation. Validation uses two public datasets (Criteo Uplift with 14M observations and KuaiRec with 7K users) and 80 simulated A/B tests, reporting composite reliabilities of 0.80 and 0.62 for early engagement metrics, 98.4% average decision agreement with an oracle policy, higher fragility in recommendation domains (68%) than advertising (13%), and a sensitivity analysis showing the composite outperforms correlation alone. Code is provided for reproducibility.

Significance. If the simulation-based validation holds, PROXIMA offers a practical, decision-focused alternative to surrogate-index methods for proxy selection in large-scale A/B testing, directly addressing segment-level heterogeneity and Simpson's paradox risks. The use of public datasets, explicit sensitivity checks over weights, and open code are strengths that support reproducibility and allow external scrutiny of the discrimination power of the three-component score.

major comments (1)
  1. [Validation on simulated A/B tests] The empirical claims rest entirely on 80 simulated A/B tests constructed on the Criteo and KuaiRec datasets (abstract and validation section). The manuscript provides no quantitative diagnostics—such as moment matching against real logged production experiments, checks for interference effects, or sensitivity to variations in segment-level treatment-effect heterogeneity—to establish that these simulations faithfully reproduce the decision thresholds, confounding structure, and heterogeneity patterns of production online controlled experiments. Without such checks, the reported composite reliabilities (0.80/0.62), 98.4% oracle agreement, and fragility rates (13%/68%) cannot be confidently transferred beyond the simulated setting.
minor comments (2)
  1. [Introduction and Methods] The exact definitions and normalization procedures for the three components (especially 'normalised effect correlation' and 'segment-level fragility rate') are described at a high level in the abstract and introduction; including the precise formulas and any hyperparameters in the main text or an appendix would improve clarity and reproducibility.
  2. [Sensitivity analysis] The sensitivity analysis over the weight space is mentioned but lacks detail on the range of weights explored and the exact discrimination metric used to compare the composite against correlation alone; a table or figure showing the full weight-sensitivity results would strengthen the claim that no single component suffices.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on the validation of our simulation framework. We address the major comment below and will revise the manuscript to improve transparency and contextualize the results.

read point-by-point responses
  1. Referee: [Validation on simulated A/B tests] The empirical claims rest entirely on 80 simulated A/B tests constructed on the Criteo and KuaiRec datasets (abstract and validation section). The manuscript provides no quantitative diagnostics—such as moment matching against real logged production experiments, checks for interference effects, or sensitivity to variations in segment-level treatment-effect heterogeneity—to establish that these simulations faithfully reproduce the decision thresholds, confounding structure, and heterogeneity patterns of production online controlled experiments. Without such checks, the reported composite reliabilities (0.80/0.62), 98.4% oracle agreement, and fragility rates (13%/68%) cannot be confidently transferred beyond the simulated setting.

    Authors: We agree that the validation relies exclusively on 80 simulated A/B tests derived from the public Criteo Uplift and KuaiRec datasets, and that the manuscript does not include direct quantitative diagnostics such as moment matching to real production experiments or explicit interference checks. The simulations are constructed directly from the observed user-level data in these corpora to retain authentic segment sizes, outcome distributions, and treatment-effect heterogeneity, which enables reproducible study of proxy decision errors including Simpson's paradox. However, we do not have access to proprietary production A/B test logs that would be required for moment matching or interference analysis. In the revised manuscript we will: (1) expand the validation section with a precise description of the simulation procedure and the data moments it preserves; (2) add further sensitivity analyses that systematically vary the degree of segment-level treatment-effect heterogeneity; and (3) insert a dedicated Limitations section that explicitly discusses the simulated nature of the experiments, the lack of interference modeling, and the resulting limits on direct transfer to production settings. These changes will make the scope and assumptions of the reported metrics (0.80/0.62 reliability, 98.4% oracle agreement, 13%/68% fragility) clearer without overstating generalizability. revision: yes

standing simulated objections not resolved
  • Moment matching against real logged production experiments and checks for interference effects, because such proprietary data are not available to us.

Circularity Check

0 steps flagged

No significant circularity; framework defined independently of validation data

full rationale

The paper defines PROXIMA as a composite of three explicitly stated dimensions (normalised effect correlation, directional accuracy, segment-level fragility rate) and applies it to externally simulated A/B tests on public datasets. Reported metrics (0.80/0.62 reliability, 98.4% oracle agreement) are computed outputs from those simulations rather than inputs; the sensitivity analysis over weights is an empirical check showing the composite outperforms singles, not a definitional reduction. No self-citations, fitted parameters renamed as predictions, or ansatzes appear in the abstract or described chain. The derivation remains self-contained against the external oracle benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard A/B testing assumptions and introduces no new physical entities; the only free parameters are the weights in the composite score whose robustness is checked via sensitivity analysis.

free parameters (1)
  • component weights in composite reliability score
    The three dimensions are combined into a single score; sensitivity analysis is performed over the weight space, implying the weights are chosen parameters.
axioms (2)
  • domain assumption An oracle policy based on true long-term outcomes provides the correct ground-truth launch decision for each simulated test.
    Used to compute the 98.4% decision agreement metric.
  • domain assumption The simulated A/B tests preserve the statistical properties of real experiments on the chosen datasets.
    Required for the validation numbers to generalize.

pith-pipeline@v0.9.0 · 5586 in / 1434 out tokens · 34195 ms · 2026-05-10T12:16:48.632935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    and Imbens, G

    Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353--7360

  2. [2]

    W., and Kang, H

    Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No.\ 26463

  3. [3]

    Blyth, C. R. (1972). On S impson's paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338):364--366

  4. [4]

    Criteo uplift prediction dataset

    Criteo AI Lab (2021). Criteo uplift prediction dataset. https://ailab.criteo.com/criteo-uplift-prediction-dataset/

  5. [5]

    Deng, A., Xu, Y., Kohavi, R., and Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of WSDM, pages 123--132

  6. [6]

    Deng, A., Li, Y., and Guo, M. (2017). Statistical inference in two-stage online controlled experiments with treatment selection and validation. In Proceedings of WWW, pages 609--618

  7. [7]

    Diemert, E., Betlei, A., Renaudin, C., Amini, M.-R., Gregoir, S., and de Br\' e bisson, A. (2021). A large scale benchmark for individual treatment effect prediction and uplift modeling. arXiv:2111.10106

  8. [8]

    Gao, C., Li, S., Lei, W., Chen, J., Li, B., Jiang, P., He, X., Mao, J., and Chua, T.-S. (2022). KuaiRec : A fully-observed dataset for recommender systems. In Proceedings of CIKM, pages 540--550

  9. [9]

    Goodhart, C. A. E. (1984). Problems of monetary management: the UK experience. In Monetary Theory and Practice, pages 91--121. Macmillan

  10. [10]

    Hagar, L., Du, C., and Deng, A. (2023a). Choosing a proxy metric from past experiments. In Proceedings of KDD, pages 4158--4168

  11. [11]

    Hagar, L., Stevens, N., Xifara, T., Yuan, L.-H., and Gandhi, A. (2023b). From augmentation to decomposition: A new look at CUPED in 2023. arXiv:2312.02935

  12. [12]

    R., Ramdas, A., McAuliffe, J., and Sekhon, J

    Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055--1080

  13. [13]

    and Turnbull, B

    Jennison, C. and Turnbull, B. W. (1999). Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC

  14. [14]

    Johari, R., Koomen, P., Pekelis, L., and Walsh, D. (2017). Peeking at A/B tests: Why it matters, and what to do about it. In Proceedings of KDD, pages 1517--1525

  15. [15]

    Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140--181

  16. [16]

    Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of KDD, pages 1168--1176

  17. [17]

    R., Sekhon, J

    K\" u nzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156--4165

  18. [18]

    Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., and Stevens, N. (2024). Statistical challenges in online controlled experiments: A review of A/B testing methodology. The American Statistician, 78(2):135--149

  19. [19]

    Liu, M., Sun, J., and Chen, K. (2023). Pareto optimal proxy metrics. arXiv:2307.01000

  20. [20]

    Manzi, J. (2012). Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society. Basic Books

  21. [21]

    sherlock : Causal machine learning for segment discovery

    Netflix (2023). sherlock : Causal machine learning for segment discovery. https://netflix.github.io/sherlock/

  22. [22]

    Pearl, J. (2014). Understanding S impson's paradox. The American Statistician, 68(1):8--13

  23. [23]

    Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4):431--440

  24. [24]

    Simpson, E. H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B, 13(2):238--241

  25. [25]

    Differential impact detection

    Statsig (2024). Differential impact detection. https://docs.statsig.com/experiments-plus/differential-impact-detection

  26. [26]

    and Lin, Y.-R

    Teng, X. and Lin, Y.-R. (2026). De-paradox tree: Breaking down S impson's paradox via a kernel-based partition algorithm. arXiv:2603.02174

  27. [27]

    and Athey, S

    Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228--1242

  28. [28]

    Zhang, V., Zhao, M., Le, A., Dimakopoulou, M., and Kallus, N. (2023). Evaluating the surrogate index as a decision-making tool using 200 A/B tests at Netflix . arXiv:2311.11922