Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED

Bokui Wan; Jinyong Ma; Yifan Guo; Yongli Qin; Yu Zhang

arxiv: 2606.18750 · v1 · pith:MJPHGVUNnew · submitted 2026-06-17 · 📊 stat.AP · cs.LG

Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED

Yu Zhang , Bokui Wan , Yongli Qin , Jinyong Ma , Yifan Guo This is my paper

Pith reviewed 2026-06-26 19:02 UTC · model grok-4.3

classification 📊 stat.AP cs.LG

keywords CUPEDA/B testingvariance estimationmulti-arm experimentstwo-stage samplingonline experimentationpre-experiment datatreatment effect

0 comments

The pith

In multi-arm experiments and two-stage sampling designs, standard variance estimators after CUPED produce severely misleading inferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines five practical questions about applying CUPED to reduce variance in online A/B tests while keeping estimates unbiased. It compares different post-adjustment estimators, checks when regression-based versions remain valid, and supplies matching variance formulas. The central extension shows that in multi-arm trials and two-stage sampling the usual variance formulas break down even after the CUPED adjustment is applied. This matters for any platform that runs large-scale experiments on features, pricing, or user experience, because wrong variance numbers can flip significance calls and launch decisions. The recommended fixes have already been put into production use.

Core claim

CUPED preserves unbiasedness of the average treatment effect, yet in multi-arm experiments and two-stage sampling designs the standard variance estimators attached to the adjusted estimator are invalid and can produce severely misleading inferences about treatment effects.

What carries the argument

CUPED adjustment of the outcome using pre-experiment data, paired with regression-based estimation and specially derived robust variance estimators that account for the multi-arm and two-stage structure.

If this is right

Comparing post-CUPED estimators identifies the specification that minimizes variance while keeping the estimator unbiased.
Regression-based CUPED adjustments require tailored robust variance methods rather than off-the-shelf formulas to stay valid.
In multi-arm experiments the usual variance estimator after CUPED fails to give correct inference.
The same failure occurs in two-stage sampling designs.
Adopting the paper's variance methods restores trustworthy inference in these common but complex settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Experimentation platforms that already use CUPED should replace their default variance calculations in multi-arm and two-stage workflows.
The same variance-bias pattern may appear in other pre-experiment adjustment techniques once the design moves beyond simple two-arm randomization.
Decision rules that rely on p-values or confidence intervals will need recalibration when these robust estimators are introduced.
Further analytic work could derive analogous variance corrections for three-stage or network-based sampling schemes.

Load-bearing premise

Pre-experiment data stays sufficiently correlated with the outcome and free of post-randomization contamination even when the design involves multiple arms or two-stage sampling.

What would settle it

A simulation or live multi-arm experiment in which the empirical coverage rate of nominal 95 percent confidence intervals computed from the standard variance formula falls well below or above 95 percent.

Figures

Figures reproduced from arXiv: 2606.18750 by Bokui Wan, Jinyong Ma, Yifan Guo, Yongli Qin, Yu Zhang.

**Figure 2.** Figure 2: Variance reduction results of 𝜏b1 and (𝜏b2)corrected relative to the standard difference-in-means estimator across five real-world experiments. Finally, we utilize real-world data from ByteDance’s experimentation platform to evaluate the performance of 𝜏b1 and (𝜏b2)corrected. All displayed experiments are derived from the platform’s core business metrics, encompassing GMV and user feedback such as likes,… view at source ↗

read the original abstract

A/B testing has become the gold standard for data-driven decision-making in large-scale online experimentation, providing critical guidance for feature launch, pricing optimization, and user experience enhancement. To maximize statistical sensitivity, many technology companies routinely employ Controlled-experiment Using Pre-Experiment Data (CUPED), a technique that achieves substantial variance reduction while preserving the unbiasedness of estimating the average treatment effect. Despite its widespread adoption, several critical methodological and practical nuances of CUPED remain underexplored. This paper systematically addresses five frequently encountered yet overlooked questions regarding the application of CUPED. First, we provide a comparative analysis of various post-CUPED estimators to identify the optimal adjustment specification. Second, we evaluate the validity of regression-based adjustments and delineate robust variance estimation methods tailored for such frameworks. Finally, we extend our investigation to complex but common scenarios, including multi-arm experiments and two-stage sampling designs. Our findings reveal that in these settings, naive reliance on standard variance estimators can lead to severely misleading inferences. By offering rigorous theoretical insights and extensive experimental validation, this work deepens the conceptual understanding of CUPED. Notably, the recommended methodologies have been successfully deployed and integrated into ByteDance's experimentation platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives practical pointers on CUPED variance in multi-arm and two-stage designs but the key claim about misleading standard estimators rests on derivations we cannot check from the abstract alone.

read the letter

The main takeaway is that this work walks through five concrete questions on CUPED use, compares post-adjustment estimators, and extends the method to multi-arm experiments and two-stage sampling. It flags that standard variance formulas can produce bad inferences in those settings and backs the point with theory, simulations, and a claim of deployment at ByteDance.

What stands out is the focus on real operational questions that experimenters actually hit, plus the direct comparison of adjustment specs. Extending CUPED to those designs is the clearest addition over prior work, and if the variance corrections hold, they could reduce errors in large-scale platforms.

The soft spot is the central claim on misleading inferences. The abstract asserts rigorous theory, yet the stress-test note correctly flags that we need to see whether the multi-arm variance expansion includes the cross-arm covariance blocks from shared pre-experiment covariates and whether the two-stage case properly handles sampling weights. Without those steps shown, the size of the discrepancy stays unclear and the word "severely" is hard to judge. The pre-experiment data assumptions also look standard but could bite in practice.

This is aimed at applied statisticians and experimenters at tech companies who already use CUPED and want tighter variance handling. A reader running thousands of tests per year would find the comparisons and extensions useful even if the math needs tightening.

I would send it to peer review. The topic matters and the claims are checkable with the full derivations and code.

Referee Report

2 major / 2 minor

Summary. The manuscript addresses five key questions on CUPED for online A/B testing. It compares post-CUPED estimators to identify optimal adjustment specifications, evaluates the validity of regression-based adjustments along with tailored robust variance methods, and extends the analysis to multi-arm experiments and two-stage sampling designs. The central claim is that naive reliance on standard variance estimators in these complex settings produces severely misleading inferences, supported by theoretical insights and extensive experimental validation; the recommended methods have been deployed at ByteDance.

Significance. If the variance derivations and experimental results hold, the work would strengthen reliable inference in industry-scale A/B testing by clarifying CUPED behavior under multi-arm and two-stage structures, directly benefiting platforms that already use CUPED for variance reduction.

major comments (2)

[Multi-arm experiments section] Multi-arm experiments section: the claim that standard variance estimators produce severely misleading inferences after CUPED requires an explicit expansion (via sandwich or delta-method) of the variance formula that includes the cross-arm Cov(Ŷ_pre, Ŷ_post) blocks induced by shared pre-experiment covariates; without this derivation the magnitude of the discrepancy is unquantified and the central claim on misleading inferences rests on an unshown term.
[Two-stage sampling designs section] Two-stage sampling designs section: the paper must demonstrate how first-stage sampling probabilities alter the effective regression weights and induce bias in the usual variance estimator; the current treatment appears to omit these terms, which is load-bearing for the assertion that naive estimators are severely misleading in this design.

minor comments (2)

The abstract refers to 'five key questions' but enumerates only three main areas; listing the five questions explicitly (perhaps in an introductory table or enumerated list) would improve readability.
Clarify whether all reported variance formulas are analytically derived or include any fitted components, consistent with the soundness assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly to strengthen the theoretical derivations.

read point-by-point responses

Referee: [Multi-arm experiments section] the claim that standard variance estimators produce severely misleading inferences after CUPED requires an explicit expansion (via sandwich or delta-method) of the variance formula that includes the cross-arm Cov(Ŷ_pre, Ŷ_post) blocks induced by shared pre-experiment covariates; without this derivation the magnitude of the discrepancy is unquantified and the central claim on misleading inferences rests on an unshown term.

Authors: We agree that an explicit sandwich or delta-method expansion including the cross-arm Cov(Ŷ_pre, Ŷ_post) terms is needed to fully quantify the discrepancy. In the revision we will add this derivation in the multi-arm section (and appendix) to make the magnitude of the bias in naive estimators transparent. revision: yes
Referee: [Two-stage sampling designs section] the paper must demonstrate how first-stage sampling probabilities alter the effective regression weights and induce bias in the usual variance estimator; the current treatment appears to omit these terms, which is load-bearing for the assertion that naive estimators are severely misleading in this design.

Authors: We acknowledge that the two-stage section should explicitly derive how first-stage sampling probabilities modify the regression weights and bias the naive variance estimator. We will expand the section with the relevant weighted formulas and bias terms in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological comparisons and extensions remain independent of fitted inputs or self-citations

full rationale

The paper's core contributions consist of comparative analysis of post-CUPED estimators, evaluation of regression adjustments with robust variance methods, and extensions to multi-arm/two-stage designs. These rest on standard statistical theory for variance estimation and bias analysis rather than any self-referential derivation. No equations reduce a prediction to a fitted parameter by construction, no uniqueness theorems are imported from prior self-work, and no ansatz is smuggled via citation. The abstract and described structure indicate external validation through experiments and deployment, keeping the derivation chain self-contained against benchmarks outside the paper's own fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5751 in / 1101 out tokens · 38615 ms · 2026-06-26T19:02:28.362872+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references

[1]

George EP Box, J Stuart Hunter, William G Hunter, et al . 2005. Statistics for experimenters. InWiley series in probability and statistics. Wiley Hoboken, NJ

2005
[2]

Guillaume Chauvet and Audrey-Anne Vallée. 2020. Inference for two-stage sampling designs.Journal of the Royal Statistical Society Series B: Statistical Methodology82, 3 (2020), 797–815

2020
[3]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters

2018
[4]

A. Deng. 2021. Chapter 10: Improving Metric Sensitivity. https://alexdeng.github. io/causal/index.html

2021
[5]

Alex Deng, Luke Hagar, Nathaniel Stevens, Tatiana Xifara, Lo-Hua Yuan, and Amit Gandhi. 2023. From augmentation to decomposition: A new look at cuped in 2023.arXiv preprint arXiv:2312.02935(2023)

arXiv 2023
[6]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the sixth ACM international conference on Web search and data mining. 123–132

2013
[7]

2024.A first course in causal inference

Peng Ding. 2024.A first course in causal inference. Chapman and Hall/CRC

2024
[8]

Peng Ding, Xinran Li, and Luke W Miratrix. 2017. Bridging finite and super population causal inference.Journal of Causal Inference5, 2 (2017), 20160027

2017
[9]

Friedhelm Eicker. 1967. Limit theorems for regressions with unequal and depen- dent errors. (1967). Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED arXiv, 2026, preprint

1967
[10]

Ping Feng, Xiao-Hua Zhou, Qing-Ming Zou, Ming-Yu Fan, and Xiao-Song Li
[11]

Generalized propensity score for estimating the average treatment effect of multiple treatments.Statistics in medicine31, 7 (2012), 681–697

2012
[12]

David A Freedman. 2008. On regression adjustments in experiments with several treatments. (2008)

2008
[13]

David A Freedman. 2008. On regression adjustments to experimental data. Advances in Applied Mathematics40, 2 (2008), 180–193

2008
[14]

Peter J Huber et al. 1967. The behavior of maximum likelihood estimates under nonstandard conditions. InProceedings of the fifth Berkeley symposium on mathe- matical statistics and probability, Vol. 1. Berkeley, CA: University of California Press, 221–233

1967
[15]

Simon Jackson. 2018. How booking. com increases the power of online experi- ments with cuped.Accessed on1, 13 (2018), 2021

2018
[16]

Göran Kauermann and Raymond J Carroll. 2000. The sandwich variance estimator: Efficiency properties and coverage probability of confidence intervals. (2000)

2000
[17]

Ron Kohavi and Roger Longbotham. 2023. Online controlled experiments and A/B tests. InEncyclopedia of machine learning and data science. Springer, 1–13

2023
[18]

2020.Trustworthy online controlled experi- ments: A practical guide to a/b testing

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to a/b testing. Cambridge University Press

2020
[19]

Winston Lin. 2013. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.The Annals of Applied Statistics(2013), 295–318

2013
[20]

Agnostic Notes on Regression Adjustments to Experimental Data : Reexamining Freedman ’ s Critique

Winston T. Lin. 2012. Supplement to “ Agnostic Notes on Regression Adjustments to Experimental Data : Reexamining Freedman ’ s Critique ” ( Proofs of theorems , corollaries , and selected remarks ). https://api.semanticscholar.org/CorpusID: 18215280

2012
[21]

Michael J Lopez and Roee Gutman. 2017. Estimation of causal effects with multiple treatments: a review and new ideas.Statist. Sci.(2017), 432–454

2017
[22]

Meta. 2024. How Meta scaled regression adjustment to improve power across hundreds of thousands of experiments on our AB testing platform. https://medium.com/@AnalyticsAtMeta/how-meta-scaled- regression-adjustment-to-improve-power-across-hundreds-of-thousands-of- experiments-624e08aaf560

2024
[23]

Microsoft. 2022. Deep Dive Into Variance Reduction. https://www.microsoft. com/en-us/research/articles/deep-dive-into-variance-reduction/#_ftnref4

2022
[24]

Nubank. 2025. 3 Lessons from implementing Controlled-Experiment Using Pre-Experiment Data (CUPED) at Nubank. https://building.nubank.com/3- lessons-from-implementing-controlled-experiment-using-pre-experiment- data-cuped-at-nubank/?utm_source=substack&utm_medium=email

2025
[25]

Esbjörn Ohlsson. 1989. Asymptotic normality for two-stage sampling from a finite population.Probability theory and related fields81, 3 (1989), 341–352

1989
[26]

Jerzy Splawa-Neyman, Dorota M Dabrowska, and Terrence P Speed. 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statist. Sci.(1990), 465–472

1990
[27]

Statsig. [n. d.]. CUPED. https://docs.statsig.com/experiments/statistical- methods/methodologies/cuped
[28]

Statsig. 2024. CUPED Explained. https://www.statsig.com/blog/cuped?_gl= 1*gtl2j5*_gcl_au*Nzk3OTk1OTg4LjE3NDY3ODA1MDg

2024
[29]

Jay M Ver Hoef. 2012. Who invented the delta method?The American Statistician 66, 2 (2012), 124–127

2012
[30]

Walmart. 2023. How Walmart Leverages CUPED and Reduces Experimenta- tion Lifecycle. https://medium.com/walmartglobaltech/how-walmart-leverages- cuped-and-reduces-experimentation-lifecycle-eae2446c2ee4

2023
[31]

Bernard L Welch. 1947. The generalization of ‘STUDENT’S’problem when several different population varlances are involved.Biometrika34, 1-2 (1947), 28–35

1947
[32]

Halbert White. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.Econometrica: journal of the Econometric Society(1980), 817–838

1980
[33]

Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online con- trolled experiments: Case studies at netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 645– 654

2016
[34]

Yu Zhang, Bokui Wan, and Yongli Qin. 2025. Bridging Control Variates and Regression Adjustment in A/B Testing: From Design-Based to Model-Based Frameworks.arXiv preprint arXiv:2509.13944(2025). arXiv, 2026, preprint Zhang et al. A Appendix To facilitate the proof of our main results, we first establish several technical lemmas regarding the convergence an...

arXiv 2025

[1] [1]

George EP Box, J Stuart Hunter, William G Hunter, et al . 2005. Statistics for experimenters. InWiley series in probability and statistics. Wiley Hoboken, NJ

2005

[2] [2]

Guillaume Chauvet and Audrey-Anne Vallée. 2020. Inference for two-stage sampling designs.Journal of the Royal Statistical Society Series B: Statistical Methodology82, 3 (2020), 797–815

2020

[3] [3]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters

2018

[4] [4]

A. Deng. 2021. Chapter 10: Improving Metric Sensitivity. https://alexdeng.github. io/causal/index.html

2021

[5] [5]

Alex Deng, Luke Hagar, Nathaniel Stevens, Tatiana Xifara, Lo-Hua Yuan, and Amit Gandhi. 2023. From augmentation to decomposition: A new look at cuped in 2023.arXiv preprint arXiv:2312.02935(2023)

arXiv 2023

[6] [6]

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the sixth ACM international conference on Web search and data mining. 123–132

2013

[7] [7]

2024.A first course in causal inference

Peng Ding. 2024.A first course in causal inference. Chapman and Hall/CRC

2024

[8] [8]

Peng Ding, Xinran Li, and Luke W Miratrix. 2017. Bridging finite and super population causal inference.Journal of Causal Inference5, 2 (2017), 20160027

2017

[9] [9]

Friedhelm Eicker. 1967. Limit theorems for regressions with unequal and depen- dent errors. (1967). Ensuring Trustworthy Online A/B Testing: Addressing Five Key Questions on CUPED arXiv, 2026, preprint

1967

[10] [10]

Ping Feng, Xiao-Hua Zhou, Qing-Ming Zou, Ming-Yu Fan, and Xiao-Song Li

[11] [11]

Generalized propensity score for estimating the average treatment effect of multiple treatments.Statistics in medicine31, 7 (2012), 681–697

2012

[12] [12]

David A Freedman. 2008. On regression adjustments in experiments with several treatments. (2008)

2008

[13] [13]

David A Freedman. 2008. On regression adjustments to experimental data. Advances in Applied Mathematics40, 2 (2008), 180–193

2008

[14] [14]

Peter J Huber et al. 1967. The behavior of maximum likelihood estimates under nonstandard conditions. InProceedings of the fifth Berkeley symposium on mathe- matical statistics and probability, Vol. 1. Berkeley, CA: University of California Press, 221–233

1967

[15] [15]

Simon Jackson. 2018. How booking. com increases the power of online experi- ments with cuped.Accessed on1, 13 (2018), 2021

2018

[16] [16]

Göran Kauermann and Raymond J Carroll. 2000. The sandwich variance estimator: Efficiency properties and coverage probability of confidence intervals. (2000)

2000

[17] [17]

Ron Kohavi and Roger Longbotham. 2023. Online controlled experiments and A/B tests. InEncyclopedia of machine learning and data science. Springer, 1–13

2023

[18] [18]

2020.Trustworthy online controlled experi- ments: A practical guide to a/b testing

Ron Kohavi, Diane Tang, and Ya Xu. 2020.Trustworthy online controlled experi- ments: A practical guide to a/b testing. Cambridge University Press

2020

[19] [19]

Winston Lin. 2013. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.The Annals of Applied Statistics(2013), 295–318

2013

[20] [20]

Agnostic Notes on Regression Adjustments to Experimental Data : Reexamining Freedman ’ s Critique

Winston T. Lin. 2012. Supplement to “ Agnostic Notes on Regression Adjustments to Experimental Data : Reexamining Freedman ’ s Critique ” ( Proofs of theorems , corollaries , and selected remarks ). https://api.semanticscholar.org/CorpusID: 18215280

2012

[21] [21]

Michael J Lopez and Roee Gutman. 2017. Estimation of causal effects with multiple treatments: a review and new ideas.Statist. Sci.(2017), 432–454

2017

[22] [22]

Meta. 2024. How Meta scaled regression adjustment to improve power across hundreds of thousands of experiments on our AB testing platform. https://medium.com/@AnalyticsAtMeta/how-meta-scaled- regression-adjustment-to-improve-power-across-hundreds-of-thousands-of- experiments-624e08aaf560

2024

[23] [23]

Microsoft. 2022. Deep Dive Into Variance Reduction. https://www.microsoft. com/en-us/research/articles/deep-dive-into-variance-reduction/#_ftnref4

2022

[24] [24]

Nubank. 2025. 3 Lessons from implementing Controlled-Experiment Using Pre-Experiment Data (CUPED) at Nubank. https://building.nubank.com/3- lessons-from-implementing-controlled-experiment-using-pre-experiment- data-cuped-at-nubank/?utm_source=substack&utm_medium=email

2025

[25] [25]

Esbjörn Ohlsson. 1989. Asymptotic normality for two-stage sampling from a finite population.Probability theory and related fields81, 3 (1989), 341–352

1989

[26] [26]

Jerzy Splawa-Neyman, Dorota M Dabrowska, and Terrence P Speed. 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statist. Sci.(1990), 465–472

1990

[27] [27]

Statsig. [n. d.]. CUPED. https://docs.statsig.com/experiments/statistical- methods/methodologies/cuped

[28] [28]

Statsig. 2024. CUPED Explained. https://www.statsig.com/blog/cuped?_gl= 1*gtl2j5*_gcl_au*Nzk3OTk1OTg4LjE3NDY3ODA1MDg

2024

[29] [29]

Jay M Ver Hoef. 2012. Who invented the delta method?The American Statistician 66, 2 (2012), 124–127

2012

[30] [30]

Walmart. 2023. How Walmart Leverages CUPED and Reduces Experimenta- tion Lifecycle. https://medium.com/walmartglobaltech/how-walmart-leverages- cuped-and-reduces-experimentation-lifecycle-eae2446c2ee4

2023

[31] [31]

Bernard L Welch. 1947. The generalization of ‘STUDENT’S’problem when several different population varlances are involved.Biometrika34, 1-2 (1947), 28–35

1947

[32] [32]

Halbert White. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.Econometrica: journal of the Econometric Society(1980), 817–838

1980

[33] [33]

Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online con- trolled experiments: Case studies at netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 645– 654

2016

[34] [34]

Yu Zhang, Bokui Wan, and Yongli Qin. 2025. Bridging Control Variates and Regression Adjustment in A/B Testing: From Design-Based to Model-Based Frameworks.arXiv preprint arXiv:2509.13944(2025). arXiv, 2026, preprint Zhang et al. A Appendix To facilitate the proof of our main results, we first establish several technical lemmas regarding the convergence an...

arXiv 2025