Multi-Experiment Analysis

Reza Hosseini

arxiv: 2604.16671 · v1 · submitted 2026-04-17 · 📊 stat.ME

Multi-Experiment Analysis

Reza Hosseini This is my paper

Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3

classification 📊 stat.ME

keywords multi-experiment analysisoverlapping experimentsonline controlled experimentsjoint estimationtreatment effectsA/B testingconditional effects

0 comments

The pith

Multi-Experiment Analysis produces corrected individual, combined, and conditional treatment effects from overlapping experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online controlled experiments often overlap on the same users, which can bias results and make it hard to understand the impact of feature combinations. The paper proposes Multi-Experiment Analysis as a way to jointly estimate effects even with partial or full overlaps and multiple variants. This yields three useful quantities: corrected effects for single experiments that adjust for overlaps, effects from launching any mix of variants, and effects of one variant conditional on others being present or absent. The method avoids the need for special traffic allocation or pre-designed factorial experiments. Simulations and a production deployment at scale support its consistency and practical value.

Core claim

The core discovery is that a joint model incorporating the observed structure of experiment overlaps can deliver consistent estimates of individual treatment effects adjusted for concurrent tests, the overall effect of any desired combination of variants from different experiments, and the effect of a given variant conditional on the status of variants in other experiments, all without requiring factorial experimental designs or restrictions on how traffic is shared.

What carries the argument

The joint estimation model that uses the fully observed overlap structure to recover the three types of unbiased effect estimates.

Load-bearing premise

The overlaps between experiments must be fully observed and the joint model must be able to recover unbiased estimates without additional unstated restrictions on the response surface or assignment mechanism.

What would settle it

If simulations with known true effects show that the MEA estimates are biased or their confidence intervals do not achieve the nominal coverage rate, that would indicate the method does not produce consistent estimates.

Figures

Figures reproduced from arXiv: 2604.16671 by Reza Hosseini.

**Figure 3.** Figure 3: Production violation (FLAG): two notification exper [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Combination analysis: checkmarks indicate rele [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Conditional analysis: only cells matching the fixed [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: MEA system architecture showing the hybrid ap [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: shows the causal graph for 𝑘 = 2. The assumption is equivalent to the absence of the dashed red edges 𝐴𝑗 → 𝑆𝑖 (𝑗 ≠ 𝑖) — no experiment’s arm assignment shifts another experiment’s trigger rate. 𝐴1 𝐴2 𝑆1 𝑈 𝑆2 𝑌 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Convergence of simulation-based delta estimates [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of 1000 MEA estimates with one ex [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: MEA estimate convergence as sample size in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Online controlled experiments face growing challenges from overlapping tests on shared traffic, where interactions between concurrent experiments obscure insights into feature combinations and produce effect estimates that do not correspond to any actionable launch scenario. While traffic splitting, layering, and sequential execution (non-concurrent) mitigate some of these issues, they require coordination overhead and can reduce experimentation velocity. We propose Multi-Experiment Analysis (MEA), a methodology for consistent joint estimation in the presence of arbitrary partial or full overlaps and multiple variants. MEA produces three types of estimates: (1) corrected individual treatment effects that account for the presence of overlapping experiments, (2) combined effects of launching any desired combination of variants across experiments, and (3) conditional effects of an experiment's variant given that specific variants of other experiments are launched or deramped -- all without requiring factorial pre-design or traffic restrictions. We validate the approach through comprehensive simulations confirming consistency and correct coverage. We report on production deployment at scale, illustrate the methodology through real-world use cases, and share practical lessons learned -- including system design, adoption patterns, and insights from production use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MEA describes a joint approach to overlapping experiments that produces corrected, combined, and conditional estimates, but combined and conditional results for unseen variant mixes rest on modeling assumptions that the abstract does not detail or robustness-test.

read the letter

MEA gives a way to analyze multiple overlapping experiments jointly and produce corrected individual effects, combined effects for any variant mix, and conditional effects given other variants. That's the core offering. It does a solid job describing the operational pain of overlaps and how layering or sequential tests fall short. The three estimate types are clearly motivated by what teams actually need when deciding launches. Reporting on production use and lessons learned adds practical weight, and the simulations are presented as confirming consistency and coverage. The main concern is that getting unbiased estimates for combinations never seen together requires some structure on the response or assignment process. The abstract claims the method works for arbitrary overlaps, but without the equations or the exact joint model, it's hard to see what restrictions are imposed. Simulations can look good if they use the same model as the estimator, so they don't rule out bias from misspecification in real data. The stress-test point about unobserved combinations needing an implicit model seems to hold based on what's shown. This is aimed at experimenters and platform builders at companies with heavy overlapping test traffic. Someone looking for a deployable approach to increase testing velocity without coordination overhead would find the use cases helpful. It deserves a serious referee because the problem is common and the claims are specific enough to check against the full details and assumptions. I would send it for peer review to get feedback on the modeling assumptions and validation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Multi-Experiment Analysis (MEA), a methodology for joint estimation of treatment effects from online controlled experiments with arbitrary partial or full overlaps. It claims to deliver three types of estimates—corrected individual treatment effects accounting for overlaps, combined effects for any desired combination of variants, and conditional effects given specific variants in other experiments—without requiring factorial pre-design or traffic restrictions. The approach is validated via simulations showing consistency and correct coverage, with additional illustration through production deployment at scale and practical lessons.

Significance. If the central claims hold, MEA would represent a meaningful advance in online experimentation methodology by enabling flexible analysis of concurrent tests on shared traffic, potentially increasing experimentation velocity while producing actionable estimates. The reported production deployment and lessons learned provide concrete evidence of practical utility beyond theoretical claims. However, the absence of explicit model equations or assumptions in the provided abstract limits assessment of whether the method achieves unbiasedness for unobserved combinations under realistic conditions.

major comments (2)

[Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.
[Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.

minor comments (1)

[Abstract] The abstract would be strengthened by a single sentence summarizing the key modeling assumptions or functional form used for the joint estimation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important opportunities to improve the clarity and rigor of our presentation of Multi-Experiment Analysis (MEA). We agree that the abstract and methods sections would benefit from more explicit statements of the joint model, assumptions, and identification strategy. We address each major comment below and commit to revisions that strengthen these aspects without altering the core claims supported by our simulations and production deployment.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.

Authors: We agree that the abstract should more explicitly reference the modeling framework that supports recovery of effects for unobserved combinations. The full manuscript (Section 3) defines the joint model used by MEA, which employs a parametric structure allowing consistent estimation and extrapolation under the stated assumptions. Simulations in Section 4 confirm consistency and coverage when data are generated from this model. To address the concern about misspecification, we will revise the abstract to briefly note the key assumptions and identification approach, and we will add a dedicated discussion subsection on robustness to plausible departures from the assumed response surface, including additional simulation results under misspecification. revision: yes
Referee: [Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.

Authors: We acknowledge that explicit definition of the model and assumptions is required for full assessment of estimates (2) and (3). In the revised manuscript we will expand the Methods section to present the complete joint model equations, enumerate all assumptions (including full observability of overlaps and the parametric form enabling recovery for non-co-occurring combinations), and provide a clear identification argument. We will also incorporate a discussion, informed by our production deployment, of how these assumptions align with real traffic conditions and any practical safeguards or diagnostics used to monitor potential violations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external validation

full rationale

The abstract describes a proposed methodology (MEA) for joint estimation under overlaps, with three types of estimates produced from observed data. No equations, fitted parameters, or derivation chain are shown that would reduce a claimed prediction to a self-defined input or self-citation. Validation is stated to come from comprehensive simulations confirming consistency and coverage, plus production deployment; these are independent checks rather than tautological. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claim therefore rests on the model's ability to recover effects from partial overlaps, which is presented as a modeling choice rather than a definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is presented as a statistical procedure whose internal assumptions are not enumerated.

pith-pipeline@v0.9.0 · 5484 in / 1007 out tokens · 49967 ms · 2026-05-10T07:14:31.913094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

Aronow and Cyrus Samii

Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference.Annals of Applied Statistics, 11(4):1912–1947, 2017

work page 1912
[2]

Imbens, and Hyunseung Kang

Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No. 26463, 2019

work page 2019
[3]

Bonferroni

Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, 1936

work page 1936
[4]

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A prac- tical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995

work page 1995
[5]

Sekhon, and Bin Yu

Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S. Sekhon, and Bin Yu. Lasso adjustments of treatment effect estimates in randomized experiments. Proceedings of the National Academy of Sciences, 113(27):7383–7390, 2016

work page 2016
[6]

George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experi- menters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005

work page 2005
[7]

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 123–132, 2013

work page 2013
[8]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994

work page 1994
[9]

Fisher.The Design of Experiments

Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh, 1935

work page 1935
[10]

Freedman

David A. Freedman. On regression adjustments to experimental data.Advances in Applied Mathematics, 40(2):180–193, 2008

work page 2008
[11]

Chapman & Hall/CRC, 2nd edition, 2002

Stephanie Green, Jacqueline Benedetti, and John Crowley.Clinical Trials in Oncology. Chapman & Hall/CRC, 2nd edition, 2002

work page 2002
[12]

Machine learning for variance reduction in online experiments

Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 8637–8648, 2021

work page 2021
[13]

Hernán and James M

Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, 2020

work page 2020
[14]

Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

Reza Hosseini and Amir Najmi. Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

work page arXiv 1904
[15]

Imbens and Donald B

Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015

work page 2015
[16]

Peeking at A/B tests: Why it matters, and what to do about it

Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at A/B tests: Why it matters, and what to do about it. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517–1525, 2017

work page 2017
[17]

Henne, and Dan Sommerfield

Ron Kohavi, Randal M. Henne, and Dan Sommerfield. Practical guide to con- trolled experiments on the web: Listen to your customers not to the HiPPO. InProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 959–967, 2007

work page 2007
[18]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Con- trolled experiments on the web: Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140–181, 2009

work page 2009
[19]

Online controlled experiments at large scale

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. Online controlled experiments at large scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1168–1176, 2013

work page 2013
[20]

Cambridge University Press, 2020

Ron Kohavi, Diane Tang, and Ya Xu.Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020

work page 2020
[21]

Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

Nicholas Larsen, Jonathan Stallrich, Srijan Sengupta, Alex Deng, Ron Kohavi, and Nathaniel Stevens. Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

work page 2024
[22]

General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

Xinran Li and Peng Ding. General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

work page 2017
[23]

Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

work page 2013
[24]

Lohr.Sampling: Design and Analysis

Sharon L. Lohr.Sampling: Design and Analysis. CRC Press, 3rd edition, 2021

work page 2021
[25]

Miratrix, Stefan Wager, and José R

Luke W. Miratrix, Stefan Wager, and José R. Zubizarreta. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):369–396, 2013

work page 2013
[26]

On the application of probability theory to agricultural experiments

Jerzy Splawa-Neyman. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statistical Science, 5(4):465–472, 1990. (Translated from the 1923 Polish original by D.M. Dabrowska and T.P. Speed)

work page 1990
[27]

Cambridge University Press, 2nd edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

work page 2009
[28]

Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

work page 1974
[29]

Overlapping experiment infrastructure: More, better, faster experimentation

Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. InProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 17–26, 2010

work page 2010
[30]

Thomke.Experimentation Works: The Surprising Power of Business Experiments

Stefan H. Thomke.Experimentation Works: The Surprising Power of Business Experiments. Harvard Business Review Press, 2020

work page 2020
[31]

A. W. van der Vaart.Asymptotic Statistics. Cambridge University Press, 2000

work page 2000
[32]

Tibshirani

Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani. High- dimensional regression adjustments in randomized experiments.Proceedings of the National Academy of Sciences, 113(45):12673–12678, 2016

work page 2016
[33]

C. F. Jeff Wu and Michael S. Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

work page 2009
[34]

Improving the sensitivity of online controlled experiments: Case studies at Netflix

Huizhi Xie and Juliette Aurisset. Improving the sensitivity of online controlled experiments: Case studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2016

work page 2016
[35]

Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners

Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2156–2164, 2019

work page 2019
[36]

John Wiley & Sons, 2002

Alan Agresti.Categorical Data Analysis, 2nd edition. John Wiley & Sons, 2002

work page 2002
[37]

From infrastructure to culture: A/B testing challenges in large scale social networks

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/B testing challenges in large scale social networks. InProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2227–2236, 2015

work page 2015
[38]

The anatomy of a large-scale experimentation platform

Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Raff, and Aleksander Fabijan. The anatomy of a large-scale experimentation platform. In Proceedings of the IEEE International Conference on Software Architecture (ICSA), pages 1–109, 2018

work page 2018
[39]

A call to relax the concern for experiment interactions

Shan Jeng, Francis Duval, Daniel Arizmendi, and Jason Tang. A call to relax the concern for experiment interactions. InProceedings of the 16th ACM International Conference on Web Search and Data Mining (WSDM), 2023

work page 2023
[40]

Embrace overlapping A/B tests at scale

David Chan, Raghavendra Peri, Hao Yu, Lucas Montalvo, and Jeff Galbraith. Embrace overlapping A/B tests at scale. InProceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2023. A Assumption Checking MEA’s correctness relies on a causal assumption—Arm-Trigger In- variance— which we state here, connect to a causa...

work page 2023

[1] [1]

Aronow and Cyrus Samii

Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference.Annals of Applied Statistics, 11(4):1912–1947, 2017

work page 1912

[2] [2]

Imbens, and Hyunseung Kang

Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No. 26463, 2019

work page 2019

[3] [3]

Bonferroni

Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, 1936

work page 1936

[4] [4]

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A prac- tical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995

work page 1995

[5] [5]

Sekhon, and Bin Yu

Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S. Sekhon, and Bin Yu. Lasso adjustments of treatment effect estimates in randomized experiments. Proceedings of the National Academy of Sciences, 113(27):7383–7390, 2016

work page 2016

[6] [6]

George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experi- menters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005

work page 2005

[7] [7]

Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 123–132, 2013

work page 2013

[8] [8]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994

work page 1994

[9] [9]

Fisher.The Design of Experiments

Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh, 1935

work page 1935

[10] [10]

Freedman

David A. Freedman. On regression adjustments to experimental data.Advances in Applied Mathematics, 40(2):180–193, 2008

work page 2008

[11] [11]

Chapman & Hall/CRC, 2nd edition, 2002

Stephanie Green, Jacqueline Benedetti, and John Crowley.Clinical Trials in Oncology. Chapman & Hall/CRC, 2nd edition, 2002

work page 2002

[12] [12]

Machine learning for variance reduction in online experiments

Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 8637–8648, 2021

work page 2021

[13] [13]

Hernán and James M

Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, 2020

work page 2020

[14] [14]

Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

Reza Hosseini and Amir Najmi. Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

work page arXiv 1904

[15] [15]

Imbens and Donald B

Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015

work page 2015

[16] [16]

Peeking at A/B tests: Why it matters, and what to do about it

Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at A/B tests: Why it matters, and what to do about it. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517–1525, 2017

work page 2017

[17] [17]

Henne, and Dan Sommerfield

Ron Kohavi, Randal M. Henne, and Dan Sommerfield. Practical guide to con- trolled experiments on the web: Listen to your customers not to the HiPPO. InProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 959–967, 2007

work page 2007

[18] [18]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Con- trolled experiments on the web: Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140–181, 2009

work page 2009

[19] [19]

Online controlled experiments at large scale

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. Online controlled experiments at large scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1168–1176, 2013

work page 2013

[20] [20]

Cambridge University Press, 2020

Ron Kohavi, Diane Tang, and Ya Xu.Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020

work page 2020

[21] [21]

Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

Nicholas Larsen, Jonathan Stallrich, Srijan Sengupta, Alex Deng, Ron Kohavi, and Nathaniel Stevens. Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

work page 2024

[22] [22]

General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

Xinran Li and Peng Ding. General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

work page 2017

[23] [23]

Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

work page 2013

[24] [24]

Lohr.Sampling: Design and Analysis

Sharon L. Lohr.Sampling: Design and Analysis. CRC Press, 3rd edition, 2021

work page 2021

[25] [25]

Miratrix, Stefan Wager, and José R

Luke W. Miratrix, Stefan Wager, and José R. Zubizarreta. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):369–396, 2013

work page 2013

[26] [26]

On the application of probability theory to agricultural experiments

Jerzy Splawa-Neyman. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statistical Science, 5(4):465–472, 1990. (Translated from the 1923 Polish original by D.M. Dabrowska and T.P. Speed)

work page 1990

[27] [27]

Cambridge University Press, 2nd edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

work page 2009

[28] [28]

Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

work page 1974

[29] [29]

Overlapping experiment infrastructure: More, better, faster experimentation

Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. InProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 17–26, 2010

work page 2010

[30] [30]

Thomke.Experimentation Works: The Surprising Power of Business Experiments

Stefan H. Thomke.Experimentation Works: The Surprising Power of Business Experiments. Harvard Business Review Press, 2020

work page 2020

[31] [31]

A. W. van der Vaart.Asymptotic Statistics. Cambridge University Press, 2000

work page 2000

[32] [32]

Tibshirani

Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani. High- dimensional regression adjustments in randomized experiments.Proceedings of the National Academy of Sciences, 113(45):12673–12678, 2016

work page 2016

[33] [33]

C. F. Jeff Wu and Michael S. Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

work page 2009

[34] [34]

Improving the sensitivity of online controlled experiments: Case studies at Netflix

Huizhi Xie and Juliette Aurisset. Improving the sensitivity of online controlled experiments: Case studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2016

work page 2016

[35] [35]

Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners

Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2156–2164, 2019

work page 2019

[36] [36]

John Wiley & Sons, 2002

Alan Agresti.Categorical Data Analysis, 2nd edition. John Wiley & Sons, 2002

work page 2002

[37] [37]

From infrastructure to culture: A/B testing challenges in large scale social networks

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/B testing challenges in large scale social networks. InProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2227–2236, 2015

work page 2015

[38] [38]

The anatomy of a large-scale experimentation platform

Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Raff, and Aleksander Fabijan. The anatomy of a large-scale experimentation platform. In Proceedings of the IEEE International Conference on Software Architecture (ICSA), pages 1–109, 2018

work page 2018

[39] [39]

A call to relax the concern for experiment interactions

Shan Jeng, Francis Duval, Daniel Arizmendi, and Jason Tang. A call to relax the concern for experiment interactions. InProceedings of the 16th ACM International Conference on Web Search and Data Mining (WSDM), 2023

work page 2023

[40] [40]

Embrace overlapping A/B tests at scale

David Chan, Raghavendra Peri, Hao Yu, Lucas Montalvo, and Jeff Galbraith. Embrace overlapping A/B tests at scale. InProceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2023. A Assumption Checking MEA’s correctness relies on a causal assumption—Arm-Trigger In- variance— which we state here, connect to a causa...

work page 2023