Multi-Experiment Analysis
Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3
The pith
Multi-Experiment Analysis produces corrected individual, combined, and conditional treatment effects from overlapping experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core discovery is that a joint model incorporating the observed structure of experiment overlaps can deliver consistent estimates of individual treatment effects adjusted for concurrent tests, the overall effect of any desired combination of variants from different experiments, and the effect of a given variant conditional on the status of variants in other experiments, all without requiring factorial experimental designs or restrictions on how traffic is shared.
What carries the argument
The joint estimation model that uses the fully observed overlap structure to recover the three types of unbiased effect estimates.
Load-bearing premise
The overlaps between experiments must be fully observed and the joint model must be able to recover unbiased estimates without additional unstated restrictions on the response surface or assignment mechanism.
What would settle it
If simulations with known true effects show that the MEA estimates are biased or their confidence intervals do not achieve the nominal coverage rate, that would indicate the method does not produce consistent estimates.
Figures
read the original abstract
Online controlled experiments face growing challenges from overlapping tests on shared traffic, where interactions between concurrent experiments obscure insights into feature combinations and produce effect estimates that do not correspond to any actionable launch scenario. While traffic splitting, layering, and sequential execution (non-concurrent) mitigate some of these issues, they require coordination overhead and can reduce experimentation velocity. We propose Multi-Experiment Analysis (MEA), a methodology for consistent joint estimation in the presence of arbitrary partial or full overlaps and multiple variants. MEA produces three types of estimates: (1) corrected individual treatment effects that account for the presence of overlapping experiments, (2) combined effects of launching any desired combination of variants across experiments, and (3) conditional effects of an experiment's variant given that specific variants of other experiments are launched or deramped -- all without requiring factorial pre-design or traffic restrictions. We validate the approach through comprehensive simulations confirming consistency and correct coverage. We report on production deployment at scale, illustrate the methodology through real-world use cases, and share practical lessons learned -- including system design, adoption patterns, and insights from production use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Multi-Experiment Analysis (MEA), a methodology for joint estimation of treatment effects from online controlled experiments with arbitrary partial or full overlaps. It claims to deliver three types of estimates—corrected individual treatment effects accounting for overlaps, combined effects for any desired combination of variants, and conditional effects given specific variants in other experiments—without requiring factorial pre-design or traffic restrictions. The approach is validated via simulations showing consistency and correct coverage, with additional illustration through production deployment at scale and practical lessons.
Significance. If the central claims hold, MEA would represent a meaningful advance in online experimentation methodology by enabling flexible analysis of concurrent tests on shared traffic, potentially increasing experimentation velocity while producing actionable estimates. The reported production deployment and lessons learned provide concrete evidence of practical utility beyond theoretical claims. However, the absence of explicit model equations or assumptions in the provided abstract limits assessment of whether the method achieves unbiasedness for unobserved combinations under realistic conditions.
major comments (2)
- [Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.
- [Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.
minor comments (1)
- [Abstract] The abstract would be strengthened by a single sentence summarizing the key modeling assumptions or functional form used for the joint estimation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important opportunities to improve the clarity and rigor of our presentation of Multi-Experiment Analysis (MEA). We agree that the abstract and methods sections would benefit from more explicit statements of the joint model, assumptions, and identification strategy. We address each major comment below and commit to revisions that strengthen these aspects without altering the core claims supported by our simulations and production deployment.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.
Authors: We agree that the abstract should more explicitly reference the modeling framework that supports recovery of effects for unobserved combinations. The full manuscript (Section 3) defines the joint model used by MEA, which employs a parametric structure allowing consistent estimation and extrapolation under the stated assumptions. Simulations in Section 4 confirm consistency and coverage when data are generated from this model. To address the concern about misspecification, we will revise the abstract to briefly note the key assumptions and identification approach, and we will add a dedicated discussion subsection on robustness to plausible departures from the assumed response surface, including additional simulation results under misspecification. revision: yes
-
Referee: [Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.
Authors: We acknowledge that explicit definition of the model and assumptions is required for full assessment of estimates (2) and (3). In the revised manuscript we will expand the Methods section to present the complete joint model equations, enumerate all assumptions (including full observability of overlaps and the parametric form enabling recovery for non-co-occurring combinations), and provide a clear identification argument. We will also incorporate a discussion, informed by our production deployment, of how these assumptions align with real traffic conditions and any practical safeguards or diagnostics used to monitor potential violations. revision: yes
Circularity Check
No circularity: derivation is self-contained with external validation
full rationale
The abstract describes a proposed methodology (MEA) for joint estimation under overlaps, with three types of estimates produced from observed data. No equations, fitted parameters, or derivation chain are shown that would reduce a claimed prediction to a self-defined input or self-citation. Validation is stated to come from comprehensive simulations confirming consistency and coverage, plus production deployment; these are independent checks rather than tautological. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claim therefore rests on the model's ability to recover effects from partial overlaps, which is presented as a modeling choice rather than a definitional reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference.Annals of Applied Statistics, 11(4):1912–1947, 2017
work page 1912
-
[2]
Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No. 26463, 2019
work page 2019
-
[3]
Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, 1936
work page 1936
-
[4]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A prac- tical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995
work page 1995
-
[5]
Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S. Sekhon, and Bin Yu. Lasso adjustments of treatment effect estimates in randomized experiments. Proceedings of the National Academy of Sciences, 113(27):7383–7390, 2016
work page 2016
-
[6]
George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experi- menters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005
work page 2005
-
[7]
Improving the sensitivity of online controlled experiments by utilizing pre-experiment data
Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 123–132, 2013
work page 2013
-
[8]
Tibshirani.An Introduction to the Bootstrap
Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994
work page 1994
-
[9]
Fisher.The Design of Experiments
Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh, 1935
work page 1935
- [10]
-
[11]
Chapman & Hall/CRC, 2nd edition, 2002
Stephanie Green, Jacqueline Benedetti, and John Crowley.Clinical Trials in Oncology. Chapman & Hall/CRC, 2nd edition, 2002
work page 2002
-
[12]
Machine learning for variance reduction in online experiments
Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 8637–8648, 2021
work page 2021
-
[13]
Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, 2020
work page 2020
-
[14]
Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019
Reza Hosseini and Amir Najmi. Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019
-
[15]
Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015
work page 2015
-
[16]
Peeking at A/B tests: Why it matters, and what to do about it
Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at A/B tests: Why it matters, and what to do about it. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517–1525, 2017
work page 2017
-
[17]
Ron Kohavi, Randal M. Henne, and Dan Sommerfield. Practical guide to con- trolled experiments on the web: Listen to your customers not to the HiPPO. InProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 959–967, 2007
work page 2007
-
[18]
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Con- trolled experiments on the web: Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140–181, 2009
work page 2009
-
[19]
Online controlled experiments at large scale
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. Online controlled experiments at large scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1168–1176, 2013
work page 2013
-
[20]
Cambridge University Press, 2020
Ron Kohavi, Diane Tang, and Ya Xu.Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020
work page 2020
-
[21]
Nicholas Larsen, Jonathan Stallrich, Srijan Sengupta, Alex Deng, Ron Kohavi, and Nathaniel Stevens. Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024
work page 2024
-
[22]
Xinran Li and Peng Ding. General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017
work page 2017
-
[23]
Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013
work page 2013
-
[24]
Lohr.Sampling: Design and Analysis
Sharon L. Lohr.Sampling: Design and Analysis. CRC Press, 3rd edition, 2021
work page 2021
-
[25]
Miratrix, Stefan Wager, and José R
Luke W. Miratrix, Stefan Wager, and José R. Zubizarreta. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):369–396, 2013
work page 2013
-
[26]
On the application of probability theory to agricultural experiments
Jerzy Splawa-Neyman. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statistical Science, 5(4):465–472, 1990. (Translated from the 1923 Polish original by D.M. Dabrowska and T.P. Speed)
work page 1990
-
[27]
Cambridge University Press, 2nd edition, 2009
Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009
work page 2009
-
[28]
Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974
work page 1974
-
[29]
Overlapping experiment infrastructure: More, better, faster experimentation
Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. InProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 17–26, 2010
work page 2010
-
[30]
Thomke.Experimentation Works: The Surprising Power of Business Experiments
Stefan H. Thomke.Experimentation Works: The Surprising Power of Business Experiments. Harvard Business Review Press, 2020
work page 2020
-
[31]
A. W. van der Vaart.Asymptotic Statistics. Cambridge University Press, 2000
work page 2000
-
[32]
Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani. High- dimensional regression adjustments in randomized experiments.Proceedings of the National Academy of Sciences, 113(45):12673–12678, 2016
work page 2016
-
[33]
C. F. Jeff Wu and Michael S. Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009
work page 2009
-
[34]
Improving the sensitivity of online controlled experiments: Case studies at Netflix
Huizhi Xie and Juliette Aurisset. Improving the sensitivity of online controlled experiments: Case studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2016
work page 2016
-
[35]
Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2156–2164, 2019
work page 2019
-
[36]
Alan Agresti.Categorical Data Analysis, 2nd edition. John Wiley & Sons, 2002
work page 2002
-
[37]
From infrastructure to culture: A/B testing challenges in large scale social networks
Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/B testing challenges in large scale social networks. InProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2227–2236, 2015
work page 2015
-
[38]
The anatomy of a large-scale experimentation platform
Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Raff, and Aleksander Fabijan. The anatomy of a large-scale experimentation platform. In Proceedings of the IEEE International Conference on Software Architecture (ICSA), pages 1–109, 2018
work page 2018
-
[39]
A call to relax the concern for experiment interactions
Shan Jeng, Francis Duval, Daniel Arizmendi, and Jason Tang. A call to relax the concern for experiment interactions. InProceedings of the 16th ACM International Conference on Web Search and Data Mining (WSDM), 2023
work page 2023
-
[40]
Embrace overlapping A/B tests at scale
David Chan, Raghavendra Peri, Hao Yu, Lucas Montalvo, and Jeff Galbraith. Embrace overlapping A/B tests at scale. InProceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2023. A Assumption Checking MEA’s correctness relies on a causal assumption—Arm-Trigger In- variance— which we state here, connect to a causa...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.