pith. sign in

arxiv: 2604.16671 · v1 · submitted 2026-04-17 · 📊 stat.ME

Multi-Experiment Analysis

Pith reviewed 2026-05-10 07:14 UTC · model grok-4.3

classification 📊 stat.ME
keywords multi-experiment analysisoverlapping experimentsonline controlled experimentsjoint estimationtreatment effectsA/B testingconditional effects
0
0 comments X

The pith

Multi-Experiment Analysis produces corrected individual, combined, and conditional treatment effects from overlapping experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Online controlled experiments often overlap on the same users, which can bias results and make it hard to understand the impact of feature combinations. The paper proposes Multi-Experiment Analysis as a way to jointly estimate effects even with partial or full overlaps and multiple variants. This yields three useful quantities: corrected effects for single experiments that adjust for overlaps, effects from launching any mix of variants, and effects of one variant conditional on others being present or absent. The method avoids the need for special traffic allocation or pre-designed factorial experiments. Simulations and a production deployment at scale support its consistency and practical value.

Core claim

The core discovery is that a joint model incorporating the observed structure of experiment overlaps can deliver consistent estimates of individual treatment effects adjusted for concurrent tests, the overall effect of any desired combination of variants from different experiments, and the effect of a given variant conditional on the status of variants in other experiments, all without requiring factorial experimental designs or restrictions on how traffic is shared.

What carries the argument

The joint estimation model that uses the fully observed overlap structure to recover the three types of unbiased effect estimates.

Load-bearing premise

The overlaps between experiments must be fully observed and the joint model must be able to recover unbiased estimates without additional unstated restrictions on the response surface or assignment mechanism.

What would settle it

If simulations with known true effects show that the MEA estimates are biased or their confidence intervals do not achieve the nominal coverage rate, that would indicate the method does not produce consistent estimates.

Figures

Figures reproduced from arXiv: 2604.16671 by Reza Hosseini.

Figure 1
Figure 1. Figure 1: L-shape partitioning showing triggering regions [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Production violation (FLAG): two notification exper [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Combination analysis: checkmarks indicate rele [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Conditional analysis: only cells matching the fixed [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MEA system architecture showing the hybrid ap [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: shows the causal graph for 𝑘 = 2. The assumption is equivalent to the absence of the dashed red edges 𝐴𝑗 → 𝑆𝑖 (𝑗 ≠ 𝑖) — no experiment’s arm assignment shifts another experiment’s trigger rate. 𝐴1 𝐴2 𝑆1 𝑈 𝑆2 𝑌 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Convergence of simulation-based delta estimates [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of 1000 MEA estimates with one ex [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MEA estimate convergence as sample size in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Online controlled experiments face growing challenges from overlapping tests on shared traffic, where interactions between concurrent experiments obscure insights into feature combinations and produce effect estimates that do not correspond to any actionable launch scenario. While traffic splitting, layering, and sequential execution (non-concurrent) mitigate some of these issues, they require coordination overhead and can reduce experimentation velocity. We propose Multi-Experiment Analysis (MEA), a methodology for consistent joint estimation in the presence of arbitrary partial or full overlaps and multiple variants. MEA produces three types of estimates: (1) corrected individual treatment effects that account for the presence of overlapping experiments, (2) combined effects of launching any desired combination of variants across experiments, and (3) conditional effects of an experiment's variant given that specific variants of other experiments are launched or deramped -- all without requiring factorial pre-design or traffic restrictions. We validate the approach through comprehensive simulations confirming consistency and correct coverage. We report on production deployment at scale, illustrate the methodology through real-world use cases, and share practical lessons learned -- including system design, adoption patterns, and insights from production use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Multi-Experiment Analysis (MEA), a methodology for joint estimation of treatment effects from online controlled experiments with arbitrary partial or full overlaps. It claims to deliver three types of estimates—corrected individual treatment effects accounting for overlaps, combined effects for any desired combination of variants, and conditional effects given specific variants in other experiments—without requiring factorial pre-design or traffic restrictions. The approach is validated via simulations showing consistency and correct coverage, with additional illustration through production deployment at scale and practical lessons.

Significance. If the central claims hold, MEA would represent a meaningful advance in online experimentation methodology by enabling flexible analysis of concurrent tests on shared traffic, potentially increasing experimentation velocity while producing actionable estimates. The reported production deployment and lessons learned provide concrete evidence of practical utility beyond theoretical claims. However, the absence of explicit model equations or assumptions in the provided abstract limits assessment of whether the method achieves unbiasedness for unobserved combinations under realistic conditions.

major comments (2)
  1. [Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.
  2. [Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a single sentence summarizing the key modeling assumptions or functional form used for the joint estimation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important opportunities to improve the clarity and rigor of our presentation of Multi-Experiment Analysis (MEA). We agree that the abstract and methods sections would benefit from more explicit statements of the joint model, assumptions, and identification strategy. We address each major comment below and commit to revisions that strengthen these aspects without altering the core claims supported by our simulations and production deployment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that combined and conditional effects can be recovered unbiasedly for variant combinations that never co-occur in the data is load-bearing for the central contribution, yet no joint model, response-surface assumptions (additivity, interaction order, or parametric form), or identification strategy is stated; simulations are reported to confirm consistency, but this only verifies behavior under the (unstated) data-generating process and does not address bias under plausible misspecification.

    Authors: We agree that the abstract should more explicitly reference the modeling framework that supports recovery of effects for unobserved combinations. The full manuscript (Section 3) defines the joint model used by MEA, which employs a parametric structure allowing consistent estimation and extrapolation under the stated assumptions. Simulations in Section 4 confirm consistency and coverage when data are generated from this model. To address the concern about misspecification, we will revise the abstract to briefly note the key assumptions and identification approach, and we will add a dedicated discussion subsection on robustness to plausible departures from the assumed response surface, including additional simulation results under misspecification. revision: yes

  2. Referee: [Methods] The weakest assumption noted—that overlaps are fully observed and a joint model recovers unbiased estimates without further restrictions—directly affects the validity of estimates (2) and (3); the manuscript must explicitly define the model (likely in the methods section) and demonstrate that identification does not rely on untestable restrictions that would be violated in production traffic.

    Authors: We acknowledge that explicit definition of the model and assumptions is required for full assessment of estimates (2) and (3). In the revised manuscript we will expand the Methods section to present the complete joint model equations, enumerate all assumptions (including full observability of overlaps and the parametric form enabling recovery for non-co-occurring combinations), and provide a clear identification argument. We will also incorporate a discussion, informed by our production deployment, of how these assumptions align with real traffic conditions and any practical safeguards or diagnostics used to monitor potential violations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained with external validation

full rationale

The abstract describes a proposed methodology (MEA) for joint estimation under overlaps, with three types of estimates produced from observed data. No equations, fitted parameters, or derivation chain are shown that would reduce a claimed prediction to a self-defined input or self-citation. Validation is stated to come from comprehensive simulations confirming consistency and coverage, plus production deployment; these are independent checks rather than tautological. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claim therefore rests on the model's ability to recover effects from partial overlaps, which is presented as a modeling choice rather than a definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is presented as a statistical procedure whose internal assumptions are not enumerated.

pith-pipeline@v0.9.0 · 5484 in / 1007 out tokens · 49967 ms · 2026-05-10T07:14:31.913094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

  1. [1]

    Aronow and Cyrus Samii

    Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference.Annals of Applied Statistics, 11(4):1912–1947, 2017

  2. [2]

    Imbens, and Hyunseung Kang

    Susan Athey, Raj Chetty, Guido W. Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. NBER Working Paper No. 26463, 2019

  3. [3]

    Bonferroni

    Carlo E. Bonferroni. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3–62, 1936

  4. [4]

    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A prac- tical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995

  5. [5]

    Sekhon, and Bin Yu

    Adam Bloniarz, Hanzhong Liu, Cun-Hui Zhang, Jasjeet S. Sekhon, and Bin Yu. Lasso adjustments of treatment effect estimates in randomized experiments. Proceedings of the National Academy of Sciences, 113(27):7383–7390, 2016

  6. [6]

    George E. P. Box, J. Stuart Hunter, and William G. Hunter.Statistics for Experi- menters: Design, Innovation, and Discovery. Wiley, 2nd edition, 2005

  7. [7]

    Improving the sensitivity of online controlled experiments by utilizing pre-experiment data

    Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. InProceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 123–132, 2013

  8. [8]

    Tibshirani.An Introduction to the Bootstrap

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1994

  9. [9]

    Fisher.The Design of Experiments

    Ronald A. Fisher.The Design of Experiments. Oliver and Boyd, Edinburgh, 1935

  10. [10]

    Freedman

    David A. Freedman. On regression adjustments to experimental data.Advances in Applied Mathematics, 40(2):180–193, 2008

  11. [11]

    Chapman & Hall/CRC, 2nd edition, 2002

    Stephanie Green, Jacqueline Benedetti, and John Crowley.Clinical Trials in Oncology. Chapman & Hall/CRC, 2nd edition, 2002

  12. [12]

    Machine learning for variance reduction in online experiments

    Yongyi Guo, Dominic Coey, Mikael Konutgan, Wenting Li, Chris Schoener, and Matt Goldman. Machine learning for variance reduction in online experiments. InAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pages 8637–8648, 2021

  13. [13]

    Hernán and James M

    Miguel A. Hernán and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, 2020

  14. [14]

    Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

    Reza Hosseini and Amir Najmi. Unbiased variance reduction in randomized experiments.arXiv preprint arXiv:1904.03817, 2019

  15. [15]

    Imbens and Donald B

    Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015

  16. [16]

    Peeking at A/B tests: Why it matters, and what to do about it

    Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at A/B tests: Why it matters, and what to do about it. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517–1525, 2017

  17. [17]

    Henne, and Dan Sommerfield

    Ron Kohavi, Randal M. Henne, and Dan Sommerfield. Practical guide to con- trolled experiments on the web: Listen to your customers not to the HiPPO. InProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 959–967, 2007

  18. [18]

    Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Con- trolled experiments on the web: Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140–181, 2009

  19. [19]

    Online controlled experiments at large scale

    Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. Online controlled experiments at large scale. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1168–1176, 2013

  20. [20]

    Cambridge University Press, 2020

    Ron Kohavi, Diane Tang, and Ya Xu.Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020

  21. [21]

    Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

    Nicholas Larsen, Jonathan Stallrich, Srijan Sengupta, Alex Deng, Ron Kohavi, and Nathaniel Stevens. Statistical challenges in online controlled experiments: A review of A/B testing methodology.The American Statistician, 78(2):135–149, 2024

  22. [22]

    General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

    Xinran Li and Peng Ding. General forms of finite population central limit theorems with applications to causal inference.Journal of the American Statistical Association, 112(520):1759–1769, 2017

  23. [23]

    Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

    Winston Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.Annals of Applied Statistics, 7(1):295–318, 2013

  24. [24]

    Lohr.Sampling: Design and Analysis

    Sharon L. Lohr.Sampling: Design and Analysis. CRC Press, 3rd edition, 2021

  25. [25]

    Miratrix, Stefan Wager, and José R

    Luke W. Miratrix, Stefan Wager, and José R. Zubizarreta. Adjusting treatment effect estimates by post-stratification in randomized experiments.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2):369–396, 2013

  26. [26]

    On the application of probability theory to agricultural experiments

    Jerzy Splawa-Neyman. On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Statistical Science, 5(4):465–472, 1990. (Translated from the 1923 Polish original by D.M. Dabrowska and T.P. Speed)

  27. [27]

    Cambridge University Press, 2nd edition, 2009

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

  28. [28]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

  29. [29]

    Overlapping experiment infrastructure: More, better, faster experimentation

    Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. InProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 17–26, 2010

  30. [30]

    Thomke.Experimentation Works: The Surprising Power of Business Experiments

    Stefan H. Thomke.Experimentation Works: The Surprising Power of Business Experiments. Harvard Business Review Press, 2020

  31. [31]

    A. W. van der Vaart.Asymptotic Statistics. Cambridge University Press, 2000

  32. [32]

    Tibshirani

    Stefan Wager, Wenfei Du, Jonathan Taylor, and Robert J. Tibshirani. High- dimensional regression adjustments in randomized experiments.Proceedings of the National Academy of Sciences, 113(45):12673–12678, 2016

  33. [33]

    C. F. Jeff Wu and Michael S. Hamada.Experiments: Planning, Analysis, and Optimization. Wiley, 2nd edition, 2009

  34. [34]

    Improving the sensitivity of online controlled experiments: Case studies at Netflix

    Huizhi Xie and Juliette Aurisset. Improving the sensitivity of online controlled experiments: Case studies at Netflix. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 645–654, 2016

  35. [35]

    Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners

    Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2156–2164, 2019

  36. [36]

    John Wiley & Sons, 2002

    Alan Agresti.Categorical Data Analysis, 2nd edition. John Wiley & Sons, 2002

  37. [37]

    From infrastructure to culture: A/B testing challenges in large scale social networks

    Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. From infrastructure to culture: A/B testing challenges in large scale social networks. InProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2227–2236, 2015

  38. [38]

    The anatomy of a large-scale experimentation platform

    Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel Dmitriev, Paul Raff, and Aleksander Fabijan. The anatomy of a large-scale experimentation platform. In Proceedings of the IEEE International Conference on Software Architecture (ICSA), pages 1–109, 2018

  39. [39]

    A call to relax the concern for experiment interactions

    Shan Jeng, Francis Duval, Daniel Arizmendi, and Jason Tang. A call to relax the concern for experiment interactions. InProceedings of the 16th ACM International Conference on Web Search and Data Mining (WSDM), 2023

  40. [40]

    Embrace overlapping A/B tests at scale

    David Chan, Raghavendra Peri, Hao Yu, Lucas Montalvo, and Jeff Galbraith. Embrace overlapping A/B tests at scale. InProceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2023. A Assumption Checking MEA’s correctness relies on a causal assumption—Arm-Trigger In- variance— which we state here, connect to a causa...