Nobody Puts Bonferroni in a Corner
Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3
The pith
Bonferroni correction controls error rates in online experiments while remaining competitive in power once the test family is limited to success metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bonferroni correction supplies unconditional simultaneous confidence intervals for every metric, is uniquely convenient for pre-experiment sample-size planning, and incurs only a modest power penalty relative to Holm or Hommel once the family of tests is restricted to success metrics; when guardrail metrics are incorrectly folded into the family the power gap shrinks to near zero.
What carries the argument
The restriction of the Bonferroni denominator to the count of success metrics alone, justified by the intersection-union logic applied to guardrail and quality metrics.
If this is right
- Pre-experiment sample size calculations become straightforward because the correction factor is known in advance.
- Unconditional simultaneous confidence intervals are available for every metric without additional computational cost.
- When the family is correctly limited to success metrics, the fraction of experiments that ship drops by only four to five percentage points compared with Holm or Hommel.
- When few metrics are truly non-null, the power gap between Bonferroni and more complex procedures vanishes regardless of family specification.
Where Pith is reading between the lines
- Teams running many experiments may gain more by pruning the number of success metrics than by switching to a more elaborate multiple-testing procedure.
- Clear documentation of which metrics count as success metrics versus guardrails becomes a first-order design choice.
- Platforms could expose the success-metric count as an explicit input to their sample-size calculators.
Load-bearing premise
Guardrail and quality metrics are evaluated with intersection-union logic and therefore cannot increase the overall false-positive rate for the deployment decision.
What would settle it
An experiment or simulation in which guardrail metrics are instead tested with union logic and the resulting false-positive rate for deployment decisions exceeds the nominal level.
read the original abstract
We argue that Bonferroni correction is a better choice for online experimentation than it is commonly given credit for. The case rests on four considerations. First, it is the simplest broadly implementable FWER-controlling method that produces unconditional simultaneous confidence intervals for every metric. Second, in a well-specified decision framework, guardrail and quality metrics use intersection-union logic and cannot inflate the false positive rate, so the Bonferroni denominator is the number of success metrics only, not the total metric count. Third, it is uniquely tractable for pre-experiment sample size calculations. Fourth, we contextualise the power cost empirically. Drawing on a simulation study and an empirical analysis of 1,296 experiments run on Spotify's experimentation platform, Confidence, we show that the power loss relative to more sophisticated FWER methods depends on both how the correction family is specified and how many metrics are truly non-null. When guardrail metrics are incorrectly included in the family, Holm and Hommel are nearly indistinguishable from Bonferroni. When the family is correctly restricted to success metrics only, they gain roughly 4--5 percentage points in ship rate (the fraction of experiments where the treatment is deployed). When few metrics are truly non-null, the gap narrows to near zero regardless of method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Bonferroni correction is underappreciated for FWER control in online experimentation. It rests on four considerations: (1) simplicity combined with unconditional simultaneous confidence intervals for all metrics, (2) in a well-specified decision framework guardrail/quality metrics follow intersection-union logic and therefore do not inflate the deployment false-positive rate, so the Bonferroni denominator equals only the number of success metrics, (3) unique tractability for pre-experiment sample-size calculations, and (4) empirical quantification via simulation and 1,296 Spotify experiments showing that Holm/Hommel ship-rate gains are only 4–5 percentage points when the family is correctly restricted to success metrics and near zero when guardrails are included or few metrics are non-null.
Significance. If the central claims hold, the work has practical significance for statistical practice in online experimentation. It supplies both a clear decision-theoretic justification for family specification and reproducible empirical evidence on the magnitude of power differences across methods, helping practitioners weigh simplicity against modest power gains. The emphasis on correct family definition and the use of real platform data are strengths.
major comments (2)
- [§3] §3 (decision framework): The assertion that guardrail and quality metrics follow intersection-union logic and therefore cannot inflate the overall false-positive rate for deployment decisions is load-bearing for the recommendation to restrict the Bonferroni family to success metrics only. A short formal statement or small illustrative example showing how the intersection-union rule interacts with the platform’s deployment criteria would make the argument fully self-contained.
- [§4.2] §4.2 (empirical analysis): The classification rules that assign the 1,296 experiments’ metrics to success, guardrail, or quality categories are essential for interpreting the reported 4–5 pp ship-rate difference. Explicit, reproducible criteria (or at least a representative example) should be supplied so readers can judge generalizability.
minor comments (3)
- [Abstract] Abstract: the phrase “near-zero gap when few metrics are truly non-null” is vague; a quantitative threshold (e.g., “≤2 non-null metrics”) would improve precision.
- [Throughout] Throughout: ensure every acronym (FWER, CI, etc.) is defined at first use.
- [Figures] Figure captions: add a brief note on how the simulation parameters were chosen to match the Spotify data distribution.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and for identifying two points where additional clarity would strengthen the manuscript. Both suggestions align with our goal of making the decision-theoretic and empirical arguments fully self-contained. We address each comment below and will incorporate the requested material in the revision.
read point-by-point responses
-
Referee: [§3] §3 (decision framework): The assertion that guardrail and quality metrics follow intersection-union logic and therefore cannot inflate the overall false-positive rate for deployment decisions is load-bearing for the recommendation to restrict the Bonferroni family to success metrics only. A short formal statement or small illustrative example showing how the intersection-union rule interacts with the platform’s deployment criteria would make the argument fully self-contained.
Authors: We agree that a concise formal statement and illustrative example will make the intersection-union argument self-contained. In the revised §3 we will add a short formal paragraph stating that, under the platform’s deployment rule (deploy only if all guardrail/quality metrics pass their thresholds), the family-wise error rate for the deployment decision is controlled by the intersection-union test; hence only the success-metric family requires multiplicity adjustment. We will also include a minimal numerical example with two guardrails and one success metric to illustrate that a false positive on a guardrail cannot produce an erroneous deployment. revision: yes
-
Referee: [§4.2] §4.2 (empirical analysis): The classification rules that assign the 1,296 experiments’ metrics to success, guardrail, or quality categories are essential for interpreting the reported 4–5 pp ship-rate difference. Explicit, reproducible criteria (or at least a representative example) should be supplied so readers can judge generalizability.
Authors: We accept that explicit classification criteria are needed for reproducibility. In the revised §4.2 we will add a dedicated subsection listing the operational rules used at Spotify (e.g., success metrics are those tied to the primary business objective of the experiment; guardrails are pre-specified safety or quality thresholds; quality metrics are secondary diagnostic measures). We will also provide a representative example from one of the 1,296 experiments showing how a given metric was assigned to each category. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper's arguments rest on standard multiple-testing procedures (Bonferroni, Holm, Hommel) whose FWER properties are externally established, plus an independent empirical analysis of 1,296 Spotify experiments and simulations. The restriction of the correction family to success metrics is framed as an explicit modeling choice within a stated decision framework rather than a quantity derived from the data or from self-referential equations. No step equates a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and the reported ship-rate gains are direct numerical outputs of the external dataset under the stated premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The decision framework for experiments is well-specified such that guardrail metrics follow intersection-union testing logic.
Reference graph
Works this paper leans on
-
[1]
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289--300
work page 1995
-
[2]
Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165--1188
work page 2001
-
[3]
Benjamini, Y., & Yekutieli, D. (2005). False discovery rate--adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469), 71--81
work page 2005
-
[4]
Berman, R., & Van den Bulte, C. (2022). False discovery in A/B testing. Management Science, 68(9), 6762--6782
work page 2022
- [5]
-
[6]
Ferreira, J. A., & Zwinderman, A. H. (2006). Approximate power and sample size calculations with the Benjamini--Hochberg method. The International Journal of Biostatistics, 2(1), Article 8
work page 2006
-
[7]
Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641--651
work page 2014
-
[8]
Gao, P., Ware, J. H., & Mehta, C. (2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6), 1184--1196
work page 2008
-
[9]
Guilbaud, O. (2008). Simultaneous confidence regions corresponding to Holm's step-down procedure and other closed-testing procedures. Biometrical Journal, 50(5), 678--692
work page 2008
-
[10]
Guilbaud, O. (2012). Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biometrical Journal, 54(3), 317--342
work page 2012
-
[11]
Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2022). Always valid inference: Continuous monitoring of A/B tests. Operations Research, 70(3), 1806--1821
work page 2022
-
[12]
Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press
work page 2020
-
[13]
Lan, K. K. G., & DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659--663
work page 1983
-
[14]
Liao, J. J. Z., et al. (2018). Defining information fractions in group sequential clinical trials with multiple endpoints. Contemporary Clinical Trials Communications, 17(3), 235--246
work page 2018
-
[15]
Negi, A., & Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5), 504--534
work page 2021
-
[16]
Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. American Journal of Human Genetics, 74(4), 765--769
work page 2004
-
[17]
Perneger, T. V. (1998). What's wrong with Bonferroni adjustments. BMJ, 316(7139), 1236--1238
work page 1998
-
[18]
VanderWeele, T. J., & Mathur, M. B. (2018). Some desirable properties of the Bonferroni correction: Is the Bonferroni correction really so bad? American Journal of Epidemiology, 188(3), 617--618
work page 2018
-
[19]
Schultzberg, M., Ankargren, S., & Fr nberg, M. (2026). Risk-aware product decisions in A/B tests with multiple metrics. Journal of Statistical Planning and Inference, pii S0378375826000212
work page 2026
-
[20]
Spotify Engineering. (2023). Choosing a sequential testing framework---comparisons and discussions. https://engineering.atspotify.com/2023/03/choosing-sequential-testing-framework-comparisons-and-discussions
work page 2023
-
[21]
Strassburger, K., & Bretz, F. (2008). Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Statistics in Medicine, 27(24), 4914--4927
work page 2008
-
[22]
Vickerstaff, V., Omar, R. Z., & Ambler, G. (2019). Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Medical Research Methodology, 19, 129
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.