Nobody Puts Bonferroni in a Corner

M{\aa}rten Schultzberg

arxiv: 2604.09256 · v1 · submitted 2026-04-10 · 📊 stat.ME

Nobody Puts Bonferroni in a Corner

M{\aa}rten Schultzberg This is my paper

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 📊 stat.ME

keywords Bonferroni correctiononline experimentationfamily-wise error rateA/B testingmultiple testingship ratepower analysisguardrail metrics

0 comments

The pith

Bonferroni correction controls error rates in online experiments while remaining competitive in power once the test family is limited to success metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that Bonferroni correction deserves more use in online A/B testing because it is the simplest method that yields unconditional simultaneous confidence intervals for all metrics. In a properly specified decision process, guardrail and quality metrics are evaluated under intersection-union logic and therefore do not raise the chance of falsely deploying a bad treatment; the correction factor therefore applies only to the smaller set of success metrics. Empirical checks on 1,296 real experiments and accompanying simulations show that more sophisticated procedures such as Holm or Hommel improve the fraction of experiments that get shipped by only four to five percentage points when the family is correctly restricted, and the advantage disappears when few metrics truly differ from zero or when guardrails are mistakenly included in the family.

Core claim

Bonferroni correction supplies unconditional simultaneous confidence intervals for every metric, is uniquely convenient for pre-experiment sample-size planning, and incurs only a modest power penalty relative to Holm or Hommel once the family of tests is restricted to success metrics; when guardrail metrics are incorrectly folded into the family the power gap shrinks to near zero.

What carries the argument

The restriction of the Bonferroni denominator to the count of success metrics alone, justified by the intersection-union logic applied to guardrail and quality metrics.

If this is right

Pre-experiment sample size calculations become straightforward because the correction factor is known in advance.
Unconditional simultaneous confidence intervals are available for every metric without additional computational cost.
When the family is correctly limited to success metrics, the fraction of experiments that ship drops by only four to five percentage points compared with Holm or Hommel.
When few metrics are truly non-null, the power gap between Bonferroni and more complex procedures vanishes regardless of family specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams running many experiments may gain more by pruning the number of success metrics than by switching to a more elaborate multiple-testing procedure.
Clear documentation of which metrics count as success metrics versus guardrails becomes a first-order design choice.
Platforms could expose the success-metric count as an explicit input to their sample-size calculators.

Load-bearing premise

Guardrail and quality metrics are evaluated with intersection-union logic and therefore cannot increase the overall false-positive rate for the deployment decision.

What would settle it

An experiment or simulation in which guardrail metrics are instead tested with union logic and the resulting false-positive rate for deployment decisions exceeds the nominal level.

read the original abstract

We argue that Bonferroni correction is a better choice for online experimentation than it is commonly given credit for. The case rests on four considerations. First, it is the simplest broadly implementable FWER-controlling method that produces unconditional simultaneous confidence intervals for every metric. Second, in a well-specified decision framework, guardrail and quality metrics use intersection-union logic and cannot inflate the false positive rate, so the Bonferroni denominator is the number of success metrics only, not the total metric count. Third, it is uniquely tractable for pre-experiment sample size calculations. Fourth, we contextualise the power cost empirically. Drawing on a simulation study and an empirical analysis of 1,296 experiments run on Spotify's experimentation platform, Confidence, we show that the power loss relative to more sophisticated FWER methods depends on both how the correction family is specified and how many metrics are truly non-null. When guardrail metrics are incorrectly included in the family, Holm and Hommel are nearly indistinguishable from Bonferroni. When the family is correctly restricted to success metrics only, they gain roughly 4--5 percentage points in ship rate (the fraction of experiments where the treatment is deployed). When few metrics are truly non-null, the gap narrows to near zero regardless of method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bonferroni holds up better than its reputation once you restrict the family to success metrics, with the Spotify data showing only a modest 4-5 point ship-rate gap versus Holm or Hommel.

read the letter

The paper's core contribution is the empirical check on how much power you actually lose with Bonferroni in a realistic online setting. They pull 1,296 experiments from Spotify's platform and pair that with a simulation that varies the number of non-null metrics. When the family is limited to success metrics, Holm and Hommel improve the ship rate by roughly 4-5 points. When guardrails are folded in by mistake or when few metrics move, the gap shrinks to near zero. That gives practitioners a concrete number instead of just asymptotic comparisons.

Referee Report

2 major / 3 minor

Summary. The paper claims that Bonferroni correction is underappreciated for FWER control in online experimentation. It rests on four considerations: (1) simplicity combined with unconditional simultaneous confidence intervals for all metrics, (2) in a well-specified decision framework guardrail/quality metrics follow intersection-union logic and therefore do not inflate the deployment false-positive rate, so the Bonferroni denominator equals only the number of success metrics, (3) unique tractability for pre-experiment sample-size calculations, and (4) empirical quantification via simulation and 1,296 Spotify experiments showing that Holm/Hommel ship-rate gains are only 4–5 percentage points when the family is correctly restricted to success metrics and near zero when guardrails are included or few metrics are non-null.

Significance. If the central claims hold, the work has practical significance for statistical practice in online experimentation. It supplies both a clear decision-theoretic justification for family specification and reproducible empirical evidence on the magnitude of power differences across methods, helping practitioners weigh simplicity against modest power gains. The emphasis on correct family definition and the use of real platform data are strengths.

major comments (2)

[§3] §3 (decision framework): The assertion that guardrail and quality metrics follow intersection-union logic and therefore cannot inflate the overall false-positive rate for deployment decisions is load-bearing for the recommendation to restrict the Bonferroni family to success metrics only. A short formal statement or small illustrative example showing how the intersection-union rule interacts with the platform’s deployment criteria would make the argument fully self-contained.
[§4.2] §4.2 (empirical analysis): The classification rules that assign the 1,296 experiments’ metrics to success, guardrail, or quality categories are essential for interpreting the reported 4–5 pp ship-rate difference. Explicit, reproducible criteria (or at least a representative example) should be supplied so readers can judge generalizability.

minor comments (3)

[Abstract] Abstract: the phrase “near-zero gap when few metrics are truly non-null” is vague; a quantitative threshold (e.g., “≤2 non-null metrics”) would improve precision.
[Throughout] Throughout: ensure every acronym (FWER, CI, etc.) is defined at first use.
[Figures] Figure captions: add a brief note on how the simulation parameters were chosen to match the Spotify data distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and for identifying two points where additional clarity would strengthen the manuscript. Both suggestions align with our goal of making the decision-theoretic and empirical arguments fully self-contained. We address each comment below and will incorporate the requested material in the revision.

read point-by-point responses

Referee: [§3] §3 (decision framework): The assertion that guardrail and quality metrics follow intersection-union logic and therefore cannot inflate the overall false-positive rate for deployment decisions is load-bearing for the recommendation to restrict the Bonferroni family to success metrics only. A short formal statement or small illustrative example showing how the intersection-union rule interacts with the platform’s deployment criteria would make the argument fully self-contained.

Authors: We agree that a concise formal statement and illustrative example will make the intersection-union argument self-contained. In the revised §3 we will add a short formal paragraph stating that, under the platform’s deployment rule (deploy only if all guardrail/quality metrics pass their thresholds), the family-wise error rate for the deployment decision is controlled by the intersection-union test; hence only the success-metric family requires multiplicity adjustment. We will also include a minimal numerical example with two guardrails and one success metric to illustrate that a false positive on a guardrail cannot produce an erroneous deployment. revision: yes
Referee: [§4.2] §4.2 (empirical analysis): The classification rules that assign the 1,296 experiments’ metrics to success, guardrail, or quality categories are essential for interpreting the reported 4–5 pp ship-rate difference. Explicit, reproducible criteria (or at least a representative example) should be supplied so readers can judge generalizability.

Authors: We accept that explicit classification criteria are needed for reproducibility. In the revised §4.2 we will add a dedicated subsection listing the operational rules used at Spotify (e.g., success metrics are those tied to the primary business objective of the experiment; guardrails are pre-specified safety or quality thresholds; quality metrics are secondary diagnostic measures). We will also provide a representative example from one of the 1,296 experiments showing how a given metric was assigned to each category. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's arguments rest on standard multiple-testing procedures (Bonferroni, Holm, Hommel) whose FWER properties are externally established, plus an independent empirical analysis of 1,296 Spotify experiments and simulations. The restriction of the correction family to success metrics is framed as an explicit modeling choice within a stated decision framework rather than a quantity derived from the data or from self-referential equations. No step equates a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported via self-citation, and the reported ship-rate gains are direct numerical outputs of the external dataset under the stated premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on established multiple-testing theory without introducing new free parameters or invented entities; the contribution is contextual application and empirical evaluation.

axioms (1)

domain assumption The decision framework for experiments is well-specified such that guardrail metrics follow intersection-union testing logic.
This assumption is required to restrict the Bonferroni family to success metrics only.

pith-pipeline@v0.9.0 · 5513 in / 1359 out tokens · 74623 ms · 2026-05-10T17:08:05.039497+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289--300

work page 1995
[2]

Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165--1188

work page 2001
[3]

Benjamini, Y., & Yekutieli, D. (2005). False discovery rate--adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469), 71--81

work page 2005
[4]

Berman, R., & Van den Bulte, C. (2022). False discovery in A/B testing. Management Science, 68(9), 6762--6782

work page 2022
[5]

Brannath, W., Kluge, L., & Scharpenberg, M. (2024). Informative simultaneous confidence intervals for graphical test procedures. arXiv preprint, arXiv:2402.13719

work page arXiv 2024
[6]

A., & Zwinderman, A

Ferreira, J. A., & Zwinderman, A. H. (2006). Approximate power and sample size calculations with the Benjamini--Hochberg method. The International Journal of Biostatistics, 2(1), Article 8

work page 2006
[7]

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641--651

work page 2014
[8]

H., & Mehta, C

Gao, P., Ware, J. H., & Mehta, C. (2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6), 1184--1196

work page 2008
[9]

Guilbaud, O. (2008). Simultaneous confidence regions corresponding to Holm's step-down procedure and other closed-testing procedures. Biometrical Journal, 50(5), 678--692

work page 2008
[10]

Guilbaud, O. (2012). Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biometrical Journal, 54(3), 317--342

work page 2012
[11]

Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2022). Always valid inference: Continuous monitoring of A/B tests. Operations Research, 70(3), 1806--1821

work page 2022
[12]

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press

work page 2020
[13]

Lan, K. K. G., & DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659--663

work page 1983
[14]

Liao, J. J. Z., et al. (2018). Defining information fractions in group sequential clinical trials with multiple endpoints. Contemporary Clinical Trials Communications, 17(3), 235--246

work page 2018
[15]

Negi, A., & Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5), 504--534

work page 2021
[16]

Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. American Journal of Human Genetics, 74(4), 765--769

work page 2004
[17]

Perneger, T. V. (1998). What's wrong with Bonferroni adjustments. BMJ, 316(7139), 1236--1238

work page 1998
[18]

J., & Mathur, M

VanderWeele, T. J., & Mathur, M. B. (2018). Some desirable properties of the Bonferroni correction: Is the Bonferroni correction really so bad? American Journal of Epidemiology, 188(3), 617--618

work page 2018
[19]

Schultzberg, M., Ankargren, S., & Fr nberg, M. (2026). Risk-aware product decisions in A/B tests with multiple metrics. Journal of Statistical Planning and Inference, pii S0378375826000212

work page 2026
[20]

Spotify Engineering. (2023). Choosing a sequential testing framework---comparisons and discussions. https://engineering.atspotify.com/2023/03/choosing-sequential-testing-framework-comparisons-and-discussions

work page 2023
[21]

Strassburger, K., & Bretz, F. (2008). Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Statistics in Medicine, 27(24), 4914--4927

work page 2008
[22]

Z., & Ambler, G

Vickerstaff, V., Omar, R. Z., & Ambler, G. (2019). Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Medical Research Methodology, 19, 129

work page 2019

[1] [1]

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289--300

work page 1995

[2] [2]

Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165--1188

work page 2001

[3] [3]

Benjamini, Y., & Yekutieli, D. (2005). False discovery rate--adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469), 71--81

work page 2005

[4] [4]

Berman, R., & Van den Bulte, C. (2022). False discovery in A/B testing. Management Science, 68(9), 6762--6782

work page 2022

[5] [5]

Brannath, W., Kluge, L., & Scharpenberg, M. (2024). Informative simultaneous confidence intervals for graphical test procedures. arXiv preprint, arXiv:2402.13719

work page arXiv 2024

[6] [6]

A., & Zwinderman, A

Ferreira, J. A., & Zwinderman, A. H. (2006). Approximate power and sample size calculations with the Benjamini--Hochberg method. The International Journal of Biostatistics, 2(1), Article 8

work page 2006

[7] [7]

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641--651

work page 2014

[8] [8]

H., & Mehta, C

Gao, P., Ware, J. H., & Mehta, C. (2008). Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics, 18(6), 1184--1196

work page 2008

[9] [9]

Guilbaud, O. (2008). Simultaneous confidence regions corresponding to Holm's step-down procedure and other closed-testing procedures. Biometrical Journal, 50(5), 678--692

work page 2008

[10] [10]

Guilbaud, O. (2012). Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biometrical Journal, 54(3), 317--342

work page 2012

[11] [11]

Johari, R., Koomen, P., Pekelis, L., & Walsh, D. (2022). Always valid inference: Continuous monitoring of A/B tests. Operations Research, 70(3), 1806--1821

work page 2022

[12] [12]

Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press

work page 2020

[13] [13]

Lan, K. K. G., & DeMets, D. L. (1983). Discrete sequential boundaries for clinical trials. Biometrika, 70(3), 659--663

work page 1983

[14] [14]

Liao, J. J. Z., et al. (2018). Defining information fractions in group sequential clinical trials with multiple endpoints. Contemporary Clinical Trials Communications, 17(3), 235--246

work page 2018

[15] [15]

Negi, A., & Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5), 504--534

work page 2021

[16] [16]

Nyholt, D. R. (2004). A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. American Journal of Human Genetics, 74(4), 765--769

work page 2004

[17] [17]

Perneger, T. V. (1998). What's wrong with Bonferroni adjustments. BMJ, 316(7139), 1236--1238

work page 1998

[18] [18]

J., & Mathur, M

VanderWeele, T. J., & Mathur, M. B. (2018). Some desirable properties of the Bonferroni correction: Is the Bonferroni correction really so bad? American Journal of Epidemiology, 188(3), 617--618

work page 2018

[19] [19]

Schultzberg, M., Ankargren, S., & Fr nberg, M. (2026). Risk-aware product decisions in A/B tests with multiple metrics. Journal of Statistical Planning and Inference, pii S0378375826000212

work page 2026

[20] [20]

Spotify Engineering. (2023). Choosing a sequential testing framework---comparisons and discussions. https://engineering.atspotify.com/2023/03/choosing-sequential-testing-framework-comparisons-and-discussions

work page 2023

[21] [21]

Strassburger, K., & Bretz, F. (2008). Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Statistics in Medicine, 27(24), 4914--4927

work page 2008

[22] [22]

Z., & Ambler, G

Vickerstaff, V., Omar, R. Z., & Ambler, G. (2019). Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Medical Research Methodology, 19, 129

work page 2019