p-Hacking Inflates Type I Error Rates in the Error Statistical Approach but not in the Formal Inference Approach
Pith reviewed 2026-05-22 11:24 UTC · model grok-4.3
The pith
P-hacking inflates Type I error rates in the error statistical approach but not in the formal inference approach.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the error statistical approach the actual familywise error rate is the relevant quantity because it applies to the complete test procedure that includes both reported and unreported tests; for two independent tests each at alpha = 0.05 this rate equals 1 - (1 - 0.05)^2 = 0.098, which exceeds the nominal 0.05 and therefore constitutes Type I error inflation. In the formal inference approach the actual familywise error rate is irrelevant because the researcher reports no statistical inference about the intersection null hypothesis and the actual rate therefore supplies no license for inferences about the individual reported hypotheses; only the nominal error rate remains pertinent.
What carries the argument
The contrast between the actual familywise error rate, which tracks the full set of tests actually performed, and the nominal error rate attached only to the hypotheses that are reported.
If this is right
- Under the error statistical approach any demonstration of p-hacking automatically demonstrates inflation of the actual Type I error rate.
- Under the formal inference approach p-hacking leaves the nominal error rate for each reported test unchanged and therefore does not invalidate the reported inferences.
- Methods for reducing p-hacking must be justified separately for each approach rather than assumed to apply equally.
- Conceptual discussions of p-hacking need to specify whether the actual or the nominal error rate is under discussion.
Where Pith is reading between the lines
- Debates about p-hacking may hinge on whether researchers implicitly treat their tests as a single joint procedure even when they do not say so.
- Journals could require authors to declare which philosophy of testing they follow so that readers can judge whether error-rate complaints apply.
- Empirical work could test how often published papers make claims that effectively concern an intersection null without acknowledging it.
- The distinction may extend to other selective-reporting practices such as optional stopping or covariate selection.
Load-bearing premise
That researchers following the formal inference approach never report a statistical inference about the intersection null hypothesis covering all the tests they ran.
What would settle it
An examination of published papers that would show whether authors routinely state conclusions about the joint null when they have conducted multiple tests without correction.
read the original abstract
p-hacking occurs when researchers conduct multiple significance tests (e.g., p1;H0,1 and p2;H0,2) and then selectively report tests that yield desirable (usually significant) results (e.g., p2 < 0.05;H0,2) without correcting for multiple testing (e.g., 0.05/2 = 0.025). In the present article, I consider p-hacking in the context of two philosophies of significance testing - the error statistical approach and the formal inference approach. I argue that although p-hacking inflates Type I error rates in the error statistical approach, it does not inflate them in the formal inference approach. Specifically, in the error statistical approach, the "actual" familywise error rate (e.g., 1 - [1 - 0.05]2 = 0.098 for two independent tests) is relevant because it covers both the reported and unreported tests in the "actual" test procedure (i.e., p1;H0,1 and p2;H0,2). In this approach, Type I error rate inflation occurs because the "actual" error rate (0.098) is higher than the nominal error rate (0.05). In contrast, in the formal inference approach, the "actual" familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 & H0,2), and (b) the "actual" familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2). Instead, in the formal inference approach, only the nominal error rate is relevant, and a comparison with the "actual" error rate is inappropriate. Implications for conceptualizing, demonstrating, and reducing p-hacking are discussed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript distinguishes two philosophies of significance testing and claims that p-hacking inflates Type I error rates in the error statistical approach (because the actual familywise error rate of the full procedure, including unreported tests, exceeds the nominal rate) but not in the formal inference approach (because only the nominal rate for the reported individual hypothesis matters, as the researcher draws no inference about the intersection null).
Significance. If the distinction is sustained, the paper offers a philosophically grounded way to conceptualize p-hacking that could inform statistical education and practice by suggesting that concerns about error-rate inflation are approach-specific rather than universal. The explicit contrast between actual and nominal error rates is a clear contribution, though its force depends on whether the formal-inference framing is accepted as standard frequentist practice.
major comments (2)
- [Abstract / formal inference paragraph] Abstract and the paragraph defining the formal inference approach: the claim that the actual familywise error rate is irrelevant because 'the researcher does not report a statistical inference about the corresponding intersection null hypothesis' is load-bearing for the central thesis. In a frequentist framework the Type I error of the reported rejection is the probability that the actual data-dependent procedure rejects a true null; for two independent tests with both nulls true, selective reporting yields P(report false rejection) > 0.05 even when only one hypothesis is claimed. The manuscript provides no derivation showing why the selection step leaves the error rate of the reported inference exactly at the nominal level.
- [Implications section] The section on implications for demonstrating p-hacking: the argument that only nominal rates matter in formal inference would be strengthened by an explicit comparison to post-selection inference results (e.g., the probability of reporting a rejection under the observed selection rule). Without this, the claim that 'a comparison with the actual error rate is inappropriate' remains definitional rather than demonstrated.
minor comments (2)
- [Abstract] Notation such as 'p1;H0,1' and 'p2;H0,2' is introduced without a clear definition or example; a small illustrative table showing the actual versus nominal rates for n=2 independent tests would improve readability.
- [References / discussion] The manuscript would benefit from citing the selective-inference and post-selection literature (e.g., works on conditional inference after data-dependent testing) to locate the formal-inference position relative to existing frequentist treatments of selection.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. These comments have prompted us to clarify several key aspects of our argument regarding the distinction between the error statistical and formal inference approaches. Below, we address each major comment in turn.
read point-by-point responses
-
Referee: [Abstract / formal inference paragraph] Abstract and the paragraph defining the formal inference approach: the claim that the actual familywise error rate is irrelevant because 'the researcher does not report a statistical inference about the corresponding intersection null hypothesis' is load-bearing for the central thesis. In a frequentist framework the Type I error of the reported rejection is the probability that the actual data-dependent procedure rejects a true null; for two independent tests with both nulls true, selective reporting yields P(report false rejection) > 0.05 even when only one hypothesis is claimed. The manuscript provides no derivation showing why the selection step leaves the error rate of the reported inference exactly at the nominal level.
Authors: We acknowledge that this is a central claim and that the manuscript would benefit from a more explicit derivation. In the formal inference approach, the Type I error rate is attached to the specific hypothesis test that is reported, not to the data-dependent selection process that led to reporting it. The selection affects which hypothesis is tested but does not change the nominal error rate associated with the reported test's p-value. We have added a short derivation in the revised manuscript to illustrate that, under this framing, the error rate for the reported individual hypothesis remains at the nominal level because the inference concerns only that hypothesis. However, we note that this does not contradict the referee's observation about the overall probability of reporting a false rejection; rather, it reflects a difference in what is considered the relevant error rate in each approach. revision: partial
-
Referee: [Implications section] The section on implications for demonstrating p-hacking: the argument that only nominal rates matter in formal inference would be strengthened by an explicit comparison to post-selection inference results (e.g., the probability of reporting a rejection under the observed selection rule). Without this, the claim that 'a comparison with the actual error rate is inappropriate' remains definitional rather than demonstrated.
Authors: We agree that an explicit comparison would strengthen the argument. Post-selection inference typically adjusts the error rates to account for the selection rule, providing guarantees on the actual error rate of the selected inference. In contrast, the formal inference approach does not perform such an adjustment because it does not aim to control the error rate of the selection procedure. We have revised the implications section to include a brief discussion contrasting our view with post-selection inference methods, emphasizing that the formal inference approach treats the reported test independently of the selection mechanism. revision: yes
Circularity Check
No circularity; argument rests on explicit definitional distinctions between two testing philosophies
full rationale
The paper presents a conceptual distinction: in the error statistical approach the actual familywise error rate (covering reported and unreported tests) is the relevant quantity, while in the formal inference approach only the nominal rate for the reported individual hypothesis matters because no inference is drawn about the intersection null. This follows directly from the stated definitions of each approach rather than any derivation that reduces to its own inputs. No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing way. The central claim is therefore self-contained as a clarification of differing error-rate relevance criteria and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The error statistical approach evaluates error rates over the actual test procedure that includes both reported and unreported tests.
- domain assumption In the formal inference approach, statistical inferences are drawn only about individually reported hypotheses and not about intersection null hypotheses.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
In the formal inference approach, the 'actual' familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 ∩ H0,2), and (b) the 'actual' familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Type I error rates are based on formally reported inferences, not 'actual' test procedures.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.