p-Hacking Inflates Type I Error Rates in the Error Statistical Approach but not in the Formal Inference Approach

Mark Rubin

arxiv: 2602.21792 · v3 · pith:HLG6RPUJnew · submitted 2026-02-25 · 📊 stat.OT

p-Hacking Inflates Type I Error Rates in the Error Statistical Approach but not in the Formal Inference Approach

Mark Rubin This is my paper

Pith reviewed 2026-05-22 11:24 UTC · model grok-4.3

classification 📊 stat.OT

keywords p-hackingType I errorfamilywise error rateerror statistical approachformal inference approachsignificance testingmultiple testing

0 comments

The pith

P-hacking inflates Type I error rates in the error statistical approach but not in the formal inference approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes two ways of thinking about significance testing when researchers run several tests and report only the ones that look significant. In the error statistical approach the actual familywise error rate covers every test performed, reported or not, so selective reporting pushes the true error rate above the nominal level such as 0.05. In the formal inference approach only the nominal rate for each reported test counts, because no claim is made about the joint null hypothesis that covers all tests at once. A reader should care because the practical force of p-hacking accusations then depends on which philosophy of testing one accepts.

Core claim

In the error statistical approach the actual familywise error rate is the relevant quantity because it applies to the complete test procedure that includes both reported and unreported tests; for two independent tests each at alpha = 0.05 this rate equals 1 - (1 - 0.05)^2 = 0.098, which exceeds the nominal 0.05 and therefore constitutes Type I error inflation. In the formal inference approach the actual familywise error rate is irrelevant because the researcher reports no statistical inference about the intersection null hypothesis and the actual rate therefore supplies no license for inferences about the individual reported hypotheses; only the nominal error rate remains pertinent.

What carries the argument

The contrast between the actual familywise error rate, which tracks the full set of tests actually performed, and the nominal error rate attached only to the hypotheses that are reported.

If this is right

Under the error statistical approach any demonstration of p-hacking automatically demonstrates inflation of the actual Type I error rate.
Under the formal inference approach p-hacking leaves the nominal error rate for each reported test unchanged and therefore does not invalidate the reported inferences.
Methods for reducing p-hacking must be justified separately for each approach rather than assumed to apply equally.
Conceptual discussions of p-hacking need to specify whether the actual or the nominal error rate is under discussion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Debates about p-hacking may hinge on whether researchers implicitly treat their tests as a single joint procedure even when they do not say so.
Journals could require authors to declare which philosophy of testing they follow so that readers can judge whether error-rate complaints apply.
Empirical work could test how often published papers make claims that effectively concern an intersection null without acknowledging it.
The distinction may extend to other selective-reporting practices such as optional stopping or covariate selection.

Load-bearing premise

That researchers following the formal inference approach never report a statistical inference about the intersection null hypothesis covering all the tests they ran.

What would settle it

An examination of published papers that would show whether authors routinely state conclusions about the joint null when they have conducted multiple tests without correction.

read the original abstract

p-hacking occurs when researchers conduct multiple significance tests (e.g., p1;H0,1 and p2;H0,2) and then selectively report tests that yield desirable (usually significant) results (e.g., p2 < 0.05;H0,2) without correcting for multiple testing (e.g., 0.05/2 = 0.025). In the present article, I consider p-hacking in the context of two philosophies of significance testing - the error statistical approach and the formal inference approach. I argue that although p-hacking inflates Type I error rates in the error statistical approach, it does not inflate them in the formal inference approach. Specifically, in the error statistical approach, the "actual" familywise error rate (e.g., 1 - [1 - 0.05]2 = 0.098 for two independent tests) is relevant because it covers both the reported and unreported tests in the "actual" test procedure (i.e., p1;H0,1 and p2;H0,2). In this approach, Type I error rate inflation occurs because the "actual" error rate (0.098) is higher than the nominal error rate (0.05). In contrast, in the formal inference approach, the "actual" familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 & H0,2), and (b) the "actual" familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2). Instead, in the formal inference approach, only the nominal error rate is relevant, and a comparison with the "actual" error rate is inappropriate. Implications for conceptualizing, demonstrating, and reducing p-hacking are discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims p-hacking inflates Type I error only under the error statistical view because that view tracks the full actual procedure, while the formal inference view cares only about the nominal rate on the reported test.

read the letter

The key point is that Rubin separates two philosophies of significance testing and concludes that p-hacking raises actual error rates in one but not the other. Under the error statistical approach the familywise rate across every test that was run matters, so selective reporting pushes the rate above the nominal 0.05. Under the formal inference approach only the reported individual hypothesis is claimed, so the actual rate across the selection process is set aside and the nominal level is treated as sufficient.

Referee Report

2 major / 2 minor

Summary. The manuscript distinguishes two philosophies of significance testing and claims that p-hacking inflates Type I error rates in the error statistical approach (because the actual familywise error rate of the full procedure, including unreported tests, exceeds the nominal rate) but not in the formal inference approach (because only the nominal rate for the reported individual hypothesis matters, as the researcher draws no inference about the intersection null).

Significance. If the distinction is sustained, the paper offers a philosophically grounded way to conceptualize p-hacking that could inform statistical education and practice by suggesting that concerns about error-rate inflation are approach-specific rather than universal. The explicit contrast between actual and nominal error rates is a clear contribution, though its force depends on whether the formal-inference framing is accepted as standard frequentist practice.

major comments (2)

[Abstract / formal inference paragraph] Abstract and the paragraph defining the formal inference approach: the claim that the actual familywise error rate is irrelevant because 'the researcher does not report a statistical inference about the corresponding intersection null hypothesis' is load-bearing for the central thesis. In a frequentist framework the Type I error of the reported rejection is the probability that the actual data-dependent procedure rejects a true null; for two independent tests with both nulls true, selective reporting yields P(report false rejection) > 0.05 even when only one hypothesis is claimed. The manuscript provides no derivation showing why the selection step leaves the error rate of the reported inference exactly at the nominal level.
[Implications section] The section on implications for demonstrating p-hacking: the argument that only nominal rates matter in formal inference would be strengthened by an explicit comparison to post-selection inference results (e.g., the probability of reporting a rejection under the observed selection rule). Without this, the claim that 'a comparison with the actual error rate is inappropriate' remains definitional rather than demonstrated.

minor comments (2)

[Abstract] Notation such as 'p1;H0,1' and 'p2;H0,2' is introduced without a clear definition or example; a small illustrative table showing the actual versus nominal rates for n=2 independent tests would improve readability.
[References / discussion] The manuscript would benefit from citing the selective-inference and post-selection literature (e.g., works on conditional inference after data-dependent testing) to locate the formal-inference position relative to existing frequentist treatments of selection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. These comments have prompted us to clarify several key aspects of our argument regarding the distinction between the error statistical and formal inference approaches. Below, we address each major comment in turn.

read point-by-point responses

Referee: [Abstract / formal inference paragraph] Abstract and the paragraph defining the formal inference approach: the claim that the actual familywise error rate is irrelevant because 'the researcher does not report a statistical inference about the corresponding intersection null hypothesis' is load-bearing for the central thesis. In a frequentist framework the Type I error of the reported rejection is the probability that the actual data-dependent procedure rejects a true null; for two independent tests with both nulls true, selective reporting yields P(report false rejection) > 0.05 even when only one hypothesis is claimed. The manuscript provides no derivation showing why the selection step leaves the error rate of the reported inference exactly at the nominal level.

Authors: We acknowledge that this is a central claim and that the manuscript would benefit from a more explicit derivation. In the formal inference approach, the Type I error rate is attached to the specific hypothesis test that is reported, not to the data-dependent selection process that led to reporting it. The selection affects which hypothesis is tested but does not change the nominal error rate associated with the reported test's p-value. We have added a short derivation in the revised manuscript to illustrate that, under this framing, the error rate for the reported individual hypothesis remains at the nominal level because the inference concerns only that hypothesis. However, we note that this does not contradict the referee's observation about the overall probability of reporting a false rejection; rather, it reflects a difference in what is considered the relevant error rate in each approach. revision: partial
Referee: [Implications section] The section on implications for demonstrating p-hacking: the argument that only nominal rates matter in formal inference would be strengthened by an explicit comparison to post-selection inference results (e.g., the probability of reporting a rejection under the observed selection rule). Without this, the claim that 'a comparison with the actual error rate is inappropriate' remains definitional rather than demonstrated.

Authors: We agree that an explicit comparison would strengthen the argument. Post-selection inference typically adjusts the error rates to account for the selection rule, providing guarantees on the actual error rate of the selected inference. In contrast, the formal inference approach does not perform such an adjustment because it does not aim to control the error rate of the selection procedure. We have revised the implications section to include a brief discussion contrasting our view with post-selection inference methods, emphasizing that the formal inference approach treats the reported test independently of the selection mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; argument rests on explicit definitional distinctions between two testing philosophies

full rationale

The paper presents a conceptual distinction: in the error statistical approach the actual familywise error rate (covering reported and unreported tests) is the relevant quantity, while in the formal inference approach only the nominal rate for the reported individual hypothesis matters because no inference is drawn about the intersection null. This follows directly from the stated definitions of each approach rather than any derivation that reduces to its own inputs. No equations, fitted parameters, self-citations, or ansatzes are invoked in a load-bearing way. The central claim is therefore self-contained as a clarification of differing error-rate relevance criteria and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on background acceptance of two distinct philosophies of significance testing and on the premise that only reported inferences matter in the formal approach.

axioms (2)

domain assumption The error statistical approach evaluates error rates over the actual test procedure that includes both reported and unreported tests.
This premise is invoked to explain why the actual familywise error rate becomes relevant under p-hacking.
domain assumption In the formal inference approach, statistical inferences are drawn only about individually reported hypotheses and not about intersection null hypotheses.
This premise is used to argue that the actual familywise error rate does not affect the validity of reported inferences.

pith-pipeline@v0.9.0 · 5884 in / 1408 out tokens · 67380 ms · 2026-05-22T11:24:08.109430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In the formal inference approach, the 'actual' familywise error rate is irrelevant because (a) the researcher does not report a statistical inference about the corresponding intersection null hypothesis (i.e., H0,1 ∩ H0,2), and (b) the 'actual' familywise error rate does not license inferences about the reported individual hypotheses (i.e., H0,2).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Type I error rates are based on formally reported inferences, not 'actual' test procedures.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.