Recognition: no theorem link
Applied Statistics Requires Scientific Context
Pith reviewed 2026-05-13 19:54 UTC · model grok-4.3
The pith
Applied statistics needs nuanced scientific context rather than any universal significance threshold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The application and interpretation of statistical methods requires careful consideration of foundational contextual issues, which include both elusive background assumptions and quantifiable features of a study area. A recent re-formulation of the p-value as a measure of divergence between observed data and modeling assumptions is used to demonstrate this role in two randomized trials. Success with low significance thresholds in genome-wide association studies and particle physics is attributed to the accompanying validity-checking gauntlets and contextual considerations rather than the thresholds themselves. Therefore the adoption of a universal threshold should be abandoned as a goal of统计s
What carries the argument
Re-formulation of the p-value as a measure of divergence between an observed dataset and the set of assumptions used to construct the statistical measure.
If this is right
- Ignoring foundational context can lead to misinterpretation of results even when low p-values are obtained.
- Reform efforts in statistics should prioritize integration of domain-specific assumptions over standardization of thresholds.
- The two randomized trial examples show that different scientific contexts produce different valid interpretations of the same statistical output.
- Validity-checking procedures must be tailored to the specific assumptions of each field rather than applied uniformly.
Where Pith is reading between the lines
- Greater collaboration between statisticians and domain scientists would be needed to identify the relevant contextual assumptions for each application.
- Fields without strong pre-existing validity gauntlets might benefit from higher thresholds to reduce false positives until such checks are developed.
- This view implies that statistical education should emphasize case-by-case contextual reasoning over mastery of fixed rules.
Load-bearing premise
The success of low significance thresholds in genome-wide association studies and particle physics stems primarily from the accompanying validity-checking gauntlets rather than from the thresholds themselves.
What would settle it
Finding a new scientific domain that achieves reliable discoveries with low significance thresholds while lacking extensive validity-checking procedures would challenge the claim that context and checks, not the thresholds, drive success.
Figures
read the original abstract
Statistical methods are indispensable to scientific inference. However, there exists a longstanding tension across a wide range of scientific disciplines about the role that ``context'' should play in the application of statistical methods and the interpretation of statistical results. Though frequently invoked, the notion of ``scientific context'' refers to at least two distinct concepts: a set of foundational nuanced and elusive background assumptions and substantive features of a given area of study that shape the validity and reliability of statistical methods; and more quantifiable contextual issues that affect the performance of statistical methods and interpretation of statistical results. I argue here that the application and interpretation of statistical methods requires careful consideration of foundational contextual issues. To motivate the arguments, I review a recent re-formulation of the $p$-value as a measure of divergence between an observed dataset and a set of assumptions used to construct statistical measures. I use this framework to illustrate the role that context plays in two randomized trials: on low-dose aspirin for pregnancy loss, and a new inhibitor of a key biochemical pathway affecting ankylosing spondylitis. Finally, I note that the adoption of low significance thresholds in genome-wide association studies and high energy particle physics has been successful more so because of extensive validity-checking gauntlets and contextual considerations that have accompanied these low thresholds, not because of the low thresholds themselves. I use these illustrations and arguments to suggest that (i) the adoption of a universal threshold for significance testing should be abandoned as a goal of statistics reform; and (ii) the validity and optimal use of applied statistical tools requires careful consideration of nuanced scientific context.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that applied statistical methods require careful consideration of scientific context—both foundational background assumptions and quantifiable features—for valid application and interpretation. It reviews a recent reformulation of the p-value as a divergence measure between data and assumptions, uses this to analyze two randomized trials (low-dose aspirin for pregnancy loss and a biochemical inhibitor for ankylosing spondylitis), and claims that the success of low significance thresholds in GWAS and particle physics arises primarily from accompanying validity-checking gauntlets rather than the thresholds themselves. On this basis, it recommends abandoning universal significance thresholds as a goal of statistics reform and prioritizing context-specific use of tools.
Significance. If the arguments hold, the paper could usefully redirect statistics reform discussions away from fixed thresholds toward context-aware practices, with potential benefits for reliability in applied fields. The concrete trial examples provide clear illustrations of how context shapes interpretation, and the emphasis on validity-checking procedures in high-stakes domains is a constructive observation.
major comments (2)
- [Abstract and GWAS/particle physics discussion] Abstract and the section discussing GWAS/particle physics: The claim that low thresholds succeeded 'more so because of extensive validity-checking gauntlets and contextual considerations that have accompanied these low thresholds, not because of the low thresholds themselves' is load-bearing for the central recommendation to abandon universal thresholds. No quantitative decomposition, counterfactual, or separating evidence is supplied to isolate the contribution of the threshold value from the gauntlets; the two randomized-trial examples show context affects interpretation but do not address this attribution.
- [p-value reformulation review] Section reviewing the p-value reformulation: The framework is invoked to illustrate context's role, yet the manuscript provides no formal derivation, simulation study, or direct comparison within the paper to demonstrate how the divergence measure alters conclusions relative to standard p-value usage in the cited trials.
minor comments (2)
- [Abstract] The abstract introduces two distinct concepts of 'context' but does not explicitly label or separate them in the subsequent trial analyses, which could improve clarity for readers.
- No tables or figures are referenced in the provided text; if any are present, ensure they directly support the trial interpretations or the GWAS contrast.
Simulated Author's Rebuttal
We thank the referee for these constructive comments. We address each major point below with planned revisions where feasible.
read point-by-point responses
-
Referee: [Abstract and GWAS/particle physics discussion] The claim that low thresholds succeeded 'more so because of extensive validity-checking gauntlets and contextual considerations that have accompanied these low thresholds, not because of the low thresholds themselves' is load-bearing for the central recommendation to abandon universal thresholds. No quantitative decomposition, counterfactual, or separating evidence is supplied to isolate the contribution of the threshold value from the gauntlets; the two randomized-trial examples show context affects interpretation but do not address this attribution.
Authors: We acknowledge that the manuscript supplies no quantitative decomposition or counterfactual to isolate the threshold value from the validity-checking procedures. The claim rests on historical observation: GWAS and particle physics apply low thresholds exclusively within integrated validation frameworks, and we argue this integration, rather than the threshold alone, drives reliability. The trial examples illustrate context's general role in interpretation but do not quantify the attribution. In revision we will add a paragraph clarifying the evidence as observational and historical, explicitly noting the absence of counterfactual analysis as a limitation while retaining the recommendation to prioritize context-specific practices. revision: partial
-
Referee: [p-value reformulation review] Section reviewing the p-value reformulation: The framework is invoked to illustrate context's role, yet the manuscript provides no formal derivation, simulation study, or direct comparison within the paper to demonstrate how the divergence measure alters conclusions relative to standard p-value usage in the cited trials.
Authors: The section reviews an existing reformulation from prior literature to supply a conceptual lens for discussing context; no new derivation is offered. To make the illustration more concrete, we will add a brief simulation or side-by-side comparison in the revised manuscript that applies both the standard p-value and the divergence measure to the cited trial data, highlighting how contextual assumptions change conclusions. revision: yes
- No quantitative decomposition or counterfactual evidence is supplied to isolate the contribution of low thresholds from validity-checking gauntlets in the success of GWAS and particle physics.
Circularity Check
No significant circularity; conceptual argument is self-contained
full rationale
The paper advances a conceptual position that scientific context must inform statistical application and that universal significance thresholds should be abandoned. It motivates this via a reviewed p-value reformulation (treated as external input) and two trial illustrations plus cross-field examples. No equations, fitted quantities, or self-referential definitions appear; the central claims do not reduce to their own inputs by construction. External examples supply the evidentiary load rather than any tautological restatement or self-citation chain. The derivation chain therefore remains independent of the paper's own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Statistical methods depend on foundational nuanced background assumptions specific to each scientific domain
Reference graph
Works this paper leans on
-
[1]
Things I have learned (so far).American Psychologist1990;45(12):1304–1312
Cohen J. Things I have learned (so far).American Psychologist1990;45(12):1304–1312. doi:10.1037/0003-066X.45. 12.1304
-
[2]
Cambridge, MA: Harvard Univer- sity Press, 1986
Stigler SM.The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Harvard Univer- sity Press, 1986
work page 1900
-
[3]
Wasserstein RL, Lazar NA. The ASA’ s Statement on p-Values: Context, Process, and Purpose.The American Statisti- cian2016;70(2):129–133. doi:10.1080/00031305.2016.1154108
-
[4]
Rafi Z, Greenland S. Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise.BMC Medical Research Methodology2020;20(1):244
-
[5]
Alawbathani S, MacCallum RC. A teaching tool about the fickle p value and other statistical principles based on real-life data.Advances in Physiology Education2021;45(1):32–40
-
[6]
Surprise!International Journal of Epidemiology2021;190(2):191–193
Cole SR, Edwards J, Greenland S. Surprise!International Journal of Epidemiology2021;190(2):191–193. doi:10. 1093/aje/kwaa136
-
[7]
Gelman A, Greenland S. Are confidence intervals better termed “uncertainty intervals”? No: Call them compatibility intervals.BMJ (Clinical research ed.)2019;366(10):l5381
work page 2019
-
[8]
Greenland S, Mansournia MA, Joffe MM. To curb research misreporting, replace significance and confidence by compatibility.Preventive Medicine2022;164. doi:10.1016/j.ypmed.2022.107127
-
[9]
To aid scientific inference, emphasize unconditional compatibility descrip- tions of statistics
Greenland S, Rafi Z, Matthews R, et al. To aid scientific inference, emphasize unconditional compatibility descrip- tions of statistics. https://arxiv.org/abs/1909.08583, 2023
-
[10]
Greenland S. Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or, how divergence P-values measure evidence even when decision P-values do not.Scandinavian Journal of Statis- tics2023;50(1):54–88. doi:10.1111/sjos.12625
-
[11]
Two-Tailed p-Values and Coherent Measures of Evidence.The American Statistician2020;74(1):80–86
Peskun PH. Two-Tailed p-Values and Coherent Measures of Evidence.The American Statistician2020;74(1):80–86. doi:10.1080/00031305.2018.1475304
-
[12]
A New Look at P Values for Randomized Clinical Trials.NEJM Evidence 2023;3(1):EVIDoa2300003
van Zwet E, Gelman A, Greenland S, et al. A New Look at P Values for Randomized Clinical Trials.NEJM Evidence 2023;3(1):EVIDoa2300003. doi:10.1056/EVIDoa2300003
-
[13]
Gibson EW . The role of p-Values in judging the strength of evidence and realistic replication expectations.Statistics in Biopharmaceutical Research2021;13(1):6–18. 13
-
[14]
Abandon Statistical Significance.The American Statistician2019;73(sup1):235–
McShane BB, Gal D, Gelman A, et al. Abandon Statistical Significance.The American Statistician2019;73(sup1):235–
-
[15]
doi:10.1080/00031305.2018.1527253
-
[16]
Redefine statistical significance.Nature Human Behaviour2018; 2(1):6–10
Benjamin DJ, Berger JO, Johannesson M, et al. Redefine statistical significance.Nature Human Behaviour2018; 2(1):6–10
-
[17]
Choosing alpha post hoc: The danger of multiple standard significance thresholds, 2025
Hemerik J, Koning NW . Choosing alpha post hoc: The danger of multiple standard significance thresholds, 2025. doi:10.48550/arXiv.2410.02306
-
[18]
Maier M, Lakens D. Justify Your Alpha: A Primer on Two Practical Approaches.Advances in Methods and Practices in Psychological Science2022;5(2):25152459221080396. doi:10.1177/25152459221080396
-
[19]
Let’ s think about cognitive bias.Nature2015;526(7572):163–163
Nature Editorial Board. Let’ s think about cognitive bias.Nature2015;526(7572):163–163
-
[20]
Fernández Pinto M. Methodological and Cognitive Biases in Science: Issues for Current Research and Ways to Counteract Them.Perspectives on Science2023;31(5):535–554
-
[21]
Toward evidence-based medical statistics
Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy.Ann Intern Med1999;130(12):995– 1004
-
[22]
Bickel DR. Coherent Checking and Updating of Bayesian Models without Specifying the Model Space: A Decision- Theoretic Semantics for Possibility Theory.International Journal of Approximate Reasoning2022;142:81–93
-
[23]
Hypothesis testing with e-values.Foundations and Trends® in Statistics2025;1(1-2):1–390
Ramdas A, Wang R. Hypothesis testing with e-values.Foundations and Trends® in Statistics2025;1(1-2):1–390. doi:10.1561/3600000002
-
[24]
Frequentist probability and frequentist statistics.Synthese1977;36(1):97–131
Neyman J. Frequentist probability and frequentist statistics.Synthese1977;36(1):97–131
-
[25]
The Future of Data Analysis.The Annals of Mathematical Statistics1962;33(1):1–67
Tukey JW . The Future of Data Analysis.The Annals of Mathematical Statistics1962;33(1):1–67
-
[26]
Statistical Models and Shoe Leather.Sociological Methodology1991;21:291–313
Freedman DA. Statistical Models and Shoe Leather.Sociological Methodology1991;21:291–313
-
[27]
Pay No Attention to the Model Behind the Curtain.Pure and Applied Geophysics2022;179(11):4121–4145
Stark PB. Pay No Attention to the Model Behind the Curtain.Pure and Applied Geophysics2022;179(11):4121–4145
-
[28]
The p-value requires context, not a threshold.JAMA2019;321(21):2061–2062
Betensky RA. The p-value requires context, not a threshold.JAMA2019;321(21):2061–2062
work page 2061
-
[29]
Roychoudhury S, Scheuer N, Neuenschwander B. Beyond p-values: A phase II dual-criterion design with statistical significance and clinical relevance.Clinical Trials2018;15(5):452–461. doi:10.1177/1740774518770661
-
[30]
Perezgonzalez JD. P-values as percentiles. Commentary on: “Null hypothesis significance tests. A mix–up of two different theories: The basis for widespread confusion and numerous misinterpretations” .Frontiers in Psychology 2015;6. doi:10.3389/fpsyg.2015.00341. 14
-
[31]
Greenland S, Finkle WD. A Critical Look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses.Am J Epidemiol1995;142(12):1255–1264
-
[32]
Hurlbert SH, Lombardi CM. Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the neoFisherian.Annales Zoologici Fennici2009;46(5):311–349
-
[33]
Two-Tailed p-Values and Coherent Measures of Evidence.The American Statistician2020;74(1):80–86
Greenland S. Valid p-values behave exactly as they should: Some misleading criticisms of p-values and their reso- lution with s-values.The American Statistician2019;73(suppl 1):106–114. doi:10.1080/00031305.2018.1529625
-
[34]
Schisterman EF , Silver RM, Lesher LL, et al. Preconception low-dose aspirin and pregnancy outcomes: Results from the EAGeR randomised trial.Lancet2014;384(9937):29–36. doi:10.1016/S0140-6736(14)60157-4
-
[35]
Sporadic and Recurrent Pregnancy Loss
Silver RM, Ware Branch D. Sporadic and Recurrent Pregnancy Loss. In:Clinical Obstetrics, John Wiley & Sons, Ltd, chap. 11. 2007;141–160
work page 2007
-
[36]
Schisterman EF , Silver RM, Perkins NJ, et al. A randomised trial to evaluate the effects of low-dose aspirin in gesta- tion and reproduction: Design and baseline characteristics.Paediatric and Perinatal Epidemiology2013;27(6):598–
-
[37]
doi:10.1111/ppe.12088
-
[38]
Deodhar A, Sliwinska-Stanczyk P , Xu H, et al. Tofacitinib for the treatment of ankylosing spondylitis: A phase III, randomised, double-blind, placebo-controlled study.Annals of the Rheumatic Diseases2021;80(8):1004–1013. doi: 10.1136/annrheumdis-2020-219601
-
[39]
New York, NY: Random House Publishing Group, 2006
Plato.The Dialogues of Plato. New York, NY: Random House Publishing Group, 2006
work page 2006
-
[40]
Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations.European Journal of Epidemiology2016;31(4):337–350
-
[41]
van der Heijde D, Deodhar A, Wei JC, et al. Tofacitinib in patients with ankylosing spondylitis: A phase II, 16-week, randomised, placebo-controlled, dose-ranging study.Annals of the Rheumatic Diseases2017;76(8):1340–1347. doi: 10.1136/annrheumdis-2016-210322
-
[42]
Huneke NTM, Fusetto Veronesi G, Garner M, et al. Expectancy Effects, Failure of Blinding Integrity, and Placebo Response in Trials of Treatments for Psychiatric Disorders: A Narrative Review.JAMA Psychiatry2025;82(5):531–538. doi:10.1001/jamapsychiatry.2025.0085
-
[43]
Lund JL, Richardson DB, Stürmer T . The active comparator, new user study design in pharmacoepidemiology: Historical foundations and contemporary application.Current epidemiology reports2015;2(4):221–228. 15
-
[44]
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common dis- eases and 3,000 shared controls.Nature2007;447(7145):661–678. doi:10.1038/nature05911
-
[45]
Abe F , Albrow MG, Amendolia SR, et al. Evidence for Top Quark Production inp¯pCollisions at ps= 1.8 TeV.Physical Review Letters1994;73:225–229
-
[46]
Abe F , Akimoto H, Akopian A, et al. Observation of Top Quark Production inp¯pCollisions with the Collider Detector at Fermilab.Physical Review Letters1995;74(14):2626–2631
-
[47]
Abe F , Albrow MG, Amendolia SR, et al. Kinematic Evidence for Top Quark Pair Production in W+ Multijet Events in p ¯pCollisions at ps= 1.8 TeV.Physical Review D1995;51:4623–4638
-
[48]
University of Pittsburgh Press, 2018
Franklin A.Shifting Standards: Experiments in Particle Physics in the Twentieth Century. University of Pittsburgh Press, 2018
work page 2018
-
[49]
Cambridge: Cambridge University Press, 2004
Staley KW .The Evidence for the Top Quark: Objectivity and Bias in Collaborative Experimentation. Cambridge: Cambridge University Press, 2004
work page 2004
-
[50]
Genome-wide association studies.Nature Reviews Methods Primers 2021;1(1):59
Uffelmann E, Huang QQ, Munung NS, et al. Genome-wide association studies.Nature Reviews Methods Primers 2021;1(1):59
work page 2021
-
[51]
The application of CRISPR/Cas9–based genome-wide screening to disease research
Chen X, Zheng M, Lin S, et al. The application of CRISPR/Cas9–based genome-wide screening to disease research. Molecular and Cellular Probes2025;79:102004. doi:10.1016/j.mcp.2024.102004
-
[52]
Lee D, Gunamalai L, Kannan J, et al. Massively parallel reporter assays identify functional enhancer variants at QT interval GWAS loci.bioRxiv: The Preprint Server for Biology2025;:2025.03.11.642686doi:10.1101/2025.03.11.642686
-
[53]
James F .Statistical Methods in Experimental Physics. World Scientific, 2006
work page 2006
-
[54]
Cowan G.Statistical Data Analysis. Oxford University Press, 2011
work page 2011
-
[55]
Cambridge University Press, 2014
Lyons L.Statistics for Nuclear and Particle Physicists. Cambridge University Press, 2014
work page 2014
-
[56]
Aad G, et al (ATLAS Collaboration). Observation of a new particle in the search for the standard model higgs boson with the atlas detector at the lhc.Physics Letters B2012;716(1):1–29. doi:10.1016/j.physletb.2012.08.020
work page internal anchor Pith review doi:10.1016/j.physletb.2012.08.020 2012
-
[57]
Abachi S, et al. Observation of the Top Quark.Physical Review Letters1995;74:2632–2637. doi:10.1103/PhysRevLett. 74.2632
-
[58]
CONSORT 2025 Statement: Updated Guideline for Reporting Randomized Trials.JAMA2025;333(22):1998–2005
Hopewell S, Chan AW, Collins GS, et al. CONSORT 2025 Statement: Updated Guideline for Reporting Randomized Trials.JAMA2025;333(22):1998–2005. 16
work page 2025
-
[59]
SPIRIT 2025 statement: Updated guideline for protocols of randomised trials
Chan AW, Boutron I, Hopewell S, et al. SPIRIT 2025 statement: Updated guideline for protocols of randomised trials. BMJ2025;389:e081477
work page 2025
-
[60]
Sauerbrei W, Abrahamowicz M, Altman DG, et al. STRengthening analytical thinking for observational studies: The STRATOS initiative.Statistics in Medicine2014;33(30):5413–5432
-
[61]
Cashin AG, Hansford HJ, Hernán MA, et al. Transparent Reporting of Observational Studies Emulating a Target Trial—The TARGET Statement.JAMA2025;334(12):1084–1093. doi:10.1001/jama.2025.13350
-
[62]
London, UK: Profile Books, 2010
Gawande A.The Checklist Manifesto: How To Get Things Right. London, UK: Profile Books, 2010
work page 2010
-
[63]
Significance tests die hard: The amazing persistence of a probabilistic misconception
Falk R, Greenbaum CW . Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology1995;5(1):75–98
-
[64]
The superego, the ego, and the id in statistical reasoning
Gigerenzer G. The superego, the ego, and the id in statistical reasoning. In:A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues, Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc. 1993;311–339
work page 1993
-
[65]
Greenland S. Transparency and disclosure, neutrality and balance: Shared values or just shared words?Journal of Epidemiology and Community Health2012;66(11):967–970. 17
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.