On the probability of a causal inference is robust for internal validity
Pith reviewed 2026-05-25 19:02 UTC · model grok-4.3
The pith
The PIV is the probability that a null hypothesis rejection on observed data persists after adding counterfactual outcomes, serving as a robustness index for causal inferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the PIV, defined as the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals given that the same null was already rejected on the observed sample alone, functions as a robustness index. Under either frequentist or Bayesian framework, the PIV of an inference can be bounded from bounded beliefs about the counterfactuals, which is useful when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is considered to be based on both the observed sample and the counterfactuals.
What carries the argument
The PIV itself, the conditional probability that a null hypothesis rejection on the observed sample alone continues after the counterfactual outcomes are folded in as an additional sample.
If this is right
- A researcher can place numerical bounds on the robustness of a causal claim without observing the counterfactuals, by stating bounds on beliefs about them.
- When the test is viewed as already incorporating counterfactuals, the PIV reduces exactly to ordinary statistical power.
- The eight-step procedure supplies a concrete workflow for evaluating internal validity of any observational causal inference.
- The same bounding logic applies under both frequentist and Bayesian interpretations of the test.
Where Pith is reading between the lines
- The PIV framing could be applied to missing-data problems outside causal inference by treating the missing cases as the counterfactual sample.
- One could test the bounding procedure by generating data with known counterfactual distributions and checking whether the derived bounds contain the realized rejection probability.
- The approach supplies a probabilistic alternative to deterministic sensitivity analyses that vary one parameter at a time.
- Integration with existing software for power analysis might allow direct computation of PIV bounds once belief intervals on counterfactual means or variances are supplied.
Load-bearing premise
Counterfactual outcomes can be treated as an additional sample whose influence on the test statistic permits probabilistic bounding from subjective beliefs about those outcomes.
What would settle it
A simulation in which the true distribution of counterfactual outcomes is fully known, the actual frequency of continued null rejection with the full data is computed, and this frequency is checked against the interval obtained from the paper's bounding procedure applied to deliberately limited beliefs.
Figures
read the original abstract
The internal validity of observational study is often subject to debate. In this study, we define the counterfactuals as the unobserved sample and intend to quantify its relationship with the null hypothesis statistical testing (NHST). We propose the probability of a causal inference is robust for internal validity, i.e., the PIV, as a robustness index of causal inference. Formally, the PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample. Under either frequentist or Bayesian framework, one can bound the PIV of an inference based on his bounded belief about the counterfactuals, which is often needed when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is thought to be based on both the observed sample and the counterfactuals. We summarize the process of evaluating internal validity with the PIV into an eight-step procedure and illustrate it with an empirical example (i.e., Hong and Raudenbush (2005)).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the probability of a causal inference is robust for internal validity (PIV) as a robustness index for causal inferences from observational data. Formally, PIV is defined as the conditional probability of rejecting the null hypothesis H0 again when the test is based on both the observed sample and the counterfactual outcomes, given that H0 was already rejected based on the observed sample alone. The authors claim that PIV can be bounded using subjective beliefs about the counterfactuals under either frequentist or Bayesian frameworks, that it is equivalent to statistical power when the test incorporates both samples, and that an eight-step procedure can be used to evaluate internal validity, illustrated with the Hong and Raudenbush (2005) example.
Significance. If the central definition were rigorously grounded, the PIV could provide a quantitative index for sensitivity to unconfoundedness violations. The manuscript offers no machine-checked proofs, reproducible code, parameter-free derivations, or falsifiable predictions; the contribution rests entirely on the conceptual proposal and the eight-step procedure.
major comments (3)
- [Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.
- [Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.
- [Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.
minor comments (2)
- [Abstract] The abstract refers to 'bounded belief about the counterfactuals' without specifying how this belief is translated into numerical bounds on the rejection probability.
- Notation for counterfactual outcomes should be introduced explicitly and distinguished from random variables to avoid conflating fixed potential outcomes with a probability space.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript proposing the PIV as a robustness measure for causal inferences. We address each of the major comments point by point below. While we maintain that the conceptual contribution is valuable for sensitivity analysis in observational studies, we acknowledge the need for greater formal rigor in some aspects and will revise accordingly where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.
Authors: We agree that a more precise formalization is needed. In the revised manuscript, we will define PIV more rigorously by specifying that the bounds on counterfactual outcomes are incorporated via a set of possible distributions or values consistent with the analyst's beliefs, and the conditional probability is computed as the infimum or range over these possibilities. This treats the counterfactuals as fixed but unknown, with the probability arising from the sampling distribution of the observed data conditional on the bounds. This is an extension of standard NHST to include sensitivity parameters. revision: yes
-
Referee: [Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.
Authors: The equivalence is conceptual: PIV represents the power of a test that would be conducted if the counterfactual outcomes were observed, but since they are not, we bound it using beliefs about them. We will revise the manuscript to clarify that it is not a direct equivalence but an analogy to power under the extended sample, and remove any implication of strict mathematical equivalence without the additional bounding framework. revision: yes
-
Referee: [Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.
Authors: This is the core of our proposal, which intentionally extends beyond standard theory to provide a practical tool for assessing internal validity when unconfoundedness may be violated. Similar to other sensitivity analyses (e.g., those using Rosenbaum's bounds or partial identification), we allow subjective input on counterfactual distributions. The eight-step procedure makes these assumptions explicit and falsifiable by the reader. We do not claim justification within unmodified standard theory but as a new robustness index; thus, no change to this aspect is planned. revision: no
Circularity Check
No circularity in PIV definition or bounding claim.
full rationale
The paper introduces PIV via an explicit formal definition as the conditional probability P(reject H0 on observed+counterfactuals | reject H0 on observed). It states that bounds follow from subjective beliefs about counterfactuals under frequentist or Bayesian views and notes an equivalence to power under a joint-sample interpretation. No quoted equations, procedures, or self-citations reduce this definition or the bounding claim to a fitted parameter, prior self-result, or input by construction. The eight-step procedure and empirical illustration rest on the definitional proposal itself rather than any hidden reduction, so the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactuals can be treated as an unobserved sample for the purposes of null hypothesis statistical testing.
invented entities (1)
-
PIV
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2: ... probit(PIV) = f(Y_un_t, Y_un_c) ...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alexander, Karl L., Doris R. Entwisle, and Susan L. Dauber. 2003. On the success of failure: A reassessment of the effects of retention in the primary school grades. New York, NY: Cambridge University Press
work page 2003
-
[2]
Allen, Chiharu S., Qi Chen, Victor L. Willson, and Jan N. Hughes. 2009. “Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis.” Educational Evaluation and Policy Analysis 31(4): 480-499
work page 2009
-
[3]
P-value precision and reproducibility
Boos, Dennis D., and Leonard A. Stefanski. 2011. “P-value precision and reproducibility.” The American Statistician 65(4): 213-221
work page 2011
-
[4]
Cohen, Jacob. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Earlbaum Associates
work page 1988
-
[5]
Cohen, Jacob. 1992. “A power primer.” Psychological bulletin 112(1): 155-159
work page 1992
-
[6]
Inference for non‐random samples
Copas, John B., and H. G. Li. 1997. “Inference for non‐random samples.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1): 55-95
work page 1997
-
[7]
Conjugate priors for exponential families
Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate priors for exponential families.” The Annals of Statistics 7: 269–281
work page 1979
-
[8]
Diaconis, Persi, and Donald Ylvisaker. 1985. “Quantifying prior opinion.” Bayesian statistics 2: 133–156
work page 1985
-
[9]
Espinosa, Valeria, Tirthankar Dasgupta, and Donald B. Rubin. 2016. "A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes." Technometrics 58(1): 62-73
work page 2016
-
[10]
Doros, Gheorghe, and Andrew B. Geier. 2005. “Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests””. Psychological Science 16(12): 1005- 1006
work page 2005
-
[11]
Impact of a confounding variable on a regression coefficient
Frank, Kenneth. A. 2000. “Impact of a confounding variable on a regression coefficient.” Sociological Methods & Research 29(2): 147-194
work page 2000
-
[12]
Indices of robustness for sample representation
Frank, Kenneth. A., and Kyung-Seok Min. 2007. “Indices of robustness for sample representation.” Sociological Methodology 37: 349–392
work page 2007
-
[13]
Frank, Kenneth A., Spiro J. Maroulis, Minh Q. Duong, and Benjamin M. Kelcey. 2013. “What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences.” Education Evaluation and Policy Analysis 35: 437–460. 27
work page 2013
-
[14]
Effect sizes and p values: what should be reported and what should be replicated?
Greenwald, AnthonyG, Richard Gonzalez, Richard J. Harris, and Donald Guthrie. 1996. “Effect sizes and p values: what should be reported and what should be replicated?” Psychophysiology 33(2): 175-183
work page 1996
-
[15]
The scientific model of causality
Heckman, James J. 2005. “The scientific model of causality.” Sociological methodology 35(1): 1-97
work page 2005
-
[16]
Hoff, Peter D. 2009. A first course in Bayesian statistical methods. New York, NY: Springer Science & Business Media
work page 2009
-
[17]
Statistics and causal inference
Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American statistical Association 81(396): 945-960
work page 1986
-
[18]
Marginal mean weighting through stratification: adjustment for selection bias in multilevel data
Hong, Guanglei. 2010. “Marginal mean weighting through stratification: adjustment for selection bias in multilevel data.” Journal of Educational and Behavioral Statistics 35(5): 499-531
work page 2010
-
[19]
Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics
Hong, Guanglei, and Stephen W. Raudenbush. 2005. “Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics.” Educational Evaluation and Policy Analysis 27: 205–224
work page 2005
-
[20]
The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder
Hosman, Carrie A., Ben B. Hansen, and Paul W. Holland. 2010. “The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder.” The Annals of Applied Statistics 4(2): 849-870
work page 2010
-
[21]
Misunderstandings between experimentalists and observationalists about causal inference
Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. “Misunderstandings between experimentalists and observationalists about causal inference.” Journal of the royal statistical society: series A (statistics in society) 171(2): 481-502
work page 2008
-
[22]
Nonparametric estimation of average treatment effects under exogeneity: A review
Imbens, Guido W. 2004. “Nonparametric estimation of average treatment effects under exogeneity: A review.” The Review of Economics and Statistics 86: 4-29
work page 2004
-
[23]
Imbens, Guido W., and Donald B. Rubin. 2015. Causal inference for statistics, social, and biomedical sciences: An introduction. New York, NY: Cambridge University Press
work page 2015
-
[24]
A model-averaging approach to replication: The case of prep
Iverson, Geoffrey J., Eric-Jan Wagenmakers, and Michael D. Lee. 2010. “A model-averaging approach to replication: The case of prep.” Psychological Methods 15(2): 172-181
work page 2010
-
[25]
An alternative to null-hypothesis significance tests
Killeen, Peter R. 2005. “An alternative to null-hypothesis significance tests.” Psychological science 16(5): 345-353
work page 2005
-
[26]
Li, T. (2018). The Bayesian Paradigm of Robustness Indices of Causal Inferences (Unpublished doctoral dissertation). Michigan State University, East Lansing
work page 2018
-
[27]
Assessing the sensitivity of regression results to unmeasured confounders in observational studies
Lin, Danyu Y., Bruce M. Psaty, and Richard A. Kronmal. 1998. “Assessing the sensitivity of regression results to unmeasured confounders in observational studies.” Biometrics: 948-963. 28
work page 1998
-
[28]
Nonparametric bounds on treatment effects
Manski, Charles F. 1990. “Nonparametric bounds on treatment effects.” The American Economic Review 80(2): 319
work page 1990
-
[29]
Manski, Charles F. 1995. Identification problems in the social sciences. Harvard University Press
work page 1995
-
[30]
Bounding disagreements about treatment effects: A case study of sentencing and recidivism
Manski, Charles F., & Daniel S. Nagin. 1998. “Bounding disagreements about treatment effects: A case study of sentencing and recidivism.” Sociological methodology 28(1): 99-137
work page 1998
-
[31]
Identification of treatment effects under conditional partial independence
Masten, Matthew A., and Alexandre Poirier. 2018. “Identification of treatment effects under conditional partial independence.” Econometrica 86(1): 317-351
work page 2018
-
[32]
Bayesian sensitivity analysis for unmeasured confounding in observational studies
McCandless, Lawrence C., Paul Gustafson, and Adrian Levy. 2007. “Bayesian sensitivity analysis for unmeasured confounding in observational studies.” Statistics in Medicine 26(11): 2331-2347
work page 2007
-
[33]
Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding
McCandless, Lawrence C., Paul Gustafson, Adrian R. Levy, and Sylvia Richardson. 2012. “Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding.” Statistics in Medicine 31(4): 383-396
work page 2012
-
[34]
A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding
McCandless, Lawrence C., and Paul Gustafson. 2017. “A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding.” Statistics in Medicine 36(18): 2887- 2901
work page 2017
-
[35]
Murnane, Richard J., and John B. Willett. 2011. Methods matter: Improving causal inference in educational and social science research. New York, NY: Oxford University Press
work page 2011
-
[36]
Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. New York, NY: Basic Books
work page 2018
-
[37]
Using p values to estimate the probability of a statistically significant replication
Posavac, Emil J. 2002. “Using p values to estimate the probability of a statistically significant replication.” Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 1(2): 101-112
work page 2002
-
[38]
Robins, James M., Andrea Rotnitzky, and Daniel O. Scharfstein. 2000. “Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models.” In Statistical models in epidemiology, the environment, and clinical trials (pp. 1-94). Springer, New York, NY
work page 2000
-
[39]
Dropping out of high school in the United States: An observational study
Rosenbaum, Paul R. 1986. “Dropping out of high school in the United States: An observational study.” Journal of Educational Statistics 11(3): 207-224
work page 1986
-
[40]
Sensitivity analysis for certain permutation inferences in matched observational studies
Rosenbaum, Paul R. 1987. “Sensitivity analysis for certain permutation inferences in matched observational studies.” Biometrika 74(1): 13-26. 29
work page 1987
-
[41]
Sensitivity analysis for matched case-control studies
Rosenbaum, Paul R. 1991. “Sensitivity analysis for matched case-control studies.” Biometrics: 87-100
work page 1991
-
[42]
Rosenbaum, Paul R. 2002. Observational Studies. New York, NY: Springer
work page 2002
-
[43]
Rosenbaum, Paul R. 2010. Design of Observational Studies. New York, NY: Springer
work page 2010
-
[44]
Teaching statistical inference for causal effects in experiments and observational studies
Rubin, Donald B. 2004. “Teaching statistical inference for causal effects in experiments and observational studies.” Journal of Educational and Behavioral Statistics 29(3): 343-367
work page 2004
-
[45]
Causal inference using potential outcomes: Design, modeling, decisions
Rubin, Donald B. 2005. “Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100(469): 322-331
work page 2005
-
[46]
Rubin, Donald B. 2007. “The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials.” Statistics in medicine 26(1): 20-36
work page 2007
-
[47]
For objective causal inference, design trumps analysis
Rubin, Donald B. 2008. “For objective causal inference, design trumps analysis.” The Annals of Applied Statistics 2(3): 808-840
work page 2008
-
[48]
Average causal effects from nonrandomized studies: a practical guide and simulated example
Schafer, Joseph L., and Joseph Kang. 2008. “Average causal effects from nonrandomized studies: a practical guide and simulated example.” Psychological Methods 13(4): 279
work page 2008
-
[49]
Schneider, Barbara, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt, and Richard J. Shavelson. 2007. Estimating causal effects using experimental and observational design. American Educational & Reseach Association
work page 2007
-
[50]
Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi- experimental designs for generalized causal inference. New York, NY: Houghton Mifflin
work page 2002
-
[51]
Reproducibility probability in clinical trials
Shao, Jun, and Shein‐Chung Chow. 2002. “Reproducibility probability in clinical trials.” Statistics in Medicine 21(12): 1727-1742
work page 2002
-
[52]
An introduction to causal inference
Sobel, Michael E. 1996. “An introduction to causal inference.” Sociological Methods & Research 24(3): 353-379
work page 1996
-
[53]
Sensitivity analysis: distributional assumptions and confounding assumptions
VanderWeele, Tyler J. 2008. “Sensitivity analysis: distributional assumptions and confounding assumptions.” Biometrics 64(2): 645-649. 30 Appendix Proofs of Theorem 1 and Theorem 2 Proof of theorem 1: First, the distribution of could be derived based on the following pivotal quantity: (0,1) id id tc id id tc YY YY N − − − (A1) The pivotal quantity (...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.