pith. sign in

arxiv: 1906.08726 · v1 · pith:5WEJ5RHGnew · submitted 2019-06-20 · 📊 stat.AP · econ.EM· stat.ME· stat.OT

On the probability of a causal inference is robust for internal validity

Pith reviewed 2026-05-25 19:02 UTC · model grok-4.3

classification 📊 stat.AP econ.EMstat.MEstat.OT
keywords causal inferenceinternal validityrobustnesscounterfactualsnull hypothesis testingobservational studiessensitivity analysisPIV
0
0 comments X

The pith

The PIV is the probability that a null hypothesis rejection on observed data persists after adding counterfactual outcomes, serving as a robustness index for causal inferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the probability of a causal inference being robust for internal validity (PIV) to quantify how secure a causal claim remains when the unconfoundedness assumption is questionable. Counterfactuals are treated as an unobserved additional sample, and PIV is the conditional probability that the null hypothesis is rejected again once those outcomes are included, given that it was already rejected on the observed sample alone. Bounds on the PIV follow from bounded beliefs about the counterfactuals, available under either frequentist or Bayesian reasoning. The index equals statistical power when the test is imagined to have already used the full data including counterfactuals. An eight-step procedure is given for applying the index, demonstrated on an education example.

Core claim

The paper establishes that the PIV, defined as the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals given that the same null was already rejected on the observed sample alone, functions as a robustness index. Under either frequentist or Bayesian framework, the PIV of an inference can be bounded from bounded beliefs about the counterfactuals, which is useful when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is considered to be based on both the observed sample and the counterfactuals.

What carries the argument

The PIV itself, the conditional probability that a null hypothesis rejection on the observed sample alone continues after the counterfactual outcomes are folded in as an additional sample.

If this is right

  • A researcher can place numerical bounds on the robustness of a causal claim without observing the counterfactuals, by stating bounds on beliefs about them.
  • When the test is viewed as already incorporating counterfactuals, the PIV reduces exactly to ordinary statistical power.
  • The eight-step procedure supplies a concrete workflow for evaluating internal validity of any observational causal inference.
  • The same bounding logic applies under both frequentist and Bayesian interpretations of the test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The PIV framing could be applied to missing-data problems outside causal inference by treating the missing cases as the counterfactual sample.
  • One could test the bounding procedure by generating data with known counterfactual distributions and checking whether the derived bounds contain the realized rejection probability.
  • The approach supplies a probabilistic alternative to deterministic sensitivity analyses that vary one parameter at a time.
  • Integration with existing software for power analysis might allow direct computation of PIV bounds once belief intervals on counterfactual means or variances are supplied.

Load-bearing premise

Counterfactual outcomes can be treated as an additional sample whose influence on the test statistic permits probabilistic bounding from subjective beliefs about those outcomes.

What would settle it

A simulation in which the true distribution of counterfactual outcomes is fully known, the actual frequency of continued null rejection with the full data is computed, and this frequency is checked against the interval obtained from the paper's bounding procedure applied to deliberately limited beliefs.

Figures

Figures reproduced from arXiv: 1906.08726 by Kenneth A. Frank, Tenglong Li.

Figure 1
Figure 1. Figure 1: illustrates the conceptualization of the unobserved sample in Hong & Raudenbush (2005) for the simple estimator. The observed outcome , ob Yri symbolizes the reading score of any retained student whose counterfactual outcome is , un Ypi . Likewise, the observed outcome , ob Ypj represents the reading score of any promoted student whose counterfactual outcome is , un Yrj . The unobserved sample consists of … view at source ↗
read the original abstract

The internal validity of observational study is often subject to debate. In this study, we define the counterfactuals as the unobserved sample and intend to quantify its relationship with the null hypothesis statistical testing (NHST). We propose the probability of a causal inference is robust for internal validity, i.e., the PIV, as a robustness index of causal inference. Formally, the PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample. Under either frequentist or Bayesian framework, one can bound the PIV of an inference based on his bounded belief about the counterfactuals, which is often needed when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is thought to be based on both the observed sample and the counterfactuals. We summarize the process of evaluating internal validity with the PIV into an eight-step procedure and illustrate it with an empirical example (i.e., Hong and Raudenbush (2005)).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the probability of a causal inference is robust for internal validity (PIV) as a robustness index for causal inferences from observational data. Formally, PIV is defined as the conditional probability of rejecting the null hypothesis H0 again when the test is based on both the observed sample and the counterfactual outcomes, given that H0 was already rejected based on the observed sample alone. The authors claim that PIV can be bounded using subjective beliefs about the counterfactuals under either frequentist or Bayesian frameworks, that it is equivalent to statistical power when the test incorporates both samples, and that an eight-step procedure can be used to evaluate internal validity, illustrated with the Hong and Raudenbush (2005) example.

Significance. If the central definition were rigorously grounded, the PIV could provide a quantitative index for sensitivity to unconfoundedness violations. The manuscript offers no machine-checked proofs, reproducible code, parameter-free derivations, or falsifiable predictions; the contribution rests entirely on the conceptual proposal and the eight-step procedure.

major comments (3)
  1. [Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.
  2. [Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.
  3. [Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.
minor comments (2)
  1. [Abstract] The abstract refers to 'bounded belief about the counterfactuals' without specifying how this belief is translated into numerical bounds on the rejection probability.
  2. Notation for counterfactual outcomes should be introduced explicitly and distinguished from random variables to avoid conflating fixed potential outcomes with a probability space.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript proposing the PIV as a robustness measure for causal inferences. We address each of the major comments point by point below. While we maintain that the conceptual contribution is valuable for sensitivity analysis in observational studies, we acknowledge the need for greater formal rigor in some aspects and will revise accordingly where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.

    Authors: We agree that a more precise formalization is needed. In the revised manuscript, we will define PIV more rigorously by specifying that the bounds on counterfactual outcomes are incorporated via a set of possible distributions or values consistent with the analyst's beliefs, and the conditional probability is computed as the infimum or range over these possibilities. This treats the counterfactuals as fixed but unknown, with the probability arising from the sampling distribution of the observed data conditional on the bounds. This is an extension of standard NHST to include sensitivity parameters. revision: yes

  2. Referee: [Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.

    Authors: The equivalence is conceptual: PIV represents the power of a test that would be conducted if the counterfactual outcomes were observed, but since they are not, we bound it using beliefs about them. We will revise the manuscript to clarify that it is not a direct equivalence but an analogy to power under the extended sample, and remove any implication of strict mathematical equivalence without the additional bounding framework. revision: yes

  3. Referee: [Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.

    Authors: This is the core of our proposal, which intentionally extends beyond standard theory to provide a practical tool for assessing internal validity when unconfoundedness may be violated. Similar to other sensitivity analyses (e.g., those using Rosenbaum's bounds or partial identification), we allow subjective input on counterfactual distributions. The eight-step procedure makes these assumptions explicit and falsifiable by the reader. We do not claim justification within unmodified standard theory but as a new robustness index; thus, no change to this aspect is planned. revision: no

Circularity Check

0 steps flagged

No circularity in PIV definition or bounding claim.

full rationale

The paper introduces PIV via an explicit formal definition as the conditional probability P(reject H0 on observed+counterfactuals | reject H0 on observed). It states that bounds follow from subjective beliefs about counterfactuals under frequentist or Bayesian views and notes an equivalence to power under a joint-sample interpretation. No quoted equations, procedures, or self-citations reduce this definition or the bounding claim to a fitted parameter, prior self-result, or input by construction. The eight-step procedure and empirical illustration rest on the definitional proposal itself rather than any hidden reduction, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Limited information from abstract only; no free parameters or additional axioms explicitly stated.

axioms (1)
  • domain assumption Counterfactuals can be treated as an unobserved sample for the purposes of null hypothesis statistical testing.
    This is foundational to the definition of PIV as per the abstract.
invented entities (1)
  • PIV no independent evidence
    purpose: Robustness index for internal validity of causal inference
    Newly defined probability measure without mentioned external evidence or validation in abstract.

pith-pipeline@v0.9.0 · 5724 in / 1298 out tokens · 39217 ms · 2026-05-25T19:02:00.108116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    Entwisle, and Susan L

    Alexander, Karl L., Doris R. Entwisle, and Susan L. Dauber. 2003. On the success of failure: A reassessment of the effects of retention in the primary school grades. New York, NY: Cambridge University Press

  2. [2]

    Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis

    Allen, Chiharu S., Qi Chen, Victor L. Willson, and Jan N. Hughes. 2009. “Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis.” Educational Evaluation and Policy Analysis 31(4): 480-499

  3. [3]

    P-value precision and reproducibility

    Boos, Dennis D., and Leonard A. Stefanski. 2011. “P-value precision and reproducibility.” The American Statistician 65(4): 213-221

  4. [4]

    Cohen, Jacob. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Earlbaum Associates

  5. [5]

    A power primer

    Cohen, Jacob. 1992. “A power primer.” Psychological bulletin 112(1): 155-159

  6. [6]

    Inference for non‐random samples

    Copas, John B., and H. G. Li. 1997. “Inference for non‐random samples.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1): 55-95

  7. [7]

    Conjugate priors for exponential families

    Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate priors for exponential families.” The Annals of Statistics 7: 269–281

  8. [8]

    Quantifying prior opinion

    Diaconis, Persi, and Donald Ylvisaker. 1985. “Quantifying prior opinion.” Bayesian statistics 2: 133–156

  9. [9]

    A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes

    Espinosa, Valeria, Tirthankar Dasgupta, and Donald B. Rubin. 2016. "A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes." Technometrics 58(1): 62-73

  10. [10]

    Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests

    Doros, Gheorghe, and Andrew B. Geier. 2005. “Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests””. Psychological Science 16(12): 1005- 1006

  11. [11]

    Impact of a confounding variable on a regression coefficient

    Frank, Kenneth. A. 2000. “Impact of a confounding variable on a regression coefficient.” Sociological Methods & Research 29(2): 147-194

  12. [12]

    Indices of robustness for sample representation

    Frank, Kenneth. A., and Kyung-Seok Min. 2007. “Indices of robustness for sample representation.” Sociological Methodology 37: 349–392

  13. [13]

    What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences

    Frank, Kenneth A., Spiro J. Maroulis, Minh Q. Duong, and Benjamin M. Kelcey. 2013. “What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences.” Education Evaluation and Policy Analysis 35: 437–460. 27

  14. [14]

    Effect sizes and p values: what should be reported and what should be replicated?

    Greenwald, AnthonyG, Richard Gonzalez, Richard J. Harris, and Donald Guthrie. 1996. “Effect sizes and p values: what should be reported and what should be replicated?” Psychophysiology 33(2): 175-183

  15. [15]

    The scientific model of causality

    Heckman, James J. 2005. “The scientific model of causality.” Sociological methodology 35(1): 1-97

  16. [16]

    Hoff, Peter D. 2009. A first course in Bayesian statistical methods. New York, NY: Springer Science & Business Media

  17. [17]

    Statistics and causal inference

    Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American statistical Association 81(396): 945-960

  18. [18]

    Marginal mean weighting through stratification: adjustment for selection bias in multilevel data

    Hong, Guanglei. 2010. “Marginal mean weighting through stratification: adjustment for selection bias in multilevel data.” Journal of Educational and Behavioral Statistics 35(5): 499-531

  19. [19]

    Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics

    Hong, Guanglei, and Stephen W. Raudenbush. 2005. “Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics.” Educational Evaluation and Policy Analysis 27: 205–224

  20. [20]

    The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder

    Hosman, Carrie A., Ben B. Hansen, and Paul W. Holland. 2010. “The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder.” The Annals of Applied Statistics 4(2): 849-870

  21. [21]

    Misunderstandings between experimentalists and observationalists about causal inference

    Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. “Misunderstandings between experimentalists and observationalists about causal inference.” Journal of the royal statistical society: series A (statistics in society) 171(2): 481-502

  22. [22]

    Nonparametric estimation of average treatment effects under exogeneity: A review

    Imbens, Guido W. 2004. “Nonparametric estimation of average treatment effects under exogeneity: A review.” The Review of Economics and Statistics 86: 4-29

  23. [23]

    Imbens, Guido W., and Donald B. Rubin. 2015. Causal inference for statistics, social, and biomedical sciences: An introduction. New York, NY: Cambridge University Press

  24. [24]

    A model-averaging approach to replication: The case of prep

    Iverson, Geoffrey J., Eric-Jan Wagenmakers, and Michael D. Lee. 2010. “A model-averaging approach to replication: The case of prep.” Psychological Methods 15(2): 172-181

  25. [25]

    An alternative to null-hypothesis significance tests

    Killeen, Peter R. 2005. “An alternative to null-hypothesis significance tests.” Psychological science 16(5): 345-353

  26. [26]

    Li, T. (2018). The Bayesian Paradigm of Robustness Indices of Causal Inferences (Unpublished doctoral dissertation). Michigan State University, East Lansing

  27. [27]

    Assessing the sensitivity of regression results to unmeasured confounders in observational studies

    Lin, Danyu Y., Bruce M. Psaty, and Richard A. Kronmal. 1998. “Assessing the sensitivity of regression results to unmeasured confounders in observational studies.” Biometrics: 948-963. 28

  28. [28]

    Nonparametric bounds on treatment effects

    Manski, Charles F. 1990. “Nonparametric bounds on treatment effects.” The American Economic Review 80(2): 319

  29. [29]

    Manski, Charles F. 1995. Identification problems in the social sciences. Harvard University Press

  30. [30]

    Bounding disagreements about treatment effects: A case study of sentencing and recidivism

    Manski, Charles F., & Daniel S. Nagin. 1998. “Bounding disagreements about treatment effects: A case study of sentencing and recidivism.” Sociological methodology 28(1): 99-137

  31. [31]

    Identification of treatment effects under conditional partial independence

    Masten, Matthew A., and Alexandre Poirier. 2018. “Identification of treatment effects under conditional partial independence.” Econometrica 86(1): 317-351

  32. [32]

    Bayesian sensitivity analysis for unmeasured confounding in observational studies

    McCandless, Lawrence C., Paul Gustafson, and Adrian Levy. 2007. “Bayesian sensitivity analysis for unmeasured confounding in observational studies.” Statistics in Medicine 26(11): 2331-2347

  33. [33]

    Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding

    McCandless, Lawrence C., Paul Gustafson, Adrian R. Levy, and Sylvia Richardson. 2012. “Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding.” Statistics in Medicine 31(4): 383-396

  34. [34]

    A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding

    McCandless, Lawrence C., and Paul Gustafson. 2017. “A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding.” Statistics in Medicine 36(18): 2887- 2901

  35. [35]

    Murnane, Richard J., and John B. Willett. 2011. Methods matter: Improving causal inference in educational and social science research. New York, NY: Oxford University Press

  36. [36]

    Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. New York, NY: Basic Books

  37. [37]

    Using p values to estimate the probability of a statistically significant replication

    Posavac, Emil J. 2002. “Using p values to estimate the probability of a statistically significant replication.” Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 1(2): 101-112

  38. [38]

    Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models

    Robins, James M., Andrea Rotnitzky, and Daniel O. Scharfstein. 2000. “Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models.” In Statistical models in epidemiology, the environment, and clinical trials (pp. 1-94). Springer, New York, NY

  39. [39]

    Dropping out of high school in the United States: An observational study

    Rosenbaum, Paul R. 1986. “Dropping out of high school in the United States: An observational study.” Journal of Educational Statistics 11(3): 207-224

  40. [40]

    Sensitivity analysis for certain permutation inferences in matched observational studies

    Rosenbaum, Paul R. 1987. “Sensitivity analysis for certain permutation inferences in matched observational studies.” Biometrika 74(1): 13-26. 29

  41. [41]

    Sensitivity analysis for matched case-control studies

    Rosenbaum, Paul R. 1991. “Sensitivity analysis for matched case-control studies.” Biometrics: 87-100

  42. [42]

    Rosenbaum, Paul R. 2002. Observational Studies. New York, NY: Springer

  43. [43]

    Rosenbaum, Paul R. 2010. Design of Observational Studies. New York, NY: Springer

  44. [44]

    Teaching statistical inference for causal effects in experiments and observational studies

    Rubin, Donald B. 2004. “Teaching statistical inference for causal effects in experiments and observational studies.” Journal of Educational and Behavioral Statistics 29(3): 343-367

  45. [45]

    Causal inference using potential outcomes: Design, modeling, decisions

    Rubin, Donald B. 2005. “Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100(469): 322-331

  46. [46]

    The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials

    Rubin, Donald B. 2007. “The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials.” Statistics in medicine 26(1): 20-36

  47. [47]

    For objective causal inference, design trumps analysis

    Rubin, Donald B. 2008. “For objective causal inference, design trumps analysis.” The Annals of Applied Statistics 2(3): 808-840

  48. [48]

    Average causal effects from nonrandomized studies: a practical guide and simulated example

    Schafer, Joseph L., and Joseph Kang. 2008. “Average causal effects from nonrandomized studies: a practical guide and simulated example.” Psychological Methods 13(4): 279

  49. [49]

    Schmidt, and Richard J

    Schneider, Barbara, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt, and Richard J. Shavelson. 2007. Estimating causal effects using experimental and observational design. American Educational & Reseach Association

  50. [50]

    Cook, and Donald T

    Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi- experimental designs for generalized causal inference. New York, NY: Houghton Mifflin

  51. [51]

    Reproducibility probability in clinical trials

    Shao, Jun, and Shein‐Chung Chow. 2002. “Reproducibility probability in clinical trials.” Statistics in Medicine 21(12): 1727-1742

  52. [52]

    An introduction to causal inference

    Sobel, Michael E. 1996. “An introduction to causal inference.” Sociological Methods & Research 24(3): 353-379

  53. [53]

    Sensitivity analysis: distributional assumptions and confounding assumptions

    VanderWeele, Tyler J. 2008. “Sensitivity analysis: distributional assumptions and confounding assumptions.” Biometrics 64(2): 645-649. 30 Appendix Proofs of Theorem 1 and Theorem 2 Proof of theorem 1: First, the distribution of  could be derived based on the following pivotal quantity: (0,1) id id tc id id tc YY YY N − − −   (A1) The pivotal quantity (...