On the probability of a causal inference is robust for internal validity

Kenneth A. Frank; Tenglong Li

arxiv: 1906.08726 · v1 · pith:5WEJ5RHGnew · submitted 2019-06-20 · 📊 stat.AP · econ.EM· stat.ME· stat.OT

On the probability of a causal inference is robust for internal validity

Tenglong Li , Kenneth A. Frank This is my paper

Pith reviewed 2026-05-25 19:02 UTC · model grok-4.3

classification 📊 stat.AP econ.EMstat.MEstat.OT

keywords causal inferenceinternal validityrobustnesscounterfactualsnull hypothesis testingobservational studiessensitivity analysisPIV

0 comments

The pith

The PIV is the probability that a null hypothesis rejection on observed data persists after adding counterfactual outcomes, serving as a robustness index for causal inferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines the probability of a causal inference being robust for internal validity (PIV) to quantify how secure a causal claim remains when the unconfoundedness assumption is questionable. Counterfactuals are treated as an unobserved additional sample, and PIV is the conditional probability that the null hypothesis is rejected again once those outcomes are included, given that it was already rejected on the observed sample alone. Bounds on the PIV follow from bounded beliefs about the counterfactuals, available under either frequentist or Bayesian reasoning. The index equals statistical power when the test is imagined to have already used the full data including counterfactuals. An eight-step procedure is given for applying the index, demonstrated on an education example.

Core claim

The paper establishes that the PIV, defined as the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals given that the same null was already rejected on the observed sample alone, functions as a robustness index. Under either frequentist or Bayesian framework, the PIV of an inference can be bounded from bounded beliefs about the counterfactuals, which is useful when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is considered to be based on both the observed sample and the counterfactuals.

What carries the argument

The PIV itself, the conditional probability that a null hypothesis rejection on the observed sample alone continues after the counterfactual outcomes are folded in as an additional sample.

If this is right

A researcher can place numerical bounds on the robustness of a causal claim without observing the counterfactuals, by stating bounds on beliefs about them.
When the test is viewed as already incorporating counterfactuals, the PIV reduces exactly to ordinary statistical power.
The eight-step procedure supplies a concrete workflow for evaluating internal validity of any observational causal inference.
The same bounding logic applies under both frequentist and Bayesian interpretations of the test.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The PIV framing could be applied to missing-data problems outside causal inference by treating the missing cases as the counterfactual sample.
One could test the bounding procedure by generating data with known counterfactual distributions and checking whether the derived bounds contain the realized rejection probability.
The approach supplies a probabilistic alternative to deterministic sensitivity analyses that vary one parameter at a time.
Integration with existing software for power analysis might allow direct computation of PIV bounds once belief intervals on counterfactual means or variances are supplied.

Load-bearing premise

Counterfactual outcomes can be treated as an additional sample whose influence on the test statistic permits probabilistic bounding from subjective beliefs about those outcomes.

What would settle it

A simulation in which the true distribution of counterfactual outcomes is fully known, the actual frequency of continued null rejection with the full data is computed, and this frequency is checked against the interval obtained from the paper's bounding procedure applied to deliberately limited beliefs.

Figures

Figures reproduced from arXiv: 1906.08726 by Kenneth A. Frank, Tenglong Li.

**Figure 1.** Figure 1: illustrates the conceptualization of the unobserved sample in Hong & Raudenbush (2005) for the simple estimator. The observed outcome , ob Yri symbolizes the reading score of any retained student whose counterfactual outcome is , un Ypi . Likewise, the observed outcome , ob Ypj represents the reading score of any promoted student whose counterfactual outcome is , un Yrj . The unobserved sample consists of … view at source ↗

read the original abstract

The internal validity of observational study is often subject to debate. In this study, we define the counterfactuals as the unobserved sample and intend to quantify its relationship with the null hypothesis statistical testing (NHST). We propose the probability of a causal inference is robust for internal validity, i.e., the PIV, as a robustness index of causal inference. Formally, the PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample. Under either frequentist or Bayesian framework, one can bound the PIV of an inference based on his bounded belief about the counterfactuals, which is often needed when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is thought to be based on both the observed sample and the counterfactuals. We summarize the process of evaluating internal validity with the PIV into an eight-step procedure and illustrate it with an empirical example (i.e., Hong and Raudenbush (2005)).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIV tries to turn counterfactuals into an extra sample for a conditional rejection probability, but that quantity is not defined under the frequentist NHST the paper starts from.

read the letter

The paper defines PIV as the probability of rejecting the null again on the combined observed-plus-counterfactual data, given that the null was already rejected on the observed data alone. They claim this can be bounded from beliefs about the counterfactuals and give an eight-step procedure plus an example from Hong and Raudenbush (2005). That procedure is the most concrete part; applied users who want a checklist for internal validity might find it usable. The equivalence they note to statistical power when the test is imagined to include the counterfactuals follows directly once the setup is granted. The illustration shows how the steps would look in practice, which is better than leaving the idea purely abstract. The main problem is the one flagged in the stress test. Frequentist NHST defines rejection regions from the sampling distribution of the test statistic under the null for the observed sample; the counterfactual outcomes are fixed but unknown, not random variables with a distribution that can be subjectively bounded. Without an explicit joint probability measure or a clear worst-case construction over those fixed values, the conditional probability P(reject on combined | reject on observed) has no formal meaning, so the bounding claims rest on an undefined object. The paper mentions both frequentist and Bayesian frameworks, but the original test is NHST, and the frequentist side is not squared away in the abstract. This is load-bearing for the whole index. The work engages the literature on causal robustness at a basic level and is not internally contradictory on its own terms, but the foundational step needs fixing before the index can be used. It is aimed at applied statisticians who do observational causal work and want a quantitative robustness number. I would bring it to a reading group to talk through whether the probability can be made rigorous. I would not cite it in its current form. It deserves peer review because a workable version of this kind of index could be useful if the definition is repaired, even though heavy revision would be required.

Referee Report

3 major / 2 minor

Summary. The paper proposes the probability of a causal inference is robust for internal validity (PIV) as a robustness index for causal inferences from observational data. Formally, PIV is defined as the conditional probability of rejecting the null hypothesis H0 again when the test is based on both the observed sample and the counterfactual outcomes, given that H0 was already rejected based on the observed sample alone. The authors claim that PIV can be bounded using subjective beliefs about the counterfactuals under either frequentist or Bayesian frameworks, that it is equivalent to statistical power when the test incorporates both samples, and that an eight-step procedure can be used to evaluate internal validity, illustrated with the Hong and Raudenbush (2005) example.

Significance. If the central definition were rigorously grounded, the PIV could provide a quantitative index for sensitivity to unconfoundedness violations. The manuscript offers no machine-checked proofs, reproducible code, parameter-free derivations, or falsifiable predictions; the contribution rests entirely on the conceptual proposal and the eight-step procedure.

major comments (3)

[Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.
[Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.
[Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.

minor comments (2)

[Abstract] The abstract refers to 'bounded belief about the counterfactuals' without specifying how this belief is translated into numerical bounds on the rejection probability.
Notation for counterfactual outcomes should be introduced explicitly and distinguished from random variables to avoid conflating fixed potential outcomes with a probability space.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript proposing the PIV as a robustness measure for causal inferences. We address each of the major comments point by point below. While we maintain that the conceptual contribution is valuable for sensitivity analysis in observational studies, we acknowledge the need for greater formal rigor in some aspects and will revise accordingly where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract (formal definition of PIV): The quantity P(reject H0 on observed+counterfactuals | reject H0 on observed) is not formally defined in the frequentist NHST framework employed for the original test. Counterfactual outcomes are fixed but unobserved; without an explicit joint distribution or worst-case measure over their values, the conditional probability and its bounds cannot be derived.

Authors: We agree that a more precise formalization is needed. In the revised manuscript, we will define PIV more rigorously by specifying that the bounds on counterfactual outcomes are incorporated via a set of possible distributions or values consistent with the analyst's beliefs, and the conditional probability is computed as the infimum or range over these possibilities. This treats the counterfactuals as fixed but unknown, with the probability arising from the sampling distribution of the observed data conditional on the bounds. This is an extension of standard NHST to include sensitivity parameters. revision: yes
Referee: [Abstract] Abstract (equivalence claim): The stated equivalence of PIV to statistical power when the NHST is 'thought to be based on both' inherits the same definitional ambiguity, because power is defined with respect to a sampling distribution under the null or alternative, not over fixed potential outcomes.

Authors: The equivalence is conceptual: PIV represents the power of a test that would be conducted if the counterfactual outcomes were observed, but since they are not, we bound it using beliefs about them. We will revise the manuscript to clarify that it is not a direct equivalence but an analogy to power under the extended sample, and remove any implication of strict mathematical equivalence without the additional bounding framework. revision: yes
Referee: [Abstract] Abstract and eight-step procedure: All numerical illustrations and the robustness index rest on the load-bearing assumption that counterfactuals can be treated as an additional sample whose distribution is subjectively bounded to compute a conditional rejection probability; this assumption is not justified within standard frequentist potential-outcomes theory.

Authors: This is the core of our proposal, which intentionally extends beyond standard theory to provide a practical tool for assessing internal validity when unconfoundedness may be violated. Similar to other sensitivity analyses (e.g., those using Rosenbaum's bounds or partial identification), we allow subjective input on counterfactual distributions. The eight-step procedure makes these assumptions explicit and falsifiable by the reader. We do not claim justification within unmodified standard theory but as a new robustness index; thus, no change to this aspect is planned. revision: no

Circularity Check

0 steps flagged

No circularity in PIV definition or bounding claim.

full rationale

The paper introduces PIV via an explicit formal definition as the conditional probability P(reject H0 on observed+counterfactuals | reject H0 on observed). It states that bounds follow from subjective beliefs about counterfactuals under frequentist or Bayesian views and notes an equivalence to power under a joint-sample interpretation. No quoted equations, procedures, or self-citations reduce this definition or the bounding claim to a fitted parameter, prior self-result, or input by construction. The eight-step procedure and empirical illustration rest on the definitional proposal itself rather than any hidden reduction, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Limited information from abstract only; no free parameters or additional axioms explicitly stated.

axioms (1)

domain assumption Counterfactuals can be treated as an unobserved sample for the purposes of null hypothesis statistical testing.
This is foundational to the definition of PIV as per the abstract.

invented entities (1)

PIV no independent evidence
purpose: Robustness index for internal validity of causal inference
Newly defined probability measure without mentioned external evidence or validation in abstract.

pith-pipeline@v0.9.0 · 5724 in / 1298 out tokens · 39217 ms · 2026-05-25T19:02:00.108116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2: ... probit(PIV) = f(Y_un_t, Y_un_c) ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Entwisle, and Susan L

Alexander, Karl L., Doris R. Entwisle, and Susan L. Dauber. 2003. On the success of failure: A reassessment of the effects of retention in the primary school grades. New York, NY: Cambridge University Press

work page 2003
[2]

Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis

Allen, Chiharu S., Qi Chen, Victor L. Willson, and Jan N. Hughes. 2009. “Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis.” Educational Evaluation and Policy Analysis 31(4): 480-499

work page 2009
[3]

P-value precision and reproducibility

Boos, Dennis D., and Leonard A. Stefanski. 2011. “P-value precision and reproducibility.” The American Statistician 65(4): 213-221

work page 2011
[4]

Cohen, Jacob. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Earlbaum Associates

work page 1988
[5]

A power primer

Cohen, Jacob. 1992. “A power primer.” Psychological bulletin 112(1): 155-159

work page 1992
[6]

Inference for non‐random samples

Copas, John B., and H. G. Li. 1997. “Inference for non‐random samples.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1): 55-95

work page 1997
[7]

Conjugate priors for exponential families

Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate priors for exponential families.” The Annals of Statistics 7: 269–281

work page 1979
[8]

Quantifying prior opinion

Diaconis, Persi, and Donald Ylvisaker. 1985. “Quantifying prior opinion.” Bayesian statistics 2: 133–156

work page 1985
[9]

A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes

Espinosa, Valeria, Tirthankar Dasgupta, and Donald B. Rubin. 2016. "A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes." Technometrics 58(1): 62-73

work page 2016
[10]

Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests

Doros, Gheorghe, and Andrew B. Geier. 2005. “Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests””. Psychological Science 16(12): 1005- 1006

work page 2005
[11]

Impact of a confounding variable on a regression coefficient

Frank, Kenneth. A. 2000. “Impact of a confounding variable on a regression coefficient.” Sociological Methods & Research 29(2): 147-194

work page 2000
[12]

Indices of robustness for sample representation

Frank, Kenneth. A., and Kyung-Seok Min. 2007. “Indices of robustness for sample representation.” Sociological Methodology 37: 349–392

work page 2007
[13]

What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences

Frank, Kenneth A., Spiro J. Maroulis, Minh Q. Duong, and Benjamin M. Kelcey. 2013. “What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences.” Education Evaluation and Policy Analysis 35: 437–460. 27

work page 2013
[14]

Effect sizes and p values: what should be reported and what should be replicated?

Greenwald, AnthonyG, Richard Gonzalez, Richard J. Harris, and Donald Guthrie. 1996. “Effect sizes and p values: what should be reported and what should be replicated?” Psychophysiology 33(2): 175-183

work page 1996
[15]

The scientific model of causality

Heckman, James J. 2005. “The scientific model of causality.” Sociological methodology 35(1): 1-97

work page 2005
[16]

Hoff, Peter D. 2009. A first course in Bayesian statistical methods. New York, NY: Springer Science & Business Media

work page 2009
[17]

Statistics and causal inference

Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American statistical Association 81(396): 945-960

work page 1986
[18]

Marginal mean weighting through stratification: adjustment for selection bias in multilevel data

Hong, Guanglei. 2010. “Marginal mean weighting through stratification: adjustment for selection bias in multilevel data.” Journal of Educational and Behavioral Statistics 35(5): 499-531

work page 2010
[19]

Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics

Hong, Guanglei, and Stephen W. Raudenbush. 2005. “Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics.” Educational Evaluation and Policy Analysis 27: 205–224

work page 2005
[20]

The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder

Hosman, Carrie A., Ben B. Hansen, and Paul W. Holland. 2010. “The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder.” The Annals of Applied Statistics 4(2): 849-870

work page 2010
[21]

Misunderstandings between experimentalists and observationalists about causal inference

Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. “Misunderstandings between experimentalists and observationalists about causal inference.” Journal of the royal statistical society: series A (statistics in society) 171(2): 481-502

work page 2008
[22]

Nonparametric estimation of average treatment effects under exogeneity: A review

Imbens, Guido W. 2004. “Nonparametric estimation of average treatment effects under exogeneity: A review.” The Review of Economics and Statistics 86: 4-29

work page 2004
[23]

Imbens, Guido W., and Donald B. Rubin. 2015. Causal inference for statistics, social, and biomedical sciences: An introduction. New York, NY: Cambridge University Press

work page 2015
[24]

A model-averaging approach to replication: The case of prep

Iverson, Geoffrey J., Eric-Jan Wagenmakers, and Michael D. Lee. 2010. “A model-averaging approach to replication: The case of prep.” Psychological Methods 15(2): 172-181

work page 2010
[25]

An alternative to null-hypothesis significance tests

Killeen, Peter R. 2005. “An alternative to null-hypothesis significance tests.” Psychological science 16(5): 345-353

work page 2005
[26]

Li, T. (2018). The Bayesian Paradigm of Robustness Indices of Causal Inferences (Unpublished doctoral dissertation). Michigan State University, East Lansing

work page 2018
[27]

Assessing the sensitivity of regression results to unmeasured confounders in observational studies

Lin, Danyu Y., Bruce M. Psaty, and Richard A. Kronmal. 1998. “Assessing the sensitivity of regression results to unmeasured confounders in observational studies.” Biometrics: 948-963. 28

work page 1998
[28]

Nonparametric bounds on treatment effects

Manski, Charles F. 1990. “Nonparametric bounds on treatment effects.” The American Economic Review 80(2): 319

work page 1990
[29]

Manski, Charles F. 1995. Identification problems in the social sciences. Harvard University Press

work page 1995
[30]

Bounding disagreements about treatment effects: A case study of sentencing and recidivism

Manski, Charles F., & Daniel S. Nagin. 1998. “Bounding disagreements about treatment effects: A case study of sentencing and recidivism.” Sociological methodology 28(1): 99-137

work page 1998
[31]

Identification of treatment effects under conditional partial independence

Masten, Matthew A., and Alexandre Poirier. 2018. “Identification of treatment effects under conditional partial independence.” Econometrica 86(1): 317-351

work page 2018
[32]

Bayesian sensitivity analysis for unmeasured confounding in observational studies

McCandless, Lawrence C., Paul Gustafson, and Adrian Levy. 2007. “Bayesian sensitivity analysis for unmeasured confounding in observational studies.” Statistics in Medicine 26(11): 2331-2347

work page 2007
[33]

Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding

McCandless, Lawrence C., Paul Gustafson, Adrian R. Levy, and Sylvia Richardson. 2012. “Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding.” Statistics in Medicine 31(4): 383-396

work page 2012
[34]

A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding

McCandless, Lawrence C., and Paul Gustafson. 2017. “A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding.” Statistics in Medicine 36(18): 2887- 2901

work page 2017
[35]

Murnane, Richard J., and John B. Willett. 2011. Methods matter: Improving causal inference in educational and social science research. New York, NY: Oxford University Press

work page 2011
[36]

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. New York, NY: Basic Books

work page 2018
[37]

Using p values to estimate the probability of a statistically significant replication

Posavac, Emil J. 2002. “Using p values to estimate the probability of a statistically significant replication.” Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 1(2): 101-112

work page 2002
[38]

Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models

Robins, James M., Andrea Rotnitzky, and Daniel O. Scharfstein. 2000. “Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models.” In Statistical models in epidemiology, the environment, and clinical trials (pp. 1-94). Springer, New York, NY

work page 2000
[39]

Dropping out of high school in the United States: An observational study

Rosenbaum, Paul R. 1986. “Dropping out of high school in the United States: An observational study.” Journal of Educational Statistics 11(3): 207-224

work page 1986
[40]

Sensitivity analysis for certain permutation inferences in matched observational studies

Rosenbaum, Paul R. 1987. “Sensitivity analysis for certain permutation inferences in matched observational studies.” Biometrika 74(1): 13-26. 29

work page 1987
[41]

Sensitivity analysis for matched case-control studies

Rosenbaum, Paul R. 1991. “Sensitivity analysis for matched case-control studies.” Biometrics: 87-100

work page 1991
[42]

Rosenbaum, Paul R. 2002. Observational Studies. New York, NY: Springer

work page 2002
[43]

Rosenbaum, Paul R. 2010. Design of Observational Studies. New York, NY: Springer

work page 2010
[44]

Teaching statistical inference for causal effects in experiments and observational studies

Rubin, Donald B. 2004. “Teaching statistical inference for causal effects in experiments and observational studies.” Journal of Educational and Behavioral Statistics 29(3): 343-367

work page 2004
[45]

Causal inference using potential outcomes: Design, modeling, decisions

Rubin, Donald B. 2005. “Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100(469): 322-331

work page 2005
[46]

The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials

Rubin, Donald B. 2007. “The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials.” Statistics in medicine 26(1): 20-36

work page 2007
[47]

For objective causal inference, design trumps analysis

Rubin, Donald B. 2008. “For objective causal inference, design trumps analysis.” The Annals of Applied Statistics 2(3): 808-840

work page 2008
[48]

Average causal effects from nonrandomized studies: a practical guide and simulated example

Schafer, Joseph L., and Joseph Kang. 2008. “Average causal effects from nonrandomized studies: a practical guide and simulated example.” Psychological Methods 13(4): 279

work page 2008
[49]

Schmidt, and Richard J

Schneider, Barbara, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt, and Richard J. Shavelson. 2007. Estimating causal effects using experimental and observational design. American Educational & Reseach Association

work page 2007
[50]

Cook, and Donald T

Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi- experimental designs for generalized causal inference. New York, NY: Houghton Mifflin

work page 2002
[51]

Reproducibility probability in clinical trials

Shao, Jun, and Shein‐Chung Chow. 2002. “Reproducibility probability in clinical trials.” Statistics in Medicine 21(12): 1727-1742

work page 2002
[52]

An introduction to causal inference

Sobel, Michael E. 1996. “An introduction to causal inference.” Sociological Methods & Research 24(3): 353-379

work page 1996
[53]

Sensitivity analysis: distributional assumptions and confounding assumptions

VanderWeele, Tyler J. 2008. “Sensitivity analysis: distributional assumptions and confounding assumptions.” Biometrics 64(2): 645-649. 30 Appendix Proofs of Theorem 1 and Theorem 2 Proof of theorem 1: First, the distribution of  could be derived based on the following pivotal quantity: (0,1) id id tc id id tc YY YY N − − −   (A1) The pivotal quantity (...

work page 2008

[1] [1]

Entwisle, and Susan L

Alexander, Karl L., Doris R. Entwisle, and Susan L. Dauber. 2003. On the success of failure: A reassessment of the effects of retention in the primary school grades. New York, NY: Cambridge University Press

work page 2003

[2] [2]

Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis

Allen, Chiharu S., Qi Chen, Victor L. Willson, and Jan N. Hughes. 2009. “Quality of research design moderates effects of grade retention on achievement: A meta-analytic, multilevel analysis.” Educational Evaluation and Policy Analysis 31(4): 480-499

work page 2009

[3] [3]

P-value precision and reproducibility

Boos, Dennis D., and Leonard A. Stefanski. 2011. “P-value precision and reproducibility.” The American Statistician 65(4): 213-221

work page 2011

[4] [4]

Cohen, Jacob. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Earlbaum Associates

work page 1988

[5] [5]

A power primer

Cohen, Jacob. 1992. “A power primer.” Psychological bulletin 112(1): 155-159

work page 1992

[6] [6]

Inference for non‐random samples

Copas, John B., and H. G. Li. 1997. “Inference for non‐random samples.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59(1): 55-95

work page 1997

[7] [7]

Conjugate priors for exponential families

Diaconis, Persi, and Donald Ylvisaker. 1979. “Conjugate priors for exponential families.” The Annals of Statistics 7: 269–281

work page 1979

[8] [8]

Quantifying prior opinion

Diaconis, Persi, and Donald Ylvisaker. 1985. “Quantifying prior opinion.” Bayesian statistics 2: 133–156

work page 1985

[9] [9]

A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes

Espinosa, Valeria, Tirthankar Dasgupta, and Donald B. Rubin. 2016. "A Bayesian perspective on the analysis of unreplicated factorial experiments using potential outcomes." Technometrics 58(1): 62-73

work page 2016

[10] [10]

Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests

Doros, Gheorghe, and Andrew B. Geier. 2005. “Probability of replication revisited: Comment on “An alternative to null-hypothesis significance tests””. Psychological Science 16(12): 1005- 1006

work page 2005

[11] [11]

Impact of a confounding variable on a regression coefficient

Frank, Kenneth. A. 2000. “Impact of a confounding variable on a regression coefficient.” Sociological Methods & Research 29(2): 147-194

work page 2000

[12] [12]

Indices of robustness for sample representation

Frank, Kenneth. A., and Kyung-Seok Min. 2007. “Indices of robustness for sample representation.” Sociological Methodology 37: 349–392

work page 2007

[13] [13]

What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences

Frank, Kenneth A., Spiro J. Maroulis, Minh Q. Duong, and Benjamin M. Kelcey. 2013. “What would it take to change an inference? Using Rubin’s Causal Model to interpret the robustness of causal inferences.” Education Evaluation and Policy Analysis 35: 437–460. 27

work page 2013

[14] [14]

Effect sizes and p values: what should be reported and what should be replicated?

Greenwald, AnthonyG, Richard Gonzalez, Richard J. Harris, and Donald Guthrie. 1996. “Effect sizes and p values: what should be reported and what should be replicated?” Psychophysiology 33(2): 175-183

work page 1996

[15] [15]

The scientific model of causality

Heckman, James J. 2005. “The scientific model of causality.” Sociological methodology 35(1): 1-97

work page 2005

[16] [16]

Hoff, Peter D. 2009. A first course in Bayesian statistical methods. New York, NY: Springer Science & Business Media

work page 2009

[17] [17]

Statistics and causal inference

Holland, Paul W. 1986. “Statistics and causal inference.” Journal of the American statistical Association 81(396): 945-960

work page 1986

[18] [18]

Marginal mean weighting through stratification: adjustment for selection bias in multilevel data

Hong, Guanglei. 2010. “Marginal mean weighting through stratification: adjustment for selection bias in multilevel data.” Journal of Educational and Behavioral Statistics 35(5): 499-531

work page 2010

[19] [19]

Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics

Hong, Guanglei, and Stephen W. Raudenbush. 2005. “Effects of kindergarten retention policy on children’s cognitive growth in reading and mathematics.” Educational Evaluation and Policy Analysis 27: 205–224

work page 2005

[20] [20]

The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder

Hosman, Carrie A., Ben B. Hansen, and Paul W. Holland. 2010. “The sensitivity of linear regression coefficients’ confidence limits to the omission of a confounder.” The Annals of Applied Statistics 4(2): 849-870

work page 2010

[21] [21]

Misunderstandings between experimentalists and observationalists about causal inference

Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. “Misunderstandings between experimentalists and observationalists about causal inference.” Journal of the royal statistical society: series A (statistics in society) 171(2): 481-502

work page 2008

[22] [22]

Nonparametric estimation of average treatment effects under exogeneity: A review

Imbens, Guido W. 2004. “Nonparametric estimation of average treatment effects under exogeneity: A review.” The Review of Economics and Statistics 86: 4-29

work page 2004

[23] [23]

Imbens, Guido W., and Donald B. Rubin. 2015. Causal inference for statistics, social, and biomedical sciences: An introduction. New York, NY: Cambridge University Press

work page 2015

[24] [24]

A model-averaging approach to replication: The case of prep

Iverson, Geoffrey J., Eric-Jan Wagenmakers, and Michael D. Lee. 2010. “A model-averaging approach to replication: The case of prep.” Psychological Methods 15(2): 172-181

work page 2010

[25] [25]

An alternative to null-hypothesis significance tests

Killeen, Peter R. 2005. “An alternative to null-hypothesis significance tests.” Psychological science 16(5): 345-353

work page 2005

[26] [26]

Li, T. (2018). The Bayesian Paradigm of Robustness Indices of Causal Inferences (Unpublished doctoral dissertation). Michigan State University, East Lansing

work page 2018

[27] [27]

Assessing the sensitivity of regression results to unmeasured confounders in observational studies

Lin, Danyu Y., Bruce M. Psaty, and Richard A. Kronmal. 1998. “Assessing the sensitivity of regression results to unmeasured confounders in observational studies.” Biometrics: 948-963. 28

work page 1998

[28] [28]

Nonparametric bounds on treatment effects

Manski, Charles F. 1990. “Nonparametric bounds on treatment effects.” The American Economic Review 80(2): 319

work page 1990

[29] [29]

Manski, Charles F. 1995. Identification problems in the social sciences. Harvard University Press

work page 1995

[30] [30]

Bounding disagreements about treatment effects: A case study of sentencing and recidivism

Manski, Charles F., & Daniel S. Nagin. 1998. “Bounding disagreements about treatment effects: A case study of sentencing and recidivism.” Sociological methodology 28(1): 99-137

work page 1998

[31] [31]

Identification of treatment effects under conditional partial independence

Masten, Matthew A., and Alexandre Poirier. 2018. “Identification of treatment effects under conditional partial independence.” Econometrica 86(1): 317-351

work page 2018

[32] [32]

Bayesian sensitivity analysis for unmeasured confounding in observational studies

McCandless, Lawrence C., Paul Gustafson, and Adrian Levy. 2007. “Bayesian sensitivity analysis for unmeasured confounding in observational studies.” Statistics in Medicine 26(11): 2331-2347

work page 2007

[33] [33]

Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding

McCandless, Lawrence C., Paul Gustafson, Adrian R. Levy, and Sylvia Richardson. 2012. “Hierarchical priors for bias parameters in Bayesian sensitivity analysis for unmeasured confounding.” Statistics in Medicine 31(4): 383-396

work page 2012

[34] [34]

A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding

McCandless, Lawrence C., and Paul Gustafson. 2017. “A comparison of Bayesian and Monte Carlo sensitivity analysis for unmeasured confounding.” Statistics in Medicine 36(18): 2887- 2901

work page 2017

[35] [35]

Murnane, Richard J., and John B. Willett. 2011. Methods matter: Improving causal inference in educational and social science research. New York, NY: Oxford University Press

work page 2011

[36] [36]

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. New York, NY: Basic Books

work page 2018

[37] [37]

Using p values to estimate the probability of a statistically significant replication

Posavac, Emil J. 2002. “Using p values to estimate the probability of a statistically significant replication.” Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 1(2): 101-112

work page 2002

[38] [38]

Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models

Robins, James M., Andrea Rotnitzky, and Daniel O. Scharfstein. 2000. “Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models.” In Statistical models in epidemiology, the environment, and clinical trials (pp. 1-94). Springer, New York, NY

work page 2000

[39] [39]

Dropping out of high school in the United States: An observational study

Rosenbaum, Paul R. 1986. “Dropping out of high school in the United States: An observational study.” Journal of Educational Statistics 11(3): 207-224

work page 1986

[40] [40]

Sensitivity analysis for certain permutation inferences in matched observational studies

Rosenbaum, Paul R. 1987. “Sensitivity analysis for certain permutation inferences in matched observational studies.” Biometrika 74(1): 13-26. 29

work page 1987

[41] [41]

Sensitivity analysis for matched case-control studies

Rosenbaum, Paul R. 1991. “Sensitivity analysis for matched case-control studies.” Biometrics: 87-100

work page 1991

[42] [42]

Rosenbaum, Paul R. 2002. Observational Studies. New York, NY: Springer

work page 2002

[43] [43]

Rosenbaum, Paul R. 2010. Design of Observational Studies. New York, NY: Springer

work page 2010

[44] [44]

Teaching statistical inference for causal effects in experiments and observational studies

Rubin, Donald B. 2004. “Teaching statistical inference for causal effects in experiments and observational studies.” Journal of Educational and Behavioral Statistics 29(3): 343-367

work page 2004

[45] [45]

Causal inference using potential outcomes: Design, modeling, decisions

Rubin, Donald B. 2005. “Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100(469): 322-331

work page 2005

[46] [46]

The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials

Rubin, Donald B. 2007. “The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials.” Statistics in medicine 26(1): 20-36

work page 2007

[47] [47]

For objective causal inference, design trumps analysis

Rubin, Donald B. 2008. “For objective causal inference, design trumps analysis.” The Annals of Applied Statistics 2(3): 808-840

work page 2008

[48] [48]

Average causal effects from nonrandomized studies: a practical guide and simulated example

Schafer, Joseph L., and Joseph Kang. 2008. “Average causal effects from nonrandomized studies: a practical guide and simulated example.” Psychological Methods 13(4): 279

work page 2008

[49] [49]

Schmidt, and Richard J

Schneider, Barbara, Martin Carnoy, Jeremy Kilpatrick, William H. Schmidt, and Richard J. Shavelson. 2007. Estimating causal effects using experimental and observational design. American Educational & Reseach Association

work page 2007

[50] [50]

Cook, and Donald T

Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and quasi- experimental designs for generalized causal inference. New York, NY: Houghton Mifflin

work page 2002

[51] [51]

Reproducibility probability in clinical trials

Shao, Jun, and Shein‐Chung Chow. 2002. “Reproducibility probability in clinical trials.” Statistics in Medicine 21(12): 1727-1742

work page 2002

[52] [52]

An introduction to causal inference

Sobel, Michael E. 1996. “An introduction to causal inference.” Sociological Methods & Research 24(3): 353-379

work page 1996

[53] [53]

Sensitivity analysis: distributional assumptions and confounding assumptions

VanderWeele, Tyler J. 2008. “Sensitivity analysis: distributional assumptions and confounding assumptions.” Biometrics 64(2): 645-649. 30 Appendix Proofs of Theorem 1 and Theorem 2 Proof of theorem 1: First, the distribution of  could be derived based on the following pivotal quantity: (0,1) id id tc id id tc YY YY N − − −   (A1) The pivotal quantity (...

work page 2008