pith. sign in

arxiv: 2605.20633 · v1 · pith:BTAQP5YRnew · submitted 2026-05-20 · 📊 stat.ME · stat.AP

Application of Propensity Score Models and Causal Estimators in Observational Studies under Model Misspecification

Pith reviewed 2026-05-21 03:09 UTC · model grok-4.3

classification 📊 stat.ME stat.AP
keywords propensity scorecausal inferencemodel misspecificationinverse probability weightingaugmented inverse probability weightingobservational studiessimulation study
0
0 comments X

The pith

Augmented inverse probability weighting stays stable for causal estimates when models are misspecified

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how response surface modeling, inverse probability weighting, and augmented inverse probability weighting perform when the propensity score or outcome models are wrong. It runs extensive simulations that vary the type and degree of misspecification, sample size, and covariate correlations, while comparing logistic regression to random forests, support vector machines, and linear discriminant analysis for estimating propensity scores. The simulations show that augmented inverse probability weighting keeps bias and variance low in most cases because it is doubly robust. Inverse probability weighting breaks down quickly with misspecified propensity scores or unstable machine-learning weights. Response surface modeling works only when the outcome model is correct. The same patterns appear in applications to the ACTG175 trial and Alzheimer's neuroimaging data.

Core claim

AIPW consistently provides robust and stable estimates across most scenarios due to its doubly robust property, whereas IPW is highly sensitive to PS misspecification and unstable PS estimates produced by flexible machine learning methods. RSM performs well only when the outcome model is correctly specified.

What carries the argument

The doubly robust property of the augmented inverse probability weighting estimator, which combines inverse probability weights with an outcome regression to remain consistent if either the propensity score model or the outcome model is correct.

If this is right

  • AIPW reduces sensitivity to errors in specifying the propensity score model.
  • Machine learning methods for propensity scores should be paired with doubly robust estimators rather than used with plain inverse probability weighting.
  • Response surface modeling delivers unbiased estimates only when the outcome model is correctly specified.
  • Real-data analyses gain reliability by comparing multiple estimators rather than relying on a single approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analysts working with high-dimensional or complex observational data may default to doubly robust methods when employing flexible machine learning for confounding adjustment.
  • The results suggest that simulation-based comparisons under controlled misspecification can help select estimators before applying them to real studies.
  • Extending the evaluation to targeted maximum likelihood estimation or other doubly robust variants could test whether similar robustness holds beyond AIPW.

Load-bearing premise

The simulated scenarios with varying levels of PS and outcome model misspecification, sample sizes, and covariate correlation structures adequately capture the types and degrees of misspecification that occur in real observational data applications.

What would settle it

A new simulation or real dataset in which the true causal effect is known and AIPW exhibits larger bias or poorer coverage than IPW under severe double misspecification of both propensity score and outcome models.

read the original abstract

Propensity score (PS) methods are widely used in observational studies to reduce confounding and estimate causal treatment effects. However, the validity of PS-based causal estimators depends heavily on correct model specification, and model misspecification may lead to substantial bias and instability. In this study, we systematically evaluate the performance of commonly used causal estimators, including response surface modeling (RSM), inverse probability weighting (IPW), and augmented inverse probability weighting (AIPW), under varying levels of PS and outcome model misspecification. We compare classical logistic regression with several machine learning approaches for PS estimation, including random forests (RF), support vector machines (SVM), and linear discriminant analysis (LDA). Extensive simulation studies were conducted under multiple scenarios defined by combinations of correctly specified and misspecified PS and outcome models, varying sample sizes, and different covariate correlation structures. Estimator performance was assessed using bias, absolute bias, root mean squared error, empirical standard error, and confidence interval width. Results demonstrate that AIPW consistently provides robust and stable estimates across most scenarios due to its doubly robust property, whereas IPW is highly sensitive to PS misspecification and unstable PS estimates produced by flexible machine learning methods. RSM performs well only when the outcome model is correctly specified. Real-world applications using the ACTG175 clinical trial and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further illustrate the practical implications of estimator choice and PS modeling strategy. Overall, our findings highlight the importance of integrating flexible machine learning approaches within doubly robust frameworks to improve causal effect estimation in observational studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the performance of response surface modeling (RSM), inverse probability weighting (IPW), and augmented inverse probability weighting (AIPW) for causal effect estimation in observational studies under varying degrees of propensity score (PS) and outcome model misspecification. Simulations compare logistic regression with machine learning methods (random forests, SVM, LDA) for PS estimation across combinations of correct/misspecified models, sample sizes, and covariate correlation structures, using metrics such as bias, absolute bias, RMSE, and CI width. Real-data illustrations are provided on the ACTG175 and ADNI datasets. The central claim is that AIPW yields robust estimates due to its doubly robust property while IPW is sensitive to PS misspecification and unstable ML-based PS estimates.

Significance. If the simulation design adequately represents realistic misspecification patterns, the results would offer practical guidance for selecting doubly robust estimators when pairing flexible ML methods with PS-based causal inference. The inclusion of two real datasets adds applied relevance, though the absence of ground truth limits confirmatory power.

major comments (2)
  1. [Simulation Study] Simulation section: the construction of misspecification scenarios (explicit combinations of correct/misspecified logistic or ML models) does not include omitted interactions, non-monotonic effects, or high-dimensional sparse signals that commonly arise in observational data; this directly affects whether the reported stability ordering (AIPW robust, IPW unstable) generalizes beyond the chosen simulation grid.
  2. [Real-World Applications] Real-data applications: the ACTG175 and ADNI examples lack ground truth, so they cannot independently confirm the simulation-derived ranking of estimators; without additional benchmarks or sensitivity checks, these sections do not strengthen the central claim.
minor comments (2)
  1. [Abstract and Methods] The abstract and methods would benefit from explicit statements of the number of Monte Carlo replications and the precise functional forms used to induce misspecification.
  2. [Introduction] Notation for the doubly robust property and the definitions of the estimators could be introduced earlier with a short equation to aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable suggestions. We respond to the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Simulation Study] Simulation section: the construction of misspecification scenarios (explicit combinations of correct/misspecified logistic or ML models) does not include omitted interactions, non-monotonic effects, or high-dimensional sparse signals that commonly arise in observational data; this directly affects whether the reported stability ordering (AIPW robust, IPW unstable) generalizes beyond the chosen simulation grid.

    Authors: We acknowledge that our simulation scenarios do not encompass all possible forms of misspecification, such as omitted interactions or non-monotonic effects. Our design emphasizes misspecification arising from the choice between parametric logistic regression and machine learning methods for the propensity score model, which is central to the paper's focus on integrating ML with causal estimators. In the revised manuscript, we will include additional simulation scenarios that incorporate omitted interactions and non-monotonic relationships in the data generating process to better evaluate the generalizability of the AIPW robustness. For high-dimensional sparse signals, we will discuss this as a limitation and suggest it for future work, as expanding to very high dimensions may require substantial additional computational resources. revision: partial

  2. Referee: [Real-World Applications] Real-data applications: the ACTG175 and ADNI examples lack ground truth, so they cannot independently confirm the simulation-derived ranking of estimators; without additional benchmarks or sensitivity checks, these sections do not strengthen the central claim.

    Authors: We agree with the referee that the real-data examples cannot confirm the simulation results due to the lack of ground truth. These applications are presented to demonstrate the implementation and potential discrepancies in estimates when applying the methods to real observational data. To strengthen this section, we will incorporate additional sensitivity checks, including alternative model specifications and bootstrap-based comparisons of estimator variability. We will also revise the text to emphasize that these examples serve to illustrate practical considerations rather than to validate the simulation findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical simulation and application study

full rationale

The paper's claims rest on direct computation of bias, RMSE, and related metrics from explicitly constructed simulation scenarios (combinations of correct/misspecified logistic and ML models for PS and outcome) plus applications to external datasets ACTG175 and ADNI. These performance results are generated independently of the estimators themselves and do not reduce to fitted parameters or self-referential definitions. The doubly robust property of AIPW is invoked as a pre-existing theoretical fact rather than derived here, and no self-citations, ansatzes, or uniqueness theorems from the authors appear as load-bearing steps. The evaluation chain is therefore self-contained against the controlled inputs and external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard causal inference assumptions and controlled simulation designs rather than new free parameters or invented entities. Specific simulation settings function as design choices rather than fitted parameters.

axioms (1)
  • domain assumption No unmeasured confounding and positivity assumptions hold for causal identification in the observational data and simulations
    These are invoked implicitly as the foundation for propensity score methods to estimate causal effects in observational studies.

pith-pipeline@v0.9.0 · 5840 in / 1387 out tokens · 41440 ms · 2026-05-21T03:09:39.736115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    Journal of Epidemiology & Community Health58(4), 265–271 (2004)

    Hern´ an, M.A.: A definition of causal effect for epidemiological research. Journal of Epidemiology & Community Health58(4), 265–271 (2004)

  2. [2]

    Cambridge University Press, Cambridge (2015)

    Imbens, G.W., Rubin, D.B.: Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge (2015)

  3. [3]

    Journal of educational Psychology66(5), 688 (1974)

    Rubin, D.B.: Estimating causal effects of treatments in randomized and nonran- domized studies. Journal of educational Psychology66(5), 688 (1974)

  4. [4]

    essay on principles

    Splawa-Neyman, J., Dabrowska, D.M., Speed, T.P.: On the application of proba- bility theory to agricultural experiments. essay on principles. section 9. Statistical Science, 465–472 (1990)

  5. [5]

    Biometrics24(2), 295–313 (1968)

    Cochran, W.G.: The effectiveness of adjustment by subclassification in removing bias. Biometrics24(2), 295–313 (1968)

  6. [6]

    Biometrika70(1), 41–55 (1983)

    Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika70(1), 41–55 (1983)

  7. [7]

    Review of Economics and Statistics86(1), 4–29 (2004)

    Imbens, G.W.: Nonparametric estimation of average treatment effects under exogeneity. Review of Economics and Statistics86(1), 4–29 (2004)

  8. [8]

    Journal of Business & Economic Statistics29(1), 1–11 (2011)

    Abadie, A., Imbens, G.W.: Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics29(1), 1–11 (2011)

  9. [9]

    Journal of the American Statistical Association99(467), 609–618 (2004)

    Hansen, B.B.: Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association99(467), 609–618 (2004)

  10. [10]

    Lww (2000)

    Robins, J.M., Hernan, M.A., Brumback, B.: Marginal structural models and causal inference in epidemiology. Lww (2000)

  11. [11]

    Journal of the American Statistical Association89(427), 846–866 (1994)

    Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association89(427), 846–866 (1994)

  12. [12]

    Biometrics61(4), 962–973 (2005)

    Bang, H., Robins, J.M.: Doubly robust estimation in missing data and causal inference models. Biometrics61(4), 962–973 (2005)

  13. [13]

    Springer, New York (2011)

    Laan, M.J., Rose, S.: Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York (2011)

  14. [14]

    Journal of Biopharmaceutical Statistics29(5), 731–748 (2019)

    Wang, C., Li, H., Chen, W.-C., Lu, N., Tiwari, R., Xu, Y., Yue, L.Q.: Propensity score-integrated power prior approach for incorporating real-world evidence in 21 single-arm clinical studies. Journal of Biopharmaceutical Statistics29(5), 731–748 (2019)

  15. [15]

    Journal of Biopharmaceutical Statistics32(1), 158–169 (2022)

    Lu, N., Wang, C., Chen, W.-C., Li, H., Song, C., Tiwari, R., Xu, Y., Yue, L.Q.: Propensity score-integrated power prior approach for augmenting the control arm of a randomized controlled trial. Journal of Biopharmaceutical Statistics32(1), 158–169 (2022)

  16. [16]

    arXiv preprint (2026) arXiv:2601.03480

    Das, A.C., Salam, S., Roy, A., Chowdhury, R., Das, A.C., Das, A.C.: Improv- ing operating characteristics of clinical trials by augmenting control arm using propensity score-weighted borrowing-by-parts power prior. arXiv preprint (2026) arXiv:2601.03480

  17. [17]

    Statistics in Biosciences (2026) https://doi.org/10.1007/s12561-026-09513-z

    Das, A.C., Gwon, Y., Bonangelino, P.: Propensity score-based borrowing-by-parts power prior for augmenting control arm in clinical trials: A two-stage approach. Statistics in Biosciences (2026) https://doi.org/10.1007/s12561-026-09513-z

  18. [18]

    Multivariate Behavioral Research46(3), 399–424 (2011)

    Austin, P.C.: An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research46(3), 399–424 (2011)

  19. [19]

    Statistical Science22(4), 523–539 (2007)

    Kang, J.D., Schafer, J.L.: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science22(4), 523–539 (2007)

  20. [20]

    Statistics in Medicine29(3), 337–346 (2010)

    Lee, B.K., Lessler, J., Stuart, E.A.: Improving propensity score weighting using machine learning. Statistics in Medicine29(3), 337–346 (2010)

  21. [21]

    Statistics in Medicine32(19), 3388–3414 (2013)

    McCaffrey, D.F., Griffin, B.A., Almirall, D., Slaughter, M.E., Ramchand, R., Burgette, L.F.: A tutorial on propensity score estimation for multiple treat- ments using generalized boosted models. Statistics in Medicine32(19), 3388–3414 (2013)

  22. [22]

    Statistics in Medicine31(14), 1572–1581 (2012)

    Waernbaum, I.: Model misspecification and robustness in causal inference. Statistics in Medicine31(14), 1572–1581 (2012)

  23. [23]

    Political Analysis20(1), 25–46 (2012)

    Hainmueller, J.: Entropy balancing for causal effects. Political Analysis20(1), 25–46 (2012)

  24. [24]

    Medical Decision Making42(2), 156–167 (2022) https://doi.org/10

    Kurz, C.F.: Augmented inverse probability weighting and the double robustness property. Medical Decision Making42(2), 156–167 (2022) https://doi.org/10. 1177/0272989X211027181

  25. [25]

    Journal of Nonparametric Statistics37(4), 1317–1340 (2025) https://doi.org/10.1080/10485252.2025.2544936 22

    Chen, S., Wu, H., Zhao, H.: A comparison of causal inference methods for eval- uating multiple treatment groups. Journal of Nonparametric Statistics37(4), 1317–1340 (2025) https://doi.org/10.1080/10485252.2025.2544936 22

  26. [26]

    Journal of the American statistical Association81(396), 945–960 (1986)

    Holland, P.W.: Statistics and causal inference. Journal of the American statistical Association81(396), 945–960 (1986)

  27. [27]

    University of Chicago Press Chicago, IL (2011)

    Gelman, A.: Causality and statistical learning. University of Chicago Press Chicago, IL (2011)

  28. [28]

    Journal of the American statistical Association91(434), 444–455 (1996)

    Angrist, J.D., Imbens, G.W., Rubin, D.B.: Identification of causal effects using instrumental variables. Journal of the American statistical Association91(434), 444–455 (1996)

  29. [29]

    The International Journal of Biostatistics7(1), 6 (2011)

    Austin, P.C., Laupacis, A.: A tutorial on methods to estimating clinically and policy-meaningful measures of treatment effects in prospective observational studies: a review. The International Journal of Biostatistics7(1), 6 (2011)

  30. [30]

    Proceedings of the National Academy of Sciences113(27), 7353–7360 (2016)

    Athey, S., Imbens, G.: Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences113(27), 7353–7360 (2016)

  31. [31]

    Journal of Computational and Graphical Statistics20(1), 217–240 (2011)

    Hill, J.L.: Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics20(1), 217–240 (2011)

  32. [32]

    Political Analysis 14(2), 131–159 (2006)

    King, G., Zeng, L.: The dangers of extreme counterfactuals. Political Analysis 14(2), 131–159 (2006)

  33. [33]

    Political analysis18(1), 36–56 (2010)

    Glynn, A.N., Quinn, K.M.: An introduction to the augmented inverse propensity weighted estimator. Political analysis18(1), 36–56 (2010)

  34. [34]

    Econometrica71(4), 1161–1189 (2003)

    Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treat- ment effects using the estimated propensity score. Econometrica71(4), 1161–1189 (2003)

  35. [35]

    Springer, ??? (2006)

    Tsiatis, A.A.: Semiparametric Theory and Missing Data. Springer, ??? (2006)

  36. [36]

    Journal of the American Statistical Association90(429), 106–121 (1995)

    Robins, J.M., Rotnitzky, A., Zhao, L.P.: Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association90(429), 106–121 (1995)

  37. [37]

    New England Journal of Medicine335(15), 1081–1090 (1996)

    Hammer, S.M., Katzenstein, D.A., Hughes, M.D., Gundacker, H., Schooley, R.T., Haubrich, R.H., Henry, W.K., Lederman, M.M., Phair, J.P., Niu, M.,et al.: A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine335(15), 1081–1090 (1996)

  38. [38]

    a narrative review

    Kueper, J.K., Speechley, M., Montero-Odasso, M.: The alzheimer’s disease assess- ment scale–cognitive subscale (adas-cog): modifications and responsiveness in pre-dementia populations. a narrative review. Journal of Alzheimer’s Disease 63(2), 423–444 (2018)

  39. [39]

    Athey, S., Tibshirani, J., Wager, S.: Generalized random forests (2019) 23

  40. [40]

    Oxford University Press Oxford, UK (2018)

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J.: Double/debiased machine learning for treatment and structural parameters. Oxford University Press Oxford, UK (2018)

  41. [41]

    Journal of the Royal Statistical Society Series B: Statistical Methodology58(1), 267–288 (1996)

    Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology58(1), 267–288 (1996)

  42. [42]

    Journal of the Royal Statis- tical Society Series B: Statistical Methodology72(4), 417–473 (2010)

    Meinshausen, N., B¨ uhlmann, P.: Stability selection. Journal of the Royal Statis- tical Society Series B: Statistical Methodology72(4), 417–473 (2010)

  43. [43]

    Journal of the Royal Statistical Society Series B: Statistical Methodology75(1), 55–80 (2013)

    Shah, R.D., Samworth, R.J.: Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology75(1), 55–80 (2013)

  44. [44]

    Statistics in Biosciences, 1–29 (2026) 24

    Das, A.C., Dai, R., Lokshin, A., Salam, S., Smith, L.: Identifying predictive combinations of biomarkers for early cancer detection with stability selection in combination with ensemble learning. Statistics in Biosciences, 1–29 (2026) 24