Propensity Score Weighting to Ensure Balance in Key Subgroups or Strata: A Practical Guide
Pith reviewed 2026-05-10 12:03 UTC · model grok-4.3
The pith
When patient subgroups differ substantially in prognosis, exposure likelihood, or covariate effects, stratify propensity score weighting by those clinical groups to prioritize balance and reduce confounding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a stratified propensity score weighting approach should be used when prognosis differs substantially between patient subgroups, likelihood of exposure differs across clinical subgroups, or covariate-exposure associations differ substantially between subgroups. This method involves stratifying the analysis by indication, reason for admission, or other clinical risk factors and performing weighting within those strata to achieve balance in the composition of the strata between exposure groups, with particular guidance for electronic health records and administrative medical data.
What carries the argument
Stratified propensity score weighting, which applies separate weighting procedures within predefined clinical strata to enforce balance in subgroup composition between treatment groups.
Load-bearing premise
The chosen clinical subgroups are meaningful, well-defined in the data, and performing separate weighting within them will not introduce new biases or excessively reduce statistical power.
What would settle it
A simulation or real dataset analysis in which the true treatment effect is known, the subgroups are misspecified or overlapping, and the stratified weights produce larger bias or worse balance diagnostics than a single pooled propensity score model.
Figures
read the original abstract
Propensity score weighting approaches have been widely implemented in clinical research to estimate the effects of a treatment or exposure while mitigating the risk of confounding in the absence of random assignment. In practice, when working with large electronic health records (EHR) or administrative datasets to evaluate health quality outcomes at the institutional level, or evaluate supportive care interventions for a wide range of hospitalized patients, it may be advisable to stratify the propensity score weighting approach by indication, reason for admission, or other clinical risk factors due to the potential for substantial heterogeneity across subgroups of patients with complex care needs. A stratified approach may be appropriate if (i) prognosis differs substantially between patient subgroups such that achieving balance in the composition of these strata between exposure/treatment groups should be prioritized, (ii) likelihood of exposure differs substantially across clinical subgroups, or (iii) the covariate-exposure associations are expected to differ substantially between subgroups (i.e. there are covariate-subgroup interactions in the exposure/treatment propensity model). For example, we may want to evaluate the impact of prophylactic anticoagulant use for venous thromboembolism prevention in elderly patients admitted to hospital for a wide array of conditions. The purpose of this article is to outline an approach to implementing propensity score weighting with stratification by clinical groups. We also provide guidance on best practices with particular focus on EHR and administrative medical data, and population health settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a practical guide for implementing propensity score weighting stratified by pre-specified clinical subgroups or strata when analyzing large electronic health records or administrative datasets. It identifies three conditions under which stratification may be advisable: (i) substantial differences in prognosis between patient subgroups, (ii) substantial differences in exposure likelihood across subgroups, or (iii) substantial differences in covariate-exposure associations between subgroups. The paper illustrates the approach with the example of evaluating prophylactic anticoagulant use for venous thromboembolism prevention in elderly hospitalized patients and provides best practices focused on EHR/administrative data and population health settings.
Significance. If the recommendations hold, the guide addresses a recurring practical challenge in causal inference with heterogeneous populations by prioritizing balance within key strata, which can reduce bias from effect heterogeneity or propensity model misspecification. It translates established principles from the causal inference literature into actionable advice for applied researchers, offering value in real-world data settings where unstratified weighting may fail to achieve adequate balance. As a descriptive rather than theoretical contribution, its significance rests on the clarity and specificity of the implementation guidance provided.
minor comments (3)
- The manuscript would benefit from a step-by-step outline or pseudocode for the stratified propensity score weighting procedure, including how weights are computed and combined across strata.
- The example of prophylactic anticoagulant use is referenced but lacks concrete details on data structure, subgroup definitions, or before/after balance metrics to demonstrate the method in practice.
- Consider adding citations to key references on propensity score stratification and effect heterogeneity (e.g., work extending Rosenbaum and Rubin) to ground the three conditions in the existing literature.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation of minor revision. The report provides a helpful summary of the manuscript's focus on stratified propensity score weighting for balancing key subgroups in observational studies using EHR and administrative data. No specific major comments or points requiring clarification were raised.
Circularity Check
No significant circularity detected
full rationale
The manuscript is a practical guide outlining when and how to apply propensity score weighting within pre-specified clinical strata. Its central content consists of three standard descriptive conditions for preferring stratification (prognosis differences, exposure likelihood differences, and covariate-exposure interactions), which are presented as established considerations from the causal inference literature rather than as derivations, fitted parameters, or self-referential claims. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear; the argument is self-contained descriptive guidance without any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Hennessy, S. et al. Real-World Data and Real-World Evidence in Regulatory Deci sion Mak- ing: Report Summary From the Council for International Orga nizations of Medical Sciences (CIOMS) Working Group XIII. Pharmacoepidemiology and Drug Safety 34, e70117 (2025)
work page 2025
-
[2]
Patel, D. et al. Use of external comparators for health technology assessme nt submissions based on single-arm trials. Value in Health 24, 1118–1125 (2021)
work page 2021
-
[3]
Hern´ an, M. A. & Robins, J. M. Using big data to emulate a tar get trial when a randomized trial is not available. American Journal of Epidemiology 183, 758–764 (2016)
work page 2016
-
[4]
Hern´ an, M. A., Wang, W. & Leaf, D. E. Target trial emulatio n: a framework for causal inference from observational data. Journal of the American Medical Association 328, 2446– 2447 (2022)
work page 2022
-
[5]
Hern´ an, M. A. & Robins, J. M. Causal inference: What if (Chapman & Hall/CRC, Boca Raton, 2020)
work page 2020
-
[6]
Ho, D. E., Imai, K., King, G. & Stuart, E. A. Matching as nonp arametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 199–236 (2007)
work page 2007
-
[7]
Rosenbaum, P. R. & Rubin, D. B. The central role of the prope nsity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)
work page 1983
-
[8]
Austin, P. C. An introduction to propensity score methods for reducing the effects of con- founding in observational studies. Multivariate Behavioral Research 46, 399–424 (2011)
work page 2011
-
[9]
Williamson, E., Morley, R., Lucas, A. & Carpenter, J. Prop ensity scores: from naive enthusi- asm to intuitive understanding. Statistical Methods in Medical Research 21, 273–293 (2012). 11
work page 2012
-
[10]
Zubizarreta, J. R. Stable weights that balance covariat es for estimation with incomplete out- come data. Journal of the American Statistical Association 110, 910–922 (2015)
work page 2015
-
[11]
Funk, M. J. et al. Doubly robust estimation of causal effects. American Journal of Epidemi- ology 173, 761–767 (2011)
work page 2011
-
[12]
Schuler, M. S. & Rose, S. Targeted maximum likelihood est imation for causal inference in observational studies. American journal of epidemiology 185, 65–73 (2017)
work page 2017
-
[13]
Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 1–28 (Chapman & Hall/CRC, Boca Raton, 2007)
work page 2007
-
[14]
Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 141–170 (Chapman & Hall/CRC, Boca Raton, 2007)
work page 2007
-
[15]
Phillippo, D. M. et al. Effect modification and non-collapsibility together may lead to conflict- ing treatment decisions: A review of marginal and condition al estimands and recommendations for decision-making. Research Synthesis Methods, 1–27 (2025)
work page 2025
-
[16]
Remiro-Az´ ocar, A. et al. Marginal and conditional summary measures: transpor tability and compatibility across studies 2025. arXiv: 2507.21925 [stat.ME] . https://arxiv.org/abs/2507.21925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
International Council for Harmonisation of Technical R equirements for Pharmaceuticals for Human Use. Addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials E9(R1) https://database.ich.org/sites/default/files/E9-R1 2019
work page 2019
-
[18]
Xu, S. et al. Use of stabilized inverse propensity scores as weights to di rectly estimate relative risk and its confidence intervals. Value in Health 13, 273–277 (2010)
work page 2010
-
[19]
Gupta, A. et al. Transportability of nonlocal real-world evidence and its r elevance to health technology assessment: a primer. Journal of comparative effectiveness research 14, e250041 (2025)
work page 2025
-
[20]
Mansournia, M. A., Nazemipour, M., Naimi, A. I., Collins , G. S. & Campbell, M. J. Reflection on modern methods: demystifying robust standard errors for epidemiologists. International Journal of Epidemiology 50, 346–351 (2021)
work page 2021
-
[21]
Austin, P. C. Variance estimation when using inverse pro bability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine 35, 5642–5655 (2016)
work page 2016
-
[22]
Austin, P. C. Bootstrap vs asymptotic variance estimati on when using propensity score weight- ing with continuous and binary outcomes. Statistics in Medicine 41, 4426–4443 (2022)
work page 2022
-
[23]
Little, R. J. & Rubin, D. B. in Statistical Analysis with Missing Data 41–58 (John Wiley & Sons, 2002)
work page 2002
-
[24]
Phillippo, D., Ades, A., Dias, S, Palmer, S & Abrams KR and , W. N. NICE DSU Technical Support Document 18: Methods for population-adjusted indi rect comparisons in submission to NICE Available from https://sheffield.ac.uk/nice-dsu/tsds/full-list. 2016
work page 2016
-
[25]
Golinelli, D., Ridgeway, G., Rhoades, H., Tucker, J. & We nzel, S. Bias and variance trade-offs when combining propensity score weighting and regression: with an application to HIV status and homeless men. Health Services and Outcomes Research Methodology 12, 104–118 (2012)
work page 2012
-
[26]
Austin, P. C. Balance diagnostics for comparing the dist ribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine 28, 3083–3107 (2009). 12
work page 2009
-
[27]
Latimer, N. NICE DSU technical support document 14: survival analysis for economic evalua- tions alongside clinical trials-extrapolation with patient-level data Available from https://sheffield.ac.uk/nic 2011
work page 2011
-
[28]
VanderWeele, T. J. Principles of confounder selection. European Journal of Epidemiology 34, 211–219 (2019)
work page 2019
-
[29]
Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)
work page 2015
-
[30]
Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)
work page 2015
-
[31]
Scola, G. et al. Implementation of the trial emulation approach in medical r esearch: a scoping review. BMC Medical Research Methodology 23 (2023)
work page 2023
-
[32]
Zuo, H., Yu, L., Campbell, S. M., Yamamoto, S. S. & Yuan, Y. The implementation of target trial emulation for causal inference: a scoping review. Journal of Clinical Epidemiology 162, 29–37 (2023)
work page 2023
-
[33]
Stuart, E. A. Matching methods for causal inference: A re view and a look forward. Statistical Science 25, 1–21 (2010). Appendix: R Code for Demonstration library(tidyverse) library(cobalt) library(knitr) #set random seed set.seed(21082025) #parameters for simulating baseline characteristic data n <- c(30, 50, 70, 30) mu_age <- c(60, 45, 70, 50) sigma_age...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.