pith. sign in

arxiv: 2604.14407 · v1 · submitted 2026-04-15 · 📊 stat.ME

Propensity Score Weighting to Ensure Balance in Key Subgroups or Strata: A Practical Guide

Pith reviewed 2026-05-10 12:03 UTC · model grok-4.3

classification 📊 stat.ME
keywords propensity score weightingstratificationsubgroupselectronic health recordsconfoundingcausal inferencebalanceadministrative data
0
0 comments X

The pith

When patient subgroups differ substantially in prognosis, exposure likelihood, or covariate effects, stratify propensity score weighting by those clinical groups to prioritize balance and reduce confounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Propensity score weighting estimates treatment effects in observational data by balancing measured covariates between exposed and unexposed groups. In large electronic health records or administrative datasets, patients often fall into heterogeneous clinical subgroups such as different reasons for hospital admission or risk profiles, where standard weighting may leave residual imbalances in subgroup composition. The paper argues that stratification becomes advisable precisely when prognosis varies markedly across groups, treatment assignment probabilities differ, or the relationships between covariates and treatment vary by subgroup. Under these conditions, fitting or adjusting weights separately within strata ensures the treated and untreated groups have similar proportions of each key subgroup. The result is a practical implementation guide focused on best practices for institutional-level or population-health analyses.

Core claim

The central claim is that a stratified propensity score weighting approach should be used when prognosis differs substantially between patient subgroups, likelihood of exposure differs across clinical subgroups, or covariate-exposure associations differ substantially between subgroups. This method involves stratifying the analysis by indication, reason for admission, or other clinical risk factors and performing weighting within those strata to achieve balance in the composition of the strata between exposure groups, with particular guidance for electronic health records and administrative medical data.

What carries the argument

Stratified propensity score weighting, which applies separate weighting procedures within predefined clinical strata to enforce balance in subgroup composition between treatment groups.

Load-bearing premise

The chosen clinical subgroups are meaningful, well-defined in the data, and performing separate weighting within them will not introduce new biases or excessively reduce statistical power.

What would settle it

A simulation or real dataset analysis in which the true treatment effect is known, the subgroups are misspecified or overlapping, and the stratified weights produce larger bias or worse balance diagnostics than a single pooled propensity score model.

Figures

Figures reproduced from arXiv: 2604.14407 by Amol A. Verma, Emma K. Mackay, Fahad Razak, Surain B. Roberts.

Figure 1
Figure 1. Figure 1: Age Distribution by Stratum and Exposure Group for [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

Propensity score weighting approaches have been widely implemented in clinical research to estimate the effects of a treatment or exposure while mitigating the risk of confounding in the absence of random assignment. In practice, when working with large electronic health records (EHR) or administrative datasets to evaluate health quality outcomes at the institutional level, or evaluate supportive care interventions for a wide range of hospitalized patients, it may be advisable to stratify the propensity score weighting approach by indication, reason for admission, or other clinical risk factors due to the potential for substantial heterogeneity across subgroups of patients with complex care needs. A stratified approach may be appropriate if (i) prognosis differs substantially between patient subgroups such that achieving balance in the composition of these strata between exposure/treatment groups should be prioritized, (ii) likelihood of exposure differs substantially across clinical subgroups, or (iii) the covariate-exposure associations are expected to differ substantially between subgroups (i.e. there are covariate-subgroup interactions in the exposure/treatment propensity model). For example, we may want to evaluate the impact of prophylactic anticoagulant use for venous thromboembolism prevention in elderly patients admitted to hospital for a wide array of conditions. The purpose of this article is to outline an approach to implementing propensity score weighting with stratification by clinical groups. We also provide guidance on best practices with particular focus on EHR and administrative medical data, and population health settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript is a practical guide for implementing propensity score weighting stratified by pre-specified clinical subgroups or strata when analyzing large electronic health records or administrative datasets. It identifies three conditions under which stratification may be advisable: (i) substantial differences in prognosis between patient subgroups, (ii) substantial differences in exposure likelihood across subgroups, or (iii) substantial differences in covariate-exposure associations between subgroups. The paper illustrates the approach with the example of evaluating prophylactic anticoagulant use for venous thromboembolism prevention in elderly hospitalized patients and provides best practices focused on EHR/administrative data and population health settings.

Significance. If the recommendations hold, the guide addresses a recurring practical challenge in causal inference with heterogeneous populations by prioritizing balance within key strata, which can reduce bias from effect heterogeneity or propensity model misspecification. It translates established principles from the causal inference literature into actionable advice for applied researchers, offering value in real-world data settings where unstratified weighting may fail to achieve adequate balance. As a descriptive rather than theoretical contribution, its significance rests on the clarity and specificity of the implementation guidance provided.

minor comments (3)
  1. The manuscript would benefit from a step-by-step outline or pseudocode for the stratified propensity score weighting procedure, including how weights are computed and combined across strata.
  2. The example of prophylactic anticoagulant use is referenced but lacks concrete details on data structure, subgroup definitions, or before/after balance metrics to demonstrate the method in practice.
  3. Consider adding citations to key references on propensity score stratification and effect heterogeneity (e.g., work extending Rosenbaum and Rubin) to ground the three conditions in the existing literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation of minor revision. The report provides a helpful summary of the manuscript's focus on stratified propensity score weighting for balancing key subgroups in observational studies using EHR and administrative data. No specific major comments or points requiring clarification were raised.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is a practical guide outlining when and how to apply propensity score weighting within pre-specified clinical strata. Its central content consists of three standard descriptive conditions for preferring stratification (prognosis differences, exposure likelihood differences, and covariate-exposure interactions), which are presented as established considerations from the causal inference literature rather than as derivations, fitted parameters, or self-referential claims. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear; the argument is self-contained descriptive guidance without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a methodological guide paper with no new mathematical models, parameters, or entities introduced.

pith-pipeline@v0.9.0 · 5559 in / 1039 out tokens · 55120 ms · 2026-05-10T12:03:14.237126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Hennessy, S. et al. Real-World Data and Real-World Evidence in Regulatory Deci sion Mak- ing: Report Summary From the Council for International Orga nizations of Medical Sciences (CIOMS) Working Group XIII. Pharmacoepidemiology and Drug Safety 34, e70117 (2025)

  2. [2]

    Patel, D. et al. Use of external comparators for health technology assessme nt submissions based on single-arm trials. Value in Health 24, 1118–1125 (2021)

  3. [3]

    Hern´ an, M. A. & Robins, J. M. Using big data to emulate a tar get trial when a randomized trial is not available. American Journal of Epidemiology 183, 758–764 (2016)

  4. [4]

    A., Wang, W

    Hern´ an, M. A., Wang, W. & Leaf, D. E. Target trial emulatio n: a framework for causal inference from observational data. Journal of the American Medical Association 328, 2446– 2447 (2022)

  5. [5]

    Hern´ an, M. A. & Robins, J. M. Causal inference: What if (Chapman & Hall/CRC, Boca Raton, 2020)

  6. [6]

    E., Imai, K., King, G

    Ho, D. E., Imai, K., King, G. & Stuart, E. A. Matching as nonp arametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 199–236 (2007)

  7. [7]

    Rosenbaum, P. R. & Rubin, D. B. The central role of the prope nsity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)

  8. [8]

    Austin, P. C. An introduction to propensity score methods for reducing the effects of con- founding in observational studies. Multivariate Behavioral Research 46, 399–424 (2011)

  9. [9]

    & Carpenter, J

    Williamson, E., Morley, R., Lucas, A. & Carpenter, J. Prop ensity scores: from naive enthusi- asm to intuitive understanding. Statistical Methods in Medical Research 21, 273–293 (2012). 11

  10. [10]

    Zubizarreta, J. R. Stable weights that balance covariat es for estimation with incomplete out- come data. Journal of the American Statistical Association 110, 910–922 (2015)

  11. [11]

    Funk, M. J. et al. Doubly robust estimation of causal effects. American Journal of Epidemi- ology 173, 761–767 (2011)

  12. [12]

    Schuler, M. S. & Rose, S. Targeted maximum likelihood est imation for causal inference in observational studies. American journal of epidemiology 185, 65–73 (2017)

  13. [13]

    Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 1–28 (Chapman & Hall/CRC, Boca Raton, 2007)

  14. [14]

    Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 141–170 (Chapman & Hall/CRC, Boca Raton, 2007)

  15. [15]

    Phillippo, D. M. et al. Effect modification and non-collapsibility together may lead to conflict- ing treatment decisions: A review of marginal and condition al estimands and recommendations for decision-making. Research Synthesis Methods, 1–27 (2025)

  16. [16]

    Remiro-Az´ ocar, A. et al. Marginal and conditional summary measures: transpor tability and compatibility across studies 2025. arXiv: 2507.21925 [stat.ME] . https://arxiv.org/abs/2507.21925

  17. [17]

    International Council for Harmonisation of Technical R equirements for Pharmaceuticals for Human Use. Addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials E9(R1) https://database.ich.org/sites/default/files/E9-R1 2019

  18. [18]

    Xu, S. et al. Use of stabilized inverse propensity scores as weights to di rectly estimate relative risk and its confidence intervals. Value in Health 13, 273–277 (2010)

  19. [19]

    Gupta, A. et al. Transportability of nonlocal real-world evidence and its r elevance to health technology assessment: a primer. Journal of comparative effectiveness research 14, e250041 (2025)

  20. [20]

    A., Nazemipour, M., Naimi, A

    Mansournia, M. A., Nazemipour, M., Naimi, A. I., Collins , G. S. & Campbell, M. J. Reflection on modern methods: demystifying robust standard errors for epidemiologists. International Journal of Epidemiology 50, 346–351 (2021)

  21. [21]

    Austin, P. C. Variance estimation when using inverse pro bability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine 35, 5642–5655 (2016)

  22. [22]

    Austin, P. C. Bootstrap vs asymptotic variance estimati on when using propensity score weight- ing with continuous and binary outcomes. Statistics in Medicine 41, 4426–4443 (2022)

  23. [23]

    Little, R. J. & Rubin, D. B. in Statistical Analysis with Missing Data 41–58 (John Wiley & Sons, 2002)

  24. [24]

    Phillippo, D., Ades, A., Dias, S, Palmer, S & Abrams KR and , W. N. NICE DSU Technical Support Document 18: Methods for population-adjusted indi rect comparisons in submission to NICE Available from https://sheffield.ac.uk/nice-dsu/tsds/full-list. 2016

  25. [25]

    & We nzel, S

    Golinelli, D., Ridgeway, G., Rhoades, H., Tucker, J. & We nzel, S. Bias and variance trade-offs when combining propensity score weighting and regression: with an application to HIV status and homeless men. Health Services and Outcomes Research Methodology 12, 104–118 (2012)

  26. [26]

    Austin, P. C. Balance diagnostics for comparing the dist ribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine 28, 3083–3107 (2009). 12

  27. [27]

    Latimer, N. NICE DSU technical support document 14: survival analysis for economic evalua- tions alongside clinical trials-extrapolation with patient-level data Available from https://sheffield.ac.uk/nic 2011

  28. [28]

    VanderWeele, T. J. Principles of confounder selection. European Journal of Epidemiology 34, 211–219 (2019)

  29. [29]

    in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

    Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

  30. [30]

    in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

    Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

  31. [31]

    Scola, G. et al. Implementation of the trial emulation approach in medical r esearch: a scoping review. BMC Medical Research Methodology 23 (2023)

  32. [32]

    M., Yamamoto, S

    Zuo, H., Yu, L., Campbell, S. M., Yamamoto, S. S. & Yuan, Y. The implementation of target trial emulation for causal inference: a scoping review. Journal of Clinical Epidemiology 162, 29–37 (2023)

  33. [33]

    S1", "S2

    Stuart, E. A. Matching methods for causal inference: A re view and a look forward. Statistical Science 25, 1–21 (2010). Appendix: R Code for Demonstration library(tidyverse) library(cobalt) library(knitr) #set random seed set.seed(21082025) #parameters for simulating baseline characteristic data n <- c(30, 50, 70, 30) mu_age <- c(60, 45, 70, 50) sigma_age...