Propensity Score Weighting to Ensure Balance in Key Subgroups or Strata: A Practical Guide

Amol A. Verma; Emma K. Mackay; Fahad Razak; Surain B. Roberts

arxiv: 2604.14407 · v1 · submitted 2026-04-15 · 📊 stat.ME

Propensity Score Weighting to Ensure Balance in Key Subgroups or Strata: A Practical Guide

Emma K. Mackay , Amol A. Verma , Fahad Razak , Surain B. Roberts This is my paper

Pith reviewed 2026-05-10 12:03 UTC · model grok-4.3

classification 📊 stat.ME

keywords propensity score weightingstratificationsubgroupselectronic health recordsconfoundingcausal inferencebalanceadministrative data

0 comments

The pith

When patient subgroups differ substantially in prognosis, exposure likelihood, or covariate effects, stratify propensity score weighting by those clinical groups to prioritize balance and reduce confounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Propensity score weighting estimates treatment effects in observational data by balancing measured covariates between exposed and unexposed groups. In large electronic health records or administrative datasets, patients often fall into heterogeneous clinical subgroups such as different reasons for hospital admission or risk profiles, where standard weighting may leave residual imbalances in subgroup composition. The paper argues that stratification becomes advisable precisely when prognosis varies markedly across groups, treatment assignment probabilities differ, or the relationships between covariates and treatment vary by subgroup. Under these conditions, fitting or adjusting weights separately within strata ensures the treated and untreated groups have similar proportions of each key subgroup. The result is a practical implementation guide focused on best practices for institutional-level or population-health analyses.

Core claim

The central claim is that a stratified propensity score weighting approach should be used when prognosis differs substantially between patient subgroups, likelihood of exposure differs across clinical subgroups, or covariate-exposure associations differ substantially between subgroups. This method involves stratifying the analysis by indication, reason for admission, or other clinical risk factors and performing weighting within those strata to achieve balance in the composition of the strata between exposure groups, with particular guidance for electronic health records and administrative medical data.

What carries the argument

Stratified propensity score weighting, which applies separate weighting procedures within predefined clinical strata to enforce balance in subgroup composition between treatment groups.

Load-bearing premise

The chosen clinical subgroups are meaningful, well-defined in the data, and performing separate weighting within them will not introduce new biases or excessively reduce statistical power.

What would settle it

A simulation or real dataset analysis in which the true treatment effect is known, the subgroups are misspecified or overlapping, and the stratified weights produce larger bias or worse balance diagnostics than a single pooled propensity score model.

Figures

Figures reproduced from arXiv: 2604.14407 by Amol A. Verma, Emma K. Mackay, Fahad Razak, Surain B. Roberts.

read the original abstract

Propensity score weighting approaches have been widely implemented in clinical research to estimate the effects of a treatment or exposure while mitigating the risk of confounding in the absence of random assignment. In practice, when working with large electronic health records (EHR) or administrative datasets to evaluate health quality outcomes at the institutional level, or evaluate supportive care interventions for a wide range of hospitalized patients, it may be advisable to stratify the propensity score weighting approach by indication, reason for admission, or other clinical risk factors due to the potential for substantial heterogeneity across subgroups of patients with complex care needs. A stratified approach may be appropriate if (i) prognosis differs substantially between patient subgroups such that achieving balance in the composition of these strata between exposure/treatment groups should be prioritized, (ii) likelihood of exposure differs substantially across clinical subgroups, or (iii) the covariate-exposure associations are expected to differ substantially between subgroups (i.e. there are covariate-subgroup interactions in the exposure/treatment propensity model). For example, we may want to evaluate the impact of prophylactic anticoagulant use for venous thromboembolism prevention in elderly patients admitted to hospital for a wide array of conditions. The purpose of this article is to outline an approach to implementing propensity score weighting with stratification by clinical groups. We also provide guidance on best practices with particular focus on EHR and administrative medical data, and population health settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical checklist for stratifying propensity score weights by clinical subgroups in EHR studies, but it adds no new methods or evidence.

read the letter

The core message is that this paper is a how-to guide, not a research contribution. It tells readers when to stratify propensity score weighting by things like admission reason or risk factors, and it lists three standard conditions: large differences in prognosis across groups, differences in exposure probability, or differences in covariate-exposure links. The prophylactic anticoagulant example is concrete and relevant for hospital data work. For applied people running observational analyses on messy administrative records, that kind of short decision list can be handy to avoid obvious balance problems in heterogeneous populations. It also flags the usual EHR issues like institutional-level evaluation and wide patient mixes. That part is straightforward and matches what most causal inference texts already say about effect heterogeneity and propensity model misspecification. The paper does not claim to invent anything new, and it does not run any simulations or real-data comparisons to show how much the stratified version changes estimates or standard errors versus ordinary weighting. It stays descriptive. The conditions themselves are logically sound but rest on the assumption that the chosen strata are clinically meaningful and that splitting the sample will not create new selection issues or power loss in smaller cells. No discussion appears on how to pick cut-points, how to handle continuous covariates within strata, or what the asymptotic properties look like. The guidance on best practices is brief and does not drill into missing data patterns or high-dimensional covariate selection that often dominate EHR work. This piece is written for clinical epidemiologists and health services researchers who need a quick reference rather than for statisticians looking for technical advances. It could reasonably go to peer review at a methods-oriented clinical journal where practitioners might pick up the checklist and apply it. A pure statistics outlet would likely desk-reject it for lack of novelty or formal results.

Referee Report

0 major / 3 minor

Summary. The manuscript is a practical guide for implementing propensity score weighting stratified by pre-specified clinical subgroups or strata when analyzing large electronic health records or administrative datasets. It identifies three conditions under which stratification may be advisable: (i) substantial differences in prognosis between patient subgroups, (ii) substantial differences in exposure likelihood across subgroups, or (iii) substantial differences in covariate-exposure associations between subgroups. The paper illustrates the approach with the example of evaluating prophylactic anticoagulant use for venous thromboembolism prevention in elderly hospitalized patients and provides best practices focused on EHR/administrative data and population health settings.

Significance. If the recommendations hold, the guide addresses a recurring practical challenge in causal inference with heterogeneous populations by prioritizing balance within key strata, which can reduce bias from effect heterogeneity or propensity model misspecification. It translates established principles from the causal inference literature into actionable advice for applied researchers, offering value in real-world data settings where unstratified weighting may fail to achieve adequate balance. As a descriptive rather than theoretical contribution, its significance rests on the clarity and specificity of the implementation guidance provided.

minor comments (3)

The manuscript would benefit from a step-by-step outline or pseudocode for the stratified propensity score weighting procedure, including how weights are computed and combined across strata.
The example of prophylactic anticoagulant use is referenced but lacks concrete details on data structure, subgroup definitions, or before/after balance metrics to demonstrate the method in practice.
Consider adding citations to key references on propensity score stratification and effect heterogeneity (e.g., work extending Rosenbaum and Rubin) to ground the three conditions in the existing literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation of minor revision. The report provides a helpful summary of the manuscript's focus on stratified propensity score weighting for balancing key subgroups in observational studies using EHR and administrative data. No specific major comments or points requiring clarification were raised.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is a practical guide outlining when and how to apply propensity score weighting within pre-specified clinical strata. Its central content consists of three standard descriptive conditions for preferring stratification (prognosis differences, exposure likelihood differences, and covariate-exposure interactions), which are presented as established considerations from the causal inference literature rather than as derivations, fitted parameters, or self-referential claims. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear; the argument is self-contained descriptive guidance without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a methodological guide paper with no new mathematical models, parameters, or entities introduced.

pith-pipeline@v0.9.0 · 5559 in / 1039 out tokens · 55120 ms · 2026-05-10T12:03:14.237126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Hennessy, S. et al. Real-World Data and Real-World Evidence in Regulatory Deci sion Mak- ing: Report Summary From the Council for International Orga nizations of Medical Sciences (CIOMS) Working Group XIII. Pharmacoepidemiology and Drug Safety 34, e70117 (2025)

work page 2025
[2]

Patel, D. et al. Use of external comparators for health technology assessme nt submissions based on single-arm trials. Value in Health 24, 1118–1125 (2021)

work page 2021
[3]

Hern´ an, M. A. & Robins, J. M. Using big data to emulate a tar get trial when a randomized trial is not available. American Journal of Epidemiology 183, 758–764 (2016)

work page 2016
[4]

A., Wang, W

Hern´ an, M. A., Wang, W. & Leaf, D. E. Target trial emulatio n: a framework for causal inference from observational data. Journal of the American Medical Association 328, 2446– 2447 (2022)

work page 2022
[5]

Hern´ an, M. A. & Robins, J. M. Causal inference: What if (Chapman & Hall/CRC, Boca Raton, 2020)

work page 2020
[6]

E., Imai, K., King, G

Ho, D. E., Imai, K., King, G. & Stuart, E. A. Matching as nonp arametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 199–236 (2007)

work page 2007
[7]

Rosenbaum, P. R. & Rubin, D. B. The central role of the prope nsity score in observational studies for causal eﬀects. Biometrika 70, 41–55 (1983)

work page 1983
[8]

Austin, P. C. An introduction to propensity score methods for reducing the eﬀects of con- founding in observational studies. Multivariate Behavioral Research 46, 399–424 (2011)

work page 2011
[9]

& Carpenter, J

Williamson, E., Morley, R., Lucas, A. & Carpenter, J. Prop ensity scores: from naive enthusi- asm to intuitive understanding. Statistical Methods in Medical Research 21, 273–293 (2012). 11

work page 2012
[10]

Zubizarreta, J. R. Stable weights that balance covariat es for estimation with incomplete out- come data. Journal of the American Statistical Association 110, 910–922 (2015)

work page 2015
[11]

Funk, M. J. et al. Doubly robust estimation of causal eﬀects. American Journal of Epidemi- ology 173, 761–767 (2011)

work page 2011
[12]

Schuler, M. S. & Rose, S. Targeted maximum likelihood est imation for causal inference in observational studies. American journal of epidemiology 185, 65–73 (2017)

work page 2017
[13]

Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 1–28 (Chapman & Hall/CRC, Boca Raton, 2007)

work page 2007
[14]

Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 141–170 (Chapman & Hall/CRC, Boca Raton, 2007)

work page 2007
[15]

Phillippo, D. M. et al. Eﬀect modiﬁcation and non-collapsibility together may lead to conﬂict- ing treatment decisions: A review of marginal and condition al estimands and recommendations for decision-making. Research Synthesis Methods, 1–27 (2025)

work page 2025
[16]

Remiro-Az´ ocar, A. et al. Marginal and conditional summary measures: transpor tability and compatibility across studies 2025. arXiv: 2507.21925 [stat.ME] . https://arxiv.org/abs/2507.21925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

International Council for Harmonisation of Technical R equirements for Pharmaceuticals for Human Use. Addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials E9(R1) https://database.ich.org/sites/default/files/E9-R1 2019

work page 2019
[18]

Xu, S. et al. Use of stabilized inverse propensity scores as weights to di rectly estimate relative risk and its conﬁdence intervals. Value in Health 13, 273–277 (2010)

work page 2010
[19]

Gupta, A. et al. Transportability of nonlocal real-world evidence and its r elevance to health technology assessment: a primer. Journal of comparative eﬀectiveness research 14, e250041 (2025)

work page 2025
[20]

A., Nazemipour, M., Naimi, A

Mansournia, M. A., Nazemipour, M., Naimi, A. I., Collins , G. S. & Campbell, M. J. Reﬂection on modern methods: demystifying robust standard errors for epidemiologists. International Journal of Epidemiology 50, 346–351 (2021)

work page 2021
[21]

Austin, P. C. Variance estimation when using inverse pro bability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine 35, 5642–5655 (2016)

work page 2016
[22]

Austin, P. C. Bootstrap vs asymptotic variance estimati on when using propensity score weight- ing with continuous and binary outcomes. Statistics in Medicine 41, 4426–4443 (2022)

work page 2022
[23]

Little, R. J. & Rubin, D. B. in Statistical Analysis with Missing Data 41–58 (John Wiley & Sons, 2002)

work page 2002
[24]

Phillippo, D., Ades, A., Dias, S, Palmer, S & Abrams KR and , W. N. NICE DSU Technical Support Document 18: Methods for population-adjusted indi rect comparisons in submission to NICE Available from https://sheffield.ac.uk/nice-dsu/tsds/full-list. 2016

work page 2016
[25]

& We nzel, S

Golinelli, D., Ridgeway, G., Rhoades, H., Tucker, J. & We nzel, S. Bias and variance trade-oﬀs when combining propensity score weighting and regression: with an application to HIV status and homeless men. Health Services and Outcomes Research Methodology 12, 104–118 (2012)

work page 2012
[26]

Austin, P. C. Balance diagnostics for comparing the dist ribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine 28, 3083–3107 (2009). 12

work page 2009
[27]

Latimer, N. NICE DSU technical support document 14: survival analysis for economic evalua- tions alongside clinical trials-extrapolation with patient-level data Available from https://sheffield.ac.uk/nic 2011

work page 2011
[28]

VanderWeele, T. J. Principles of confounder selection. European Journal of Epidemiology 34, 211–219 (2019)

work page 2019
[29]

in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

work page 2015
[30]

in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

work page 2015
[31]

Scola, G. et al. Implementation of the trial emulation approach in medical r esearch: a scoping review. BMC Medical Research Methodology 23 (2023)

work page 2023
[32]

M., Yamamoto, S

Zuo, H., Yu, L., Campbell, S. M., Yamamoto, S. S. & Yuan, Y. The implementation of target trial emulation for causal inference: a scoping review. Journal of Clinical Epidemiology 162, 29–37 (2023)

work page 2023
[33]

S1", "S2

Stuart, E. A. Matching methods for causal inference: A re view and a look forward. Statistical Science 25, 1–21 (2010). Appendix: R Code for Demonstration library(tidyverse) library(cobalt) library(knitr) #set random seed set.seed(21082025) #parameters for simulating baseline characteristic data n <- c(30, 50, 70, 30) mu_age <- c(60, 45, 70, 50) sigma_age...

work page 2010

[1] [1]

Hennessy, S. et al. Real-World Data and Real-World Evidence in Regulatory Deci sion Mak- ing: Report Summary From the Council for International Orga nizations of Medical Sciences (CIOMS) Working Group XIII. Pharmacoepidemiology and Drug Safety 34, e70117 (2025)

work page 2025

[2] [2]

Patel, D. et al. Use of external comparators for health technology assessme nt submissions based on single-arm trials. Value in Health 24, 1118–1125 (2021)

work page 2021

[3] [3]

Hern´ an, M. A. & Robins, J. M. Using big data to emulate a tar get trial when a randomized trial is not available. American Journal of Epidemiology 183, 758–764 (2016)

work page 2016

[4] [4]

A., Wang, W

Hern´ an, M. A., Wang, W. & Leaf, D. E. Target trial emulatio n: a framework for causal inference from observational data. Journal of the American Medical Association 328, 2446– 2447 (2022)

work page 2022

[5] [5]

Hern´ an, M. A. & Robins, J. M. Causal inference: What if (Chapman & Hall/CRC, Boca Raton, 2020)

work page 2020

[6] [6]

E., Imai, K., King, G

Ho, D. E., Imai, K., King, G. & Stuart, E. A. Matching as nonp arametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15, 199–236 (2007)

work page 2007

[7] [7]

Rosenbaum, P. R. & Rubin, D. B. The central role of the prope nsity score in observational studies for causal eﬀects. Biometrika 70, 41–55 (1983)

work page 1983

[8] [8]

Austin, P. C. An introduction to propensity score methods for reducing the eﬀects of con- founding in observational studies. Multivariate Behavioral Research 46, 399–424 (2011)

work page 2011

[9] [9]

& Carpenter, J

Williamson, E., Morley, R., Lucas, A. & Carpenter, J. Prop ensity scores: from naive enthusi- asm to intuitive understanding. Statistical Methods in Medical Research 21, 273–293 (2012). 11

work page 2012

[10] [10]

Zubizarreta, J. R. Stable weights that balance covariat es for estimation with incomplete out- come data. Journal of the American Statistical Association 110, 910–922 (2015)

work page 2015

[11] [11]

Funk, M. J. et al. Doubly robust estimation of causal eﬀects. American Journal of Epidemi- ology 173, 761–767 (2011)

work page 2011

[12] [12]

Schuler, M. S. & Rose, S. Targeted maximum likelihood est imation for causal inference in observational studies. American journal of epidemiology 185, 65–73 (2017)

work page 2017

[13] [13]

Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 1–28 (Chapman & Hall/CRC, Boca Raton, 2007)

work page 2007

[14] [14]

Cook, T. D. & DeMets, D. L. in Introduction to Statistical Methods for Clinical Trials 141–170 (Chapman & Hall/CRC, Boca Raton, 2007)

work page 2007

[15] [15]

Phillippo, D. M. et al. Eﬀect modiﬁcation and non-collapsibility together may lead to conﬂict- ing treatment decisions: A review of marginal and condition al estimands and recommendations for decision-making. Research Synthesis Methods, 1–27 (2025)

work page 2025

[16] [16]

Remiro-Az´ ocar, A. et al. Marginal and conditional summary measures: transpor tability and compatibility across studies 2025. arXiv: 2507.21925 [stat.ME] . https://arxiv.org/abs/2507.21925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

International Council for Harmonisation of Technical R equirements for Pharmaceuticals for Human Use. Addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials E9(R1) https://database.ich.org/sites/default/files/E9-R1 2019

work page 2019

[18] [18]

Xu, S. et al. Use of stabilized inverse propensity scores as weights to di rectly estimate relative risk and its conﬁdence intervals. Value in Health 13, 273–277 (2010)

work page 2010

[19] [19]

Gupta, A. et al. Transportability of nonlocal real-world evidence and its r elevance to health technology assessment: a primer. Journal of comparative eﬀectiveness research 14, e250041 (2025)

work page 2025

[20] [20]

A., Nazemipour, M., Naimi, A

Mansournia, M. A., Nazemipour, M., Naimi, A. I., Collins , G. S. & Campbell, M. J. Reﬂection on modern methods: demystifying robust standard errors for epidemiologists. International Journal of Epidemiology 50, 346–351 (2021)

work page 2021

[21] [21]

Austin, P. C. Variance estimation when using inverse pro bability of treatment weighting (IPTW) with survival analysis. Statistics in Medicine 35, 5642–5655 (2016)

work page 2016

[22] [22]

Austin, P. C. Bootstrap vs asymptotic variance estimati on when using propensity score weight- ing with continuous and binary outcomes. Statistics in Medicine 41, 4426–4443 (2022)

work page 2022

[23] [23]

Little, R. J. & Rubin, D. B. in Statistical Analysis with Missing Data 41–58 (John Wiley & Sons, 2002)

work page 2002

[24] [24]

Phillippo, D., Ades, A., Dias, S, Palmer, S & Abrams KR and , W. N. NICE DSU Technical Support Document 18: Methods for population-adjusted indi rect comparisons in submission to NICE Available from https://sheffield.ac.uk/nice-dsu/tsds/full-list. 2016

work page 2016

[25] [25]

& We nzel, S

Golinelli, D., Ridgeway, G., Rhoades, H., Tucker, J. & We nzel, S. Bias and variance trade-oﬀs when combining propensity score weighting and regression: with an application to HIV status and homeless men. Health Services and Outcomes Research Methodology 12, 104–118 (2012)

work page 2012

[26] [26]

Austin, P. C. Balance diagnostics for comparing the dist ribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine 28, 3083–3107 (2009). 12

work page 2009

[27] [27]

Latimer, N. NICE DSU technical support document 14: survival analysis for economic evalua- tions alongside clinical trials-extrapolation with patient-level data Available from https://sheffield.ac.uk/nic 2011

work page 2011

[28] [28]

VanderWeele, T. J. Principles of confounder selection. European Journal of Epidemiology 34, 211–219 (2019)

work page 2019

[29] [29]

in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 13–44 (Springer, 2015)

work page 2015

[30] [30]

in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

Harrell, F. in Regression Modelling Strategies: With applications to lin ear models, logistic and ordinal regression, and survival analysis 63–102 (Springer, 2015)

work page 2015

[31] [31]

Scola, G. et al. Implementation of the trial emulation approach in medical r esearch: a scoping review. BMC Medical Research Methodology 23 (2023)

work page 2023

[32] [32]

M., Yamamoto, S

Zuo, H., Yu, L., Campbell, S. M., Yamamoto, S. S. & Yuan, Y. The implementation of target trial emulation for causal inference: a scoping review. Journal of Clinical Epidemiology 162, 29–37 (2023)

work page 2023

[33] [33]

S1", "S2

Stuart, E. A. Matching methods for causal inference: A re view and a look forward. Statistical Science 25, 1–21 (2010). Appendix: R Code for Demonstration library(tidyverse) library(cobalt) library(knitr) #set random seed set.seed(21082025) #parameters for simulating baseline characteristic data n <- c(30, 50, 70, 30) mu_age <- c(60, 45, 70, 50) sigma_age...

work page 2010