Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes

Alexandre Abraham; Andr\'es Hoyos Idrobo

arxiv: 2407.14861 · v3 · submitted 2024-07-20 · 📊 stat.ML · cs.LG· stat.ME

Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes

Alexandre Abraham , Andr\'es Hoyos Idrobo This is my paper

Pith reviewed 2026-05-23 22:36 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords propensity score matchingaverage treatment effectbias correctionmodel selectionA2A metricsynthetic taskshealth databases

0 comments

The pith

A2A metric selects better propensity score matches and cuts ATE estimation errors by up to 50 percent

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces A2A, a metric that builds artificial matching tasks with known outcomes to test how well different propensity score matching methods recover the true average treatment effect. When paired with standardized mean difference, A2A narrows the set of acceptable matching methods. This selection step cuts ATE estimation errors by as much as half on synthetic data and shrinks predicted ATE variability by up to 90 percent on both synthetic and real datasets. The authors also release an automated pipeline and Python package to make the full process reproducible.

Core claim

A2A constructs artificial matching tasks that mirror the original data but include known outcomes, allowing direct measurement of each matching method's accuracy from propensity score estimation through to ATE calculation. Combined with standardized mean difference, it selects methods that produce more reliable treatment effect estimates.

What carries the argument

A2A, a metric that generates artificial tasks with known ground-truth outcomes to evaluate matching performance on outcome correction using covariates not involved in selection

Load-bearing premise

The artificial matching tasks created by A2A reflect the same selection bias and covariate-outcome relationships that exist in the real data.

What would settle it

On a dataset with held-out real outcomes, the A2A-selected match produces larger ATE error than a non-selected valid match.

Figures

Figures reproduced from arXiv: 2407.14861 by Alexandre Abraham, Andr\'es Hoyos Idrobo.

**Figure 1.** Figure 1: Propensity score matching pipeline. Blue boxes indicate steps where the practitioner makes choices. Red boxes indicates steps ending with a validation. Backward arrows show points where the practitioner may revisit previous decisions. 2 Methods PSM consists of finding, in a control and treated population, two exchangeable subsets for comparison. For generality, we refer to the two initial sets as X0 and X… view at source ↗

**Figure 2.** Figure 2: Note that throughout the paper, a propensity method is deemed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

With the growing access to administrative health databases, retrospective studies have become crucial evidence for medical treatments. Yet, non-randomized studies frequently face selection biases, requiring mitigation strategies. Propensity score matching (PSM) addresses these biases by selecting comparable populations, allowing for analysis without further methodological constraints. However, PSM has several drawbacks. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To prevent cherry-picking the best method, public authorities must involve field experts and engage in extensive discussions with researchers. To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches. A2A constructs artificial matching tasks that mirror the original ones but with known outcomes, assessing each matching method's performance comprehensively from propensity estimation to ATE estimation. When combined with Standardized Mean Difference, A2A enhances the precision of model selection, resulting in a reduction of up to 50% in ATE estimation errors across synthetic tasks and up to 90% in predicted ATE variability across both synthetic and real-world datasets. To our knowledge, A2A is the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection. Computing A2A requires solving hundreds of PSMs, we therefore automate all manual steps of the PSM pipeline. We integrate PSM methods from Python and R, our automated pipeline, a new metric, and reproducible experiments into popmatch, our new Python package, to enhance reproducibility and accessibility to bias correction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the A2A metric to improve selection among propensity score matching (PSM) methods for bias correction in retrospective health studies. A2A constructs artificial matching tasks that mirror the original data-generating process but with known outcomes, allowing evaluation of methods from propensity estimation through ATE calculation. When combined with Standardized Mean Difference (SMD), A2A is claimed to reduce ATE estimation errors by up to 50% on synthetic tasks and ATE variability by up to 90% on both synthetic and real-world datasets. The authors also present an automated PSM pipeline and the popmatch Python package integrating methods from Python and R.

Significance. If the artificial tasks reliably preserve the original selection bias and covariate-outcome relationships, A2A could reduce reliance on expert judgment for PSM method selection and improve reproducibility of ATE estimates from observational data. The open-source package with automated pipeline and reproducible experiments is a concrete contribution that could increase accessibility of bias-correction methods.

major comments (3)

[Abstract and §3] Abstract and §3 (A2A construction): the central claim that A2A-selected methods transfer to real data rests on artificial tasks 'mirror[ing] the original ones but with known outcomes.' No quantitative validation is provided that the construction preserves selection bias magnitude, partial correlations between covariates and outcome, or treatment-effect heterogeneity. Without such checks, performance gains on artificial tasks do not necessarily imply the reported 50% error reduction and 90% variability reduction on real data.
[Results] Results (synthetic and real experiments): the headline quantitative improvements ('up to 50% in ATE estimation errors' and 'up to 90% in predicted ATE variability') are stated without error bars, number of replications, generation details for the synthetic tasks, or statistical significance tests. This information is required to evaluate whether the gains are robust or sensitive to the particular artificial-task construction.
[Methods] Methods (covariate handling): the claim that A2A is 'the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection' requires an explicit algorithm or pseudocode showing how outcomes are assigned to the artificial tasks while holding selected covariates out of the propensity model. The current description leaves this step underspecified.

minor comments (2)

[Introduction] The introduction would benefit from a short comparison table contrasting A2A with existing selection criteria (SMD alone, balance tests) to clarify the incremental contribution.
[Figures] Ensure all figures reporting ATE variability include the underlying sample sizes and the exact PSM methods being compared.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the manuscript. We agree that additional validation, statistical details, and algorithmic specification will improve clarity and credibility. We outline revisions below to address each point.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (A2A construction): the central claim that A2A-selected methods transfer to real data rests on artificial tasks 'mirror[ing] the original ones but with known outcomes.' No quantitative validation is provided that the construction preserves selection bias magnitude, partial correlations between covariates and outcome, or treatment-effect heterogeneity. Without such checks, performance gains on artificial tasks do not necessarily imply the reported 50% error reduction and 90% variability reduction on real data.

Authors: We agree that quantitative checks on how well the artificial tasks preserve key properties would strengthen the transferability argument. In the revision we will add comparisons of selection bias magnitude (via propensity score distribution distances), partial correlations between covariates and outcome, and treatment-effect heterogeneity measures between the original and constructed tasks. These will be reported in an expanded §3 and the results section. revision: yes
Referee: [Results] Results (synthetic and real experiments): the headline quantitative improvements ('up to 50% in ATE estimation errors' and 'up to 90% in predicted ATE variability') are stated without error bars, number of replications, generation details for the synthetic tasks, or statistical significance tests. This information is required to evaluate whether the gains are robust or sensitive to the particular artificial-task construction.

Authors: We accept that the current reporting lacks necessary statistical rigor. The revised manuscript will include error bars (standard errors across replications), the exact number of replications performed, full generation parameters for the synthetic data, and statistical significance tests (e.g., paired t-tests with p-values) comparing A2A+SMD selection against baselines. revision: yes
Referee: [Methods] Methods (covariate handling): the claim that A2A is 'the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection' requires an explicit algorithm or pseudocode showing how outcomes are assigned to the artificial tasks while holding selected covariates out of the propensity model. The current description leaves this step underspecified.

Authors: We will insert a new subsection with pseudocode and a step-by-step algorithm that explicitly describes outcome generation and assignment while excluding designated covariates from the propensity model. This will make the covariate-handling procedure fully reproducible and will support the claim with concrete implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity: A2A metric and reported gains rest on external empirical evaluation, not self-definition or fitted inputs.

full rationale

The paper defines A2A via construction of artificial tasks with known outcomes and reports error reductions from applying it to select PSM methods on both synthetic and real datasets. No equations, fitting procedures, or self-citations are described that would reduce the claimed 50% or 90% improvements to tautological inputs by construction. The central claims are presented as empirical outcomes of the new metric rather than redefinitions or renamings of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that performance on artificially constructed tasks transfers to real data; no free parameters are mentioned in the abstract.

axioms (1)

domain assumption Propensity score matching can mitigate selection biases when comparable populations are selected
Stated as the motivation for PSM in the abstract.

invented entities (1)

A2A metric no independent evidence
purpose: To construct artificial matching tasks with known outcomes for comprehensive evaluation of PSM methods
Newly introduced in the paper as the core contribution.

pith-pipeline@v0.9.0 · 5817 in / 1280 out tokens · 20020 ms · 2026-05-23T22:36:12.677398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Statistics in medicine27(12), 2037–2049 (2008)

Austin, P.C.: A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Statistics in medicine27(12), 2037–2049 (2008)

work page 1996
[2]

arXiv preprint arXiv:2002.11631 (2020)

Chen, H., Harinen, T., Lee, J.Y., Yung, M., Zhao, Z.: Causalml: Python package for causal machine learning. arXiv preprint arXiv:2002.11631 (2020)

work page arXiv 2002
[3]

Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd. vol. 96.34, pp. 226–231 (1996)

work page 1996
[4]

Journal of the American College of Surgeons230(1), 101–112 (2020)

Grose, E., Wilson, S., Barkun, J., Bertens, K., Martel, G., Balaa, F., Abou Khalil, J.: Use of propensity score methodology in contemporary high-impact surgical literature. Journal of the American College of Surgeons230(1), 101–112 (2020)

work page 2020
[5]

HAS: Méthodologie pour le développement clinique des dispositifs médicaux. Tech. rep., Haute autorité de Santé (2021)

work page 2021
[6]

Impact-Evaluation Guidelines (2010)

Heinrich, C., Maffioli, A., Vazquez, G.: A primer for applying propensity-score matching. Impact-Evaluation Guidelines (2010)

work page 2010
[7]

Journal of Open Source Software5(48), 2173 (2020)

Herbold, S.: Autorank: A python package for automated ranking of classifiers. Journal of Open Source Software5(48), 2173 (2020). https://doi.org/10.21105/ joss.02173, https://doi.org/10.21105/joss.02173

work page doi:10.21105/joss.02173 2020
[8]

Political analysis 27(4), 435–454 (2019)

King, G., Nielsen, R.: Why propensity scores should not be used for matching. Political analysis 27(4), 435–454 (2019)

work page 2019
[9]

In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

Kline, A., Luo, Y.: Psmpy: a package for retrospective cohort matching in python. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). pp. 1354–1357. IEEE (2022)

work page 2022
[10]

Behaviour Research and Therapy98, 76–90 (2017)

Lee, J., Little, T.D.: A practical guide to propensity score analysis for applied clinical research. Behaviour Research and Therapy98, 76–90 (2017)

work page 2017
[11]

Organizational Research Methods16(2), 188–226 (2013)

Li, M.: Using the propensity score method to estimate causal effects: A review and practical guide. Organizational Research Methods16(2), 188–226 (2013)

work page 2013
[12]

ERIC (1979)

McGill, M.: An evaluation of factors affecting document ranking by information retrieval systems. ERIC (1979)

work page 1979
[13]

The Journal of thoracic and cardiovascular surgery150(1), 14–19 (2015)

McMurry, T.L., Hu, Y., Blackstone, E.H., Kozower, B.D.: Propensity scores: meth- ods, considerations, and applications in the journal of thoracic and cardiovascular surgery. The Journal of thoracic and cardiovascular surgery150(1), 14–19 (2015)

work page 2015
[14]

Biometrika 108(2), 299–319 (2021)

Nie, X., Wager, S.: Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319 (2021)

work page 2021
[15]

Journal of Clinical Medicine11(19), 5643 (2022)

Nowak, M.M., Niemczyk, M., Florczyk, M., Kurzyna, M., Pączek, L.: Effect of statins on all-cause mortality in adults: A systematic review and meta-analysis of propensity score-matched studies. Journal of Clinical Medicine11(19), 5643 (2022)

work page 2022
[16]

propensity scores match- ing: A meta-analysis (2014)

Olmos, A., Govindasamy, P.: Randomized experiments vs. propensity scores match- ing: A meta-analysis (2014)

work page 2014
[17]

Advances in large margin classifiers10(3), 61–74 (1999)

Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers10(3), 61–74 (1999)

work page 1999
[18]

Practical Assessment, Research & Evaluation19 (2014)

Randolph, J.J., Falbe, K.: A step-by-step guide to propensity score matching in r. Practical Assessment, Research & Evaluation19 (2014)

work page 2014
[19]

Biometrika70(1), 41–55 (1983)

Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observa- tional studies for causal effects. Biometrika70(1), 41–55 (1983)

work page 1983
[20]

Journal of statistical software (2011) 16 A

Stuart, E.A., King, G., Imai, K., Ho, D.: Matchit: nonparametric preprocessing for parametric causal inference. Journal of statistical software (2011) 16 A. Abraham and A. Hoyos Idrobo

work page 2011
[21]

Multivariate behavioral research46(1), 90–118 (2011)

Thoemmes, F.J., Kim, E.S.: A systematic review of propensity score methods in the social sciences. Multivariate behavioral research46(1), 90–118 (2011)

work page 2011
[22]

Wang, J.: To use or not to use propensity score matching? Pharmaceutical Statistics 20(1), 15–24 (2021)

work page 2021
[23]

JNCI: Journal of the National Cancer Institute109(8), djw323 (2017)

Yao, X.I., Wang, X., Speicher, P.J., Hwang, E.S., Cheng, P., Harpole, D.H., Berry, M.F., Schrag, D., Pang, H.H.: Reporting and guidelines in propensity score analysis: a systematic review of cancer and cancer surgical studies. JNCI: Journal of the National Cancer Institute109(8), djw323 (2017)

work page 2017
[24]

Contemporary clinical trials47, 85–92 (2016) Improving Bias Correction Standards 17 A Algorithms Let xj be the jth row ofX

Zhao, P., Su, X., Ge, T., Fan, J.: Propensity score and proximity matching using random forest. Contemporary clinical trials47, 85–92 (2016) Improving Bias Correction Standards 17 A Algorithms Let xj be the jth row ofX. Data: Data ({X0, X1} , Y ), cluster membership probabilitiesp ∈ [0, 1]2, loss function L, number of iterationsK. cj ←Assign xj ∈ X0 to a ...

work page 2016

[1] [1]

Statistics in medicine27(12), 2037–2049 (2008)

Austin, P.C.: A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Statistics in medicine27(12), 2037–2049 (2008)

work page 1996

[2] [2]

arXiv preprint arXiv:2002.11631 (2020)

Chen, H., Harinen, T., Lee, J.Y., Yung, M., Zhao, Z.: Causalml: Python package for causal machine learning. arXiv preprint arXiv:2002.11631 (2020)

work page arXiv 2002

[3] [3]

Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd. vol. 96.34, pp. 226–231 (1996)

work page 1996

[4] [4]

Journal of the American College of Surgeons230(1), 101–112 (2020)

Grose, E., Wilson, S., Barkun, J., Bertens, K., Martel, G., Balaa, F., Abou Khalil, J.: Use of propensity score methodology in contemporary high-impact surgical literature. Journal of the American College of Surgeons230(1), 101–112 (2020)

work page 2020

[5] [5]

HAS: Méthodologie pour le développement clinique des dispositifs médicaux. Tech. rep., Haute autorité de Santé (2021)

work page 2021

[6] [6]

Impact-Evaluation Guidelines (2010)

Heinrich, C., Maffioli, A., Vazquez, G.: A primer for applying propensity-score matching. Impact-Evaluation Guidelines (2010)

work page 2010

[7] [7]

Journal of Open Source Software5(48), 2173 (2020)

Herbold, S.: Autorank: A python package for automated ranking of classifiers. Journal of Open Source Software5(48), 2173 (2020). https://doi.org/10.21105/ joss.02173, https://doi.org/10.21105/joss.02173

work page doi:10.21105/joss.02173 2020

[8] [8]

Political analysis 27(4), 435–454 (2019)

King, G., Nielsen, R.: Why propensity scores should not be used for matching. Political analysis 27(4), 435–454 (2019)

work page 2019

[9] [9]

In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

Kline, A., Luo, Y.: Psmpy: a package for retrospective cohort matching in python. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). pp. 1354–1357. IEEE (2022)

work page 2022

[10] [10]

Behaviour Research and Therapy98, 76–90 (2017)

Lee, J., Little, T.D.: A practical guide to propensity score analysis for applied clinical research. Behaviour Research and Therapy98, 76–90 (2017)

work page 2017

[11] [11]

Organizational Research Methods16(2), 188–226 (2013)

Li, M.: Using the propensity score method to estimate causal effects: A review and practical guide. Organizational Research Methods16(2), 188–226 (2013)

work page 2013

[12] [12]

ERIC (1979)

McGill, M.: An evaluation of factors affecting document ranking by information retrieval systems. ERIC (1979)

work page 1979

[13] [13]

The Journal of thoracic and cardiovascular surgery150(1), 14–19 (2015)

McMurry, T.L., Hu, Y., Blackstone, E.H., Kozower, B.D.: Propensity scores: meth- ods, considerations, and applications in the journal of thoracic and cardiovascular surgery. The Journal of thoracic and cardiovascular surgery150(1), 14–19 (2015)

work page 2015

[14] [14]

Biometrika 108(2), 299–319 (2021)

Nie, X., Wager, S.: Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319 (2021)

work page 2021

[15] [15]

Journal of Clinical Medicine11(19), 5643 (2022)

Nowak, M.M., Niemczyk, M., Florczyk, M., Kurzyna, M., Pączek, L.: Effect of statins on all-cause mortality in adults: A systematic review and meta-analysis of propensity score-matched studies. Journal of Clinical Medicine11(19), 5643 (2022)

work page 2022

[16] [16]

propensity scores match- ing: A meta-analysis (2014)

Olmos, A., Govindasamy, P.: Randomized experiments vs. propensity scores match- ing: A meta-analysis (2014)

work page 2014

[17] [17]

Advances in large margin classifiers10(3), 61–74 (1999)

Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers10(3), 61–74 (1999)

work page 1999

[18] [18]

Practical Assessment, Research & Evaluation19 (2014)

Randolph, J.J., Falbe, K.: A step-by-step guide to propensity score matching in r. Practical Assessment, Research & Evaluation19 (2014)

work page 2014

[19] [19]

Biometrika70(1), 41–55 (1983)

Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observa- tional studies for causal effects. Biometrika70(1), 41–55 (1983)

work page 1983

[20] [20]

Journal of statistical software (2011) 16 A

Stuart, E.A., King, G., Imai, K., Ho, D.: Matchit: nonparametric preprocessing for parametric causal inference. Journal of statistical software (2011) 16 A. Abraham and A. Hoyos Idrobo

work page 2011

[21] [21]

Multivariate behavioral research46(1), 90–118 (2011)

Thoemmes, F.J., Kim, E.S.: A systematic review of propensity score methods in the social sciences. Multivariate behavioral research46(1), 90–118 (2011)

work page 2011

[22] [22]

Wang, J.: To use or not to use propensity score matching? Pharmaceutical Statistics 20(1), 15–24 (2021)

work page 2021

[23] [23]

JNCI: Journal of the National Cancer Institute109(8), djw323 (2017)

Yao, X.I., Wang, X., Speicher, P.J., Hwang, E.S., Cheng, P., Harpole, D.H., Berry, M.F., Schrag, D., Pang, H.H.: Reporting and guidelines in propensity score analysis: a systematic review of cancer and cancer surgical studies. JNCI: Journal of the National Cancer Institute109(8), djw323 (2017)

work page 2017

[24] [24]

Contemporary clinical trials47, 85–92 (2016) Improving Bias Correction Standards 17 A Algorithms Let xj be the jth row ofX

Zhao, P., Su, X., Ge, T., Fan, J.: Propensity score and proximity matching using random forest. Contemporary clinical trials47, 85–92 (2016) Improving Bias Correction Standards 17 A Algorithms Let xj be the jth row ofX. Data: Data ({X0, X1} , Y ), cluster membership probabilitiesp ∈ [0, 1]2, loss function L, number of iterationsK. cj ←Assign xj ∈ X0 to a ...

work page 2016