How to measure intra-physician variability in clinical decision-making?

Alaedine Benani; Damien Grosgeorge; Emmanuel Messas; J\'er\^ome Salomon; Liza Hettal; Pierre Meneton; Sai Sagireddy; Sylvain Bodard; Xavier Tannier

arxiv: 2605.28212 · v1 · pith:RQUAHSBDnew · submitted 2026-05-27 · 📊 stat.AP

How to measure intra-physician variability in clinical decision-making?

Alaedine Benani , Pierre Meneton , Emmanuel Messas , Liza Hettal , Sai Sagireddy , Damien Grosgeorge , J\'er\^ome Salomon , Sylvain Bodard

show 1 more author

Xavier Tannier

This is my paper

Pith reviewed 2026-06-29 09:42 UTC · model grok-4.3

classification 📊 stat.AP

keywords intra-physician variabilityprescribing variabilityclinical decision-makingdiscordance analysismatching methodssynthetic benchmarkingphysician quality metricGLMM

0 comments

The pith

Learned-Weights matching estimates intra-physician prescribing variability with the lowest error among eight tested methods on synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks eight methods for quantifying the probability that a single physician makes discordant decisions on comparable patients. Learned-Weights matching yields the smallest mean absolute error against a synthetic ground truth, with Mutual-Information-weighted matching and Random Forest proximity close behind. All eight methods recover physician rank orderings with high Spearman correlation when variability groups are clearly separated, though performance drops for unsupervised methods under continuous heterogeneity. The evaluation supplies open-source estimators that could convert observed prescribing inconsistency into a measurable clinician-level quality metric once applied to real data.

Core claim

On synthetic data generated across 94 conditions, Learned-Weights matching recovers intra-physician discordance probability with mean absolute error of 0.027, followed by Mutual-Information-weighted matching at 0.028 and Random Forest proximity at 0.034. When physician variability groups are well separated, every method preserves the ground-truth rank ordering of physicians with Spearman correlation above 0.89. Under a continuous-heterogeneity model, supervised feature-weighted methods and the Bayesian GLMM retain moderate rank fidelity while unsupervised approaches fall to 0.28–0.35.

What carries the argument

Benchmarking of eight discordance estimators (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis, Bayesian binomial GLMM) against synthetic ground-truth variability.

If this is right

Learned-Weights matching supplies the most accurate point estimate of a physician's discordance rate on synthetic cases.
All methods maintain physician ordering when variability clusters are distinct, enabling relative quality ranking.
Supervised weighted methods remain usable under continuous heterogeneity where unsupervised methods lose rank signal.
Validated estimators would allow routine tracking of intra-physician inconsistency alongside existing between-physician studies.
Open-source implementations could be applied directly to electronic health record prescribing logs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matching approach could be tested on non-prescribing decisions such as diagnostic test ordering or referral patterns.
Integration with patient outcome registries would allow direct testing of whether high estimated variability correlates with worse clinical results.
Real-time dashboard versions could flag physicians whose recent decisions deviate from their own historical patterns on matched cases.
Extension to multi-center data would reveal whether the relative performance of the eight methods holds across different health systems.

Load-bearing premise

The process used to generate the synthetic patient and physician data accurately reflects the structure and distribution of real intra-physician variability.

What would settle it

Application of the same eight methods to actual observational prescribing records yields mean absolute errors substantially higher than the synthetic benchmarks or Spearman correlations below 0.6 even when groups appear separated.

Figures

Figures reproduced from arXiv: 2605.28212 by Alaedine Benani, Damien Grosgeorge, Emmanuel Messas, J\'er\^ome Salomon, Liza Hettal, Pierre Meneton, Sai Sagireddy, Sylvain Bodard, Xavier Tannier.

**Figure 1.** Figure 1: Overview of the synthetic dataset generation. 2.1. Synthetic data generation. Benchmarking intra-physician variability methods requires a known ground truth. We illustrate our approach with a frequent clinical use case: statin prescription in patients at elevated cardiovascular risk. We generate synthetic patient cohorts in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Physician inconsistency across all methods on the SCORE2 experiment (J = 20). Top-Left: per-physician GLMM Pearson-residual overdispersion ODdj . Top-Centre: per-method discordance rates, coloured by physician group (p high ∈ {1.00, 0.90, 0.80, 0.70, 0.50}). Top-right: per-physician ∆ = method − ground truth; dashed line at zero indicates perfect agreement; values in parentheses are the mean ∆. ¯ Bottom: … view at source ↗

**Figure 3.** Figure 3: Continuous-heterogeneity SCORE2 experiment: correlation between prediction and true discordance rate Each point is one of the J = 50 physicians. For discordance-rate methods, the dashed diagonal indicates equality between the estimated discordance rate and the true D⋆ j . Spearman correlations against the true D⋆ j are printed in the upper-left corner of each panel. The GLMM is analysed separately because… view at source ↗

**Figure 4.** Figure 4: Pearson residuals from the GLMM (SCORE2 experiment). Residuals are centred near zero; the spread reflects the unexplained patient-level variability that drives ODdj . SCORE2 cohort ( [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Calibration curve of the GLMM (SCORE2 experiment). Observed prescription frequencies versus predicted probabilities. cohort is ˆg1 = 0.70 (mean 6.48%, SD 1.48%). The eight other covariates retain their mainbenchmark marginals. Results [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Bootstrap mean delta versus manual pairing across cohort and physician panel size (SCORE2 experiment). Sub-panels correspond to nphysicians ∈ {5, 10, 20, 50, 100}; abscissa is npatients ∈ {5 000, 10 000, 20 000, 30 000}; ordinate is the bootstrap mean delta (B = 10 replicates per cell). Vertical bars are 95% bootstrap CIs. Six matching methods are reported (Euclidean, Mahalanobis, Learned Weights, Mutual … view at source ↗

**Figure 7.** Figure 7: Sensitivity B.1. SCORE2 with Gaussian-copula correlation ρcopula = 0.8 between non-HDL and LDL cholesterol [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity B.2. SCORE2 with right-skewed lognormal HbA1c marginal [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Intra-physician prescribing variability, the probability that one physician issues discordant decisions for two patients deemed comparable on observed covariates, holds great impact in quality of care, safety and cost. However, there are no known validated measurement methods. Here, we benchmark eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis and Bayesian binomial generalized linear mixed model) against a synthetic ground truth across 94 experimental conditions. Learned-Weights matching achieves the lowest mean absolute error (0.027), followed by Mutual-Information-weighted matching (0.028) and RF Proximity (0.034). All eight discordance-analysis methods preserve the physician rank ordering with high fidelity (Spearman > 0.89 versus the ground truth on the SCORE2 experiment), as long as the physician variability groups are well separated. Under a continuous-heterogeneity physician model, rank preservation degrades substantially for unsupervised methods (Spearman = [0.28, 0.35]) but is retained by supervised feature-weighted methods and the GLMM (Spearman = [0.62, 0.68]). This controlled methodological evaluation is a foundation for validation on observational prescribing data. Once validated on observational prescribing data, these evaluated open-source estimators could turn prescribing inconsistency into a routinely measurable clinician-level quality metric, systematically complementing the existing literature on between-physician variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Synthetic benchmark ranks learned-weights matching highest for intra-physician variability but leaves real-data performance untested.

read the letter

The main takeaway is that on their synthetic tests, learned-weights matching recovers the ground-truth discordance levels with the lowest error (MAE 0.027), followed closely by mutual-information weighting, while all eight methods keep physician rank order intact when variability groups are clearly separated.

The paper's contribution is the controlled head-to-head on 94 conditions plus the SCORE2 setup. It generates independent synthetic ground truth, then measures absolute error and Spearman correlation for each method. That produces concrete numbers showing supervised feature-weighted approaches and the GLMM hold up better than unsupervised ones when heterogeneity is continuous rather than discrete.

The evaluation is transparent about its limits. The abstract states outright that real observational prescribing data validation is future work, so the reported rankings are presented as preparatory.

The soft spot is the synthetic generator itself. If the way discordance probabilities are created shares structure with the supervised weighting or mixed-model approaches, the performance ordering could be partly built in rather than discovered. The stress-test note correctly flags this risk, and the abstract gives no details that would rule it out. Without the generator being fully method-agnostic, the claim that supervised methods are superior rests on an assumption that needs checking.

This is for researchers working on clinician-level quality metrics or variation in care. A reader who needs a starting point for measuring intra-physician inconsistency will find the comparison useful as a reference, even if the numbers cannot yet be taken as general.

It deserves peer review. The synthetic design is structured and the question is practical; referees can push on the generator and the path to real data.

Referee Report

1 major / 1 minor

Summary. The paper benchmarks eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis, and Bayesian binomial GLMM) for estimating intra-physician prescribing variability (discordance probability for comparable patients) against synthetic ground truth across 94 experimental conditions plus a SCORE2 experiment. It reports Learned-Weights matching yields the lowest MAE (0.027), followed by Mutual-Information-weighted (0.028) and RF Proximity (0.034); all methods preserve physician rank order with Spearman >0.89 when variability groups are well separated, while supervised weighted methods and the GLMM retain better rank fidelity (Spearman 0.62-0.68) than unsupervised ones (0.28-0.35) under continuous heterogeneity. The work is positioned as a controlled methodological foundation for future validation on real observational prescribing data, with open-source estimators intended to enable clinician-level quality metrics.

Significance. If the synthetic benchmark is representative and method-agnostic, the evaluation supplies concrete guidance on method selection for quantifying intra-physician discordance and supplies reproducible performance metrics (MAE, Spearman correlations) plus open-source code. This could support development of a new, routinely measurable quality indicator that complements existing between-physician variation studies. The use of independently generated synthetic ground truth and explicit reporting of error metrics and rank correlations are explicit strengths.

major comments (1)

[Methods (synthetic data generation)] Synthetic data generation (Methods section): The headline rankings—Learned-Weights MAE 0.027, superiority of supervised weighted methods under continuous heterogeneity (Spearman 0.62-0.68 vs. 0.28-0.35 for unsupervised), and overall claim that the benchmark is a reliable foundation—rest on the generator producing discordance probabilities independently of the evaluated methods. The manuscript must supply the precise generative mechanism (e.g., how patient covariates, physician effects, and discordance labels are sampled) with explicit checks that the process does not embed correlations favoring feature-weighting or GLMM approaches; without this, the observed performance ordering cannot be treated as robust.

minor comments (1)

[Abstract] Abstract: the interval notation 'Spearman = [0.28, 0.35]' and '[0.62, 0.68]' is ambiguous without stating whether these are ranges across conditions or 95% intervals; clarify in the abstract and main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our benchmark study. The single major comment raises an important point about transparency in the synthetic data generator, which we address below.

read point-by-point responses

Referee: Synthetic data generation (Methods section): The headline rankings—Learned-Weights MAE 0.027, superiority of supervised weighted methods under continuous heterogeneity (Spearman 0.62-0.68 vs. 0.28-0.35 for unsupervised), and overall claim that the benchmark is a reliable foundation—rest on the generator producing discordance probabilities independently of the evaluated methods. The manuscript must supply the precise generative mechanism (e.g., how patient covariates, physician effects, and discordance labels are sampled) with explicit checks that the process does not embed correlations favoring feature-weighting or GLMM approaches; without this, the observed performance ordering cannot be treated as robust.

Authors: We agree that the precise generative mechanism must be fully specified for the benchmark results to be interpretable and robust. The current manuscript provides a high-level description but omits the full sampling equations and diagnostic checks. In the revised manuscript we will add a dedicated subsection to Methods that details: (i) covariate generation X ~ MVN(0, Σ) with explicit Σ; (ii) physician-specific effects drawn either from discrete groups or a continuous Beta distribution depending on the experimental condition; (iii) ground-truth discordance probability p computed via an independent logistic function of patient and physician features that is deliberately not aligned with any of the eight evaluated estimators; and (iv) Bernoulli sampling of the binary discordance label. We will also include supplementary material with correlation matrices and ablation plots confirming that no method-specific structure was inadvertently introduced. These additions will directly address the concern about potential favoritism toward weighted or GLMM approaches. revision: yes

Circularity Check

0 steps flagged

No circularity detected; methods benchmarked on independent synthetic ground truth

full rationale

The paper evaluates eight discordance-analysis methods (Euclidean, Mahalanobis, Learned-Weights, etc.) by direct comparison to a separately generated synthetic ground truth across 94 conditions and the SCORE2 experiment. Reported metrics such as MAE 0.027 and Spearman correlations >0.89 are empirical performance measures against this external benchmark, not quantities derived by construction from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claims to the inputs tautologically. The work is a controlled methodological comparison whose validity rests on the independence of the data generator from the evaluated estimators, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5825 in / 989 out tokens · 31597 ms · 2026-06-29T09:42:36.546750+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages

[1]

Virani SS, Kennedy KF, Akeroyd JM, et al. Variation in Lipid-Lowering Therapy Use in Patients With Low-Density Lipoprotein Cholesterol≥190 mg/dL: Insights From the Na- tional Cardiovascular Data Registry-Practice Innovation and Clinical Excellence Registry. Circ Cardiovasc Qual Outcomes. 2018;11(5):e004652

2018
[2]

Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study

Bottle A, Adamson A, Hayhoe B, Quint JK. Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study. BJGP Open. 2026;10(1). doi:10.3399/BJGPO.2024.0263

work page doi:10.3399/bjgpo.2024.0263 2026
[3]

practice style

Mousqu` es J, Renaud T, Scemama O. Is the “practice style” hypothesis relevant for general practitioners? An analysis of antibiotics prescription for acute rhinopharyngitis. Soc Sci Med. 2010;70(8):1176–1184. doi:10.1016/j.socscimed.2009.12.016

work page doi:10.1016/j.socscimed.2009.12.016 2010
[4]

Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study

Nyholm T, Jonsson J, S¨ oderstr¨ om K, et al. Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study. Radiat Oncol. 2013;8:126. doi:10.1186/1748-717X-8-126

work page doi:10.1186/1748-717x-8-126 2013
[5]

Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography

Choi HJ, Kim YS, Lee SH, et al. Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography. Acta Oncol. 2011;50(4):539–546. doi:10.3109/0284186X.2011.562916 20 A. BENANI ET AL

work page doi:10.3109/0284186x.2011.562916 2011
[6]

Intra- and inter-physician variability in target volume delineation in radiation therapy

Das IJ, Compton JJ, Bajaj A, Johnstone PA. Intra- and inter-physician variability in target volume delineation in radiation therapy. J Radiat Res. 2021. doi:10.1093/jrr/rrab080

work page doi:10.1093/jrr/rrab080 2021
[7]

Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes

Kobayati A, Tsoukas MA, Garfield N, et al. Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes. Diabetologia. 2026;69(4):872–882

2026
[8]

SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope

SCORE2 Working Group and ESC Cardiovascular Risk Collaboration. SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope. Eur Heart J. 2021;42(25):2439–2454

2021
[9]

SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions

SCORE2-OP Working Group and ESC Cardiovascular Risk Collaboration. SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions. Eur Heart J. 2021;42(25):2455–2467

2021
[10]

Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies

Diamond A, Sekhon JS. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. Rev Econ Stat. 2013

2013
[11]

Random forests

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32

2001
[12]

Unsupervised learning with random forest predictors

Shi T, Horvath S. Unsupervised learning with random forest predictors. J Comput Graph Stat. 2006;15(1):118–138

2006
[13]

Estimating mutual information

Kraskov A, St¨ ogbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69(6):066138

2004
[14]

Latent class models for clustering: A comparison with K-means

Vermunt JK, Magidson J. Latent class models for clustering: A comparison with K-means. In: Hagenaars JA, McCutcheon AL, editors. Applied Latent Class Analysis. Cambridge: Cambridge University Press; 2002. p. 89–106. doi:10.1017/CBO9780511499531.004

work page doi:10.1017/cbo9780511499531.004 2002
[15]

Constructing a control group using multivariate matched sampling methods that incorporate the propensity score

Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39(1):33–38

1985
[16]

Variational inference: a review for statisticians

Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112(518):859–877

2017
[17]

statsmodels: econometric and statistical modeling with Python

Seabold S, Perktold J. statsmodels: econometric and statistical modeling with Python. Proc 9th Python Sci Conf. 2010:92–96. AppendixA.Distribution of Pearson residuals from the GLMM The distribution of Pearson residuals is normal, and plotted in Fig 4 AppendixB.Calibration curve of the GLMM The calibration curve of the GLMM for the SCORE2 experiment, show...

2010

[1] [1]

Virani SS, Kennedy KF, Akeroyd JM, et al. Variation in Lipid-Lowering Therapy Use in Patients With Low-Density Lipoprotein Cholesterol≥190 mg/dL: Insights From the Na- tional Cardiovascular Data Registry-Practice Innovation and Clinical Excellence Registry. Circ Cardiovasc Qual Outcomes. 2018;11(5):e004652

2018

[2] [2]

Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study

Bottle A, Adamson A, Hayhoe B, Quint JK. Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study. BJGP Open. 2026;10(1). doi:10.3399/BJGPO.2024.0263

work page doi:10.3399/bjgpo.2024.0263 2026

[3] [3]

practice style

Mousqu` es J, Renaud T, Scemama O. Is the “practice style” hypothesis relevant for general practitioners? An analysis of antibiotics prescription for acute rhinopharyngitis. Soc Sci Med. 2010;70(8):1176–1184. doi:10.1016/j.socscimed.2009.12.016

work page doi:10.1016/j.socscimed.2009.12.016 2010

[4] [4]

Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study

Nyholm T, Jonsson J, S¨ oderstr¨ om K, et al. Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study. Radiat Oncol. 2013;8:126. doi:10.1186/1748-717X-8-126

work page doi:10.1186/1748-717x-8-126 2013

[5] [5]

Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography

Choi HJ, Kim YS, Lee SH, et al. Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography. Acta Oncol. 2011;50(4):539–546. doi:10.3109/0284186X.2011.562916 20 A. BENANI ET AL

work page doi:10.3109/0284186x.2011.562916 2011

[6] [6]

Intra- and inter-physician variability in target volume delineation in radiation therapy

Das IJ, Compton JJ, Bajaj A, Johnstone PA. Intra- and inter-physician variability in target volume delineation in radiation therapy. J Radiat Res. 2021. doi:10.1093/jrr/rrab080

work page doi:10.1093/jrr/rrab080 2021

[7] [7]

Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes

Kobayati A, Tsoukas MA, Garfield N, et al. Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes. Diabetologia. 2026;69(4):872–882

2026

[8] [8]

SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope

SCORE2 Working Group and ESC Cardiovascular Risk Collaboration. SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope. Eur Heart J. 2021;42(25):2439–2454

2021

[9] [9]

SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions

SCORE2-OP Working Group and ESC Cardiovascular Risk Collaboration. SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions. Eur Heart J. 2021;42(25):2455–2467

2021

[10] [10]

Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies

Diamond A, Sekhon JS. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. Rev Econ Stat. 2013

2013

[11] [11]

Random forests

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32

2001

[12] [12]

Unsupervised learning with random forest predictors

Shi T, Horvath S. Unsupervised learning with random forest predictors. J Comput Graph Stat. 2006;15(1):118–138

2006

[13] [13]

Estimating mutual information

Kraskov A, St¨ ogbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69(6):066138

2004

[14] [14]

Latent class models for clustering: A comparison with K-means

Vermunt JK, Magidson J. Latent class models for clustering: A comparison with K-means. In: Hagenaars JA, McCutcheon AL, editors. Applied Latent Class Analysis. Cambridge: Cambridge University Press; 2002. p. 89–106. doi:10.1017/CBO9780511499531.004

work page doi:10.1017/cbo9780511499531.004 2002

[15] [15]

Constructing a control group using multivariate matched sampling methods that incorporate the propensity score

Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39(1):33–38

1985

[16] [16]

Variational inference: a review for statisticians

Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112(518):859–877

2017

[17] [17]

statsmodels: econometric and statistical modeling with Python

Seabold S, Perktold J. statsmodels: econometric and statistical modeling with Python. Proc 9th Python Sci Conf. 2010:92–96. AppendixA.Distribution of Pearson residuals from the GLMM The distribution of Pearson residuals is normal, and plotted in Fig 4 AppendixB.Calibration curve of the GLMM The calibration curve of the GLMM for the SCORE2 experiment, show...

2010