pith. sign in

arxiv: 2605.28212 · v1 · pith:RQUAHSBDnew · submitted 2026-05-27 · 📊 stat.AP

How to measure intra-physician variability in clinical decision-making?

Pith reviewed 2026-06-29 09:42 UTC · model grok-4.3

classification 📊 stat.AP
keywords intra-physician variabilityprescribing variabilityclinical decision-makingdiscordance analysismatching methodssynthetic benchmarkingphysician quality metricGLMM
0
0 comments X

The pith

Learned-Weights matching estimates intra-physician prescribing variability with the lowest error among eight tested methods on synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks eight methods for quantifying the probability that a single physician makes discordant decisions on comparable patients. Learned-Weights matching yields the smallest mean absolute error against a synthetic ground truth, with Mutual-Information-weighted matching and Random Forest proximity close behind. All eight methods recover physician rank orderings with high Spearman correlation when variability groups are clearly separated, though performance drops for unsupervised methods under continuous heterogeneity. The evaluation supplies open-source estimators that could convert observed prescribing inconsistency into a measurable clinician-level quality metric once applied to real data.

Core claim

On synthetic data generated across 94 conditions, Learned-Weights matching recovers intra-physician discordance probability with mean absolute error of 0.027, followed by Mutual-Information-weighted matching at 0.028 and Random Forest proximity at 0.034. When physician variability groups are well separated, every method preserves the ground-truth rank ordering of physicians with Spearman correlation above 0.89. Under a continuous-heterogeneity model, supervised feature-weighted methods and the Bayesian GLMM retain moderate rank fidelity while unsupervised approaches fall to 0.28–0.35.

What carries the argument

Benchmarking of eight discordance estimators (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis, Bayesian binomial GLMM) against synthetic ground-truth variability.

If this is right

  • Learned-Weights matching supplies the most accurate point estimate of a physician's discordance rate on synthetic cases.
  • All methods maintain physician ordering when variability clusters are distinct, enabling relative quality ranking.
  • Supervised weighted methods remain usable under continuous heterogeneity where unsupervised methods lose rank signal.
  • Validated estimators would allow routine tracking of intra-physician inconsistency alongside existing between-physician studies.
  • Open-source implementations could be applied directly to electronic health record prescribing logs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same matching approach could be tested on non-prescribing decisions such as diagnostic test ordering or referral patterns.
  • Integration with patient outcome registries would allow direct testing of whether high estimated variability correlates with worse clinical results.
  • Real-time dashboard versions could flag physicians whose recent decisions deviate from their own historical patterns on matched cases.
  • Extension to multi-center data would reveal whether the relative performance of the eight methods holds across different health systems.

Load-bearing premise

The process used to generate the synthetic patient and physician data accurately reflects the structure and distribution of real intra-physician variability.

What would settle it

Application of the same eight methods to actual observational prescribing records yields mean absolute errors substantially higher than the synthetic benchmarks or Spearman correlations below 0.6 even when groups appear separated.

Figures

Figures reproduced from arXiv: 2605.28212 by Alaedine Benani, Damien Grosgeorge, Emmanuel Messas, J\'er\^ome Salomon, Liza Hettal, Pierre Meneton, Sai Sagireddy, Sylvain Bodard, Xavier Tannier.

Figure 1
Figure 1. Figure 1: Overview of the synthetic dataset generation. 2.1. Synthetic data generation. Benchmarking intra-physician variability methods requires a known ground truth. We illustrate our approach with a frequent clinical use case: statin pre￾scription in patients at elevated cardiovascular risk. We generate synthetic patient cohorts in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Physician inconsistency across all methods on the SCORE2 experiment (J = 20). Top-Left: per-physician GLMM Pearson-residual overdispersion ODdj . Top-Centre: per-method discordance rates, coloured by physician group (p high ∈ {1.00, 0.90, 0.80, 0.70, 0.50}). Top-right: per-physician ∆ = method − ground truth; dashed line at zero in￾dicates perfect agreement; values in parentheses are the mean ∆. ¯ Bottom: … view at source ↗
Figure 3
Figure 3. Figure 3: Continuous-heterogeneity SCORE2 experiment: correla￾tion between prediction and true discordance rate Each point is one of the J = 50 physicians. For discordance-rate methods, the dashed diagonal indicates equality between the estimated discordance rate and the true D⋆ j . Spearman correlations against the true D⋆ j are printed in the upper-left corner of each panel. The GLMM is analysed separately because… view at source ↗
Figure 4
Figure 4. Figure 4: Pearson residuals from the GLMM (SCORE2 experiment). Resid￾uals are centred near zero; the spread reflects the unexplained patient-level variability that drives ODdj . SCORE2 cohort ( [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Calibration curve of the GLMM (SCORE2 experiment). Observed prescription frequencies versus predicted probabilities. cohort is ˆg1 = 0.70 (mean 6.48%, SD 1.48%). The eight other covariates retain their main￾benchmark marginals. Results [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bootstrap mean delta versus manual pairing across cohort and physician panel size (SCORE2 experiment). Sub-panels correspond to nphysicians ∈ {5, 10, 20, 50, 100}; abscissa is npatients ∈ {5 000, 10 000, 20 000, 30 000}; ordinate is the bootstrap mean delta (B = 10 replicates per cell). Vertical bars are 95% bootstrap CIs. Six matching meth￾ods are reported (Euclidean, Mahalanobis, Learned Weights, Mutual … view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity B.1. SCORE2 with Gaussian-copula correlation ρcopula = 0.8 between non-HDL and LDL cholesterol [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity B.2. SCORE2 with right-skewed lognormal HbA1c marginal [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Intra-physician prescribing variability, the probability that one physician issues discordant decisions for two patients deemed comparable on observed covariates, holds great impact in quality of care, safety and cost. However, there are no known validated measurement methods. Here, we benchmark eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis and Bayesian binomial generalized linear mixed model) against a synthetic ground truth across 94 experimental conditions. Learned-Weights matching achieves the lowest mean absolute error (0.027), followed by Mutual-Information-weighted matching (0.028) and RF Proximity (0.034). All eight discordance-analysis methods preserve the physician rank ordering with high fidelity (Spearman > 0.89 versus the ground truth on the SCORE2 experiment), as long as the physician variability groups are well separated. Under a continuous-heterogeneity physician model, rank preservation degrades substantially for unsupervised methods (Spearman = [0.28, 0.35]) but is retained by supervised feature-weighted methods and the GLMM (Spearman = [0.62, 0.68]). This controlled methodological evaluation is a foundation for validation on observational prescribing data. Once validated on observational prescribing data, these evaluated open-source estimators could turn prescribing inconsistency into a routinely measurable clinician-level quality metric, systematically complementing the existing literature on between-physician variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper benchmarks eight methods (Euclidean, Mahalanobis, Learned-Weights, Genetic Mahalanobis, Random Forest proximity, Mutual-Information-weighted, Latent Profile Analysis, and Bayesian binomial GLMM) for estimating intra-physician prescribing variability (discordance probability for comparable patients) against synthetic ground truth across 94 experimental conditions plus a SCORE2 experiment. It reports Learned-Weights matching yields the lowest MAE (0.027), followed by Mutual-Information-weighted (0.028) and RF Proximity (0.034); all methods preserve physician rank order with Spearman >0.89 when variability groups are well separated, while supervised weighted methods and the GLMM retain better rank fidelity (Spearman 0.62-0.68) than unsupervised ones (0.28-0.35) under continuous heterogeneity. The work is positioned as a controlled methodological foundation for future validation on real observational prescribing data, with open-source estimators intended to enable clinician-level quality metrics.

Significance. If the synthetic benchmark is representative and method-agnostic, the evaluation supplies concrete guidance on method selection for quantifying intra-physician discordance and supplies reproducible performance metrics (MAE, Spearman correlations) plus open-source code. This could support development of a new, routinely measurable quality indicator that complements existing between-physician variation studies. The use of independently generated synthetic ground truth and explicit reporting of error metrics and rank correlations are explicit strengths.

major comments (1)
  1. [Methods (synthetic data generation)] Synthetic data generation (Methods section): The headline rankings—Learned-Weights MAE 0.027, superiority of supervised weighted methods under continuous heterogeneity (Spearman 0.62-0.68 vs. 0.28-0.35 for unsupervised), and overall claim that the benchmark is a reliable foundation—rest on the generator producing discordance probabilities independently of the evaluated methods. The manuscript must supply the precise generative mechanism (e.g., how patient covariates, physician effects, and discordance labels are sampled) with explicit checks that the process does not embed correlations favoring feature-weighting or GLMM approaches; without this, the observed performance ordering cannot be treated as robust.
minor comments (1)
  1. [Abstract] Abstract: the interval notation 'Spearman = [0.28, 0.35]' and '[0.62, 0.68]' is ambiguous without stating whether these are ranges across conditions or 95% intervals; clarify in the abstract and main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our benchmark study. The single major comment raises an important point about transparency in the synthetic data generator, which we address below.

read point-by-point responses
  1. Referee: Synthetic data generation (Methods section): The headline rankings—Learned-Weights MAE 0.027, superiority of supervised weighted methods under continuous heterogeneity (Spearman 0.62-0.68 vs. 0.28-0.35 for unsupervised), and overall claim that the benchmark is a reliable foundation—rest on the generator producing discordance probabilities independently of the evaluated methods. The manuscript must supply the precise generative mechanism (e.g., how patient covariates, physician effects, and discordance labels are sampled) with explicit checks that the process does not embed correlations favoring feature-weighting or GLMM approaches; without this, the observed performance ordering cannot be treated as robust.

    Authors: We agree that the precise generative mechanism must be fully specified for the benchmark results to be interpretable and robust. The current manuscript provides a high-level description but omits the full sampling equations and diagnostic checks. In the revised manuscript we will add a dedicated subsection to Methods that details: (i) covariate generation X ~ MVN(0, Σ) with explicit Σ; (ii) physician-specific effects drawn either from discrete groups or a continuous Beta distribution depending on the experimental condition; (iii) ground-truth discordance probability p computed via an independent logistic function of patient and physician features that is deliberately not aligned with any of the eight evaluated estimators; and (iv) Bernoulli sampling of the binary discordance label. We will also include supplementary material with correlation matrices and ablation plots confirming that no method-specific structure was inadvertently introduced. These additions will directly address the concern about potential favoritism toward weighted or GLMM approaches. revision: yes

Circularity Check

0 steps flagged

No circularity detected; methods benchmarked on independent synthetic ground truth

full rationale

The paper evaluates eight discordance-analysis methods (Euclidean, Mahalanobis, Learned-Weights, etc.) by direct comparison to a separately generated synthetic ground truth across 94 conditions and the SCORE2 experiment. Reported metrics such as MAE 0.027 and Spearman correlations >0.89 are empirical performance measures against this external benchmark, not quantities derived by construction from fitted parameters or self-citations. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central claims to the inputs tautologically. The work is a controlled methodological comparison whose validity rests on the independence of the data generator from the evaluated estimators, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5825 in / 989 out tokens · 31597 ms · 2026-06-29T09:42:36.546750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages

  1. [1]

    Virani SS, Kennedy KF, Akeroyd JM, et al. Variation in Lipid-Lowering Therapy Use in Patients With Low-Density Lipoprotein Cholesterol≥190 mg/dL: Insights From the Na- tional Cardiovascular Data Registry-Practice Innovation and Clinical Excellence Registry. Circ Cardiovasc Qual Outcomes. 2018;11(5):e004652

  2. [2]

    Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study

    Bottle A, Adamson A, Hayhoe B, Quint JK. Between-practice variation in chronic ob- structive pulmonary disease diagnosis guideline compliance: an observational study. BJGP Open. 2026;10(1). doi:10.3399/BJGPO.2024.0263

  3. [3]

    practice style

    Mousqu` es J, Renaud T, Scemama O. Is the “practice style” hypothesis relevant for general practitioners? An analysis of antibiotics prescription for acute rhinopharyngitis. Soc Sci Med. 2010;70(8):1176–1184. doi:10.1016/j.socscimed.2009.12.016

  4. [4]

    Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study

    Nyholm T, Jonsson J, S¨ oderstr¨ om K, et al. Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study. Radiat Oncol. 2013;8:126. doi:10.1186/1748-717X-8-126

  5. [5]

    Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography

    Choi HJ, Kim YS, Lee SH, et al. Inter- and intra-observer variability in contouring of the prostate gland on planning computed tomography and cone beam computed tomography. Acta Oncol. 2011;50(4):539–546. doi:10.3109/0284186X.2011.562916 20 A. BENANI ET AL

  6. [6]

    Intra- and inter-physician variability in target volume delineation in radiation therapy

    Das IJ, Compton JJ, Bajaj A, Johnstone PA. Intra- and inter-physician variability in target volume delineation in radiation therapy. J Radiat Res. 2021. doi:10.1093/jrr/rrab080

  7. [7]

    Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes

    Kobayati A, Tsoukas MA, Garfield N, et al. Inter- and intra-physician variability in in- sulin injection adjustments compared with Bayesian algorithm recommendations in type 1 diabetes. Diabetologia. 2026;69(4):872–882

  8. [8]

    SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope

    SCORE2 Working Group and ESC Cardiovascular Risk Collaboration. SCORE2 risk pre- diction algorithms: new models to estimate 10-year risk of cardiovascular disease in Eu- rope. Eur Heart J. 2021;42(25):2439–2454

  9. [9]

    SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions

    SCORE2-OP Working Group and ESC Cardiovascular Risk Collaboration. SCORE2-OP risk prediction algorithms: estimating incident cardiovascular event risk in older persons in four geographical risk regions. Eur Heart J. 2021;42(25):2455–2467

  10. [10]

    Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies

    Diamond A, Sekhon JS. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. Rev Econ Stat. 2013

  11. [11]

    Random forests

    Breiman L. Random forests. Mach Learn. 2001;45(1):5–32

  12. [12]

    Unsupervised learning with random forest predictors

    Shi T, Horvath S. Unsupervised learning with random forest predictors. J Comput Graph Stat. 2006;15(1):118–138

  13. [13]

    Estimating mutual information

    Kraskov A, St¨ ogbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69(6):066138

  14. [14]

    Latent class models for clustering: A comparison with K-means

    Vermunt JK, Magidson J. Latent class models for clustering: A comparison with K-means. In: Hagenaars JA, McCutcheon AL, editors. Applied Latent Class Analysis. Cambridge: Cambridge University Press; 2002. p. 89–106. doi:10.1017/CBO9780511499531.004

  15. [15]

    Constructing a control group using multivariate matched sampling methods that incorporate the propensity score

    Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39(1):33–38

  16. [16]

    Variational inference: a review for statisticians

    Blei DM, Kucukelbir A, McAuliffe JD. Variational inference: a review for statisticians. J Am Stat Assoc. 2017;112(518):859–877

  17. [17]

    statsmodels: econometric and statistical modeling with Python

    Seabold S, Perktold J. statsmodels: econometric and statistical modeling with Python. Proc 9th Python Sci Conf. 2010:92–96. AppendixA.Distribution of Pearson residuals from the GLMM The distribution of Pearson residuals is normal, and plotted in Fig 4 AppendixB.Calibration curve of the GLMM The calibration curve of the GLMM for the SCORE2 experiment, show...