arxiv: 2604.16450 · v1 · submitted 2026-04-07 · 💻 cs.CY · cs.LG· q-bio.QM

Recognition: no theorem link

FairLogue: Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using the All of Us Research Program

Nick Souligne , Vignesh Subbian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.CY cs.LGq-bio.QM

keywords intersectional fairnessclinical machine learningAll of Us datasetfairness auditingcounterfactual analysishealthcare disparities

0 comments

The pith

Intersectional fairness evaluation of two clinical ML models on All of Us data detects larger disparities than single-axis analyses, yet most match those expected from randomized group membership.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FairLogue as a toolkit to audit fairness at the intersection of demographic attributes in clinical machine learning. It replicates two published models using the All of Us dataset—one for SSRI bleeding risk and one for stroke risk in atrial fibrillation patients—and measures disparities across race, gender, and their combinations. Observational metrics show bigger gaps when groups are combined, yet counterfactual analysis finds that most of these gaps are similar in size to what would occur if demographic labels were shuffled randomly. This approach matters because single-axis fairness tests may overlook how biases compound across multiple identities in healthcare data.

Core claim

Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership.

What carries the argument

FairLogue toolkit for computing observational fairness metrics on intersectional subgroups and performing counterfactual diagnostics to assess whether disparities are attributable to group membership rather than chance.

If this is right

Single-axis fairness evaluations can underestimate the scale of disparities present in combined demographic groups.
Counterfactual analysis offers a method to test if observed disparities in clinical predictions are driven by actual group membership.
The toolkit enables more nuanced auditing of bias in healthcare machine learning applications.
Results from All of Us data suggest the need for intersectional perspectives in fairness assessments to avoid misattributing bias sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If counterfactual diagnostics are valid, many apparent intersectional biases may stem from unmeasured confounders or correlations rather than direct demographic effects, pointing toward targeted data collection improvements.
Applying similar auditing to additional clinical tasks could test whether the pattern of random-comparable disparities holds across other prediction problems.
Broader use might encourage development of fairness tools that integrate counterfactual checks as standard practice in medical AI evaluation.

Load-bearing premise

The counterfactual analysis can reliably separate disparities caused by actual group membership from those arising under random assignment or unmeasured confounders in the observational All of Us data.

What would settle it

Running the same fairness metrics and counterfactual tests on a controlled synthetic dataset where group labels are randomly permuted versus one with engineered causal group effects would show whether the diagnostics correctly classify the disparities in each case.

Figures

Figures reproduced from arXiv: 2604.16450 by Nick Souligne, Vignesh Subbian.

**Figure 2.** Figure 2: Fairness comparison across demographics for SSRI-associated bleeding prediction (random forest models) Heatmaps show demographic parity (DP), equalized odds false positive rate (EO FPR), and equal opportunity difference (EOD) gaps for each SSRI cohort. Results are shown separately for single axis race and gender only, and intersectional (race x gender) subgroups. Color intensity reflects the magnitude of t… view at source ↗

read the original abstract

Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FairLogue shows intersectional gaps exceed single-axis ones on two All of Us tasks but the counterfactual randomization step rests on untestable assumptions in observational data.

read the letter

The main thing to know is that this paper packages standard fairness metrics and a counterfactual check into a toolkit called FairLogue, then applies it to replicated models for SSRI bleeding risk and two-year stroke risk in atrial fibrillation patients using the All of Us dataset. Intersectional subgroups show larger disparities than race or gender alone, yet those disparities look similar to what appears under randomized group membership.

Referee Report

2 major / 2 minor

Summary. The paper introduces FairLogue, a toolkit for intersectional fairness auditing of clinical ML models. It replicates two published models on the All of Us dataset—one for SSRI-associated bleeding events and one for two-year stroke risk in atrial fibrillation patients—computes standard observational fairness metrics across race, gender, and their intersections, and applies counterfactual analysis to test whether observed disparities exceed those expected under randomized group membership. The central finding is that intersectional disparities are larger than single-axis ones, yet most are statistically comparable to the randomized baseline.

Significance. If the counterfactual diagnostics prove robust, the work usefully demonstrates that intersectional auditing can surface compound disparities missed by single-axis checks and supplies a reusable toolkit for such evaluations on a large, diverse cohort. The replication of existing models and focus on real clinical tasks are positive; however, the headline equivalence result hinges on an untestable assumption in observational data.

major comments (2)

[Methods / Results] The counterfactual randomization step (described in the Methods and used for the key comparison in Results) is not identifiable from observational All of Us data without strong, untestable assumptions about unmeasured confounding (SES, genetics, care access). No explicit construction details, balance diagnostics, or sensitivity analyses are provided, so the claim that observed disparities match the randomized baseline cannot be verified and risks being an artifact of residual associations.
[Abstract / Methods] Abstract and Methods give no information on model architectures, train/test splits, exact definitions of the fairness metrics, statistical tests for equivalence, or how counterfactuals were constructed (e.g., label permutation vs. resampling vs. causal model). These omissions make the central claim—that intersectional disparities are larger yet comparable to random—impossible to assess from the provided text.

minor comments (2)

[Results] Clarify notation for intersectional subgroups and ensure all figures include confidence intervals or p-values for the disparity comparisons.
[Discussion] Add a limitations subsection explicitly discussing the observational nature of the data and the assumptions required for the counterfactual analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight opportunities to strengthen the clarity and rigor of our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Methods / Results] The counterfactual randomization step (described in the Methods and used for the key comparison in Results) is not identifiable from observational All of Us data without strong, untestable assumptions about unmeasured confounding (SES, genetics, care access). No explicit construction details, balance diagnostics, or sensitivity analyses are provided, so the claim that observed disparities match the randomized baseline cannot be verified and risks being an artifact of residual associations.

Authors: The counterfactual analysis is a permutation test that randomly reassigns demographic group labels (including intersections) to the fixed predictions and outcomes, generating a null distribution of fairness metrics under randomized group membership. This is a non-parametric statistical test for whether observed disparities exceed those expected by chance under independence; it makes no causal claims and requires no assumptions about unmeasured confounding or identifiability of effects. We agree the Methods lacked explicit details on the procedure. In revision we will add the exact construction (group-label permutation with 1000 iterations), any diagnostics, and sensitivity analyses (e.g., varying iteration count). This will allow verification that the comparison is not an artifact. revision: yes
Referee: [Abstract / Methods] Abstract and Methods give no information on model architectures, train/test splits, exact definitions of the fairness metrics, statistical tests for equivalence, or how counterfactuals were constructed (e.g., label permutation vs. resampling vs. causal model). These omissions make the central claim—that intersectional disparities are larger yet comparable to random—impossible to assess from the provided text.

Authors: We agree these details are missing and limit assessment. The revised manuscript will expand the Abstract and Methods to specify: replicated model architectures (including original publication details), train/test split procedures, exact definitions and formulas for all fairness metrics, statistical tests for comparability to the randomized baseline (permutation p-values), and the counterfactual method as group-label permutation. These additions will make the central claims fully reproducible and assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fairness metrics applied to observational data

full rationale

The paper describes an empirical application of standard observational fairness metrics and counterfactual comparisons on the All of Us dataset for two clinical prediction tasks. No mathematical derivation, first-principles result, or prediction is claimed that reduces by construction to fitted parameters, self-definitions, or self-citations. The central findings (larger intersectional disparities but comparability to randomized-group baselines) are direct computations from the data rather than tautological outputs. This is a normal non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical and draws on standard fairness auditing techniques from prior literature without introducing new mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5469 in / 1045 out tokens · 67266 ms · 2026-05-10T18:02:13.934585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Translating intersectionality to fair machine learning in health sciences

Lett E, La Cava WG. Translating intersectionality to fair machine learning in health sciences. Nat Mach Intell. 2023;5(5):476–479. doi:10.1038/s42256-023-00651-3

work page doi:10.1038/s42256-023-00651-3 2023
[2]

An intersectional framework for counterfactual fairness in risk prediction

Wastvedt S, Huling J, Wolfson J. An intersectional framework for counterfactual fairness in risk prediction. Biostatistics. 2023;kxad021. doi:10.1093/biostatistics/kxad021

work page doi:10.1093/biostatistics/kxad021 2023
[3]

M., Sarro, F., and Harman, M

Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias mitigation for machine learning classifiers: a comprehensive survey. arXiv. 2023. Available from: http://arxiv.org/abs/2207.07068

work page arXiv 2023
[4]

Examining inclusivity: the use of AI and diverse populations in health and social care: a systematic review

Marko JGO, Neagu CD, Anand PB. Examining inclusivity: the use of AI and diverse populations in health and social care: a systematic review. BMC Med Inform Decis Mak. 2025;25:57. doi:10.1186/s12911-025-02884-1

work page doi:10.1186/s12911-025-02884-1 2025
[5]

Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence

Walsh CG, Chaudhry B, Dua P , Goodman KW, Kaplan B, Kavuluru R, et al. Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence. JAMIA Open. 2020;3(1):9–15. doi:10.1093/jamiaopen/ooz054

work page doi:10.1093/jamiaopen/ooz054 2020
[6]

Racial and ethnic disparities in primary open -angle glaucoma clinical trials: a systematic review and meta -analysis

Allison K, Patel DG, Greene L. Racial and ethnic disparities in primary open -angle glaucoma clinical trials: a systematic review and meta -analysis. JAMA Netw Open. 2021;4(5):e218348. doi:10.1001/jamanetworkopen.2021.8348

work page doi:10.1001/jamanetworkopen.2021.8348 2021
[7]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Obermeyer Z, Powers B, V ogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. doi:10.1126/science.aax2342

work page doi:10.1126/science.aax2342 2019
[8]

Predictive analytics for glaucoma using data from the All of Us research program

Baxter SL, Saseendrakumar BR, Paul P, Kim J, Bonomi L, Kuo TT, et al. Predictive analytics for glaucoma using data from the All of Us research program. Am J Ophthalmol. 2021;227:74–86. doi:10.1016/j.ajo.2021.01.008

work page doi:10.1016/j.ajo.2021.01.008 2021
[9]

Fair and interpretable models for survival analysis

Rahman MM, Purushotham S. Fair and interpretable models for survival analysis. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022. p. 1452 –1462. doi:10.1145/3534678.3539259

work page doi:10.1145/3534678.3539259 2022
[10]

Improving fairness in AI models on electronic health records: the case for federated learning methods

Poulain R, Bin Tarek MF, Beheshti R. Improving fairness in AI models on electronic health records: the case for federated learning methods. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 2023. p. 1599–1608. doi:10.1145/3593013.3594102

work page doi:10.1145/3593013.3594102 2023
[11]

An intersectional definition of fairness

Foulds J, Islam R, Keya KN, Pan S. An intersectional definition of fairness. arXiv. 2019. doi:10.48550/arXiv.1807.08362

work page doi:10.48550/arxiv.1807.08362 2019
[12]

Fair admission risk prediction with proportional multicalibration

La Cava WG, Lett E, Wan G. Fair admission risk prediction with proportional multicalibration. Proc Mach Learn Res. 2023;209:350–378

work page 2023
[13]

FairLogue: a toolkit for intersectional fairness analysis in clinical machine learning models [Internet]

Subbian V , Souligne N. FairLogue: a toolkit for intersectional fairness analysis in clinical machine learning models [Internet]. GitHub; 2026. Available from: https://github.com/vsubbian/FairLogue

work page 2026
[14]

Using machine learning to develop a clinical prediction model for SSRI -associated bleeding: a feasibility study

Goyal J, Ng DQ, Zhang K, Chan A, Lee J, Zheng K, et al. Using machine learning to develop a clinical prediction model for SSRI -associated bleeding: a feasibility study. BMC Med Inform Decis Mak. 2023;23(1):105. doi:10.1186/s12911-023-02206-3

work page doi:10.1186/s12911-023-02206-3 2023
[15]

Fair prediction of 2-year stroke risk in patients with atrial fibrillation

Gao J, Mar P, Tang ZZ, Chen G. Fair prediction of 2-year stroke risk in patients with atrial fibrillation. J Am Med Inform Assoc. 2024;31(12):2820–2828. doi:10.1093/jamia/ocae170

work page doi:10.1093/jamia/ocae170 2024
[16]

Data and statistics dissemination policy [Internet]

All of Us Research Program. Data and statistics dissemination policy [Internet]. 2024 Jan 18. Available from: https://support.researchallofus.org/hc/en-us/articles/22346276580372-Data-and-Statistics-Dissemination-Policy

work page arXiv 2024
[17]

Bear Don't Walk O 4th, Haldar S, Wei DH, Huang H, Rivera RL, Fan JW, et al. Developing and sustaining inclusive language in biomedical informatics communications: an AMIA Board of Directors endorsed paper on the Inclusive Language and Context Style Guideli nes. J Am Med Inform Assoc. 2025;32(8):1380–1387. doi:10.1093/jamia/ocaf096

work page doi:10.1093/jamia/ocaf096 2025