Recognition: no theorem link
FairLogue: Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using the All of Us Research Program
Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3
The pith
Intersectional fairness evaluation of two clinical ML models on All of Us data detects larger disparities than single-axis analyses, yet most match those expected from randomized group membership.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership.
What carries the argument
FairLogue toolkit for computing observational fairness metrics on intersectional subgroups and performing counterfactual diagnostics to assess whether disparities are attributable to group membership rather than chance.
If this is right
- Single-axis fairness evaluations can underestimate the scale of disparities present in combined demographic groups.
- Counterfactual analysis offers a method to test if observed disparities in clinical predictions are driven by actual group membership.
- The toolkit enables more nuanced auditing of bias in healthcare machine learning applications.
- Results from All of Us data suggest the need for intersectional perspectives in fairness assessments to avoid misattributing bias sources.
Where Pith is reading between the lines
- If counterfactual diagnostics are valid, many apparent intersectional biases may stem from unmeasured confounders or correlations rather than direct demographic effects, pointing toward targeted data collection improvements.
- Applying similar auditing to additional clinical tasks could test whether the pattern of random-comparable disparities holds across other prediction problems.
- Broader use might encourage development of fairness tools that integrate counterfactual checks as standard practice in medical AI evaluation.
Load-bearing premise
The counterfactual analysis can reliably separate disparities caused by actual group membership from those arising under random assignment or unmeasured confounders in the observational All of Us data.
What would settle it
Running the same fairness metrics and counterfactual tests on a controlled synthetic dataset where group labels are randomly permuted versus one with engineered causal group effects would show whether the diagnostics correctly classify the disparities in each case.
Figures
read the original abstract
Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FairLogue, a toolkit for intersectional fairness auditing of clinical ML models. It replicates two published models on the All of Us dataset—one for SSRI-associated bleeding events and one for two-year stroke risk in atrial fibrillation patients—computes standard observational fairness metrics across race, gender, and their intersections, and applies counterfactual analysis to test whether observed disparities exceed those expected under randomized group membership. The central finding is that intersectional disparities are larger than single-axis ones, yet most are statistically comparable to the randomized baseline.
Significance. If the counterfactual diagnostics prove robust, the work usefully demonstrates that intersectional auditing can surface compound disparities missed by single-axis checks and supplies a reusable toolkit for such evaluations on a large, diverse cohort. The replication of existing models and focus on real clinical tasks are positive; however, the headline equivalence result hinges on an untestable assumption in observational data.
major comments (2)
- [Methods / Results] The counterfactual randomization step (described in the Methods and used for the key comparison in Results) is not identifiable from observational All of Us data without strong, untestable assumptions about unmeasured confounding (SES, genetics, care access). No explicit construction details, balance diagnostics, or sensitivity analyses are provided, so the claim that observed disparities match the randomized baseline cannot be verified and risks being an artifact of residual associations.
- [Abstract / Methods] Abstract and Methods give no information on model architectures, train/test splits, exact definitions of the fairness metrics, statistical tests for equivalence, or how counterfactuals were constructed (e.g., label permutation vs. resampling vs. causal model). These omissions make the central claim—that intersectional disparities are larger yet comparable to random—impossible to assess from the provided text.
minor comments (2)
- [Results] Clarify notation for intersectional subgroups and ensure all figures include confidence intervals or p-values for the disparity comparisons.
- [Discussion] Add a limitations subsection explicitly discussing the observational nature of the data and the assumptions required for the counterfactual analysis.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight opportunities to strengthen the clarity and rigor of our manuscript. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Methods / Results] The counterfactual randomization step (described in the Methods and used for the key comparison in Results) is not identifiable from observational All of Us data without strong, untestable assumptions about unmeasured confounding (SES, genetics, care access). No explicit construction details, balance diagnostics, or sensitivity analyses are provided, so the claim that observed disparities match the randomized baseline cannot be verified and risks being an artifact of residual associations.
Authors: The counterfactual analysis is a permutation test that randomly reassigns demographic group labels (including intersections) to the fixed predictions and outcomes, generating a null distribution of fairness metrics under randomized group membership. This is a non-parametric statistical test for whether observed disparities exceed those expected by chance under independence; it makes no causal claims and requires no assumptions about unmeasured confounding or identifiability of effects. We agree the Methods lacked explicit details on the procedure. In revision we will add the exact construction (group-label permutation with 1000 iterations), any diagnostics, and sensitivity analyses (e.g., varying iteration count). This will allow verification that the comparison is not an artifact. revision: yes
-
Referee: [Abstract / Methods] Abstract and Methods give no information on model architectures, train/test splits, exact definitions of the fairness metrics, statistical tests for equivalence, or how counterfactuals were constructed (e.g., label permutation vs. resampling vs. causal model). These omissions make the central claim—that intersectional disparities are larger yet comparable to random—impossible to assess from the provided text.
Authors: We agree these details are missing and limit assessment. The revised manuscript will expand the Abstract and Methods to specify: replicated model architectures (including original publication details), train/test split procedures, exact definitions and formulas for all fairness metrics, statistical tests for comparability to the randomized baseline (permutation p-values), and the counterfactual method as group-label permutation. These additions will make the central claims fully reproducible and assessable. revision: yes
Circularity Check
No circularity: empirical fairness metrics applied to observational data
full rationale
The paper describes an empirical application of standard observational fairness metrics and counterfactual comparisons on the All of Us dataset for two clinical prediction tasks. No mathematical derivation, first-principles result, or prediction is claimed that reduces by construction to fitted parameters, self-definitions, or self-citations. The central findings (larger intersectional disparities but comparability to randomized-group baselines) are direct computations from the data rather than tautological outputs. This is a normal non-circular empirical evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Translating intersectionality to fair machine learning in health sciences
Lett E, La Cava WG. Translating intersectionality to fair machine learning in health sciences. Nat Mach Intell. 2023;5(5):476–479. doi:10.1038/s42256-023-00651-3
-
[2]
An intersectional framework for counterfactual fairness in risk prediction
Wastvedt S, Huling J, Wolfson J. An intersectional framework for counterfactual fairness in risk prediction. Biostatistics. 2023;kxad021. doi:10.1093/biostatistics/kxad021
-
[3]
Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias mitigation for machine learning classifiers: a comprehensive survey. arXiv. 2023. Available from: http://arxiv.org/abs/2207.07068
-
[4]
Marko JGO, Neagu CD, Anand PB. Examining inclusivity: the use of AI and diverse populations in health and social care: a systematic review. BMC Med Inform Decis Mak. 2025;25:57. doi:10.1186/s12911-025-02884-1
-
[5]
Walsh CG, Chaudhry B, Dua P , Goodman KW, Kaplan B, Kavuluru R, et al. Stigma, biomarkers, and algorithmic bias: recommendations for precision behavioral health with artificial intelligence. JAMIA Open. 2020;3(1):9–15. doi:10.1093/jamiaopen/ooz054
-
[6]
Allison K, Patel DG, Greene L. Racial and ethnic disparities in primary open -angle glaucoma clinical trials: a systematic review and meta -analysis. JAMA Netw Open. 2021;4(5):e218348. doi:10.1001/jamanetworkopen.2021.8348
-
[7]
Obermeyer Z, Powers B, V ogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. doi:10.1126/science.aax2342
-
[8]
Predictive analytics for glaucoma using data from the All of Us research program
Baxter SL, Saseendrakumar BR, Paul P, Kim J, Bonomi L, Kuo TT, et al. Predictive analytics for glaucoma using data from the All of Us research program. Am J Ophthalmol. 2021;227:74–86. doi:10.1016/j.ajo.2021.01.008
-
[9]
Fair and interpretable models for survival analysis
Rahman MM, Purushotham S. Fair and interpretable models for survival analysis. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022. p. 1452 –1462. doi:10.1145/3534678.3539259
-
[10]
Poulain R, Bin Tarek MF, Beheshti R. Improving fairness in AI models on electronic health records: the case for federated learning methods. In: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 2023. p. 1599–1608. doi:10.1145/3593013.3594102
-
[11]
An intersectional definition of fairness
Foulds J, Islam R, Keya KN, Pan S. An intersectional definition of fairness. arXiv. 2019. doi:10.48550/arXiv.1807.08362
-
[12]
Fair admission risk prediction with proportional multicalibration
La Cava WG, Lett E, Wan G. Fair admission risk prediction with proportional multicalibration. Proc Mach Learn Res. 2023;209:350–378
work page 2023
-
[13]
Subbian V , Souligne N. FairLogue: a toolkit for intersectional fairness analysis in clinical machine learning models [Internet]. GitHub; 2026. Available from: https://github.com/vsubbian/FairLogue
work page 2026
-
[14]
Goyal J, Ng DQ, Zhang K, Chan A, Lee J, Zheng K, et al. Using machine learning to develop a clinical prediction model for SSRI -associated bleeding: a feasibility study. BMC Med Inform Decis Mak. 2023;23(1):105. doi:10.1186/s12911-023-02206-3
-
[15]
Fair prediction of 2-year stroke risk in patients with atrial fibrillation
Gao J, Mar P, Tang ZZ, Chen G. Fair prediction of 2-year stroke risk in patients with atrial fibrillation. J Am Med Inform Assoc. 2024;31(12):2820–2828. doi:10.1093/jamia/ocae170
-
[16]
Data and statistics dissemination policy [Internet]
All of Us Research Program. Data and statistics dissemination policy [Internet]. 2024 Jan 18. Available from: https://support.researchallofus.org/hc/en-us/articles/22346276580372-Data-and-Statistics-Dissemination-Policy
-
[17]
Bear Don't Walk O 4th, Haldar S, Wei DH, Huang H, Rivera RL, Fan JW, et al. Developing and sustaining inclusive language in biomedical informatics communications: an AMIA Board of Directors endorsed paper on the Inclusive Language and Context Style Guideli nes. J Am Med Inform Assoc. 2025;32(8):1380–1387. doi:10.1093/jamia/ocaf096
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.