A Framework for Exploring and Disentangling Intersectional Bias: A Case Study in Fetal Ultrasound

Aasa Feragen; Anders Christensen; Aya Elgebaly; Benjamin Laine J{\o}nch Jurgensen; Claes Ladefoged; Joris Fournel; Kamil Mikolaj; Martin Tolsgaard

arxiv: 2605.02942 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.CV· eess.IV

A Framework for Exploring and Disentangling Intersectional Bias: A Case Study in Fetal Ultrasound

Aya Elgebaly , Joris Fournel , Benjamin Laine J{\o}nch Jurgensen , Kamil Mikolaj , Anders Christensen , Martin Tolsgaard , Claes Ladefoged , Aasa Feragen This is my paper

Pith reviewed 2026-05-09 19:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CVeess.IV

keywords intersectional biasfetal ultrasoundmedical AI fairnesspixel spacingimage acquisitionconfounding factorsdeep learningbias detection

0 comments

The pith

A framework using unsupervised slice discovery and factor analysis shows pixel spacing drives performance differences in fetal ultrasound models up to 24 percent, confounding with BMI and gestational age.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a structured framework to explore intersectional bias in medical AI for image-based tasks like fetal ultrasound, where disparities arise from image quality shaped by acquisition conditions and patient factors rather than representation alone. By applying unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation to a dataset of over 94,000 images, it identifies pixel spacing as a consistent factor improving accuracy by up to 24% in subgroups for both deep learning and traditional models. This is important because pixel spacing is often adjusted based on maternal BMI or gestational age, introducing confounding that standard fairness methods might miss. The analysis shows part of the pixel spacing effect is explained by gestational age, but improvements persist across BMI levels, emphasizing acquisition-aware bias evaluation.

Core claim

The authors present a structured framework to explore and detect intersectional bias in image-based medical AI, integrating unsupervised methods to identify data slices, factor-wise analysis, and intersectional evaluation. Applied to over 94,000 fetal ultrasound images for weight estimation using both a deep learning model and the Hadlock formula, the framework identifies pixel spacing as a consistent driver of performance, where higher spacing yields improvements up to 24% in selected subgroups. Since pixel spacing is adjusted for high BMI or low GA, part of the effect is explained by gestational age but persists across BMI levels, underscoring the need for acquisition-aware evaluations.

What carries the argument

The structured framework integrating unsupervised slice discovery to find performance-varying data subgroups, systematic factor-wise analysis across demographic, clinical and acquisition variables, and targeted intersectional evaluation to isolate interactions and confounding.

If this is right

Pixel spacing should be included as a variable in bias assessments for ultrasound-based AI models.
Both deep learning models and clinical regression formulas like Hadlock exhibit similar performance sensitivities to acquisition parameters.
Gestational age accounts for part but not all of the performance benefits tied to higher pixel spacing.
Medical AI fairness evaluations must account for interactions between acquisition settings and patient characteristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could apply to other medical imaging tasks where acquisition protocols adapt to patient body characteristics.
Standardizing pixel spacing in protocols might reduce confounding but requires testing for impacts on image quality in difficult cases.
Future audits could add pixel spacing as an explicit input feature to models to check if it mitigates subgroup differences.
The method points to value in examining raw imaging parameters together with demographics during fairness reviews.

Load-bearing premise

That unsupervised slice discovery combined with systematic factor-wise analysis can reliably disentangle demographic, clinical, and acquisition factors without missing important unmeasured confounders or producing spurious associations.

What would settle it

Re-training and evaluating the models on a dataset where pixel spacing is held fixed while balancing BMI and gestational age across subgroups, then checking whether performance differences disappear.

Figures

Figures reproduced from arXiv: 2605.02942 by Aasa Feragen, Anders Christensen, Aya Elgebaly, Benjamin Laine J{\o}nch Jurgensen, Claes Ladefoged, Joris Fournel, Kamil Mikolaj, Martin Tolsgaard.

**Figure 2.** Figure 2: Unsupervised slice discovery results. (left) Distribution of factors in worst vs. best-performing slices. (right) Radar comparison of subgroup characteristics. Higher GA and PS dominate the best-performing slice (Slice 5), whereas lower ranges of both factors are more prevalent in the worst-performing slice (Slice 9). Intra-slice analysis confirms that lower GA and lower PS are consistently associated with… view at source ↗

**Figure 3.** Figure 3: Full-dataset global radar plot from the Structured Stratified Performance Analysis. Each axis represents the MRE gap between the best (green) and worst (red) performing subgroups for every factor. Larger radial values indicate greater performance variability across subgroups. 3.3 Structured Stratified Analysis Global stratified evaluation revealed significant performance variability across clinical and a… view at source ↗

**Figure 4.** Figure 4: Factor-wise subgroup analysis from the Structured Stratified Performance Analysis. The Mean Relative Error (MRE) is shown for all subgroups within each factor, comparing the deep learning (DL) model (green) with the Hadlock formula (red). groups, indicating that the PS-associated improvement extends beyond challenging imaging conditions. The Hadlock formula showed relative differences of 17% in the high-… view at source ↗

**Figure 5.** Figure 5: Intersectional analysis of pixel spacing (PS) vs. maternal BMI and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Bias in medical AI is often framed as a problem of representation. However, in image-based tasks such as fetal ultrasound, performance disparities can arise even when representation is adequate, because predictive accuracy depends strongly on image quality. Image quality is shaped by acquisition conditions and operator expertise, as well as patient-dependent factors such as maternal body mass index (BMI), all of which may correlate with sensitive demographic features. Consequently, observed disparities may reflect the combined influence of demographic, clinical, and acquisition-related factors rather than data imbalance alone, and may obscure underlying interaction or confounding effects. We propose a structured framework to explore and detect intersectional bias, combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation. In a case study of over 94{,}000 ultrasound images for fetal weight estimation, we analyze bias in a state-of-the-art deep learning (DL) model and the clinical standard Hadlock, a regression formula using biometric measurements. Pixel spacing (PS) -- a parameter considered suboptimal in current acquisition protocols -- emerged as a consistent driver of performance differences, with higher PS associated with improvements of up to 24\% in selected subgroups for both models. Because PS is often adapted in cases of high BMI or low gestational age (GA), this effect carries a substantial risk of confounding. Our intersectional analysis revealed that part of the PS-associated signal is explained by GA, while PS-related improvements persist across BMI strata, highlighting the importance of acquisition-aware and interaction-aware evaluation in medical AI fairness research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable framework for spotting acquisition biases like pixel spacing in fetal ultrasound beyond demographics, but the unsupervised disentanglement from GA and BMI confounders is only moderately supported.

read the letter

The main takeaway is a structured way to combine unsupervised slice discovery with factor-wise and intersectional checks to look at bias in fetal weight estimation from ultrasound. They run this on over 94,000 images, compare a deep learning model to the Hadlock formula, and report that higher pixel spacing links to performance gains up to 24% in some subgroups. They also show partial mediation by gestational age while the signal holds across BMI levels, which flags a real risk that acquisition choices confound demographic effects.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation to detect and disentangle intersectional biases in medical imaging AI. In a case study using over 94,000 fetal ultrasound images for weight estimation, it compares a state-of-the-art deep learning model against the clinical Hadlock regression formula and identifies pixel spacing (PS) as a consistent performance driver, with higher PS linked to up to 24% improvements in selected subgroups for both models. The analysis indicates partial mediation by gestational age (GA) while PS effects persist across BMI strata, emphasizing risks of confounding from acquisition parameters that correlate with demographic and clinical factors.

Significance. If the disentanglement holds, the work is significant for shifting focus in medical AI fairness from representation alone to acquisition and interaction effects that can produce performance disparities even with adequate data balance. The large dataset and concrete quantitative findings on PS (including subgroup-specific gains and partial GA mediation) provide actionable insights for protocol design and evaluation practices. Strengths include the empirical grounding on real clinical data and the structured framework that enables exploration beyond standard parity metrics.

major comments (2)

Abstract: The central claim that PS drives up to 24% performance gains with partial GA mediation but persistence across BMI strata depends on the unsupervised slice discovery plus factor-wise analysis successfully isolating PS effects from correlated confounders (e.g., operator adaptation or equipment choices for high-BMI/low-GA cases). No quantitative validation is described, such as mutual information between discovered slices and PS after conditioning on BMI/GA or ablation of the discovery step, leaving the attribution vulnerable to spurious associations from known clinical adaptation rules.
Case study results (implied): The reported subgroup improvements and intersectional findings lack mention of controls for multiple testing across the numerous demographic, clinical, and acquisition factors examined; without such adjustments, the 24% figure and persistence claims risk overstatement from chance findings in the large but factor-rich analysis.

minor comments (2)

Abstract: Specify the exact performance metric underlying the '24% improvement' (e.g., mean absolute error reduction in fetal weight estimation) and whether it is relative or absolute, to allow precise interpretation of the effect sizes.
Abstract: Provide the precise dataset size (beyond 'over 94,000'), inclusion/exclusion criteria, and train/validation/test splits to support reproducibility of the quantitative findings on PS.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which highlight important aspects of validating our framework's ability to isolate effects and ensure statistical robustness. We address each major comment below and outline revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The central claim that PS drives up to 24% performance gains with partial GA mediation but persistence across BMI strata depends on the unsupervised slice discovery plus factor-wise analysis successfully isolating PS effects from correlated confounders (e.g., operator adaptation or equipment choices for high-BMI/low-GA cases). No quantitative validation is described, such as mutual information between discovered slices and PS after conditioning on BMI/GA or ablation of the discovery step, leaving the attribution vulnerable to spurious associations from known clinical adaptation rules.

Authors: We acknowledge that the manuscript does not include explicit quantitative validation steps such as conditional mutual information or ablation of the slice discovery component. The framework is designed as an exploratory tool that first identifies performance slices via unsupervised methods and then attributes differences through systematic factor-wise and intersectional stratification, which in our case study pointed to PS with partial GA mediation and persistence across BMI. To directly address the concern about potential spurious associations, we will add in the revision: (1) an ablation study comparing performance attributions with and without the unsupervised discovery step, and (2) conditional mutual information analysis between PS and model performance metrics after conditioning on BMI and GA. These additions will provide stronger evidence for the isolation of effects. revision: yes
Referee: Case study results (implied): The reported subgroup improvements and intersectional findings lack mention of controls for multiple testing across the numerous demographic, clinical, and acquisition factors examined; without such adjustments, the 24% figure and persistence claims risk overstatement from chance findings in the large but factor-rich analysis.

Authors: The referee is correct that the large number of factors examined introduces a multiple-testing concern that was not addressed in the original submission. In the revised manuscript, we will apply false discovery rate (FDR) control across all subgroup and intersectional analyses. We will report both unadjusted and adjusted p-values for the key performance differences, including the up to 24% improvements and the persistence of PS effects across BMI strata, to ensure the claims are statistically robust and not overstated due to chance findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework applied to observational data

full rationale

The paper proposes a framework of unsupervised slice discovery plus factor-wise and intersectional analysis, then applies it to a large fetal ultrasound dataset to report observational associations (e.g., PS-linked performance gains). No derivation chain, first-principles prediction, or mathematical result is claimed that reduces by construction to fitted parameters, self-definitions, or prior self-citations. The 24% figure and PS-GA/BMI relations are data-driven outputs, not inputs renamed as predictions. Self-citations, if present, are not load-bearing for the central empirical claims. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework relies on standard assumptions from unsupervised learning and statistical analysis in medical imaging without introducing new free parameters or invented entities; the contribution is the integration and application rather than new foundational elements.

axioms (2)

domain assumption Unsupervised slice discovery identifies subgroups that are relevant to performance disparities and bias
Invoked as the starting point for exploring hidden patterns without predefined labels.
domain assumption Systematic factor-wise and intersectional analysis can separate the influences of acquisition parameters from demographic and clinical factors
Central premise required for the claim that confounding can be detected and partially explained.

pith-pipeline@v0.9.0 · 5619 in / 1432 out tokens · 42487 ms · 2026-05-09T19:38:49.683943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

Andreasen, L.A., Tabor, A., Nørgaard, L.N., Taksøe-Vester, C.A., Krebs, L., Jør- gensen, F.S., Jepsen, I.E., Sharif, H., Zingenberg, H., Rosthøj, S., et al.: Why we succeed and fail in detecting fetal growth restriction: A population-based study. Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

work page 2021
[2]

American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

Chung, K., Han, C.S.: Obstetric ultrasound imaging in the patient with obesity. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

work page 1985
[3]

In: Medical Imaging with Deep Learning-Short Papers (2025)

Dawood, T., Stucchi, G., Feragen, A.: Racial disparities persist beyond data rep- resentation in medical imaging—even predictive uncertainty fails to capture them. In: Medical Imaging with Deep Learning-Short Papers (2025)

work page 2025
[4]

arXiv preprint arXiv:2203.14960 (2022)

Eyuboglu, S., et al.: Domino: Discovering systematic errors with cross-modal em- beddings. arXiv preprint arXiv:2203.14960 (2022)

work page arXiv 2022
[5]

In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI)

Fournel, J., et al.: The cervix in context: Bias assessment in preterm birth pre- diction. In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI). pp. 43–52. Springer (2025)

work page 2025
[6]

Radiology: Artificial Intelligence5(6), e230060 (2023)

Glocker, B., Jones, C., Roschewitz, M., Winzeck, S.: Risk of bias in chest radio- graphy deep learning foundation models. Radiology: Artificial Intelligence5(6), e230060 (2023)

work page 2023
[7]

American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

Hadlock, F.P., Harrist, R.B., Sharman, R.S., Deter, R.L., Park, S.K.: Estimation of fetal weight with the use of head, body, and femur measurements—a prospective study. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

work page 1985
[8]

Contem- porary OB/GYN (2019)

Han, C.S., Holliman, K.: How to optimize imaging in the obese gravida. Contem- porary OB/GYN (2019)

work page 2019
[9]

In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI)

Jiménez-Sánchez, A., Juodelyte, D., Chamberlain, B., Cheplygina, V.: Detecting shortcuts in medical images-a case study in chest x-rays. In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI). pp. 1–5. IEEE (2023)

work page 2023
[10]

In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

Johnson, N., Cabrera, Á.A., Plumb, G., Talwalkar, A.: Where does my model underperform? a human evaluation of slice discovery algorithms. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. vol. 11, pp. 65–76 (2023)

work page 2023
[11]

Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

Larrazabal, A.J., Nieto, N., Peterson, V., Milone, D.H., Ferrante, E.: Gender im- balance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

work page 2020
[12]

In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp

Lee, H., Yang, S., Chu, Y.: Equitable ai in healthcare: Navigating sex, gender, and intersectional biases in diagnostics. In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp. 197–229. Springer (2026)

work page 2026
[13]

Acta Paediatrica85, 843–848 (1996)

Maršál, K., et al.: Intrauterine growth curves based on ultrasonically estimated foetal weights. Acta Paediatrica85, 843–848 (1996)

work page 1996
[14]

npj Digital Medicine8, 318 (2025)

Mikołaj, K.W., et al.: Predicting abnormal fetal growth using deep learning. npj Digital Medicine8, 318 (2025)

work page 2025
[15]

Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

Morchi, L., Mariani, A., Diodato, A., Tognarelli, S., Cafarelli, A., Menciassi, A.: Acoustic coupling quantification in ultrasound-guided focused ultrasound surgery: Simulation-based evaluation and experimental feasibility study. Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

work page 2020
[16]

PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

Nazer, L.H., Zatarah, R., Waldrip, S., Ke, J.X.C., Moukheiber, M., Khanna, A.K., et al.: Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

work page 2023
[17]

Patterns2(10) (2021)

Norori, N., Hu, Q., Aellen, F.M., Faraci, F.D., Tzovara, A.: Addressing bias in big data and ai for health care: A call for open science. Patterns2(10) (2021)

work page 2021
[18]

In: Proceedings of the ACM conference on health, inference, and learning

Oakden-Rayner, L., Dunnmon, J., Carneiro, G., Ré, C.: Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In: Proceedings of the ACM conference on health, inference, and learning. pp. 151–159 (2020)

work page 2020
[19]

arXiv preprint arXiv:2406.12142 (2024)

Olesen, V., Weng, N., Feragen, A., Petersen, E.: Slicing through bias: Explaining performance gaps in medical image analysis using slice discovery methods. arXiv preprint arXiv:2406.12142 (2024)

work page arXiv 2024
[20]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Puyol-Antón, E., Ruijsink, B., Piechnik, S.K., Neubauer, S., Petersen, S.E., Razavi, R.,King, A.P.: Fairnessin cardiacmrimage analysis:aninvestigation ofbiasdue to data imbalance in deep learning based segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 413–423. Springer (2021)

work page 2021
[21]

In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D

Roski, J., Chapman, W., Heffner, J., Trivedi, R., Del Fiol, G., Kukafka, R., et al.: How artificial intelligence is changing health and health care. In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D. (eds.) Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril, pp. 65–98. The National Academies Press (2019)

work page 2019
[22]

Nature Medicine 27(12), 2176–2182 (2021)

Seyyed-Kalantari, L., et al.: Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine 27(12), 2176–2182 (2021)

work page 2021
[23]

In: Workshop on the Ethical and Philosophical Issues in Medical Imaging

Stanley, E.A., Wilms, M., Forkert, N.D.: Disproportionate subgroup impacts and other challenges of fairness in artificial intelligence for medical image analysis. In: Workshop on the Ethical and Philosophical Issues in Medical Imaging. pp. 14–25. Springer (2022)

work page 2022
[24]

World Health Organization: Training in diagnostic ultrasound: Essentials, prin- ciples and standards. Tech. Rep. Technical Report Series No. 875, World Health Organization, Geneva (1998)

work page 1998
[25]

arXiv preprint arXiv:2512.15249 (2025)

Zhang, Y., Dunn, A.G., Naseem, U., Kim, J.: Intersectional fairness in vision-language models for medical image disease classification. arXiv preprint arXiv:2512.15249 (2025)

work page arXiv 2025

[1] [1]

Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

Andreasen, L.A., Tabor, A., Nørgaard, L.N., Taksøe-Vester, C.A., Krebs, L., Jør- gensen, F.S., Jepsen, I.E., Sharif, H., Zingenberg, H., Rosthøj, S., et al.: Why we succeed and fail in detecting fetal growth restriction: A population-based study. Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

work page 2021

[2] [2]

American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

Chung, K., Han, C.S.: Obstetric ultrasound imaging in the patient with obesity. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

work page 1985

[3] [3]

In: Medical Imaging with Deep Learning-Short Papers (2025)

Dawood, T., Stucchi, G., Feragen, A.: Racial disparities persist beyond data rep- resentation in medical imaging—even predictive uncertainty fails to capture them. In: Medical Imaging with Deep Learning-Short Papers (2025)

work page 2025

[4] [4]

arXiv preprint arXiv:2203.14960 (2022)

Eyuboglu, S., et al.: Domino: Discovering systematic errors with cross-modal em- beddings. arXiv preprint arXiv:2203.14960 (2022)

work page arXiv 2022

[5] [5]

In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI)

Fournel, J., et al.: The cervix in context: Bias assessment in preterm birth pre- diction. In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI). pp. 43–52. Springer (2025)

work page 2025

[6] [6]

Radiology: Artificial Intelligence5(6), e230060 (2023)

Glocker, B., Jones, C., Roschewitz, M., Winzeck, S.: Risk of bias in chest radio- graphy deep learning foundation models. Radiology: Artificial Intelligence5(6), e230060 (2023)

work page 2023

[7] [7]

American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

Hadlock, F.P., Harrist, R.B., Sharman, R.S., Deter, R.L., Park, S.K.: Estimation of fetal weight with the use of head, body, and femur measurements—a prospective study. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

work page 1985

[8] [8]

Contem- porary OB/GYN (2019)

Han, C.S., Holliman, K.: How to optimize imaging in the obese gravida. Contem- porary OB/GYN (2019)

work page 2019

[9] [9]

In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI)

Jiménez-Sánchez, A., Juodelyte, D., Chamberlain, B., Cheplygina, V.: Detecting shortcuts in medical images-a case study in chest x-rays. In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI). pp. 1–5. IEEE (2023)

work page 2023

[10] [10]

In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

Johnson, N., Cabrera, Á.A., Plumb, G., Talwalkar, A.: Where does my model underperform? a human evaluation of slice discovery algorithms. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. vol. 11, pp. 65–76 (2023)

work page 2023

[11] [11]

Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

Larrazabal, A.J., Nieto, N., Peterson, V., Milone, D.H., Ferrante, E.: Gender im- balance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

work page 2020

[12] [12]

In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp

Lee, H., Yang, S., Chu, Y.: Equitable ai in healthcare: Navigating sex, gender, and intersectional biases in diagnostics. In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp. 197–229. Springer (2026)

work page 2026

[13] [13]

Acta Paediatrica85, 843–848 (1996)

Maršál, K., et al.: Intrauterine growth curves based on ultrasonically estimated foetal weights. Acta Paediatrica85, 843–848 (1996)

work page 1996

[14] [14]

npj Digital Medicine8, 318 (2025)

Mikołaj, K.W., et al.: Predicting abnormal fetal growth using deep learning. npj Digital Medicine8, 318 (2025)

work page 2025

[15] [15]

Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

Morchi, L., Mariani, A., Diodato, A., Tognarelli, S., Cafarelli, A., Menciassi, A.: Acoustic coupling quantification in ultrasound-guided focused ultrasound surgery: Simulation-based evaluation and experimental feasibility study. Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

work page 2020

[16] [16]

PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

Nazer, L.H., Zatarah, R., Waldrip, S., Ke, J.X.C., Moukheiber, M., Khanna, A.K., et al.: Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

work page 2023

[17] [17]

Patterns2(10) (2021)

Norori, N., Hu, Q., Aellen, F.M., Faraci, F.D., Tzovara, A.: Addressing bias in big data and ai for health care: A call for open science. Patterns2(10) (2021)

work page 2021

[18] [18]

In: Proceedings of the ACM conference on health, inference, and learning

Oakden-Rayner, L., Dunnmon, J., Carneiro, G., Ré, C.: Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In: Proceedings of the ACM conference on health, inference, and learning. pp. 151–159 (2020)

work page 2020

[19] [19]

arXiv preprint arXiv:2406.12142 (2024)

Olesen, V., Weng, N., Feragen, A., Petersen, E.: Slicing through bias: Explaining performance gaps in medical image analysis using slice discovery methods. arXiv preprint arXiv:2406.12142 (2024)

work page arXiv 2024

[20] [20]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Puyol-Antón, E., Ruijsink, B., Piechnik, S.K., Neubauer, S., Petersen, S.E., Razavi, R.,King, A.P.: Fairnessin cardiacmrimage analysis:aninvestigation ofbiasdue to data imbalance in deep learning based segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 413–423. Springer (2021)

work page 2021

[21] [21]

In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D

Roski, J., Chapman, W., Heffner, J., Trivedi, R., Del Fiol, G., Kukafka, R., et al.: How artificial intelligence is changing health and health care. In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D. (eds.) Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril, pp. 65–98. The National Academies Press (2019)

work page 2019

[22] [22]

Nature Medicine 27(12), 2176–2182 (2021)

Seyyed-Kalantari, L., et al.: Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine 27(12), 2176–2182 (2021)

work page 2021

[23] [23]

In: Workshop on the Ethical and Philosophical Issues in Medical Imaging

Stanley, E.A., Wilms, M., Forkert, N.D.: Disproportionate subgroup impacts and other challenges of fairness in artificial intelligence for medical image analysis. In: Workshop on the Ethical and Philosophical Issues in Medical Imaging. pp. 14–25. Springer (2022)

work page 2022

[24] [24]

World Health Organization: Training in diagnostic ultrasound: Essentials, prin- ciples and standards. Tech. Rep. Technical Report Series No. 875, World Health Organization, Geneva (1998)

work page 1998

[25] [25]

arXiv preprint arXiv:2512.15249 (2025)

Zhang, Y., Dunn, A.G., Naseem, U., Kim, J.: Intersectional fairness in vision-language models for medical image disease classification. arXiv preprint arXiv:2512.15249 (2025)

work page arXiv 2025