pith. sign in

arxiv: 2605.02942 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.CV· eess.IV

A Framework for Exploring and Disentangling Intersectional Bias: A Case Study in Fetal Ultrasound

Pith reviewed 2026-05-09 19:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CVeess.IV
keywords intersectional biasfetal ultrasoundmedical AI fairnesspixel spacingimage acquisitionconfounding factorsdeep learningbias detection
0
0 comments X

The pith

A framework using unsupervised slice discovery and factor analysis shows pixel spacing drives performance differences in fetal ultrasound models up to 24 percent, confounding with BMI and gestational age.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a structured framework to explore intersectional bias in medical AI for image-based tasks like fetal ultrasound, where disparities arise from image quality shaped by acquisition conditions and patient factors rather than representation alone. By applying unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation to a dataset of over 94,000 images, it identifies pixel spacing as a consistent factor improving accuracy by up to 24% in subgroups for both deep learning and traditional models. This is important because pixel spacing is often adjusted based on maternal BMI or gestational age, introducing confounding that standard fairness methods might miss. The analysis shows part of the pixel spacing effect is explained by gestational age, but improvements persist across BMI levels, emphasizing acquisition-aware bias evaluation.

Core claim

The authors present a structured framework to explore and detect intersectional bias in image-based medical AI, integrating unsupervised methods to identify data slices, factor-wise analysis, and intersectional evaluation. Applied to over 94,000 fetal ultrasound images for weight estimation using both a deep learning model and the Hadlock formula, the framework identifies pixel spacing as a consistent driver of performance, where higher spacing yields improvements up to 24% in selected subgroups. Since pixel spacing is adjusted for high BMI or low GA, part of the effect is explained by gestational age but persists across BMI levels, underscoring the need for acquisition-aware evaluations.

What carries the argument

The structured framework integrating unsupervised slice discovery to find performance-varying data subgroups, systematic factor-wise analysis across demographic, clinical and acquisition variables, and targeted intersectional evaluation to isolate interactions and confounding.

If this is right

  • Pixel spacing should be included as a variable in bias assessments for ultrasound-based AI models.
  • Both deep learning models and clinical regression formulas like Hadlock exhibit similar performance sensitivities to acquisition parameters.
  • Gestational age accounts for part but not all of the performance benefits tied to higher pixel spacing.
  • Medical AI fairness evaluations must account for interactions between acquisition settings and patient characteristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could apply to other medical imaging tasks where acquisition protocols adapt to patient body characteristics.
  • Standardizing pixel spacing in protocols might reduce confounding but requires testing for impacts on image quality in difficult cases.
  • Future audits could add pixel spacing as an explicit input feature to models to check if it mitigates subgroup differences.
  • The method points to value in examining raw imaging parameters together with demographics during fairness reviews.

Load-bearing premise

That unsupervised slice discovery combined with systematic factor-wise analysis can reliably disentangle demographic, clinical, and acquisition factors without missing important unmeasured confounders or producing spurious associations.

What would settle it

Re-training and evaluating the models on a dataset where pixel spacing is held fixed while balancing BMI and gestational age across subgroups, then checking whether performance differences disappear.

Figures

Figures reproduced from arXiv: 2605.02942 by Aasa Feragen, Anders Christensen, Aya Elgebaly, Benjamin Laine J{\o}nch Jurgensen, Claes Ladefoged, Joris Fournel, Kamil Mikolaj, Martin Tolsgaard.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework for discovering and disentan [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Unsupervised slice discovery results. (left) Distribution of factors in worst vs. best-performing slices. (right) Radar comparison of subgroup characteristics. Higher GA and PS dominate the best-performing slice (Slice 5), whereas lower ranges of both factors are more prevalent in the worst-performing slice (Slice 9). Intra-slice analysis confirms that lower GA and lower PS are consistently associated with… view at source ↗
Figure 3
Figure 3. Figure 3: Full-dataset global radar plot from the Structured Stratified Performance Anal￾ysis. Each axis represents the MRE gap between the best (green) and worst (red) per￾forming subgroups for every factor. Larger radial values indicate greater performance variability across subgroups. 3.3 Structured Stratified Analysis Global stratified evaluation revealed significant performance variability across clinical and a… view at source ↗
Figure 4
Figure 4. Figure 4: Factor-wise subgroup analysis from the Structured Stratified Performance Anal￾ysis. The Mean Relative Error (MRE) is shown for all subgroups within each factor, comparing the deep learning (DL) model (green) with the Hadlock formula (red). groups, indicating that the PS-associated improvement extends beyond challeng￾ing imaging conditions. The Hadlock formula showed relative differences of 17% in the high-… view at source ↗
Figure 5
Figure 5. Figure 5: Intersectional analysis of pixel spacing (PS) vs. maternal BMI and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Bias in medical AI is often framed as a problem of representation. However, in image-based tasks such as fetal ultrasound, performance disparities can arise even when representation is adequate, because predictive accuracy depends strongly on image quality. Image quality is shaped by acquisition conditions and operator expertise, as well as patient-dependent factors such as maternal body mass index (BMI), all of which may correlate with sensitive demographic features. Consequently, observed disparities may reflect the combined influence of demographic, clinical, and acquisition-related factors rather than data imbalance alone, and may obscure underlying interaction or confounding effects. We propose a structured framework to explore and detect intersectional bias, combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation. In a case study of over 94{,}000 ultrasound images for fetal weight estimation, we analyze bias in a state-of-the-art deep learning (DL) model and the clinical standard Hadlock, a regression formula using biometric measurements. Pixel spacing (PS) -- a parameter considered suboptimal in current acquisition protocols -- emerged as a consistent driver of performance differences, with higher PS associated with improvements of up to 24\% in selected subgroups for both models. Because PS is often adapted in cases of high BMI or low gestational age (GA), this effect carries a substantial risk of confounding. Our intersectional analysis revealed that part of the PS-associated signal is explained by GA, while PS-related improvements persist across BMI strata, highlighting the importance of acquisition-aware and interaction-aware evaluation in medical AI fairness research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a framework combining unsupervised slice discovery, systematic factor-wise analysis, and targeted intersectional evaluation to detect and disentangle intersectional biases in medical imaging AI. In a case study using over 94,000 fetal ultrasound images for weight estimation, it compares a state-of-the-art deep learning model against the clinical Hadlock regression formula and identifies pixel spacing (PS) as a consistent performance driver, with higher PS linked to up to 24% improvements in selected subgroups for both models. The analysis indicates partial mediation by gestational age (GA) while PS effects persist across BMI strata, emphasizing risks of confounding from acquisition parameters that correlate with demographic and clinical factors.

Significance. If the disentanglement holds, the work is significant for shifting focus in medical AI fairness from representation alone to acquisition and interaction effects that can produce performance disparities even with adequate data balance. The large dataset and concrete quantitative findings on PS (including subgroup-specific gains and partial GA mediation) provide actionable insights for protocol design and evaluation practices. Strengths include the empirical grounding on real clinical data and the structured framework that enables exploration beyond standard parity metrics.

major comments (2)
  1. Abstract: The central claim that PS drives up to 24% performance gains with partial GA mediation but persistence across BMI strata depends on the unsupervised slice discovery plus factor-wise analysis successfully isolating PS effects from correlated confounders (e.g., operator adaptation or equipment choices for high-BMI/low-GA cases). No quantitative validation is described, such as mutual information between discovered slices and PS after conditioning on BMI/GA or ablation of the discovery step, leaving the attribution vulnerable to spurious associations from known clinical adaptation rules.
  2. Case study results (implied): The reported subgroup improvements and intersectional findings lack mention of controls for multiple testing across the numerous demographic, clinical, and acquisition factors examined; without such adjustments, the 24% figure and persistence claims risk overstatement from chance findings in the large but factor-rich analysis.
minor comments (2)
  1. Abstract: Specify the exact performance metric underlying the '24% improvement' (e.g., mean absolute error reduction in fetal weight estimation) and whether it is relative or absolute, to allow precise interpretation of the effect sizes.
  2. Abstract: Provide the precise dataset size (beyond 'over 94,000'), inclusion/exclusion criteria, and train/validation/test splits to support reproducibility of the quantitative findings on PS.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which highlight important aspects of validating our framework's ability to isolate effects and ensure statistical robustness. We address each major comment below and outline revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The central claim that PS drives up to 24% performance gains with partial GA mediation but persistence across BMI strata depends on the unsupervised slice discovery plus factor-wise analysis successfully isolating PS effects from correlated confounders (e.g., operator adaptation or equipment choices for high-BMI/low-GA cases). No quantitative validation is described, such as mutual information between discovered slices and PS after conditioning on BMI/GA or ablation of the discovery step, leaving the attribution vulnerable to spurious associations from known clinical adaptation rules.

    Authors: We acknowledge that the manuscript does not include explicit quantitative validation steps such as conditional mutual information or ablation of the slice discovery component. The framework is designed as an exploratory tool that first identifies performance slices via unsupervised methods and then attributes differences through systematic factor-wise and intersectional stratification, which in our case study pointed to PS with partial GA mediation and persistence across BMI. To directly address the concern about potential spurious associations, we will add in the revision: (1) an ablation study comparing performance attributions with and without the unsupervised discovery step, and (2) conditional mutual information analysis between PS and model performance metrics after conditioning on BMI and GA. These additions will provide stronger evidence for the isolation of effects. revision: yes

  2. Referee: Case study results (implied): The reported subgroup improvements and intersectional findings lack mention of controls for multiple testing across the numerous demographic, clinical, and acquisition factors examined; without such adjustments, the 24% figure and persistence claims risk overstatement from chance findings in the large but factor-rich analysis.

    Authors: The referee is correct that the large number of factors examined introduces a multiple-testing concern that was not addressed in the original submission. In the revised manuscript, we will apply false discovery rate (FDR) control across all subgroup and intersectional analyses. We will report both unadjusted and adjusted p-values for the key performance differences, including the up to 24% improvements and the persistence of PS effects across BMI strata, to ensure the claims are statistically robust and not overstated due to chance findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework applied to observational data

full rationale

The paper proposes a framework of unsupervised slice discovery plus factor-wise and intersectional analysis, then applies it to a large fetal ultrasound dataset to report observational associations (e.g., PS-linked performance gains). No derivation chain, first-principles prediction, or mathematical result is claimed that reduces by construction to fitted parameters, self-definitions, or prior self-citations. The 24% figure and PS-GA/BMI relations are data-driven outputs, not inputs renamed as predictions. Self-citations, if present, are not load-bearing for the central empirical claims. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework relies on standard assumptions from unsupervised learning and statistical analysis in medical imaging without introducing new free parameters or invented entities; the contribution is the integration and application rather than new foundational elements.

axioms (2)
  • domain assumption Unsupervised slice discovery identifies subgroups that are relevant to performance disparities and bias
    Invoked as the starting point for exploring hidden patterns without predefined labels.
  • domain assumption Systematic factor-wise and intersectional analysis can separate the influences of acquisition parameters from demographic and clinical factors
    Central premise required for the claim that confounding can be detected and partially explained.

pith-pipeline@v0.9.0 · 5619 in / 1432 out tokens · 42487 ms · 2026-05-09T19:38:49.683943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

    Andreasen, L.A., Tabor, A., Nørgaard, L.N., Taksøe-Vester, C.A., Krebs, L., Jør- gensen, F.S., Jepsen, I.E., Sharif, H., Zingenberg, H., Rosthøj, S., et al.: Why we succeed and fail in detecting fetal growth restriction: A population-based study. Acta Obstetricia et Gynecologica Scandinavica100(5), 893–899 (2021)

  2. [2]

    American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

    Chung, K., Han, C.S.: Obstetric ultrasound imaging in the patient with obesity. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

  3. [3]

    In: Medical Imaging with Deep Learning-Short Papers (2025)

    Dawood, T., Stucchi, G., Feragen, A.: Racial disparities persist beyond data rep- resentation in medical imaging—even predictive uncertainty fails to capture them. In: Medical Imaging with Deep Learning-Short Papers (2025)

  4. [4]

    arXiv preprint arXiv:2203.14960 (2022)

    Eyuboglu, S., et al.: Domino: Discovering systematic errors with cross-modal em- beddings. arXiv preprint arXiv:2203.14960 (2022)

  5. [5]

    In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI)

    Fournel, J., et al.: The cervix in context: Bias assessment in preterm birth pre- diction. In: Proceedings of the 3rd International Workshop on Fairness of AI in Medical Imaging (FAIMI). pp. 43–52. Springer (2025)

  6. [6]

    Radiology: Artificial Intelligence5(6), e230060 (2023)

    Glocker, B., Jones, C., Roschewitz, M., Winzeck, S.: Risk of bias in chest radio- graphy deep learning foundation models. Radiology: Artificial Intelligence5(6), e230060 (2023)

  7. [7]

    American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

    Hadlock, F.P., Harrist, R.B., Sharman, R.S., Deter, R.L., Park, S.K.: Estimation of fetal weight with the use of head, body, and femur measurements—a prospective study. American Journal of Obstetrics and Gynecology151(3), 333–337 (1985)

  8. [8]

    Contem- porary OB/GYN (2019)

    Han, C.S., Holliman, K.: How to optimize imaging in the obese gravida. Contem- porary OB/GYN (2019)

  9. [9]

    In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI)

    Jiménez-Sánchez, A., Juodelyte, D., Chamberlain, B., Cheplygina, V.: Detecting shortcuts in medical images-a case study in chest x-rays. In: 2023 IEEE 20th in- ternational symposium on biomedical imaging (ISBI). pp. 1–5. IEEE (2023)

  10. [10]

    In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

    Johnson, N., Cabrera, Á.A., Plumb, G., Talwalkar, A.: Where does my model underperform? a human evaluation of slice discovery algorithms. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. vol. 11, pp. 65–76 (2023)

  11. [11]

    Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

    Larrazabal, A.J., Nieto, N., Peterson, V., Milone, D.H., Ferrante, E.: Gender im- balance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences117(23), 12592–12594 (2020)

  12. [12]

    In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp

    Lee, H., Yang, S., Chu, Y.: Equitable ai in healthcare: Navigating sex, gender, and intersectional biases in diagnostics. In: Sex, Gender, and Emerging Technology in Healthcare: Mitigating Bias and Fostering Equity: From Biology to Care: Sex and Gender Impacts on Health and Medicine, pp. 197–229. Springer (2026)

  13. [13]

    Acta Paediatrica85, 843–848 (1996)

    Maršál, K., et al.: Intrauterine growth curves based on ultrasonically estimated foetal weights. Acta Paediatrica85, 843–848 (1996)

  14. [14]

    npj Digital Medicine8, 318 (2025)

    Mikołaj, K.W., et al.: Predicting abnormal fetal growth using deep learning. npj Digital Medicine8, 318 (2025)

  15. [15]

    Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

    Morchi, L., Mariani, A., Diodato, A., Tognarelli, S., Cafarelli, A., Menciassi, A.: Acoustic coupling quantification in ultrasound-guided focused ultrasound surgery: Simulation-based evaluation and experimental feasibility study. Ultrasound in Medicine & Biology46(12), 3305–3316 (2020)

  16. [16]

    PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

    Nazer, L.H., Zatarah, R., Waldrip, S., Ke, J.X.C., Moukheiber, M., Khanna, A.K., et al.: Bias in artificial intelligence algorithms and recommendations for mitigation. PLOS Digital Health2(6), e0000278 (2023) 10 Elgebaly et al

  17. [17]

    Patterns2(10) (2021)

    Norori, N., Hu, Q., Aellen, F.M., Faraci, F.D., Tzovara, A.: Addressing bias in big data and ai for health care: A call for open science. Patterns2(10) (2021)

  18. [18]

    In: Proceedings of the ACM conference on health, inference, and learning

    Oakden-Rayner, L., Dunnmon, J., Carneiro, G., Ré, C.: Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In: Proceedings of the ACM conference on health, inference, and learning. pp. 151–159 (2020)

  19. [19]

    arXiv preprint arXiv:2406.12142 (2024)

    Olesen, V., Weng, N., Feragen, A., Petersen, E.: Slicing through bias: Explaining performance gaps in medical image analysis using slice discovery methods. arXiv preprint arXiv:2406.12142 (2024)

  20. [20]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Puyol-Antón, E., Ruijsink, B., Piechnik, S.K., Neubauer, S., Petersen, S.E., Razavi, R.,King, A.P.: Fairnessin cardiacmrimage analysis:aninvestigation ofbiasdue to data imbalance in deep learning based segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 413–423. Springer (2021)

  21. [21]

    In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D

    Roski, J., Chapman, W., Heffner, J., Trivedi, R., Del Fiol, G., Kukafka, R., et al.: How artificial intelligence is changing health and health care. In: Matheny, M., Israni, S.T., Ahmed, M., Whicher, D. (eds.) Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril, pp. 65–98. The National Academies Press (2019)

  22. [22]

    Nature Medicine 27(12), 2176–2182 (2021)

    Seyyed-Kalantari, L., et al.: Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine 27(12), 2176–2182 (2021)

  23. [23]

    In: Workshop on the Ethical and Philosophical Issues in Medical Imaging

    Stanley, E.A., Wilms, M., Forkert, N.D.: Disproportionate subgroup impacts and other challenges of fairness in artificial intelligence for medical image analysis. In: Workshop on the Ethical and Philosophical Issues in Medical Imaging. pp. 14–25. Springer (2022)

  24. [24]

    World Health Organization: Training in diagnostic ultrasound: Essentials, prin- ciples and standards. Tech. Rep. Technical Report Series No. 875, World Health Organization, Geneva (1998)

  25. [25]

    arXiv preprint arXiv:2512.15249 (2025)

    Zhang, Y., Dunn, A.G., Naseem, U., Kim, J.: Intersectional fairness in vision-language models for medical image disease classification. arXiv preprint arXiv:2512.15249 (2025)