Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging

Cuong Nguyen; Gustavo Carneiro; Kevin Wells; Milad Masroor

arxiv: 2605.29827 · v1 · pith:SGBJOWG5new · submitted 2026-05-28 · 💻 cs.CV

Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging

Milad Masroor , Cuong Nguyen , Kevin Wells , Gustavo Carneiro This is my paper

Pith reviewed 2026-06-29 08:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords fairnessmedical imaginghidden cohortslabel-free trainingappearance clusteringperformance disparitiesdemographic attributesunsupervised grouping

0 comments

The pith

Optimizing fairness across appearance-based image clusters cuts performance gaps on demographic attributes without using any demographic labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical imaging models show unequal accuracy across patient groups, and existing fixes focus on balancing known demographics such as sex or age. This creates problems when multiple demographics are considered together because subgroups become too small, and it misses other hidden sources of error. The paper proposes label-free hidden-cohort fairness training, which first groups images into K clusters based solely on visual appearance and then applies fairness constraints across those clusters. The method is tested on a new benchmark called HIDFairBench. Results show reduced disparities across both single and multiple demographic attributes even though no demographic information is provided during training.

Core claim

The label-free hidden-cohort fairness (LHCF) training paradigm clusters images into K appearance-based cohorts and optimizes fairness over these cohorts rather than over visible demographic attributes, yielding state-of-the-art reductions in performance disparities across single and multiple demographic attributes on HIDFairBench despite never using demographic labels.

What carries the argument

Label-free hidden-cohort fairness (LHCF) training paradigm, which discovers latent subpopulations through appearance-based clustering of images into K cohorts and then enforces fairness optimization across those cohorts.

If this is right

Fairness optimization becomes feasible for combinations of multiple demographic attributes without suffering from sparse subgroup data.
Models can be made more equitable across visible demographics even when those labels are unavailable at training time.
Latent sources of model error beyond known demographics become accessible for optimization.
The same training procedure applies uniformly to single-attribute and multi-attribute fairness settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The clustering step could be replaced by other unsupervised grouping methods to test whether appearance is the most effective signal for hidden cohorts.
If the discovered cohorts align with clinical variables not used in training, the method might surface new patient stratifications worth studying.
Deployment in hospitals would require checking whether the appearance clusters remain stable across different scanners and imaging protocols.

Load-bearing premise

Grouping images by visual appearance into a chosen number of cohorts will locate the main subpopulations that cause model errors and unfair performance rather than capturing imaging artifacts or other non-causal patterns.

What would settle it

Training a model with LHCF on a dataset and then observing that performance gaps on held-out demographic splits remain as large as or larger than those from standard training.

Figures

Figures reproduced from arXiv: 2605.29827 by Cuong Nguyen, Gustavo Carneiro, Kevin Wells, Milad Masroor.

**Figure 2.** Figure 2: CD diagrams comparing Classic and LHCF learning across all datasets [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of LHCF (FairDi+ResNet18 backbone) with respect [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of LHCF (FairDi) with respect to the backbone [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: DAC vs. LHCF (FairDi+ResNet18 backbone) on HAM10000 dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea—clustering images by appearance to create label-free cohorts for fairness optimization—directly targets the sparsity and label problems in demographic fairness, but the abstract gives no evidence that those clusters align with actual performance gaps rather than artifacts.

read the letter

The new piece here is the LHCF setup: instead of optimizing over explicit demographic buckets, it clusters training images into K appearance-based groups and runs fairness constraints across those groups. This removes the need for demographic labels at training time and avoids the combinatorial sparsity that comes with crossing multiple attributes. The HIDFairBench benchmark is also new and aimed at this setting.

That framing is useful for real clinical data where labels are patchy. It is a clear departure from the usual demographic-attribute DRO or reweighting work.

The soft spot is the missing link between the clusters and the disparities the method claims to fix. Appearance clustering can easily pick up scanner effects, contrast variations, or acquisition artifacts that have nothing to do with patient subgroups. The abstract reports SOTA fairness numbers on held-out demographic labels, but does not show that the discovered cohorts have higher error variance tied to those labels or that the clusters are stable when artifacts are perturbed. Without that check, the gains could be unrelated to the intended mechanism.

The paper is aimed at fairness researchers in medical imaging who are frustrated with label requirements. It is coherent on its own terms and engages the right literature, so it deserves a full referee to examine the clustering details, the choice of K, the exact fairness objective, and any validation that the cohorts track demographic error patterns.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Label-free Hidden-Cohort Fairness (LHCF), which clusters medical images into K appearance-based cohorts via embeddings and applies group-fairness optimization (e.g., group DRO) over those cohorts rather than visible demographic labels. The central claim is that this yields state-of-the-art fairness on both single- and multi-attribute demographic disparities on the proposed HIDFairBench benchmark, without ever using demographic labels during training, while avoiding sparsity issues that arise when optimizing over multiple visible attributes simultaneously.

Significance. If the appearance-based cohorts reliably identify the subpopulations responsible for performance disparities, the approach would be significant: it offers a scalable, label-free alternative to demographic fairness methods and directly addresses the combinatorial sparsity problem for multiple attributes. The introduction of HIDFairBench as a dedicated fairness benchmark is also a positive contribution.

major comments (3)

[Abstract] Abstract and method description: the SOTA fairness claim on demographic attributes (single and multi-attribute) without demographic labels during training rests on the unverified assumption that appearance-based clusters align with the actual sources of demographic disparity rather than scanner artifacts, acquisition parameters, or other non-causal visual factors. No section demonstrates that the discovered cohorts exhibit higher error variance tied to held-out demographic labels or remain stable under semantic-preserving perturbations.
[Method (clustering and optimization)] The paper lists K as the sole free parameter yet provides no analysis of sensitivity to K or justification that the fairness gains are not artifacts of the particular choice of K; this directly affects the claim that the method is a robust alternative to demographic-based optimization.
[Experiments] No statistical significance tests, confidence intervals, or ablation on the clustering step are described that would confirm the reported fairness improvements are attributable to the hidden-cohort mechanism rather than the underlying model or optimization details.

minor comments (2)

[Abstract] The abstract refers to 'our proposed fairness benchmark, HIDFairBench' but does not specify dataset composition, cohort sizes, or how the benchmark ensures that demographic labels are strictly held out from training.
[Method] Notation for the fairness objective (e.g., the precise group DRO formulation over cohorts) is not introduced in the provided text, making it difficult to verify independence from demographic labels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We respond to each major comment below and indicate where revisions will be made to address valid concerns.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the SOTA fairness claim on demographic attributes (single and multi-attribute) without demographic labels during training rests on the unverified assumption that appearance-based clusters align with the actual sources of demographic disparity rather than scanner artifacts, acquisition parameters, or other non-causal visual factors. No section demonstrates that the discovered cohorts exhibit higher error variance tied to held-out demographic labels or remain stable under semantic-preserving perturbations.

Authors: We agree that the manuscript does not include explicit verification that the clusters align with demographic sources of disparity (as opposed to other visual factors) or demonstrate higher error variance tied to held-out labels and stability under perturbations. The current evidence consists of downstream improvements in demographic fairness metrics. In revision we will add analyses correlating cluster membership with held-out demographic attributes and evaluating cluster stability under semantic-preserving perturbations. revision: yes
Referee: [Method (clustering and optimization)] The paper lists K as the sole free parameter yet provides no analysis of sensitivity to K or justification that the fairness gains are not artifacts of the particular choice of K; this directly affects the claim that the method is a robust alternative to demographic-based optimization.

Authors: We concur that sensitivity analysis for K is required to substantiate robustness. The revised manuscript will include experiments across a range of K values showing that fairness gains remain consistent. revision: yes
Referee: [Experiments] No statistical significance tests, confidence intervals, or ablation on the clustering step are described that would confirm the reported fairness improvements are attributable to the hidden-cohort mechanism rather than the underlying model or optimization details.

Authors: The absence of statistical tests, confidence intervals, and clustering ablations is a valid observation. We will add these elements, including significance testing with intervals and an ablation isolating the contribution of the hidden-cohort optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: method and claims are self-contained without reduction to inputs or self-citations

full rationale

The paper introduces LHCF by describing clustering of images into appearance-based cohorts followed by fairness optimization (e.g., group DRO) over those cohorts, explicitly without demographic labels. No equations, derivations, or self-citations appear in the provided text that would make any result equivalent to its inputs by construction. The benchmark HIDFairBench is proposed as an evaluation tool separate from the training procedure, and the SOTA claim rests on empirical demonstration rather than definitional equivalence. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that visual appearance clusters capture the relevant fairness-relevant subpopulations; K is an explicit free parameter whose selection is not justified in the abstract.

free parameters (1)

K
Number of appearance-based cohorts; must be chosen before clustering and directly affects the fairness optimization groups.

axioms (1)

domain assumption Appearance-based image clustering will surface latent subpopulations that are more informative for fairness than visible demographic attributes.
Invoked when the paper states that clustering uncovers underlying sources of model error and inequity.

invented entities (1)

hidden cohorts no independent evidence
purpose: Latent subpopulations discovered from image appearance for fairness optimization.
New construct introduced as the core unit of the LHCF paradigm; no independent evidence of their causal link to model error is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1308 out tokens · 25960 ms · 2026-06-29T08:03:11.146882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 5 canonical work pages · 1 internal anchor

[1]

In MICCAI (2025), 594–603

Bissoto, A.et al.Subgroup Performance Analysis in Hidden Stratifications. In MICCAI (2025), 594–603

2025
[2]

In ACM FAccT (2018), 77–91

Buolamwini, J.et al.Gender shades: Intersectional accuracy disparities in com- mercial gender classification. In ACM FAccT (2018), 77–91

2018
[3]

Scientific Data10,123 (2023)

Cai, H.et al.An online mammography database with biopsy confirmed types. Scientific Data10,123 (2023)

2023
[4]

NeurIPS34, 22405–22418 (2021)

Cha, J.et al.Swad: Domain generalization by seeking flat minima. NeurIPS34, 22405–22418 (2021)

2021
[5]

In CVPR (2024), 12017–12026

Chakraborty, R.et al.Exmap: Leveraging explainability heatmaps for unsuper- vised group robustness to spurious correlations. In CVPR (2024), 12017–12026

2024
[6]

Christodoulou,E.et al.Confidenceintervalsuncovered:Arewereadyforreal-world medical imaging AI? In MICCAI (2024), 124–132

2024
[7]

In NESY (2024), 403–418

Collevati, M.et al.Leveraging Neurosymbolic AI for Slice Discovery. In NESY (2024), 403–418

2024
[8]

DeGrave, A.et al.AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619. 2021

2021
[9]

P.et al.Maximum likelihood from incomplete data via the EM algorithm

Dempster, A. P.et al.Maximum likelihood from incomplete data via the EM algorithm. JRSSSB (methodological)39,1–22 (1977)

1977
[10]

Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

Eyuboglu, S.et al.Domino: Discovering systematic errors with cross-modal em- beddings. arXiv preprint arXiv:2203.14960 (2022)

work page arXiv 2022
[11]

In CVPR (2023), 19358–19369

Fang, Y.et al.Eva: Exploring the limits of masked visual representation learning at scale. In CVPR (2023), 19358–19369

2023
[12]

G.et al.The clinician and dataset shift in artificial intelligence

Finlayson, S. G.et al.The clinician and dataset shift in artificial intelligence. New England Journal of Medicine385,283–286 (2021)

2021
[13]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association32,675–701 (1937)

1937
[14]

In CVPR Workshop (2021), 1820– 1828

Groh, M.et al.Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In CVPR Workshop (2021), 1820– 1828

2021
[15]

In ICML (2018), 1929–1938

Hashimoto, T.et al.Fairness without demographics in repeated loss minimization. In ICML (2018), 1929–1938

2018
[16]

In CVPR (2016), 770– 778

He, K.et al.Deep residual learning for image recognition. In CVPR (2016), 770– 778

2016
[17]

IEEE TPAMI43,4037–4058 (2020)

Jing, L.et al.Self-supervised visual feature learning with deep neural networks: A survey. IEEE TPAMI43,4037–4058 (2020)

2020
[18]

M.et al.Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study

Koch, L. M.et al.Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study. NPJ Digital Medicine7, 120 (2024)

2024
[19]

In CVPR (2024), 12289–12301

Luo, Y.et al.FairCLIP: Harnessing fairness in vision-language learning. In CVPR (2024), 12289–12301

2024
[20]

Luo, Y.et al.FairVision: Equitable Deep Learning for Eye Disease Screening via Fair Identity Scaling. 2024. arXiv:2310.02492 [cs.CV].https://arxiv.org/abs/ 2310.02492

work page arXiv 2024
[21]

C.et al.Systematic outperformance of 112 dermatologists in multi- class skin cancer image classification by convolutional neural networks

Maron, R. C.et al.Systematic outperformance of 112 dermatologists in multi- class skin cancer image classification by convolutional neural networks. European Journal of Cancer119,57–65 (2019). 10 Masroor et al

2019
[22]

arXiv preprint arXiv:2411.11939 (2024)

Masroor, M.et al.Fair Distillation: Teaching Fairness from Biased Teachers in Medical Imaging. arXiv preprint arXiv:2411.11939 (2024)

work page arXiv 2024
[23]

Nemenyi,P.B.Distribution-freemultiplecomparisons.(PrincetonUniversity,1963)

1963
[24]

In CHIL (2020), 151–159

Oakden-Rayner, L.et al.Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In CHIL (2020), 151–159

2020
[25]

In FAIMI Workshop, MICCAI (2024)

Olesen, V.et al.Slicing through bias: Explaining performance gaps in medical im- age analysis using slice discovery methods. In FAIMI Workshop, MICCAI (2024)

2024
[26]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M.et al.Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

arXiv preprint arXiv:2207.04104 (2022)

Plumb, G.et al.Towards a more rigorous science of blindspot discovery in image classification models. arXiv preprint arXiv:2207.04104 (2022)

work page arXiv 2022
[28]

PLoS medicine 15,e1002686 (2018)

Rajpurkar, P.et al.Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine 15,e1002686 (2018)

2018
[29]

In ICLR (2020)

Sagawa, S.et al.Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR (2020)

2020
[30]

Estimating the dimension of a model

Schwarz, G. Estimating the dimension of a model. The annals of statistics, 461–464 (1978)

1978
[31]

In CVPR (2022), 16742–16751

Seo, S.et al.Unsupervised learning of debiased representations with pseudo- attributes. In CVPR (2022), 16742–16751

2022
[32]

Nature medicine 27,2176–2182 (2021)

Seyyed-Kalantari, L.et al.Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine 27,2176–2182 (2021)

2021
[33]

NeurIPS33,19339–19352 (2020)

Sohoni, N.et al.No subclass left behind: Fine-grained robustness in coarse-grained classification problems. NeurIPS33,19339–19352 (2020)

2020
[34]

In ECCV (2024)

Tian, Y.et al.FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification. In ECCV (2024)

2024
[35]

In ICLR (2024)

Tian, Y.et al.FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling. In ICLR (2024)

2024
[36]

Scientific data5,1–9 (2018)

Tschandl, P.et al.The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions. Scientific data5,1–9 (2018)

2018
[37]

Vapnik, V. N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10,988–999 (1999)

1999
[38]

In EMNLP (2022), 3876–3887

Wang, Z.et al.Medclip: Contrastive learning from unpaired medical images and text. In EMNLP (2022), 3876–3887

2022
[39]

BMJ369(2020)

Wynants, L.et al.Prediction models for diagnosis and prognosis of covid-19: sys- tematic review and critical appraisal. BMJ369(2020)

2020
[40]

R.et al.Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

Zech, J. R.et al.Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine15, e1002683 (2018)

2018
[41]

In ICLR (2023)

Zong, Y.et al.MEDFAIR: Benchmarking fairness for medical imaging. In ICLR (2023)

2023

[1] [1]

In MICCAI (2025), 594–603

Bissoto, A.et al.Subgroup Performance Analysis in Hidden Stratifications. In MICCAI (2025), 594–603

2025

[2] [2]

In ACM FAccT (2018), 77–91

Buolamwini, J.et al.Gender shades: Intersectional accuracy disparities in com- mercial gender classification. In ACM FAccT (2018), 77–91

2018

[3] [3]

Scientific Data10,123 (2023)

Cai, H.et al.An online mammography database with biopsy confirmed types. Scientific Data10,123 (2023)

2023

[4] [4]

NeurIPS34, 22405–22418 (2021)

Cha, J.et al.Swad: Domain generalization by seeking flat minima. NeurIPS34, 22405–22418 (2021)

2021

[5] [5]

In CVPR (2024), 12017–12026

Chakraborty, R.et al.Exmap: Leveraging explainability heatmaps for unsuper- vised group robustness to spurious correlations. In CVPR (2024), 12017–12026

2024

[6] [6]

Christodoulou,E.et al.Confidenceintervalsuncovered:Arewereadyforreal-world medical imaging AI? In MICCAI (2024), 124–132

2024

[7] [7]

In NESY (2024), 403–418

Collevati, M.et al.Leveraging Neurosymbolic AI for Slice Discovery. In NESY (2024), 403–418

2024

[8] [8]

DeGrave, A.et al.AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619. 2021

2021

[9] [9]

P.et al.Maximum likelihood from incomplete data via the EM algorithm

Dempster, A. P.et al.Maximum likelihood from incomplete data via the EM algorithm. JRSSSB (methodological)39,1–22 (1977)

1977

[10] [10]

Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

Eyuboglu, S.et al.Domino: Discovering systematic errors with cross-modal em- beddings. arXiv preprint arXiv:2203.14960 (2022)

work page arXiv 2022

[11] [11]

In CVPR (2023), 19358–19369

Fang, Y.et al.Eva: Exploring the limits of masked visual representation learning at scale. In CVPR (2023), 19358–19369

2023

[12] [12]

G.et al.The clinician and dataset shift in artificial intelligence

Finlayson, S. G.et al.The clinician and dataset shift in artificial intelligence. New England Journal of Medicine385,283–286 (2021)

2021

[13] [13]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association32,675–701 (1937)

1937

[14] [14]

In CVPR Workshop (2021), 1820– 1828

Groh, M.et al.Evaluating deep neural networks trained on clinical images in dermatology with the Fitzpatrick 17k dataset. In CVPR Workshop (2021), 1820– 1828

2021

[15] [15]

In ICML (2018), 1929–1938

Hashimoto, T.et al.Fairness without demographics in repeated loss minimization. In ICML (2018), 1929–1938

2018

[16] [16]

In CVPR (2016), 770– 778

He, K.et al.Deep residual learning for image recognition. In CVPR (2016), 770– 778

2016

[17] [17]

IEEE TPAMI43,4037–4058 (2020)

Jing, L.et al.Self-supervised visual feature learning with deep neural networks: A survey. IEEE TPAMI43,4037–4058 (2020)

2020

[18] [18]

M.et al.Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study

Koch, L. M.et al.Distribution shift detection for the postmarket surveillance of medical AI algorithms: a retrospective simulation study. NPJ Digital Medicine7, 120 (2024)

2024

[19] [19]

In CVPR (2024), 12289–12301

Luo, Y.et al.FairCLIP: Harnessing fairness in vision-language learning. In CVPR (2024), 12289–12301

2024

[20] [20]

Luo, Y.et al.FairVision: Equitable Deep Learning for Eye Disease Screening via Fair Identity Scaling. 2024. arXiv:2310.02492 [cs.CV].https://arxiv.org/abs/ 2310.02492

work page arXiv 2024

[21] [21]

C.et al.Systematic outperformance of 112 dermatologists in multi- class skin cancer image classification by convolutional neural networks

Maron, R. C.et al.Systematic outperformance of 112 dermatologists in multi- class skin cancer image classification by convolutional neural networks. European Journal of Cancer119,57–65 (2019). 10 Masroor et al

2019

[22] [22]

arXiv preprint arXiv:2411.11939 (2024)

Masroor, M.et al.Fair Distillation: Teaching Fairness from Biased Teachers in Medical Imaging. arXiv preprint arXiv:2411.11939 (2024)

work page arXiv 2024

[23] [23]

Nemenyi,P.B.Distribution-freemultiplecomparisons.(PrincetonUniversity,1963)

1963

[24] [24]

In CHIL (2020), 151–159

Oakden-Rayner, L.et al.Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In CHIL (2020), 151–159

2020

[25] [25]

In FAIMI Workshop, MICCAI (2024)

Olesen, V.et al.Slicing through bias: Explaining performance gaps in medical im- age analysis using slice discovery methods. In FAIMI Workshop, MICCAI (2024)

2024

[26] [26]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M.et al.Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

arXiv preprint arXiv:2207.04104 (2022)

Plumb, G.et al.Towards a more rigorous science of blindspot discovery in image classification models. arXiv preprint arXiv:2207.04104 (2022)

work page arXiv 2022

[28] [28]

PLoS medicine 15,e1002686 (2018)

Rajpurkar, P.et al.Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine 15,e1002686 (2018)

2018

[29] [29]

In ICLR (2020)

Sagawa, S.et al.Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR (2020)

2020

[30] [30]

Estimating the dimension of a model

Schwarz, G. Estimating the dimension of a model. The annals of statistics, 461–464 (1978)

1978

[31] [31]

In CVPR (2022), 16742–16751

Seo, S.et al.Unsupervised learning of debiased representations with pseudo- attributes. In CVPR (2022), 16742–16751

2022

[32] [32]

Nature medicine 27,2176–2182 (2021)

Seyyed-Kalantari, L.et al.Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine 27,2176–2182 (2021)

2021

[33] [33]

NeurIPS33,19339–19352 (2020)

Sohoni, N.et al.No subclass left behind: Fine-grained robustness in coarse-grained classification problems. NeurIPS33,19339–19352 (2020)

2020

[34] [34]

In ECCV (2024)

Tian, Y.et al.FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification. In ECCV (2024)

2024

[35] [35]

In ICLR (2024)

Tian, Y.et al.FairSeg: A Large-Scale Medical Image Segmentation Dataset for Fairness Learning Using Segment Anything Model with Fair Error-Bound Scaling. In ICLR (2024)

2024

[36] [36]

Scientific data5,1–9 (2018)

Tschandl, P.et al.The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions. Scientific data5,1–9 (2018)

2018

[37] [37]

Vapnik, V. N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10,988–999 (1999)

1999

[38] [38]

In EMNLP (2022), 3876–3887

Wang, Z.et al.Medclip: Contrastive learning from unpaired medical images and text. In EMNLP (2022), 3876–3887

2022

[39] [39]

BMJ369(2020)

Wynants, L.et al.Prediction models for diagnosis and prognosis of covid-19: sys- tematic review and critical appraisal. BMJ369(2020)

2020

[40] [40]

R.et al.Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

Zech, J. R.et al.Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine15, e1002683 (2018)

2018

[41] [41]

In ICLR (2023)

Zong, Y.et al.MEDFAIR: Benchmarking fairness for medical imaging. In ICLR (2023)

2023