People-Centred Medical Image Analysis

Cuong Nguyen; David Rosewarne; Gustavo Carneiro; Kevin Wells; Milad Masroor; Tahir Hassan; Thanh-Toan Do; Yuanhong Chen; Zheng Zhang

arxiv: 2604.26991 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

People-Centred Medical Image Analysis

Zheng Zhang , Milad Masroor , Cuong Nguyen , Tahir Hassan , Yuanhong Chen , David Rosewarne , Kevin Wells , Thanh-Toan Do

show 1 more author

Gustavo Carneiro

This is my paper

Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords medical image analysisAI fairnesshuman-AI collaborationworkflow integrationclinical adoptiondynamic gatingbenchmark

0 comments

The pith

PecMan uses a dynamic gating mechanism to jointly optimize fairness, accuracy, and clinician workload in medical image analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical AI systems often achieve high accuracy but struggle with adoption due to biases across patient groups and poor fit with clinical routines. The paper contends that addressing fairness and human collaboration together, rather than separately, under the constraint of limited clinician time, can create more viable tools. It introduces PecMan, a framework with a gating system that decides case assignments, along with a benchmark to measure the balance of these factors. Experiments indicate this integrated approach outperforms methods that handle the issues in isolation.

Core claim

The central discovery is that a people-centred approach to medical image analysis, implemented via PecMan's dynamic gating that routes cases to AI, human clinicians, or joint review while respecting workload limits, achieves better combined performance on accuracy, fairness across diverse populations, and workflow integration than prior separate solutions.

What carries the argument

The dynamic gating mechanism within PecMan, which assigns each medical image case to AI alone, clinician alone, or both, subject to overall clinician availability constraints, while pursuing joint optimization of diagnostic accuracy and fairness.

If this is right

Performance biases that hinder regulatory approval can be mitigated by explicit fairness optimization.
Clinician adoption increases when AI does not disrupt established workflows or overload staff.
Trade-offs between the three goals can be quantified and managed using the FairHAI benchmark.
The framework demonstrates consistent gains over methods optimizing only subsets of these objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar gating logic could help in other high-stakes domains with scarce expert time.
Real-world deployment might require adapting the workload model to specific hospital schedules and team structures.
The benchmark provides a template for testing other human-AI systems on fairness and integration metrics simultaneously.

Load-bearing premise

The assumption that clinician availability can be modeled as a simple dynamic constraint that captures real clinical settings without overlooking workflow disruptions or introducing new barriers.

What would settle it

If a study in an actual clinic finds that using PecMan results in lower overall diagnostic quality or higher clinician burnout than using separate fairness and deferral tools, the joint optimization benefit would be falsified.

Figures

Figures reproduced from arXiv: 2604.26991 by Cuong Nguyen, David Rosewarne, Gustavo Carneiro, Kevin Wells, Milad Masroor, Tahir Hassan, Thanh-Toan Do, Yuanhong Chen, Zheng Zhang.

**Figure 1.** Figure 1: PecMan: A unified framework that jointly optimises fairness and human-AI collaboration. A gating mechanism selects the appropriate cohort-specific AI model and determines whether clinician input is needed, ensuring high accuracy, balanced group performance, and adherence to workload constraints. Although AI fairness, L2D, and L2C all aim to improve AI-assisted medical decision-making, they have traditiona… view at source ↗

**Figure 2.** Figure 2: Step 0 – Backbone Training: PecMan initialises its backbone model using the FIS loss [82], which jointly optimises overall classification accuracy and fairness across patient groups, which in this case represent the sensitive attribute sex with values “male” and “female” view at source ↗

**Figure 3.** Figure 3: Step 1 – Group-specific Model Training: This step focuses on training classifiers tailored to individual patient cohorts, enabling fairness-aware performance across demographic groups. weights are defined as follows: s I (x, y, B) = exp(ℓBCE(hϕ (fθ(x)), y)) P (˜x,M˜ ,y, ˜ a˜)∈B exp(ℓBCE(hϕ (fθ(˜x)), y˜)), s G(a, B) = exp (DOT(L(B),La(B))) P j∈A exp (DOT(L(B),Lj (B))), (3) where DOT(L(B),La(B)) is the optim… view at source ↗

**Figure 4.** Figure 4: Step 2 – L2D+L2C Unbiased Training: PecMan trains the gating and consolidator models using the FIS loss, enabling unbiased decision-making that combines L2D and L2C strategies. 3.2.1. Step 0: Backbone Model Training - view at source ↗

**Figure 5.** Figure 5: The AUC vs coverage (top row) and ES-AUC vs. coverage (bottom row) of com view at source ↗

**Figure 6.** Figure 6: Performance analysis of PecMan on the testing samples of HAM10000. (a) The view at source ↗

**Figure 7.** Figure 7: The cohort-specific AUC (a,b), overall AUC (C), and ES-AUC vs. coverage view at source ↗

**Figure 8.** Figure 8: Training time of PecMan and competing methods on HAM10000 dataset. view at source ↗

**Figure 9.** Figure 9: Inference time of PecMan and competing methods on HAM10000 dataset. view at source ↗

read the original abstract

Recent advances in data-centric medical AI have produced highly accurate diagnostic systems, but the emphasis on data curation and performance metrics has not translated into widespread clinical adoption. We conjecture that this limited uptake stems from insufficient attention dedicated to the optimisation of fair performance across diverse patient populations and to workflow integration: performance biases can create regulatory barriers, and poorly integrated automation can disrupt clinical routines, degrade the quality of human-AI collaboration, and reduce clinicians' willingness to adopt AI tools. Prior work on workflow integration (e.g., Learning to Defer (L2D) and Learning to Complement (L2C)) and AI fairness has typically examined these challenges in isolation, overlooking their natural interdependence and the practical constraints of clinical environments, such as restricted clinician availability. We propose People-Centred Medical Image Analysis (PecMan), a human-AI framework that jointly optimises fairness, diagnostic accuracy, and workflow effectiveness through a dynamic gating mechanism that assigns cases to AI, clinicians, or both under clinician workload constraints. We also introduce the Fairness and Human-Centred AI (FairHAI) benchmark for evaluating trade-offs between accuracy, fairness, and clinician workload. Experiments using this benchmark show that PecMan consistently outperforms existing methods, paving the way for more trustworthy and clinically viable AI systems. Code will be available upon paper acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes People-Centred Medical Image Analysis (PecMan), a human-AI framework that jointly optimizes fairness, diagnostic accuracy, and workflow effectiveness through a dynamic gating mechanism that assigns cases to AI, clinicians, or both under clinician workload constraints. It introduces the FairHAI benchmark for evaluating trade-offs between accuracy, fairness, and clinician workload, and reports that experiments show PecMan consistently outperforms existing methods.

Significance. If the results hold and the modeled constraints align with real clinical environments, this work would be significant in advancing clinically viable medical AI by addressing the interdependence of fairness and workflow integration, areas previously studied in isolation. The FairHAI benchmark could serve as a useful tool for future research in human-centred AI.

major comments (1)

The central claim that PecMan outperforms baselines on FairHAI depends on the dynamic gating jointly optimizing under a modeled clinician availability constraint. However, this treats availability as a clean resource allocation problem, while real clinical settings introduce unmodeled factors including communication costs, decision latency, EHR integration friction, and variable case complexity that could invert the trade-offs. Without validation that the synthetic constraint matches observed clinical logs, the outperformance does not establish clinical viability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major concern point by point below, acknowledging the limitations of our modeled constraints while clarifying the scope of our claims.

read point-by-point responses

Referee: The central claim that PecMan outperforms baselines on FairHAI depends on the dynamic gating jointly optimizing under a modeled clinician availability constraint. However, this treats availability as a clean resource allocation problem, while real clinical settings introduce unmodeled factors including communication costs, decision latency, EHR integration friction, and variable case complexity that could invert the trade-offs. Without validation that the synthetic constraint matches observed clinical logs, the outperformance does not establish clinical viability.

Authors: We agree that the clinician availability constraint in PecMan and FairHAI is modeled as a simplified resource allocation problem and does not incorporate additional real-world factors such as communication costs, decision latency, EHR integration friction, and variable case complexity. The FairHAI benchmark is a controlled, synthetic environment intended to isolate and evaluate the effects of joint optimization of fairness, accuracy, and workflow under workload constraints. Our central claim is limited to outperformance within this benchmark; we do not assert that the results establish clinical viability. In the revised manuscript, we will expand the limitations and discussion sections to explicitly address these unmodeled factors, analyze how they could alter the observed trade-offs, and propose directions for empirical validation against clinical logs and real workflow data. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; framework and benchmark are independently proposed

full rationale

The paper introduces PecMan as a joint optimization framework via dynamic gating under workload constraints and the FairHAI benchmark, with performance claims resting on experimental comparisons rather than any closed-form derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, ansatzes, or uniqueness theorems are presented in the provided text that reduce to inputs by construction. Prior work on L2D/L2C is cited externally without self-citation load-bearing the central claim. The derivation chain is self-contained as a proposal validated by new experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the framework description implies unstated modeling choices for gating and workload but provides no details for auditing.

pith-pipeline@v0.9.0 · 5554 in / 1025 out tokens · 48759 ms · 2026-05-07T16:30:00.962720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 6 canonical work pages

[1]

M. Li, Y. Jiang, Y. Zhang, H. Zhu, Medical image analysis using deep learning algorithms, Frontiers in Public Health 11 (2023) 1273253

2023
[2]

T. R. C. of Radiologists, Clinical radiology workforce census 2023, Tech. rep., The Royal College of Radiologists (2023). 21

2023
[3]

E. J. Topol, High-performance medicine: The convergence of human and artificial intelligence, Nature medicine 25 (1) (2019) 44–56

2019
[4]

M. M. Abuzaid, W. Elshami, H. Tekin, B. Issa, Assessment of the will- ingness of radiologists and radiographers to accept the integration of artificial intelligence into radiology practice, Academic Radiology 29 (1) (2022) 87–94

2022
[5]

Derevianko, S

A. Derevianko, S. F. M. Pizzoli, F. Pesapane, A. Rotili, D. Monzani, R. Grasso, E. Cassano, G. Pravettoni, The use of artificial intelligence (ai) in the radiology field: What is the state of doctor–patient commu- nication in cancer diagnosis?, Cancers 15 (2) (2023) 470

2023
[6]

Zhang, Y

S. Zhang, Y. Li, W. Liu, Q. Chu, Y. Chen, A decade of review in global regulation and research of artificial intelligence medical devices (2015- 2025), Frontiers in Medicine 12 (2025) 1630408

2015
[7]

Jones, J

C. Jones, J. Thornton, J. C. Wyatt, Artificial intelligence and clinical decision support: clinicians’ perspectives on trust, trustworthiness, and liability, Medical law review 31 (4) (2023) 501–520

2023
[8]

Kumah, Artificial intelligence in healthcare and its implications for patient centered care, Discover Public Health 22 (1) (2025) 524

E. Kumah, Artificial intelligence in healthcare and its implications for patient centered care, Discover Public Health 22 (1) (2025) 524

2025
[9]

S. S. Jain, S. Goto, J. L. Hall, S. S. Khan, C. A. MacRae, C. Ofori, C. Pegus, M. Pencina, E. D. Peterson, L. H. Schwamm, et al., Pragmatic approaches to the evaluation and monitoring of artificial intelligence in health care: A science advisory from the american heart association, Circulation 152 (23) (2025) e433–e442

2025
[10]

E. U. Alum, O. P.-C. Ugwu, Artificial intelligence in personalized medicine: transforming diagnosis and treatment, Discover Applied Sci- ences 7 (3) (2025) 193

2025
[11]

L. A. Celi, J. Cellini, M.-L. Charpignon, E. C. Dee, F. Dernoncourt, R. Eber, W. G. Mitchell, L. Moukheiber, J. Schirmer, J. Situ, et al., Sources of bias in artificial intelligence that perpetuate healthcare dis- parities - a global review, PLOS Digital Health 1 (3) (2022) e0000022. 22

2022
[12]

Oakden-Rayner, J

L. Oakden-Rayner, J. Dunnmon, G. Carneiro, C. Ré, Hidden stratifica- tion causes clinically meaningful failures in machine learning for medical imaging, in: ACM CHIL, 2020, pp. 151–159

2020
[13]

M. A. Ricci Lara, R. Echeveste, E. Ferrante, Addressing fairness in artificialintelligenceformedicalimaging, NatureCommunications13(1) (2022) 4581

2022
[14]

Madras, T

D. Madras, T. Pitassi, R. Zemel, Predict responsibly: Improving fairness and accuracy by learning to defer, in: NeurIPS, Vol. 31, 2018

2018
[15]

Wilder, E

B. Wilder, E. Horvitz, E. Kamar, Learning to complement humans, in: International Joint Conference on Artificial Intelligence, 2021

2021
[16]

Y. Zong, Y. Yang, T. Hospedales, MEDFAIR: Benchmarking fairness for medical imaging, in: ICLR, 2023

2023
[17]

Iqbal, M

T. Iqbal, M. Masud, B. Amin, C. Feely, M. Faherty, T. Jones, M. Tier- ney, A. Shahzad, P. Vazquez, Towards integration of artificial intelli- gence into medical devices as a real-time recommender system for per- sonalised healthcare: State-of-the-art and future prospects, Health Sci- ences Review (2024)

2024
[18]

Quadrianto, V

N. Quadrianto, V. Sharmanska, O. Thomas, Discovering fair represen- tations in the data domain, in: CVPR, 2019, pp. 8227–8236

2019
[19]

Zhang, J

Y. Zhang, J. Sang, Towards accuracy-fairness paradox: Adversarial example-based data augmentation for visual debiasing, in: ACM Multi- media, 2020

2020
[20]

V. V. Ramaswamy, S. S. Kim, O. Russakovsky, Fair attribute classifica- tion through latent space de-biasing, in: CVPR, 2021, pp. 9301–9310

2021
[21]

S. Park, J. Lee, P. Lee, S. Hwang, D. Kim, H. Byun, Fair contrastive learning for facial attribute classification, in: CVPR, 2022, pp. 10389– 10398

2022
[22]

Y. Roh, K. Lee, S. Whang, C. Suh, Fr-train: A mutual information- based approach to fair and robust training, in: ICML, PMLR, 2020, pp. 8147–8157. 23

2020
[23]

M. B. Zafar, I. Valera, M. G. Rogriguez, K. P. Gummadi, Fairness con- straints: Mechanisms for fair classification, in: AISTATS, PMLR, 2017, pp. 962–970

2017
[24]

B. H. Zhang, B. Lemoine, M. Mitchell, Mitigating unwanted biases with adversarial learning, in: AIES, 2018, pp. 335–340

2018
[25]

Z. Wang, X. Dong, H. Xue, Z. Zhang, W. Chiu, T. Wei, K. Ren, Fairness-aware adversarial perturbation towards bias mitigation for de- ployed deep models, in: CVPR, 2022, pp. 10379–10388

2022
[26]

M. P. Kim, A. Ghorbani, J. Zou, Multiaccuracy: Black-box post- processing for fairness in classification, in: AIES, 2019, pp. 247–254

2019
[27]

Herington, M

J. Herington, M. D. McCradden, K. Creel, R. Boellaard, E. C. Jones, A. K. Jha, A. Rahmim, P. J. Scott, J. J. Sunderland, R. L. Wahl, et al., Ethical considerations for artificial intelligence in medical imaging: de- ployment and governance, Journal of Nuclear Medicine 64 (10) (2023) 1509–1515

2023
[28]

Obermeyer, B

Z. Obermeyer, B. Powers, C. Vogeli, S. Mullainathan, Dissecting racial bias in an algorithm used to manage the health of populations, Science 366 (6464) (2019)

2019
[29]

A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, E. Ferrante, Gen- der imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis, National Academy of Sciences 117 (23) (2020) 12592–12594

2020
[30]

V. C. Nitesh, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (1) (2002) 321

2002
[31]

Pleiss, M

G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, K. Q. Weinberger, On fairness and calibration, in: NeurIPS, Vol. 30, 2017

2017
[32]

B. Kim, H. Kim, K. Kim, S. Kim, J. Kim, Learning not to learn: Train- ing deep neural networks with biased data, in: CVPR, 2019, pp. 9012– 9020

2019
[33]

Madras, E

D. Madras, E. Creager, T. Pitassi, R. Zemel, Learning adversarially fair and transferable representations, in: ICML, PMLR, 2018, pp. 3384– 3393. 24

2018
[34]

H. Zhao, A. Coston, T. Adel, G. J. Gordon, Conditional learning of fair representations, arXiv:1910.07162 (2019)

work page arXiv 1910
[35]

Sagawa, P

S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks for group shifts: On the importance of regular- ization for worst-case generalization, in: ICLR, 2020

2020
[36]

J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, S. Park, Swad: Domain generalization by seeking flat minima, in: NeurIPS, Vol. 34, 2021, pp. 22405–22418

2021
[37]

Tartaglione, C

E. Tartaglione, C. A. Barbano, M. Grangetto, End: Entangling and disentangling deep representations for bias correction, in: CVPR, 2021, pp. 13508–13517

2021
[38]

M. H. Sarhan, N. Navab, A. Eslami, S. Albarqouni, Fairness by learning orthogonal disentangled representations, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, Springer, 2020, pp. 746–761

2020
[39]

Y. Tian, M. Shi, Y. Luo, A. Kouhana, T. Elze, M. Wang, Fairseg: A large-scale medical image segmentation dataset for fairness learning us- ing segment anything model with fair error-bound scaling, in: ICLR, 2024

2024
[40]

Y. Luo, M. Shi, M. O. Khan, M. M. Afzal, H. Huang, S. Yuan, Y. Tian, L. Song, A. Kouhana, T. Elze, et al., FairCLIP: Harnessing fairness in vision-language learning, in: CVPR, 2024, pp. 12289–12301

2024
[42]

Rosenfeld, M

A. Rosenfeld, M. D. Solbach, J. K. Tsotsos, Totally looks like-how hu- mans compare, compared to machines, in: IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018, pp. 1961–1964

2018
[43]

Serre, Deep learning: the good, the bad, and the ugly, Annual Review of Vision Science 5 (2019) 399–426

T. Serre, Deep learning: the good, the bad, and the ugly, Annual Review of Vision Science 5 (2019) 399–426. 25

2019
[44]

Kamar, S

E. Kamar, S. Hacker, E. Horvitz, Combining human and machine in- telligence in large-scale crowdsourcing., in: International Conference on Autonomous Agents and Multiagent Systems, Vol. 12, 2012, pp. 467– 474

2012
[45]

E. K. Chiou, J. D. Lee, Trusting automation: Designing for responsivity and resilience, Human Factors 65 (1) (2023) 137–165

2023
[46]

Z. Lu, M. Yin, Human reliance on machine learning models when per- formance feedback is limited: Heuristics and risks, in: CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–16

2021
[47]

M. Yin, J. Wortman Vaughan, H. Wallach, Understanding the effect of accuracy on trust in machine learning models, in: CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–12

2019
[48]

D. Shin, The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable AI, International Journal of Human-Computer Studies 146 (2021) 102551

2021
[49]

do you trust me?

K. Weitz, D. Schiller, R. Schlagowski, T. Huber, E. André, "do you trust me?" increasing user-trust by integrating virtual agents in explainable ai interaction design, in: ACM International Conference on Intelligent Virtual Agents, 2019, pp. 7–9

2019
[50]

Bansal, B

G. Bansal, B. Nushi, E. Kamar, E. Horvitz, D. S. Weld, Is the most accurate AI the best teammate? Optimizing AI for teamwork, in: AAAI Conference on Artificial Intelligence, Vol. 35(13), 2021, pp. 11405–11414

2021
[51]

Agarwal, A

N. Agarwal, A. Moehring, P. Rajpurkar, T. Salz, Combining human ex- pertise with artificial intelligence: experimental evidence from radiology, Tech. rep., National Bureau of Economic Research (2023)

2023
[52]

Vodrahalli, R

K. Vodrahalli, R. Daneshjou, T. Gerstenberg, J. Zou, Do humans trust advice more if it comes from ai? an analysis of human-AI interactions, in: AIES, 2022, pp. 763–777

2022
[53]

X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, L. He, A survey of human- in-the-loop for machine learning, Future Generation Computer Systems 135 (C) (2022) 364–381.doi:10.1016/j.future.2022.05.014. URLhttps://doi.org/10.1016/j.future.2022.05.014 26

work page doi:10.1016/j.future.2022.05.014 2022
[54]

Keswani, M

V. Keswani, M. Lease, K. Kenthapadi, Towards unbiased and accurate deferral to multiple experts, in: AIES, 2021, pp. 154–165

2021
[55]

Narasimhan, W

H. Narasimhan, W. Jitkrittum, A. K. Menon, A. Rawat, S. Kumar, Post-hoc estimators for learning to defer to an expert, in: NeurIPS, Vol. 35, 2022

2022
[56]

A. Mao, C. Mohri, M. Mohri, Y. Zhong, Two-stage learning to defer with multiple experts, in: NeurIPS, 2023

2023
[57]

Zhang, C

Z. Zhang, C. Nguyen, K. Wells, T.-T. Do, D. Rosewarne, G. Carneiro, Coverage-constrained human-ai cooperation with multiple experts, in: AAAI, 2026

2026
[58]

Cortes, G

C. Cortes, G. DeSalvo, M. Mohri, Learning with rejection, in: ALT, Springer, 2016

2016
[59]

Charoenphakdee, Z

N. Charoenphakdee, Z. Cui, Y. Zhang, M. Sugiyama, Classification with rejection based on cost-sensitive classification, in: ICML, PMLR, 2021, pp. 1507–1517

2021
[60]

Raghu, K

M. Raghu, K. Blumer, G. Corrado, J. Kleinberg, Z. Obermeyer, S. Mul- lainathan, The algorithmic automation problem: Prediction, triage, and human effort, in: Machine Learning for Health Symposium, 2018

2018
[61]

Okati, A

N. Okati, A. De, M. Rodriguez, Differentiable learning under triage 34 (2021) 9140–9151

2021
[62]

Mozannar, D

H. Mozannar, D. Sontag, Consistent estimators for learning to defer to an expert, in: ICML, PMLR, 2020, pp. 7076–7087

2020
[63]

Verma, E

R. Verma, E. Nalisnick, Calibrated learning to defer with one-vs-all classifiers, in: ICML, PMLR, 2022, pp. 22184–22202

2022
[64]

Mozannar, H

H. Mozannar, H. Lang, D. Wei, P. Sattigeri, S. Das, D. Sontag, Who should predict? Exact algorithms for learning to defer to humans, in: AISTATS, PMLR, 2023, pp. 10520–10545

2023
[65]

Charusaie, H

M.-A. Charusaie, H. Mozannar, D. Sontag, S. Samadi, Sample efficient learningofpredictorsthatcomplementhumans, in: ICML,PMLR,2022, pp. 2972–3005. 27

2022
[66]

Y. Cao, H. Mozannar, L. Feng, H. Wei, B. An, In defense of soft- max parametrization for calibrated and consistent learning to defer, in: NeurIPS, Vol. 36, 2024

2024
[67]

Straitouri, L

E. Straitouri, L. Wang, N. Okati, M. G. Rodriguez, Improving expert predictions with conformal prediction, in: ICML, PMLR, 2023, pp. 32633–32653

2023
[68]

S. Liu, Y. Cao, Q. Zhang, L. Feng, B. An, Mitigating underfitting in learning to defer with consistent losses, in: AISTATS, 2024

2024
[69]

Mozannar, A

H. Mozannar, A. Satyanarayan, D. Sontag, Teaching humans when to defer to a classifier via exemplars, in: AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 5323–5331

2022
[70]

Verma, D

R. Verma, D. Barrejón, E. Nalisnick, On the calibration of learning to defer to multiple experts, in: ICML Workshop on HMCT, 2022

2022
[71]

Verma, D

R. Verma, D. Barrejon, E. Nalisnick, Learning to defer to multiple ex- perts: Consistent surrogate losses, confidence calibration, and conformal ensembles, in: AISTATS, PMLR, 2023, pp. 11415–11434

2023
[72]

Babbar, U

V. Babbar, U. Bhatt, A. Weller, On the utility of prediction sets in human-AI teams, in: International Joint Conference on Artificial Intel- ligence, 2022

2022
[73]

A. Mao, M. Mohri, Y. Zhong, Principled approaches for learning to defer with multiple experts, in: International Symposium on Artificial Intelligence and Mathematics, 2024

2024
[74]

Hemmer, L

P. Hemmer, L. Thede, M. Vössing, J. Jakubik, N. Kühl, Learning to defer with limited expert predictions, in: AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 6002–6011

2023
[75]

Tailor, A

D. Tailor, A. Patra, R. Verma, P. Manggala, E. Nalisnick, Learning to defer to a population: A meta-learning approach, in: AISTATS, 2024

2024
[76]

Leitão, P

D. Leitão, P. Saleiro, M. A. Figueiredo, P. Bizarro, Human-AI collabora- tion in decision-making: Beyond learning to defer, in: ICML Workshop on Human-Machine Collaboration and Teaming, 2022. 28

2022
[77]

Steyvers, H

M. Steyvers, H. Tejeda, G. Kerrigan, P. Smyth, Bayesian modeling of human–AI complementarity, National Academy of Sciences 119 (11) (2022) e2111547119

2022
[78]

Kerrigan, P

G. Kerrigan, P. Smyth, M. Steyvers, Combining human predictions with model probabilities via confusion matrices and calibration, in: NeurIPS, Vol. 34, 2021, pp. 4421–4434

2021
[79]

M. Liu, J. Wei, Y. Liu, J. Davis, Do humans and machines have the same eyes? Human-machine perceptual differences on image classification, arXiv:2304.08733 (2023)

work page arXiv 2023
[80]

Forming Effective Human-

P. Hemmer, S. Schellhammer, M. Vössing, J. Jakubik, G. Satzger, Forming effective human-AI teams: Building machine learning models that complement the capabilities of multiple experts, in: L. D. Raedt (Ed.), International Joint Conference on Artificial Intelligence, Interna- tional Joint Conferences on Artificial Intelligence Organization, 2022, pp. 2478–...

work page doi:10.24963/ijcai.2022/344 2022
[81]

Zhang, W

Z. Zhang, W. Ai, K. Wells, D. Rosewarne, T.-T. Do, G. Carneiro, Learn- ing to complement and to defer to multiple users, in: ECCV, Springer, 2025, pp. 144–162

2025

Showing first 80 references.

[1] [1]

M. Li, Y. Jiang, Y. Zhang, H. Zhu, Medical image analysis using deep learning algorithms, Frontiers in Public Health 11 (2023) 1273253

2023

[2] [2]

T. R. C. of Radiologists, Clinical radiology workforce census 2023, Tech. rep., The Royal College of Radiologists (2023). 21

2023

[3] [3]

E. J. Topol, High-performance medicine: The convergence of human and artificial intelligence, Nature medicine 25 (1) (2019) 44–56

2019

[4] [4]

M. M. Abuzaid, W. Elshami, H. Tekin, B. Issa, Assessment of the will- ingness of radiologists and radiographers to accept the integration of artificial intelligence into radiology practice, Academic Radiology 29 (1) (2022) 87–94

2022

[5] [5]

Derevianko, S

A. Derevianko, S. F. M. Pizzoli, F. Pesapane, A. Rotili, D. Monzani, R. Grasso, E. Cassano, G. Pravettoni, The use of artificial intelligence (ai) in the radiology field: What is the state of doctor–patient commu- nication in cancer diagnosis?, Cancers 15 (2) (2023) 470

2023

[6] [6]

Zhang, Y

S. Zhang, Y. Li, W. Liu, Q. Chu, Y. Chen, A decade of review in global regulation and research of artificial intelligence medical devices (2015- 2025), Frontiers in Medicine 12 (2025) 1630408

2015

[7] [7]

Jones, J

C. Jones, J. Thornton, J. C. Wyatt, Artificial intelligence and clinical decision support: clinicians’ perspectives on trust, trustworthiness, and liability, Medical law review 31 (4) (2023) 501–520

2023

[8] [8]

Kumah, Artificial intelligence in healthcare and its implications for patient centered care, Discover Public Health 22 (1) (2025) 524

E. Kumah, Artificial intelligence in healthcare and its implications for patient centered care, Discover Public Health 22 (1) (2025) 524

2025

[9] [9]

S. S. Jain, S. Goto, J. L. Hall, S. S. Khan, C. A. MacRae, C. Ofori, C. Pegus, M. Pencina, E. D. Peterson, L. H. Schwamm, et al., Pragmatic approaches to the evaluation and monitoring of artificial intelligence in health care: A science advisory from the american heart association, Circulation 152 (23) (2025) e433–e442

2025

[10] [10]

E. U. Alum, O. P.-C. Ugwu, Artificial intelligence in personalized medicine: transforming diagnosis and treatment, Discover Applied Sci- ences 7 (3) (2025) 193

2025

[11] [11]

L. A. Celi, J. Cellini, M.-L. Charpignon, E. C. Dee, F. Dernoncourt, R. Eber, W. G. Mitchell, L. Moukheiber, J. Schirmer, J. Situ, et al., Sources of bias in artificial intelligence that perpetuate healthcare dis- parities - a global review, PLOS Digital Health 1 (3) (2022) e0000022. 22

2022

[12] [12]

Oakden-Rayner, J

L. Oakden-Rayner, J. Dunnmon, G. Carneiro, C. Ré, Hidden stratifica- tion causes clinically meaningful failures in machine learning for medical imaging, in: ACM CHIL, 2020, pp. 151–159

2020

[13] [13]

M. A. Ricci Lara, R. Echeveste, E. Ferrante, Addressing fairness in artificialintelligenceformedicalimaging, NatureCommunications13(1) (2022) 4581

2022

[14] [14]

Madras, T

D. Madras, T. Pitassi, R. Zemel, Predict responsibly: Improving fairness and accuracy by learning to defer, in: NeurIPS, Vol. 31, 2018

2018

[15] [15]

Wilder, E

B. Wilder, E. Horvitz, E. Kamar, Learning to complement humans, in: International Joint Conference on Artificial Intelligence, 2021

2021

[16] [16]

Y. Zong, Y. Yang, T. Hospedales, MEDFAIR: Benchmarking fairness for medical imaging, in: ICLR, 2023

2023

[17] [17]

Iqbal, M

T. Iqbal, M. Masud, B. Amin, C. Feely, M. Faherty, T. Jones, M. Tier- ney, A. Shahzad, P. Vazquez, Towards integration of artificial intelli- gence into medical devices as a real-time recommender system for per- sonalised healthcare: State-of-the-art and future prospects, Health Sci- ences Review (2024)

2024

[18] [18]

Quadrianto, V

N. Quadrianto, V. Sharmanska, O. Thomas, Discovering fair represen- tations in the data domain, in: CVPR, 2019, pp. 8227–8236

2019

[19] [19]

Zhang, J

Y. Zhang, J. Sang, Towards accuracy-fairness paradox: Adversarial example-based data augmentation for visual debiasing, in: ACM Multi- media, 2020

2020

[20] [20]

V. V. Ramaswamy, S. S. Kim, O. Russakovsky, Fair attribute classifica- tion through latent space de-biasing, in: CVPR, 2021, pp. 9301–9310

2021

[21] [21]

S. Park, J. Lee, P. Lee, S. Hwang, D. Kim, H. Byun, Fair contrastive learning for facial attribute classification, in: CVPR, 2022, pp. 10389– 10398

2022

[22] [22]

Y. Roh, K. Lee, S. Whang, C. Suh, Fr-train: A mutual information- based approach to fair and robust training, in: ICML, PMLR, 2020, pp. 8147–8157. 23

2020

[23] [23]

M. B. Zafar, I. Valera, M. G. Rogriguez, K. P. Gummadi, Fairness con- straints: Mechanisms for fair classification, in: AISTATS, PMLR, 2017, pp. 962–970

2017

[24] [24]

B. H. Zhang, B. Lemoine, M. Mitchell, Mitigating unwanted biases with adversarial learning, in: AIES, 2018, pp. 335–340

2018

[25] [25]

Z. Wang, X. Dong, H. Xue, Z. Zhang, W. Chiu, T. Wei, K. Ren, Fairness-aware adversarial perturbation towards bias mitigation for de- ployed deep models, in: CVPR, 2022, pp. 10379–10388

2022

[26] [26]

M. P. Kim, A. Ghorbani, J. Zou, Multiaccuracy: Black-box post- processing for fairness in classification, in: AIES, 2019, pp. 247–254

2019

[27] [27]

Herington, M

J. Herington, M. D. McCradden, K. Creel, R. Boellaard, E. C. Jones, A. K. Jha, A. Rahmim, P. J. Scott, J. J. Sunderland, R. L. Wahl, et al., Ethical considerations for artificial intelligence in medical imaging: de- ployment and governance, Journal of Nuclear Medicine 64 (10) (2023) 1509–1515

2023

[28] [28]

Obermeyer, B

Z. Obermeyer, B. Powers, C. Vogeli, S. Mullainathan, Dissecting racial bias in an algorithm used to manage the health of populations, Science 366 (6464) (2019)

2019

[29] [29]

A. J. Larrazabal, N. Nieto, V. Peterson, D. H. Milone, E. Ferrante, Gen- der imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis, National Academy of Sciences 117 (23) (2020) 12592–12594

2020

[30] [30]

V. C. Nitesh, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (1) (2002) 321

2002

[31] [31]

Pleiss, M

G. Pleiss, M. Raghavan, F. Wu, J. Kleinberg, K. Q. Weinberger, On fairness and calibration, in: NeurIPS, Vol. 30, 2017

2017

[32] [32]

B. Kim, H. Kim, K. Kim, S. Kim, J. Kim, Learning not to learn: Train- ing deep neural networks with biased data, in: CVPR, 2019, pp. 9012– 9020

2019

[33] [33]

Madras, E

D. Madras, E. Creager, T. Pitassi, R. Zemel, Learning adversarially fair and transferable representations, in: ICML, PMLR, 2018, pp. 3384– 3393. 24

2018

[34] [34]

H. Zhao, A. Coston, T. Adel, G. J. Gordon, Conditional learning of fair representations, arXiv:1910.07162 (2019)

work page arXiv 1910

[35] [35]

Sagawa, P

S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks for group shifts: On the importance of regular- ization for worst-case generalization, in: ICLR, 2020

2020

[36] [36]

J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, S. Park, Swad: Domain generalization by seeking flat minima, in: NeurIPS, Vol. 34, 2021, pp. 22405–22418

2021

[37] [37]

Tartaglione, C

E. Tartaglione, C. A. Barbano, M. Grangetto, End: Entangling and disentangling deep representations for bias correction, in: CVPR, 2021, pp. 13508–13517

2021

[38] [38]

M. H. Sarhan, N. Navab, A. Eslami, S. Albarqouni, Fairness by learning orthogonal disentangled representations, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, Springer, 2020, pp. 746–761

2020

[39] [39]

Y. Tian, M. Shi, Y. Luo, A. Kouhana, T. Elze, M. Wang, Fairseg: A large-scale medical image segmentation dataset for fairness learning us- ing segment anything model with fair error-bound scaling, in: ICLR, 2024

2024

[40] [40]

Y. Luo, M. Shi, M. O. Khan, M. M. Afzal, H. Huang, S. Yuan, Y. Tian, L. Song, A. Kouhana, T. Elze, et al., FairCLIP: Harnessing fairness in vision-language learning, in: CVPR, 2024, pp. 12289–12301

2024

[41] [42]

Rosenfeld, M

A. Rosenfeld, M. D. Solbach, J. K. Tsotsos, Totally looks like-how hu- mans compare, compared to machines, in: IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2018, pp. 1961–1964

2018

[42] [43]

Serre, Deep learning: the good, the bad, and the ugly, Annual Review of Vision Science 5 (2019) 399–426

T. Serre, Deep learning: the good, the bad, and the ugly, Annual Review of Vision Science 5 (2019) 399–426. 25

2019

[43] [44]

Kamar, S

E. Kamar, S. Hacker, E. Horvitz, Combining human and machine in- telligence in large-scale crowdsourcing., in: International Conference on Autonomous Agents and Multiagent Systems, Vol. 12, 2012, pp. 467– 474

2012

[44] [45]

E. K. Chiou, J. D. Lee, Trusting automation: Designing for responsivity and resilience, Human Factors 65 (1) (2023) 137–165

2023

[45] [46]

Z. Lu, M. Yin, Human reliance on machine learning models when per- formance feedback is limited: Heuristics and risks, in: CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–16

2021

[46] [47]

M. Yin, J. Wortman Vaughan, H. Wallach, Understanding the effect of accuracy on trust in machine learning models, in: CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–12

2019

[47] [48]

D. Shin, The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable AI, International Journal of Human-Computer Studies 146 (2021) 102551

2021

[48] [49]

do you trust me?

K. Weitz, D. Schiller, R. Schlagowski, T. Huber, E. André, "do you trust me?" increasing user-trust by integrating virtual agents in explainable ai interaction design, in: ACM International Conference on Intelligent Virtual Agents, 2019, pp. 7–9

2019

[49] [50]

Bansal, B

G. Bansal, B. Nushi, E. Kamar, E. Horvitz, D. S. Weld, Is the most accurate AI the best teammate? Optimizing AI for teamwork, in: AAAI Conference on Artificial Intelligence, Vol. 35(13), 2021, pp. 11405–11414

2021

[50] [51]

Agarwal, A

N. Agarwal, A. Moehring, P. Rajpurkar, T. Salz, Combining human ex- pertise with artificial intelligence: experimental evidence from radiology, Tech. rep., National Bureau of Economic Research (2023)

2023

[51] [52]

Vodrahalli, R

K. Vodrahalli, R. Daneshjou, T. Gerstenberg, J. Zou, Do humans trust advice more if it comes from ai? an analysis of human-AI interactions, in: AIES, 2022, pp. 763–777

2022

[52] [53]

X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, L. He, A survey of human- in-the-loop for machine learning, Future Generation Computer Systems 135 (C) (2022) 364–381.doi:10.1016/j.future.2022.05.014. URLhttps://doi.org/10.1016/j.future.2022.05.014 26

work page doi:10.1016/j.future.2022.05.014 2022

[53] [54]

Keswani, M

V. Keswani, M. Lease, K. Kenthapadi, Towards unbiased and accurate deferral to multiple experts, in: AIES, 2021, pp. 154–165

2021

[54] [55]

Narasimhan, W

H. Narasimhan, W. Jitkrittum, A. K. Menon, A. Rawat, S. Kumar, Post-hoc estimators for learning to defer to an expert, in: NeurIPS, Vol. 35, 2022

2022

[55] [56]

A. Mao, C. Mohri, M. Mohri, Y. Zhong, Two-stage learning to defer with multiple experts, in: NeurIPS, 2023

2023

[56] [57]

Zhang, C

Z. Zhang, C. Nguyen, K. Wells, T.-T. Do, D. Rosewarne, G. Carneiro, Coverage-constrained human-ai cooperation with multiple experts, in: AAAI, 2026

2026

[57] [58]

Cortes, G

C. Cortes, G. DeSalvo, M. Mohri, Learning with rejection, in: ALT, Springer, 2016

2016

[58] [59]

Charoenphakdee, Z

N. Charoenphakdee, Z. Cui, Y. Zhang, M. Sugiyama, Classification with rejection based on cost-sensitive classification, in: ICML, PMLR, 2021, pp. 1507–1517

2021

[59] [60]

Raghu, K

M. Raghu, K. Blumer, G. Corrado, J. Kleinberg, Z. Obermeyer, S. Mul- lainathan, The algorithmic automation problem: Prediction, triage, and human effort, in: Machine Learning for Health Symposium, 2018

2018

[60] [61]

Okati, A

N. Okati, A. De, M. Rodriguez, Differentiable learning under triage 34 (2021) 9140–9151

2021

[61] [62]

Mozannar, D

H. Mozannar, D. Sontag, Consistent estimators for learning to defer to an expert, in: ICML, PMLR, 2020, pp. 7076–7087

2020

[62] [63]

Verma, E

R. Verma, E. Nalisnick, Calibrated learning to defer with one-vs-all classifiers, in: ICML, PMLR, 2022, pp. 22184–22202

2022

[63] [64]

Mozannar, H

H. Mozannar, H. Lang, D. Wei, P. Sattigeri, S. Das, D. Sontag, Who should predict? Exact algorithms for learning to defer to humans, in: AISTATS, PMLR, 2023, pp. 10520–10545

2023

[64] [65]

Charusaie, H

M.-A. Charusaie, H. Mozannar, D. Sontag, S. Samadi, Sample efficient learningofpredictorsthatcomplementhumans, in: ICML,PMLR,2022, pp. 2972–3005. 27

2022

[65] [66]

Y. Cao, H. Mozannar, L. Feng, H. Wei, B. An, In defense of soft- max parametrization for calibrated and consistent learning to defer, in: NeurIPS, Vol. 36, 2024

2024

[66] [67]

Straitouri, L

E. Straitouri, L. Wang, N. Okati, M. G. Rodriguez, Improving expert predictions with conformal prediction, in: ICML, PMLR, 2023, pp. 32633–32653

2023

[67] [68]

S. Liu, Y. Cao, Q. Zhang, L. Feng, B. An, Mitigating underfitting in learning to defer with consistent losses, in: AISTATS, 2024

2024

[68] [69]

Mozannar, A

H. Mozannar, A. Satyanarayan, D. Sontag, Teaching humans when to defer to a classifier via exemplars, in: AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 5323–5331

2022

[69] [70]

Verma, D

R. Verma, D. Barrejón, E. Nalisnick, On the calibration of learning to defer to multiple experts, in: ICML Workshop on HMCT, 2022

2022

[70] [71]

Verma, D

R. Verma, D. Barrejon, E. Nalisnick, Learning to defer to multiple ex- perts: Consistent surrogate losses, confidence calibration, and conformal ensembles, in: AISTATS, PMLR, 2023, pp. 11415–11434

2023

[71] [72]

Babbar, U

V. Babbar, U. Bhatt, A. Weller, On the utility of prediction sets in human-AI teams, in: International Joint Conference on Artificial Intel- ligence, 2022

2022

[72] [73]

A. Mao, M. Mohri, Y. Zhong, Principled approaches for learning to defer with multiple experts, in: International Symposium on Artificial Intelligence and Mathematics, 2024

2024

[73] [74]

Hemmer, L

P. Hemmer, L. Thede, M. Vössing, J. Jakubik, N. Kühl, Learning to defer with limited expert predictions, in: AAAI Conference on Artificial Intelligence, Vol. 37, 2023, pp. 6002–6011

2023

[74] [75]

Tailor, A

D. Tailor, A. Patra, R. Verma, P. Manggala, E. Nalisnick, Learning to defer to a population: A meta-learning approach, in: AISTATS, 2024

2024

[75] [76]

Leitão, P

D. Leitão, P. Saleiro, M. A. Figueiredo, P. Bizarro, Human-AI collabora- tion in decision-making: Beyond learning to defer, in: ICML Workshop on Human-Machine Collaboration and Teaming, 2022. 28

2022

[76] [77]

Steyvers, H

M. Steyvers, H. Tejeda, G. Kerrigan, P. Smyth, Bayesian modeling of human–AI complementarity, National Academy of Sciences 119 (11) (2022) e2111547119

2022

[77] [78]

Kerrigan, P

G. Kerrigan, P. Smyth, M. Steyvers, Combining human predictions with model probabilities via confusion matrices and calibration, in: NeurIPS, Vol. 34, 2021, pp. 4421–4434

2021

[78] [79]

M. Liu, J. Wei, Y. Liu, J. Davis, Do humans and machines have the same eyes? Human-machine perceptual differences on image classification, arXiv:2304.08733 (2023)

work page arXiv 2023

[79] [80]

Forming Effective Human-

P. Hemmer, S. Schellhammer, M. Vössing, J. Jakubik, G. Satzger, Forming effective human-AI teams: Building machine learning models that complement the capabilities of multiple experts, in: L. D. Raedt (Ed.), International Joint Conference on Artificial Intelligence, Interna- tional Joint Conferences on Artificial Intelligence Organization, 2022, pp. 2478–...

work page doi:10.24963/ijcai.2022/344 2022

[80] [81]

Zhang, W

Z. Zhang, W. Ai, K. Wells, D. Rosewarne, T.-T. Do, G. Carneiro, Learn- ing to complement and to defer to multiple users, in: ECCV, Springer, 2025, pp. 144–162

2025