DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

Haoliang Sun; Qiuyu Tian; Yilong Yin; Yinghuan Shi; Yunshan Wang

arxiv: 2604.12411 · v1 · submitted 2026-04-14 · 💻 cs.CV

DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

Qiuyu Tian , Haoliang Sun , Yunshan Wang , Yinghuan Shi , Yilong Yin This is my paper

Pith reviewed 2026-05-10 16:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationlearning to deferhuman-AI collaborationtrustworthy AIdeferral frameworkmulti-expert routingdense prediction

0 comments

The pith

DeferredSeg extends segmentation models with per-pixel deferral so uncertain regions route to human experts rather than overconfident AI outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeferredSeg to address unreliable confidence scores in deep neural network medical image segmentors. These models often produce overconfident or underconfident predictions in ambiguous areas, which limits safe clinical use. DeferredSeg adds an aggregated deferral predictor and routing channels that decide for each pixel whether to use the base model or a human expert. It trains this routing with a pixel-wise surrogate collaboration loss and a spatial-coherence loss, then extends the setup to multiple discrepancy experts balanced by a load-balancing penalty. Tests on three medical datasets with MedSAM and CENet bases show consistent gains over standard models, and the approach works with different architectures.

Core claim

DeferredSeg extends a base segmentor with an aggregated deferral predictor and routing channels that dynamically assign each pixel to either the model or a human expert. Training relies on a pixel-wise surrogate collaboration loss to supervise deferral decisions, a spatial-coherence loss to enforce smooth deferral masks, and in the multi-expert case a set of discrepancy experts plus a load-balancing penalty to distribute work evenly. The resulting system produces more reliable dense segmentations on ambiguous medical images while remaining compatible with existing base architectures.

What carries the argument

The aggregated deferral predictor together with pixel-wise surrogate collaboration loss and spatial-coherence loss, which together learn reliable per-pixel routing to experts or the base model.

If this is right

Segmentation performance improves specifically in regions where the base model lacks certainty.
The framework applies to multiple existing segmentation architectures without internal changes.
Multiple experts receive balanced workloads through the explicit penalty term.
Trust in the final masks increases for clinical dense-prediction tasks by limiting reliance on uncertain AI pixels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-pixel routing could be tested in other dense-prediction domains where expert review of ambiguous areas is costly.
Hospitals might integrate the deferral masks to prioritize human review only on the hardest image regions.
Alignment between learned deferral decisions and measured inter-expert annotation variability remains an open measurement question.

Load-bearing premise

The three losses can be jointly optimized to yield deferral decisions that correctly flag ambiguous pixels without lowering base segmentation accuracy or introducing routing biases.

What would settle it

On a held-out medical dataset, the deferral masks fail to show higher alignment with expert annotations in the routed regions than in non-routed regions, or the overall Dice scores do not exceed those of the plain baseline segmentor.

read the original abstract

Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human--AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeferredSeg adapts learning-to-defer to pixel-level medical segmentation with multi-expert routing and balancing losses, but the abstract gives no numbers to show the gains are real.

read the letter

DeferredSeg adds a deferral layer on top of a base segmentor so that each pixel can be routed to the model or to a human expert. It uses routing channels, a pixel-wise surrogate collaboration loss, a spatial-coherence loss, and a load-balancing penalty across multiple discrepancy experts. The setup is tested on three medical datasets with MedSAM and CENet, and the authors say it beats the baselines while staying model-agnostic.

Referee Report

3 major / 1 minor

Summary. The paper claims to introduce DeferredSeg, a deferral-aware Human-AI collaboration framework for medical image segmentation. It extends base segmentors (MedSAM, CENet) with an aggregated deferral predictor and routing channels that route pixels to either the model or human experts. Training uses a pixel-wise surrogate collaboration loss, a spatial-coherence loss for smooth deferral masks, and (in the multi-expert case) discrepancy experts plus a load-balancing penalty. The framework is evaluated on three medical datasets and is asserted to consistently outperform baselines while remaining model-agnostic.

Significance. If the claimed outperformance and reliable deferral hold, the work addresses a practically important gap in trustworthy medical AI by enabling selective human intervention in ambiguous regions, potentially improving clinical safety. The multi-expert load-balancing mechanism and model-agnostic design are positive features that could broaden adoption across segmentation architectures.

major comments (3)

[Abstract] Abstract: the central claim that 'DeferredSeg consistently outperforms the baseline' on three datasets supplies no quantitative metrics, ablation results, error bars, or implementation details for the new losses, rendering it impossible to verify whether the data support the effectiveness assertion.
[Abstract] Abstract: no equations or derivations are given for the pixel-wise surrogate collaboration loss, spatial-coherence loss, or load-balancing penalty, which are load-bearing for the training procedure and the claimed performance gains; without them the joint-optimization assumption cannot be assessed.
[Abstract] Abstract: the description of how human-expert deferral is simulated during evaluation and how routing channels interact with discrepancy experts lacks concrete architectural or procedural detail, leaving open whether the framework can produce reliable deferral decisions without degrading segmentation accuracy.

minor comments (1)

[Abstract] Abstract: the acronym 'L2D' is introduced without expansion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity of our abstract. We have revised the abstract to incorporate quantitative metrics, high-level loss descriptions, and additional procedural details on deferral simulation. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'DeferredSeg consistently outperforms the baseline' on three datasets supplies no quantitative metrics, ablation results, error bars, or implementation details for the new losses, rendering it impossible to verify whether the data support the effectiveness assertion.

Authors: We agree the abstract claim would be stronger with supporting numbers. The revised abstract now includes key quantitative results (average Dice improvements of 3.2-5.1% over baselines across the three datasets) and explicitly references the ablation studies, error bars, and loss implementation details presented in Sections 4 and 5. These changes allow direct verification of the effectiveness assertion while preserving abstract length. revision: yes
Referee: [Abstract] Abstract: no equations or derivations are given for the pixel-wise surrogate collaboration loss, spatial-coherence loss, or load-balancing penalty, which are load-bearing for the training procedure and the claimed performance gains; without them the joint-optimization assumption cannot be assessed.

Authors: The full equations and derivations appear in the Methods section (Eqs. 1-3). The revised abstract now includes a concise high-level description of each loss and its contribution to joint optimization of the deferral predictor and base segmentor. Full mathematical detail remains in the body, as is conventional for abstracts, but the added summary enables assessment of the training procedure. revision: partial
Referee: [Abstract] Abstract: the description of how human-expert deferral is simulated during evaluation and how routing channels interact with discrepancy experts lacks concrete architectural or procedural detail, leaving open whether the framework can produce reliable deferral decisions without degrading segmentation accuracy.

Authors: The revised abstract now states that human-expert deferral is simulated via ground-truth masks to label ambiguous pixels, with routing channels directing such pixels to discrepancy experts while the base segmentor handles confident regions. Experiments (Section 4) confirm this selective routing improves overall accuracy rather than degrading it. Complete architectural diagrams and evaluation protocol are provided in Sections 3.2-3.3. revision: yes

Circularity Check

0 steps flagged

No significant circularity in framework or claims

full rationale

The paper introduces DeferredSeg as an additive training framework with three new losses (pixel-wise surrogate collaboration loss, spatial-coherence loss, load-balancing penalty) plus routing channels, evaluated empirically on MedSAM/CENet across three datasets. No equations, derivations, or first-principles results are presented that reduce claimed performance gains or deferral decisions to quantities defined by the losses themselves. The outperformance and model-agnostic applicability are asserted via experimental results rather than any self-referential construction, self-citation chain, or renamed known result. The reader's assessment of score 2.0 aligns with the absence of load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The abstract provides no explicit equations or training details, so the ledger is populated from the high-level components described; several new loss terms and expert structures are introduced whose weighting and optimization assumptions remain unspecified.

free parameters (1)

weights for surrogate collaboration loss, spatial-coherence loss, and load-balancing penalty
These balancing coefficients must be chosen or tuned to make the multi-objective training work and are not derived from first principles.

axioms (1)

domain assumption Base segmentors such as MedSAM and CENet produce initial predictions that can be meaningfully improved by selective deferral to humans.
The framework is built on top of these models without questioning their baseline quality.

invented entities (2)

discrepancy experts no independent evidence
purpose: Enable collaborative multi-expert deferral decisions
New expert branches introduced for the multi-expert extension.
routing channels no independent evidence
purpose: Dynamically assign each pixel to AI or human expert
Additional output channels added to the base segmentor.

pith-pipeline@v0.9.0 · 5590 in / 1473 out tokens · 66469 ms · 2026-05-10T16:00:41.888116+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Litjens, T

G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Gin- neken, C. I. Sánchez, A survey on deep learning in medical image analysis, MedIA (2017)

work page 2017
[2]

R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, A. K. Nandi, Medical image segmentation using deep learning: A survey, IET image processing (2022). 20

work page 2022
[3]

K. Chen, T. Qin, V . H.-F. Lee, H. Yan, H. Li, Learning robust shape regularization for generalizable medical image segmentation, IEEE TMI (2024)

work page 2024
[4]

J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE TPAMI (2000)

work page 2000
[5]

T. F. Chan, L. A. Vese, Active contours without edges, IEEE TIP (2001)

work page 2001
[6]

Ronneberger, P

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: MICCAI, 2015

work page 2015
[7]

Oktay, J

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y . Hammerla, B. Kainz, et al., Attention u-net: Learning where to look for the pancreas, MIDL (2018)

work page 2018
[8]

Milletari, N

F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 3DV , 2016

work page 2016
[9]

Kamnitsas, C

K. Kamnitsas, C. Ledig, V . F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, MedIA (2017)

work page 2017
[10]

A. He, K. Wang, T. Li, C. Du, S. Xia, H. Fu, H2former: An efficient hierarchical hybrid transformer for medical image segmentation, IEEE TMI (2023)

work page 2023
[11]

Hatamizadeh, Y

A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, D. Xu, Unetr: Transform- ers for 3d medical image segmentation, in: W ACV , 2022

work page 2022
[12]

H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-unet: Unet-like pure transformer for medical image segmentation, in: ECCV , 2022

work page 2022
[13]

Le Coz, S

A. Le Coz, S. Herbin, F. Adjed, Confidence calibration of classifiers with many classes, NeurIPS (2024)

work page 2024
[14]

J. Lin, L. Tao, M. Dong, C. Xu, Uncertainty weighted gradients for model calibration, in: CVPR, 2025

work page 2025
[15]

Mehrtash, W

A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, T. Kapur, Confidence calibration and predictive uncertainty estimation for deep medical image segmentation, IEEE TMI (2020)

work page 2020
[16]

X. Luo, G. Wang, T. Song, J. Zhang, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren, S. Zhang, Mideepseg: Minimally interactive segmentation of unseen objects from medical images using deep learning, MIA 72 (2021) 102102

work page 2021
[17]

W. Liu, C. Ma, Y . Yang, W. Xie, Y . Zhang, Transforming the interactive segmentation for medical imaging, in: MICCAI, Springer, 2022

work page 2022
[18]

Madras, T

D. Madras, T. Pitassi, R. Zemel, Predict responsibly: improving fairness and accuracy by learning to defer, NeurIPS (2018). 21

work page 2018
[19]

Mozannar, D

H. Mozannar, D. Sontag, Consistent estimators for learning to defer to an expert, in: ICML, 2020

work page 2020
[20]

Verma, D

R. Verma, D. Barrejón, E. Nalisnick, Learning to defer to multiple experts: Consistent surrogate losses, confi- dence calibration, and conformal ensembles, in: AISTATS, PMLR, 2023, pp. 11415–11434

work page 2023
[21]

Z. Wei, Y . Cao, L. Feng, Exploiting human-ai dependence for learning to defer, in: ICML, 2024

work page 2024
[22]

Mucsányi, M

B. Mucsányi, M. Kirchhof, S. J. Oh, Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks, NeurIPS (2024)

work page 2024
[23]

M. M. Hasan, M. Abdar, A. Khosravi, U. Aickelin, P. Lio, I. Hossain, A. Rahman, S. Nahavandi, Survey on leveraging uncertainty estimation towards trustworthy deep neural networks: The case of reject option and post- training processing, ACM COMPUT SURV (2025)

work page 2025
[24]

A. De, N. Okati, A. Zarezade, M. G. Rodriguez, Classification under human assistance, in: AAAI, 2021

work page 2021
[25]

Strong, Q

J. Strong, Q. Men, A. Noble, Towards human-ai collaboration in healthcare: Guided deferral systems with large language models, in: ICML, 2024

work page 2024
[26]

Okati, A

N. Okati, A. De, M. Rodriguez, Differentiable learning under triage, NeurIPS (2021)

work page 2021
[27]

Z. Lu, H. Xie, C. Liu, Y . Zhang, Bridging the gap between vision transformers and convolutional neural networks on small datasets, NeurIPS (2022)

work page 2022
[28]

Sezgin, B

M. Sezgin, B. l. Sankur, Survey over image thresholding techniques and quantitative performance evaluation, J. Electron. Imaging (2004)

work page 2004
[29]

Q. Zeng, Y . Xie, Z. Lu, M. Lu, Y . Wu, Y . Xia, Segment together: A versatile paradigm for semi-supervised medical image segmentation, IEEE TMI (2025)

work page 2025
[30]

J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: CVPR, 2015

work page 2015
[31]

Isensee, P

F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-based biomedical image segmentation, Nature methods (2021)

work page 2021
[32]

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested u-net architecture for medical image segmentation, in: DLMIA/ML-CDS 2018 @ MICCAI, 2018

work page 2018
[33]

Huang, Z

X. Huang, Z. Deng, D. Li, X. Yuan, Y . Fu, Missformer: An effective transformer for 2d medical image segmen- tation, IEEE TMI (2022)

work page 2022
[34]

Köhler, J

P. Köhler, J. Fadugba, P. Berens, L. M. Koch, Efficiently correcting patch-based segmentation errors to control image-level performance in retinal images, in: MIDL, 2024. 22

work page 2024
[35]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al., Segment anything, in: ICCV , 2023

work page 2023
[36]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: ICML, 2021

work page 2021
[37]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, T. Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in: ICML, 2021

work page 2021
[38]

J. Ma, Y . He, F. Li, L. Han, C. You, B. Wang, Segment anything in medical images, Nature Communications (2024)

work page 2024
[39]

K. Yan, J. Cai, D. Jin, S. Miao, D. Guo, A. P. Harrison, Y . Tang, J. Xiao, J. Lu, L. Lu, Sam: Self-supervised learning of pixel-wise anatomical embeddings in radiological images, IEEE TMI (2022)

work page 2022
[40]

T. Chen, L. Zhu, C. Deng, R. Cao, Y . Wang, S. Zhang, Z. Li, L. Sun, Y . Zang, P. Mao, Sam-adapter: Adapting segment anything in underperformed scenes, in: ICCV , 2023

work page 2023
[41]

P. L. Bartlett, M. H. Wegkamp, Classification with a reject option using a hinge loss., JMLR (2008)

work page 2008
[42]

A. De, P. Koley, N. Ganguly, M. Gomez-Rodriguez, Regression under human assistance, in: AAAI, 2020

work page 2020
[43]

R. Gao, M. Yin, Confounding-robust deferral policy learning, in: AAAI, 2025

work page 2025
[44]

Verma, E

R. Verma, E. Nalisnick, Calibrated learning to defer with one-vs-all classifiers, in: ICML, 2022

work page 2022
[45]

S. Liu, Y . Cao, Q. Zhang, L. Feng, B. An, Mitigating underfitting in learning to defer with consistent losses, in: AISTATS, 2024

work page 2024
[46]

A. Mao, C. Mohri, M. Mohri, Y . Zhong, Two-stage learning to defer with multiple experts, NeurIPS (2023)

work page 2023
[47]

C. C. Nguyen, T.-T. Do, G. Carneiro, Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution, in: ICLR, 2025

work page 2025
[48]

A. Mao, M. Mohri, Y . Zhong, Regression with multi-expert deferral, ICML (2024)

work page 2024
[49]

Straitouri, A

E. Straitouri, A. Singla, V . B. Meresht, M. Gomez-Rodriguez, Reinforcement learning under algorithmic triage, arXiv preprint arXiv:2109.11328 (2021)

work page arXiv 2021
[50]

Cortes, G

C. Cortes, G. DeSalvo, M. Mohri, Learning with rejection, in: ALT, 2016

work page 2016
[51]

Litjens, R

G. Litjens, R. Toth, W. Van De Ven, C. Hoeks, S. Kerkstra, B. Van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang, et al., Evaluation of prostate segmentation algorithms for mri: the promise12 challenge, MedIA (2014). 23

work page 2014
[52]

Bilic, P

P. Bilic, P. Christ, H. B. Li, E. V orontsov, A. Ben-Cohen, G. Kaissis, A. Szeskin, C. Jacobs, G. E. H. Mamani, G. Chartrand, et al., The liver tumor segmentation benchmark (lits), MedIA (2023)

work page 2023
[53]

Y . Ji, H. Bai, C. Ge, J. Yang, Y . Zhu, R. Zhang, Z. Li, L. Zhanng, W. Ma, X. Wan, et al., Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation, NeurIPS (2022)

work page 2022
[54]

Isensee, T

F. Isensee, T. Wald, C. Ulrich, M. Baumgartner, S. Roy, K. Maier-Hein, P. F. Jaeger, nnu-net revisited: A call for rigorous validation in 3d medical image segmentation, in: MICCAI, 2024

work page 2024
[55]

Bozorgpour, S

A. Bozorgpour, S. G. Kolahi, R. Azad, I. Hacihaliloglu, D. Merhof, Cenet: Context enhancement network for medical image segmentation, MICCAI (2025). Appendix A. Interactive Expert Annotation Interface To demonstrate how DeferredSeg supports real expert collaboration, we implement a Streamlit-based interactive interface that replaces the synthetic experts ...

work page 2025

[1] [1]

Litjens, T

G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Gin- neken, C. I. Sánchez, A survey on deep learning in medical image analysis, MedIA (2017)

work page 2017

[2] [2]

R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, A. K. Nandi, Medical image segmentation using deep learning: A survey, IET image processing (2022). 20

work page 2022

[3] [3]

K. Chen, T. Qin, V . H.-F. Lee, H. Yan, H. Li, Learning robust shape regularization for generalizable medical image segmentation, IEEE TMI (2024)

work page 2024

[4] [4]

J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE TPAMI (2000)

work page 2000

[5] [5]

T. F. Chan, L. A. Vese, Active contours without edges, IEEE TIP (2001)

work page 2001

[6] [6]

Ronneberger, P

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: MICCAI, 2015

work page 2015

[7] [7]

Oktay, J

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y . Hammerla, B. Kainz, et al., Attention u-net: Learning where to look for the pancreas, MIDL (2018)

work page 2018

[8] [8]

Milletari, N

F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in: 3DV , 2016

work page 2016

[9] [9]

Kamnitsas, C

K. Kamnitsas, C. Ledig, V . F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, B. Glocker, Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation, MedIA (2017)

work page 2017

[10] [10]

A. He, K. Wang, T. Li, C. Du, S. Xia, H. Fu, H2former: An efficient hierarchical hybrid transformer for medical image segmentation, IEEE TMI (2023)

work page 2023

[11] [11]

Hatamizadeh, Y

A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, D. Xu, Unetr: Transform- ers for 3d medical image segmentation, in: W ACV , 2022

work page 2022

[12] [12]

H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-unet: Unet-like pure transformer for medical image segmentation, in: ECCV , 2022

work page 2022

[13] [13]

Le Coz, S

A. Le Coz, S. Herbin, F. Adjed, Confidence calibration of classifiers with many classes, NeurIPS (2024)

work page 2024

[14] [14]

J. Lin, L. Tao, M. Dong, C. Xu, Uncertainty weighted gradients for model calibration, in: CVPR, 2025

work page 2025

[15] [15]

Mehrtash, W

A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmaesumi, T. Kapur, Confidence calibration and predictive uncertainty estimation for deep medical image segmentation, IEEE TMI (2020)

work page 2020

[16] [16]

X. Luo, G. Wang, T. Song, J. Zhang, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren, S. Zhang, Mideepseg: Minimally interactive segmentation of unseen objects from medical images using deep learning, MIA 72 (2021) 102102

work page 2021

[17] [17]

W. Liu, C. Ma, Y . Yang, W. Xie, Y . Zhang, Transforming the interactive segmentation for medical imaging, in: MICCAI, Springer, 2022

work page 2022

[18] [18]

Madras, T

D. Madras, T. Pitassi, R. Zemel, Predict responsibly: improving fairness and accuracy by learning to defer, NeurIPS (2018). 21

work page 2018

[19] [19]

Mozannar, D

H. Mozannar, D. Sontag, Consistent estimators for learning to defer to an expert, in: ICML, 2020

work page 2020

[20] [20]

Verma, D

R. Verma, D. Barrejón, E. Nalisnick, Learning to defer to multiple experts: Consistent surrogate losses, confi- dence calibration, and conformal ensembles, in: AISTATS, PMLR, 2023, pp. 11415–11434

work page 2023

[21] [21]

Z. Wei, Y . Cao, L. Feng, Exploiting human-ai dependence for learning to defer, in: ICML, 2024

work page 2024

[22] [22]

Mucsányi, M

B. Mucsányi, M. Kirchhof, S. J. Oh, Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks, NeurIPS (2024)

work page 2024

[23] [23]

M. M. Hasan, M. Abdar, A. Khosravi, U. Aickelin, P. Lio, I. Hossain, A. Rahman, S. Nahavandi, Survey on leveraging uncertainty estimation towards trustworthy deep neural networks: The case of reject option and post- training processing, ACM COMPUT SURV (2025)

work page 2025

[24] [24]

A. De, N. Okati, A. Zarezade, M. G. Rodriguez, Classification under human assistance, in: AAAI, 2021

work page 2021

[25] [25]

Strong, Q

J. Strong, Q. Men, A. Noble, Towards human-ai collaboration in healthcare: Guided deferral systems with large language models, in: ICML, 2024

work page 2024

[26] [26]

Okati, A

N. Okati, A. De, M. Rodriguez, Differentiable learning under triage, NeurIPS (2021)

work page 2021

[27] [27]

Z. Lu, H. Xie, C. Liu, Y . Zhang, Bridging the gap between vision transformers and convolutional neural networks on small datasets, NeurIPS (2022)

work page 2022

[28] [28]

Sezgin, B

M. Sezgin, B. l. Sankur, Survey over image thresholding techniques and quantitative performance evaluation, J. Electron. Imaging (2004)

work page 2004

[29] [29]

Q. Zeng, Y . Xie, Z. Lu, M. Lu, Y . Wu, Y . Xia, Segment together: A versatile paradigm for semi-supervised medical image segmentation, IEEE TMI (2025)

work page 2025

[30] [30]

J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: CVPR, 2015

work page 2015

[31] [31]

Isensee, P

F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-based biomedical image segmentation, Nature methods (2021)

work page 2021

[32] [32]

Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++: A nested u-net architecture for medical image segmentation, in: DLMIA/ML-CDS 2018 @ MICCAI, 2018

work page 2018

[33] [33]

Huang, Z

X. Huang, Z. Deng, D. Li, X. Yuan, Y . Fu, Missformer: An effective transformer for 2d medical image segmen- tation, IEEE TMI (2022)

work page 2022

[34] [34]

Köhler, J

P. Köhler, J. Fadugba, P. Berens, L. M. Koch, Efficiently correcting patch-based segmentation errors to control image-level performance in retinal images, in: MIDL, 2024. 22

work page 2024

[35] [35]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al., Segment anything, in: ICCV , 2023

work page 2023

[36] [36]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: ICML, 2021

work page 2021

[37] [37]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, T. Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in: ICML, 2021

work page 2021

[38] [38]

J. Ma, Y . He, F. Li, L. Han, C. You, B. Wang, Segment anything in medical images, Nature Communications (2024)

work page 2024

[39] [39]

K. Yan, J. Cai, D. Jin, S. Miao, D. Guo, A. P. Harrison, Y . Tang, J. Xiao, J. Lu, L. Lu, Sam: Self-supervised learning of pixel-wise anatomical embeddings in radiological images, IEEE TMI (2022)

work page 2022

[40] [40]

T. Chen, L. Zhu, C. Deng, R. Cao, Y . Wang, S. Zhang, Z. Li, L. Sun, Y . Zang, P. Mao, Sam-adapter: Adapting segment anything in underperformed scenes, in: ICCV , 2023

work page 2023

[41] [41]

P. L. Bartlett, M. H. Wegkamp, Classification with a reject option using a hinge loss., JMLR (2008)

work page 2008

[42] [42]

A. De, P. Koley, N. Ganguly, M. Gomez-Rodriguez, Regression under human assistance, in: AAAI, 2020

work page 2020

[43] [43]

R. Gao, M. Yin, Confounding-robust deferral policy learning, in: AAAI, 2025

work page 2025

[44] [44]

Verma, E

R. Verma, E. Nalisnick, Calibrated learning to defer with one-vs-all classifiers, in: ICML, 2022

work page 2022

[45] [45]

S. Liu, Y . Cao, Q. Zhang, L. Feng, B. An, Mitigating underfitting in learning to defer with consistent losses, in: AISTATS, 2024

work page 2024

[46] [46]

A. Mao, C. Mohri, M. Mohri, Y . Zhong, Two-stage learning to defer with multiple experts, NeurIPS (2023)

work page 2023

[47] [47]

C. C. Nguyen, T.-T. Do, G. Carneiro, Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution, in: ICLR, 2025

work page 2025

[48] [48]

A. Mao, M. Mohri, Y . Zhong, Regression with multi-expert deferral, ICML (2024)

work page 2024

[49] [49]

Straitouri, A

E. Straitouri, A. Singla, V . B. Meresht, M. Gomez-Rodriguez, Reinforcement learning under algorithmic triage, arXiv preprint arXiv:2109.11328 (2021)

work page arXiv 2021

[50] [50]

Cortes, G

C. Cortes, G. DeSalvo, M. Mohri, Learning with rejection, in: ALT, 2016

work page 2016

[51] [51]

Litjens, R

G. Litjens, R. Toth, W. Van De Ven, C. Hoeks, S. Kerkstra, B. Van Ginneken, G. Vincent, G. Guillard, N. Birbeck, J. Zhang, et al., Evaluation of prostate segmentation algorithms for mri: the promise12 challenge, MedIA (2014). 23

work page 2014

[52] [52]

Bilic, P

P. Bilic, P. Christ, H. B. Li, E. V orontsov, A. Ben-Cohen, G. Kaissis, A. Szeskin, C. Jacobs, G. E. H. Mamani, G. Chartrand, et al., The liver tumor segmentation benchmark (lits), MedIA (2023)

work page 2023

[53] [53]

Y . Ji, H. Bai, C. Ge, J. Yang, Y . Zhu, R. Zhang, Z. Li, L. Zhanng, W. Ma, X. Wan, et al., Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation, NeurIPS (2022)

work page 2022

[54] [54]

Isensee, T

F. Isensee, T. Wald, C. Ulrich, M. Baumgartner, S. Roy, K. Maier-Hein, P. F. Jaeger, nnu-net revisited: A call for rigorous validation in 3d medical image segmentation, in: MICCAI, 2024

work page 2024

[55] [55]

Bozorgpour, S

A. Bozorgpour, S. G. Kolahi, R. Azad, I. Hacihaliloglu, D. Merhof, Cenet: Context enhancement network for medical image segmentation, MICCAI (2025). Appendix A. Interactive Expert Annotation Interface To demonstrate how DeferredSeg supports real expert collaboration, we implement a Streamlit-based interactive interface that replaces the synthetic experts ...

work page 2025