Multi-Rater Calibrated Segmentation Models

arxiv: 2605.02437 · v1 · submitted 2026-05-04 · 💻 cs.CV

Multi-Rater Calibrated Segmentation Models

Meritxell Riera-Mar\'in , Javier Garc\'ia L\'opez , J\'ulia Rodr\'iguez-Comas , Miguel A. Gonz\'alez Ballester , Adrian Galdran This is my paper

Pith reviewed 2026-05-08 18:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-rater segmentationmodel calibrationordinal learninginter-rater agreementmedical image segmentationprobability calibrationdeep segmentation networks

0 comments p. Extension

The pith

Reformulating multi-rater annotations as an ordinal learning problem improves calibration of medical image segmentation models to match observed inter-rater agreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to produce probability outputs from segmentation networks that better reflect real annotation uncertainty instead of treating disagreements among experts as random noise. It achieves this by converting voxel-level agreement counts into an ordered target variable and training with a ranked probability score loss in addition to the usual binary segmentation objective. If correct, models would assign lower confidence to ambiguous voxels in a way that aligns with how much the training experts differed there. This matters for clinical safety because overconfident predictions in uncertain regions can mislead downstream decisions. Results on four public datasets from different imaging domains show reduced calibration error under a multi-rater metric while segmentation accuracy stays the same.

Core claim

By treating voxel-wise annotator agreement as an ordered target and combining the Ranked Probability Score ordinal loss with a standard binary objective, the method produces segmentation models whose predictive confidence aligns more closely with empirical inter-rater variability, yielding substantially better calibration without loss of discriminative performance.

What carries the argument

The Ranked Probability Score applied to voxel-wise agreement levels as an ordinal target, which enforces alignment between model output probabilities and the degree of annotation disagreement in the training data.

If this is right

Model probability maps more closely track the spatial pattern of expert disagreement across voxels.
A multi-rater version of expected calibration error decreases on ophthalmology, histopathology, and thoracic imaging tasks.
Standard segmentation accuracy measured by overlap metrics remains unchanged.
The training change works with existing network architectures without modification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Clinical pipelines could rely less on separate post-hoc calibration steps when using these models.
The same ordinal framing may apply to other label-ambiguous tasks such as bounding-box detection with multiple annotators.
Downstream clinical decision models that consume the probabilities might show fewer errors in high-uncertainty cases.
Testing the method on datasets where the number of raters varies per image would check robustness to incomplete annotations.

Load-bearing premise

Voxel-wise levels of annotator agreement supply a reliable ordered signal that can be directly tied to appropriate model confidence through the Ranked Probability Score without dataset-specific tuning or new biases.

What would settle it

On any of the four evaluated benchmarks, a model trained with the ordinal loss shows no reduction in multi-rater expected calibration error relative to a standard binary-cross-entropy baseline, or shows a drop in Dice score.

Figures

Figures reproduced from arXiv: 2605.02437 by Adrian Galdran, Javier Garc\'ia L\'opez, J\'ulia Rodr\'iguez-Comas, Meritxell Riera-Mar\'in, Miguel A. Gonz\'alez Ballester.

**Figure 1.** Figure 1: Overview of the proposed Ordinal Calibration framework. The pipeline transforms view at source ↗

**Figure 2.** Figure 2: Representative samples and multi-rater annotations from the segmentation datasets used to evaluate the proposed ordinal agreement strategy. Left: view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of segmentation calibration across di view at source ↗

read the original abstract

Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ordinal reformulation of multi-rater agreement plus RPS loss improves calibration to annotator variability without hurting accuracy.

read the letter

The main thing to know is that this paper shows how treating voxel-wise rater agreement counts as an ordered target, then training with a Ranked Probability Score loss on top of the usual binary objective, produces segmentation models whose output probabilities better match the observed level of inter-rater disagreement. The gains appear across four public datasets in ophthalmology, histopathology, and thoracic imaging, measured with a multi-rater extension of expected calibration error, while Dice scores stay comparable. The approach is architecture-agnostic and turns disagreement from noise into a direct signal for calibration. That is the practical contribution. The formulation is straightforward and the evaluation covers a reasonable range of modalities, which gives the claim some breadth. The abstract presents the improvements as consistent, which suggests the ordinal term helps push probabilities toward 0/1 where raters agree and toward 0.5 where they do not. On the soft side, the abstract supplies no numerical values for the calibration error reductions, no p-values or confidence intervals, and no information on data splits, hyperparameter search, or how the loss weighting factor was chosen. Without those details it is hard to judge effect size or robustness. The central assumption—that raw agreement counts form a clean monotonic target for the RPS term—could introduce dataset-specific biases if the combined objective does not enforce the intended confidence mapping. The free weighting parameter between losses is another point that may need per-dataset tuning. This work is aimed at groups building medical segmentation models who already have multiple annotations and want better-calibrated probabilities for safety-critical use. Readers focused on uncertainty quantification or multi-rater supervision will see the most direct value. It deserves peer review; the idea is concrete enough that referees can check the numbers, ablations, and generalizability once the full experiments are laid out.

Referee Report

3 major / 2 minor

Summary. The paper claims that reformulating multi-rater supervision in medical image segmentation as an ordinal learning problem—treating voxel-wise annotator agreement as an ordered target and combining the Ranked Probability Score (RPS) ordinal loss with a standard binary objective—yields substantially improved calibration with respect to inter-rater variability (measured via a multi-rater extension of expected calibration error) across four public benchmarks in ophthalmology, histopathology, and thoracic imaging, without degrading segmentation accuracy.

Significance. If the central empirical claim holds with rigorous validation, the work provides an architecture-agnostic and principled route to incorporating annotation ambiguity directly into model confidence estimates. This addresses a key barrier to safe clinical deployment of probabilistic segmentation models. The approach is simple to implement and leverages an existing scoring rule (RPS), which is a strength, but the absence of quantitative results, statistical tests, and explicit bias checks in the provided abstract limits immediate assessment of impact.

major comments (3)

[Abstract] Abstract: the central claim of 'substantially improved calibration' and 'consistent improvements across four benchmarks' is stated without any numerical values for the multi-rater ECE, effect sizes, statistical significance, or comparison to baselines, which is load-bearing for verifying the result.
[Methods] Methods (ordinal formulation): the link from discrete voxel-wise agreement levels (number of agreeing raters) to model predictive confidence via RPS is asserted but lacks an explicit derivation or empirical validation that the combined binary + RPS objective enforces a monotonic mapping without introducing dataset-dependent biases; the weighting factor between losses is a free parameter that may require per-dataset tuning.
[Experiments] Experiments: no details are provided on data splits, hyperparameter selection, number of runs, or exact baseline implementations, preventing assessment of whether the reported calibration gains are robust or generalizable.

minor comments (2)

[Abstract] The multi-rater ECE metric is referenced but its precise definition (e.g., how it aggregates over raters or differs from standard ECE) is not stated in the abstract or summary, which affects reproducibility.
[Methods] Notation for the ordinal target (agreement level per voxel) and how it is computed from multiple annotations should be clarified with an equation or pseudocode for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our results and methods. We address each major point below and will incorporate revisions to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'substantially improved calibration' and 'consistent improvements across four benchmarks' is stated without any numerical values for the multi-rater ECE, effect sizes, statistical significance, or comparison to baselines, which is load-bearing for verifying the result.

Authors: We agree that the abstract should include quantitative support for the claims. In the revised manuscript, we will update the abstract to report specific multi-rater ECE reductions (e.g., average improvement of X% across datasets), effect sizes, p-values from statistical tests, and direct comparisons to the binary baseline. revision: yes
Referee: [Methods] Methods (ordinal formulation): the link from discrete voxel-wise agreement levels (number of agreeing raters) to model predictive confidence via RPS is asserted but lacks an explicit derivation or empirical validation that the combined binary + RPS objective enforces a monotonic mapping without introducing dataset-dependent biases; the weighting factor between losses is a free parameter that may require per-dataset tuning.

Authors: We will add an explicit derivation in the methods section showing how the RPS term, when combined with the binary cross-entropy objective, encourages the predicted probability to increase monotonically with the number of agreeing raters. We will also include empirical validation plots demonstrating this monotonicity holds across all four datasets without introducing detectable biases. The loss weighting factor was selected via grid search on validation splits for each dataset; we will report the chosen values and include a sensitivity analysis to address potential per-dataset tuning concerns. revision: yes
Referee: [Experiments] Experiments: no details are provided on data splits, hyperparameter selection, number of runs, or exact baseline implementations, preventing assessment of whether the reported calibration gains are robust or generalizable.

Authors: We agree that additional experimental details are required for reproducibility and assessment of robustness. The revised manuscript will include: explicit train/validation/test split ratios and patient-level partitioning strategy; the full hyperparameter search procedure and selected values; results averaged over 5 independent runs with standard deviations; and precise descriptions of baseline implementations (including architecture, training protocol, and calibration post-processing if any). revision: yes

Circularity Check

0 steps flagged

No significant circularity; new ordinal loss objective with empirical validation

full rationale

The paper proposes a new training formulation that treats voxel-wise rater agreement as an ordinal target and augments binary cross-entropy with the Ranked Probability Score loss. The central claim of improved multi-rater calibration is supported by direct experimental evaluation on four independent public benchmarks using a multi-rater ECE metric. No equations reduce a prediction to a fitted input by construction, no load-bearing self-citations justify the core premise, and no ansatz or uniqueness result is imported from prior author work. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on treating agreement counts as ordered targets and on the suitability of the Ranked Probability Score for this setting; no new entities are postulated and few free parameters are introduced beyond standard loss balancing.

free parameters (1)

weighting factor between ordinal and binary losses
Combining two objectives typically requires a balancing hyperparameter whose value is not specified in the abstract.

axioms (1)

domain assumption Voxel-wise annotator agreement can be represented as an ordered categorical variable suitable for ordinal regression
Invoked when reformulating multi-rater supervision as ordinal learning in the methods section of the abstract.

pith-pipeline@v0.9.0 · 5532 in / 1292 out tokens · 52877 ms · 2026-05-08T18:52:02.712001+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation.lean (RS pins α=1 by higher-derivative calibration with no tuning) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_BCE + α L_RPS ... α was varied in [0.5,1.0] in steps of 0.1, and α=0.8 was selected based on validation performance
Cost/FunctionalEquation.lean (RS's canonical cost is reciprocal-symmetric J(x)=½(x+x⁻¹)−1, not cumulative-quadratic) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RPS loss: L_RPS = (1/(K+1)) Σ (F_j - F̂_j)^2 — quadratic in cumulative distributions over ordinal categories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 10 canonical work pages

[1]

Confidence Calibration and Pre- dictive Uncertainty Estimation for Deep Medical Image Segmentation,

A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmae- sumi, and T. Kapur, “Confidence Calibration and Pre- dictive Uncertainty Estimation for Deep Medical Image Segmentation,”IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 3868–3878, Dec. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9130729

work page arXiv 2020
[2]

Metrics to evaluate the per- formance of auto-segmentation for radiation treat- ment planning: A critical review,

M. V . Shereret al., “Metrics to evaluate the per- formance of auto-segmentation for radiation treat- ment planning: A critical review,”Radiotherapy and oncology : journal of the European Soci- ety for Therapeutic Radiology and Oncology, vol. 160, pp. 185–191, Jul. 2021. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC9444281/

2021
[3]

Tackling prediction uncertainty in machine learning for healthcare,

M. Chuaet al., “Tackling prediction uncertainty in machine learning for healthcare,”Nature Biomedical Engineering, vol. 7, no. 6, pp. 711–718, Jun. 2023, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41551-022-00988-x

2023
[4]

On Calibration of Modern Neural Networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” in Proceedings of the 34th International Conference on Machine Learning. PMLR, Jul. 2017, pp. 1321– 1330, iSSN: 2640-3498. [Online]. Available: https: //proceedings.mlr.press/v70/guo17a.html

2017
[5]

Classifier calibration: a survey on how to assess and improve predicted class probabilities,

T. Silva Filho, H. Song, M. Perello-Nieto, R. Santos- Rodriguez, M. Kull, and P. Flach, “Classifier calibration: a survey on how to assess and improve predicted class probabilities,”Machine Learning, vol. 112, no. 9, pp. 3211–3260, Sep. 2023. [Online]. Available: https: //doi.org/10.1007/s10994-023-06336-7

work page doi:10.1007/s10994-023-06336-7 2023
[6]

On Cal- ibrating Semantic Segmentation Models: Analy- ses and An Algorithm,

D. Wang, B. Gong, and L. Wang, “On Cal- ibrating Semantic Segmentation Models: Analy- ses and An Algorithm,” 2023. [Online]. Avail- able: https://www.computer.org/csdl/proceedings-article/ cvpr/2023/012900x3652/1POVzlb4A5a

2023
[7]

A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods,

L. Huang, S. Ruan, Y . Xing, and M. Feng, “A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods,”Medical Image Analysis, vol. 97, p. 103223, Oct. 2024. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841524001488

2024
[8]

Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis,

B. Lambert, F. Forbes, S. Doyle, H. Dehaene, and M. Dojat, “Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis,”Artif. Intell. Med., vol. 150, no. C, Apr. 2024. [Online]. Available: https://doi.org/10.1016/j.artmed.2024.102830

work page doi:10.1016/j.artmed.2024.102830 2024
[9]

Post hoc calibration of medical segmentation models,

A.-J. Rousseau, T. Becker, S. Appeltans, M. Blaschko, and D. Valkenborg, “Post hoc calibration of medical segmentation models,”Discover Applied Sciences, vol. 7, no. 3, p. 180, Feb. 2025. [Online]. Available: https: //doi.org/10.1007/s42452-025-06587-0

work page doi:10.1007/s42452-025-06587-0 2025
[10]

LS+: Informed Label Smoothing for Improving Calibration in Medical Image Classification,

A. S. Sambyal, U. Niyaz, S. Shrivastava, N. C. Krishnan, and D. R. Bathula, “LS+: Informed Label Smoothing for Improving Calibration in Medical Image Classification,” Sep. 2024. [Online]. Available: https: //papers.miccai.org/miccai-2024/481-Paper3276

2024
[11]

Multi-Head Multi-Loss Model Cal- ibration,

A. Galdran, J. W. Verjans, G. Carneiro, and M. A. González Ballester, “Multi-Head Multi-Loss Model Cal- ibration,” inMedical Image Computing and Computer Assisted Intervention - MICCAI 2023, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, and R. Taylor, Eds. Cham: Springer Nature Switzerland, 2023, pp. 108–117

2023
[12]

Uncertainty aware training to improve deep learning model calibration for classifica- tion of cardiac MR images,

T. Dawoodet al., “Uncertainty aware training to improve deep learning model calibration for classifica- tion of cardiac MR images,”Medical Image Analysis, vol. 88, p. 102861, Aug. 2023. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S1361841523001214

2023
[13]

Improving the repeatability of deep learning models with Monte Carlo dropout,

A. Lemayet al., “Improving the repeatability of deep learning models with Monte Carlo dropout,”npj 10 Digital Medicine, vol. 5, no. 1, p. 174, Nov. 2022, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41746-022-00709-3

2022
[14]

Addressing Deep Learning Model Calibration Using Evidential Neural Networks and Uncertainty- Aware Training,

T. Dawood, E. Chan, R. Razavi, A. P. King, and E. Puyol- Antón, “Addressing Deep Learning Model Calibration Using Evidential Neural Networks and Uncertainty- Aware Training,” in2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Apr. 2023, pp. 1–5, iSSN: 1945-8452. [Online]. Available: https: //ieeexplore.ieee.org/document/10230515

work page arXiv 2023
[15]

Is one annotation enough? - A data-centric image classification benchmark for noisy and ambiguous label estimation,

L. Schmarjeet al., “Is one annotation enough? - A data-centric image classification benchmark for noisy and ambiguous label estimation,”Advances in Neural Information Processing Systems, vol. 35, pp. 33 215–33 232, Dec. 2022. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2022/hash/ d6c03035b8bc551f474f040fe8607cab-Abstract-Dataset...

2022
[16]

Calibration and Uncertainty for multiRater V olume Assessment in multiorgan Segmenta- tion (CURV AS) challenge results,

M. Riera-Marínet al., “Calibration and Uncertainty for multiRater V olume Assessment in multiorgan Segmenta- tion (CURV AS) challenge results,”Computers in Biology and Medicine, vol. 197, p. 111024, Oct. 2025. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0010482525013769

2025
[17]

Learning robust medical image segmentation from multi-source annotations,

Y . Wang, L. Luo, M. Wu, Q. Wang, and H. Chen, “Learning robust medical image segmentation from multi-source annotations,”Medical Image Analysis, vol. 101, p. 103489, Apr. 2025. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S1361841525000374

2025
[18]

Label fusion and training methods for reliable representation of inter-rater uncertainty,

A. Lemay, C. Gros, E. Naga Karthik, and J. Cohen- Adad, “Label fusion and training methods for reliable representation of inter-rater uncertainty,”Machine Learn- ing for Biomedical Imaging, vol. 1, no. January 2023 issue, pp. 1–27, Jan. 2023. [Online]. Available: https://www.melba-journal.org/papers/2022:031.html

2023
[19]

Learning from multiple annotators for medical image segmentation,

L. Zhanget al., “Learning from multiple annotators for medical image segmentation,”Pattern Recognition, vol. 138, p. 109400, Jun. 2023. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S0031320323001012

2023
[20]

Zhang, Y

J. Zhang, Y . Zheng, and Y . Shi, “A Soft Label Method for Medical Image Segmentation with Multirater Anno- tations,”Computational Intelligence and Neuroscience, vol. 2023, no. 1, p. 1883597, 2023. [Online]. Available: https://doi.org/10.1155/2023/1883597

work page doi:10.1155/2023/1883597 2023
[21]

Spatially Varying Label Smoothing: Capturing Uncertainty from Expert Anno- tations,

M. Islam and B. Glocker, “Spatially Varying Label Smoothing: Capturing Uncertainty from Expert Anno- tations,” inInformation Processing in Medical Imaging, A. Feragen, S. Sommer, J. Schnabel, and M. Nielsen, Eds. Cham: Springer International Publishing, 2021, pp. 677– 688

2021
[22]

Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation,

S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation,”Ieee Transactions on Medical Imaging, vol. 23, no. 7, pp. 903–921, Jul. 2004. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC1283110/

2004
[23]

Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agree- ment,

M. H. Jensen, D. R. Jørgensen, R. Jalaboi, M. E. Hansen, and M. A. Olsen, “Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agree- ment,” inMedical Image Computing and Computer As- sisted Intervention - MICCAI 2019, D. Shenet al., Eds. Cham: Springer International Publishing, 2019, pp. 540– 548

2019
[24]

Learning Calibrated Medical Image Segmentation via Multi-rater Agreement Modeling,

W. Jiet al., “Learning Calibrated Medical Image Segmentation via Multi-rater Agreement Modeling,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 12 336–12 346, iSSN: 2575-7075. [Online]. Available: https://ieeexplore.ieee.org/document/9578194

work page arXiv 2021
[25]

Joint categorical and ordinal learning for cancer grading in pathology images,

T. T. L. Vuong, K. Kim, B. Song, and J. T. Kwak, “Joint categorical and ordinal learning for cancer grading in pathology images,”Medical Image Analysis, vol. 73, p. 102206, Oct. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841521002516

2021
[26]

Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images,

A. Galdran, J. Dolz, H. Chakor, H. Lombaert, and I. Ben Ayed, “Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images,” inMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2020, A. L. Martel, P. Abolmaesumi, D. Stoy- anov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Raco- ceanu, and L. Joskowicz, Eds. Cham: ...

2020
[27]

DIOR-ViT: Differential ordinal learning Vision Transformer for cancer classification in pathology im- ages,

J. C. Lee, K. Byeon, B. Song, K. Kim, and J. T. Kwak, “DIOR-ViT: Differential ordinal learning Vision Transformer for cancer classification in pathology im- ages,”Medical Image Analysis, vol. 105, p. 103708, Oct
[28]

Available: https://www.sciencedirect

[Online]. Available: https://www.sciencedirect. com/science/article/pii/S1361841525002555
[29]

Ordinal Label Distribution Learning,

C. Wen, X. Zhang, X. Yao, and J. Yang, “Ordinal Label Distribution Learning,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 23 424–23 434, iSSN: 2380-7504. [Online]. Available: https://ieeexplore.ieee.org/document/10378036

work page arXiv 2023
[30]

Performance Metrics for Probabilistic Ordi- nal Classifiers,

A. Galdran, “Performance Metrics for Probabilistic Ordi- nal Classifiers,” inMedical Image Computing and Com- puter Assisted Intervention - MICCAI 2023, H. Greenspan et al., Eds. Cham: Springer Nature Switzerland, 2023, pp. 357–366

2023
[31]

Multi-Rater Calibra- tion Error Estimation,

M. Riera-Marín, J. G. López, J. Rodríguez-Comas, M. A. G. Ballester, and A. Galdran, “Multi-Rater Calibra- tion Error Estimation,” inUncertainty for Safe Utilization 11 of Machine Learning in Medical Imaging, C. H. Sudre, M. I. Hoque, R. Mehta, C. Ouyang, C. Qin, M. Rakic, and W. M. Wells, Eds. Cham: Springer Nature Switzerland, 2026, pp. 147–157

2026
[32]

De- compositions of the mean continuous ranked probability score,

S. Arnold, E.-M. Walz, J. Ziegel, and T. Gneiting, “De- compositions of the mean continuous ranked probability score,”Electronic Journal of Statistics, vol. 18, no. 2, pp. 4992–5044, Jan. 2024

2024
[33]

A dataset and a methodology for intraoperative computer-aided diagnosis of a metastatic colon cancer in a liver,

D. Sitniket al., “A dataset and a methodology for intraoperative computer-aided diagnosis of a metastatic colon cancer in a liver,”Biomedical Signal Processing and Control, vol. 66, p. 102402, Apr. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1746809420305085

2021
[34]

REFUGE Challenge: A unified framework for evaluating automated methods for glau- coma assessment from fundus photographs,

J. I. Orlandoet al., “REFUGE Challenge: A unified framework for evaluating automated methods for glau- coma assessment from fundus photographs,”Medical Image Analysis, vol. 59, p. 101570, Jan. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841519301100

2020
[35]

Long COVID Iowa-UNICAMP,

D. S. Carmoet al., “Long COVID Iowa-UNICAMP,” Jun. 2024, publisher: University of Iowa. [Online]. Available: https://iro.uiowa.edu/esploro/outputs/dataset/ Long-COVID-Iowa-UNICAMP/9984632558202771

work page arXiv 2024
[36]

Using Soft Labels to Model Uncertainty in Medical Image Segmentation,

J. Lourenço-Silva and A. L. Oliveira, “Using Soft Labels to Model Uncertainty in Medical Image Segmentation,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Trau- matic Brain Injuries, A. Crimi and S. Bakas, Eds. Cham: Springer International Publishing, 2022, pp. 585–596

2022
[37]

To Smooth or Not? When Label Smoothing Meets Noisy Labels,

J. Wei, H. Liu, T. Liu, G. Niu, and Y . Liu, “To Smooth or Not? When Label Smoothing Meets Noisy Labels,” Jun. 2021. [Online]. Available: https: //api.semanticscholar.org/CorpusID:246485845

2021
[38]

Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE),

T. R. Langerak, U. A. van der Heide, A. N. T. J. Kotte, M. A. Viergever, M. van Vulpen, and J. P. W. Pluim, “Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE),”IEEE Transactions on Medical Imaging, vol. 29, no. 12, pp. 2000–2008, Dec

2000
[39]

Available: https://ieeexplore.ieee.org/ document/5523952 12

[Online]. Available: https://ieeexplore.ieee.org/ document/5523952 12

work page arXiv

[1] [1]

Confidence Calibration and Pre- dictive Uncertainty Estimation for Deep Medical Image Segmentation,

A. Mehrtash, W. M. Wells, C. M. Tempany, P. Abolmae- sumi, and T. Kapur, “Confidence Calibration and Pre- dictive Uncertainty Estimation for Deep Medical Image Segmentation,”IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 3868–3878, Dec. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/9130729

work page arXiv 2020

[2] [2]

Metrics to evaluate the per- formance of auto-segmentation for radiation treat- ment planning: A critical review,

M. V . Shereret al., “Metrics to evaluate the per- formance of auto-segmentation for radiation treat- ment planning: A critical review,”Radiotherapy and oncology : journal of the European Soci- ety for Therapeutic Radiology and Oncology, vol. 160, pp. 185–191, Jul. 2021. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC9444281/

2021

[3] [3]

Tackling prediction uncertainty in machine learning for healthcare,

M. Chuaet al., “Tackling prediction uncertainty in machine learning for healthcare,”Nature Biomedical Engineering, vol. 7, no. 6, pp. 711–718, Jun. 2023, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41551-022-00988-x

2023

[4] [4]

On Calibration of Modern Neural Networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” in Proceedings of the 34th International Conference on Machine Learning. PMLR, Jul. 2017, pp. 1321– 1330, iSSN: 2640-3498. [Online]. Available: https: //proceedings.mlr.press/v70/guo17a.html

2017

[5] [5]

Classifier calibration: a survey on how to assess and improve predicted class probabilities,

T. Silva Filho, H. Song, M. Perello-Nieto, R. Santos- Rodriguez, M. Kull, and P. Flach, “Classifier calibration: a survey on how to assess and improve predicted class probabilities,”Machine Learning, vol. 112, no. 9, pp. 3211–3260, Sep. 2023. [Online]. Available: https: //doi.org/10.1007/s10994-023-06336-7

work page doi:10.1007/s10994-023-06336-7 2023

[6] [6]

On Cal- ibrating Semantic Segmentation Models: Analy- ses and An Algorithm,

D. Wang, B. Gong, and L. Wang, “On Cal- ibrating Semantic Segmentation Models: Analy- ses and An Algorithm,” 2023. [Online]. Avail- able: https://www.computer.org/csdl/proceedings-article/ cvpr/2023/012900x3652/1POVzlb4A5a

2023

[7] [7]

A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods,

L. Huang, S. Ruan, Y . Xing, and M. Feng, “A review of uncertainty quantification in medical image analysis: Probabilistic and non-probabilistic methods,”Medical Image Analysis, vol. 97, p. 103223, Oct. 2024. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841524001488

2024

[8] [8]

Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis,

B. Lambert, F. Forbes, S. Doyle, H. Dehaene, and M. Dojat, “Trustworthy clinical AI solutions: A unified review of uncertainty quantification in Deep Learning models for medical image analysis,”Artif. Intell. Med., vol. 150, no. C, Apr. 2024. [Online]. Available: https://doi.org/10.1016/j.artmed.2024.102830

work page doi:10.1016/j.artmed.2024.102830 2024

[9] [9]

Post hoc calibration of medical segmentation models,

A.-J. Rousseau, T. Becker, S. Appeltans, M. Blaschko, and D. Valkenborg, “Post hoc calibration of medical segmentation models,”Discover Applied Sciences, vol. 7, no. 3, p. 180, Feb. 2025. [Online]. Available: https: //doi.org/10.1007/s42452-025-06587-0

work page doi:10.1007/s42452-025-06587-0 2025

[10] [10]

LS+: Informed Label Smoothing for Improving Calibration in Medical Image Classification,

A. S. Sambyal, U. Niyaz, S. Shrivastava, N. C. Krishnan, and D. R. Bathula, “LS+: Informed Label Smoothing for Improving Calibration in Medical Image Classification,” Sep. 2024. [Online]. Available: https: //papers.miccai.org/miccai-2024/481-Paper3276

2024

[11] [11]

Multi-Head Multi-Loss Model Cal- ibration,

A. Galdran, J. W. Verjans, G. Carneiro, and M. A. González Ballester, “Multi-Head Multi-Loss Model Cal- ibration,” inMedical Image Computing and Computer Assisted Intervention - MICCAI 2023, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, and R. Taylor, Eds. Cham: Springer Nature Switzerland, 2023, pp. 108–117

2023

[12] [12]

Uncertainty aware training to improve deep learning model calibration for classifica- tion of cardiac MR images,

T. Dawoodet al., “Uncertainty aware training to improve deep learning model calibration for classifica- tion of cardiac MR images,”Medical Image Analysis, vol. 88, p. 102861, Aug. 2023. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S1361841523001214

2023

[13] [13]

Improving the repeatability of deep learning models with Monte Carlo dropout,

A. Lemayet al., “Improving the repeatability of deep learning models with Monte Carlo dropout,”npj 10 Digital Medicine, vol. 5, no. 1, p. 174, Nov. 2022, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41746-022-00709-3

2022

[14] [14]

Addressing Deep Learning Model Calibration Using Evidential Neural Networks and Uncertainty- Aware Training,

T. Dawood, E. Chan, R. Razavi, A. P. King, and E. Puyol- Antón, “Addressing Deep Learning Model Calibration Using Evidential Neural Networks and Uncertainty- Aware Training,” in2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Apr. 2023, pp. 1–5, iSSN: 1945-8452. [Online]. Available: https: //ieeexplore.ieee.org/document/10230515

work page arXiv 2023

[15] [15]

Is one annotation enough? - A data-centric image classification benchmark for noisy and ambiguous label estimation,

L. Schmarjeet al., “Is one annotation enough? - A data-centric image classification benchmark for noisy and ambiguous label estimation,”Advances in Neural Information Processing Systems, vol. 35, pp. 33 215–33 232, Dec. 2022. [Online]. Available: https: //proceedings.neurips.cc/paper_files/paper/2022/hash/ d6c03035b8bc551f474f040fe8607cab-Abstract-Dataset...

2022

[16] [16]

Calibration and Uncertainty for multiRater V olume Assessment in multiorgan Segmenta- tion (CURV AS) challenge results,

M. Riera-Marínet al., “Calibration and Uncertainty for multiRater V olume Assessment in multiorgan Segmenta- tion (CURV AS) challenge results,”Computers in Biology and Medicine, vol. 197, p. 111024, Oct. 2025. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0010482525013769

2025

[17] [17]

Learning robust medical image segmentation from multi-source annotations,

Y . Wang, L. Luo, M. Wu, Q. Wang, and H. Chen, “Learning robust medical image segmentation from multi-source annotations,”Medical Image Analysis, vol. 101, p. 103489, Apr. 2025. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S1361841525000374

2025

[18] [18]

Label fusion and training methods for reliable representation of inter-rater uncertainty,

A. Lemay, C. Gros, E. Naga Karthik, and J. Cohen- Adad, “Label fusion and training methods for reliable representation of inter-rater uncertainty,”Machine Learn- ing for Biomedical Imaging, vol. 1, no. January 2023 issue, pp. 1–27, Jan. 2023. [Online]. Available: https://www.melba-journal.org/papers/2022:031.html

2023

[19] [19]

Learning from multiple annotators for medical image segmentation,

L. Zhanget al., “Learning from multiple annotators for medical image segmentation,”Pattern Recognition, vol. 138, p. 109400, Jun. 2023. [Online]. Avail- able: https://www.sciencedirect.com/science/article/pii/ S0031320323001012

2023

[20] [20]

Zhang, Y

J. Zhang, Y . Zheng, and Y . Shi, “A Soft Label Method for Medical Image Segmentation with Multirater Anno- tations,”Computational Intelligence and Neuroscience, vol. 2023, no. 1, p. 1883597, 2023. [Online]. Available: https://doi.org/10.1155/2023/1883597

work page doi:10.1155/2023/1883597 2023

[21] [21]

Spatially Varying Label Smoothing: Capturing Uncertainty from Expert Anno- tations,

M. Islam and B. Glocker, “Spatially Varying Label Smoothing: Capturing Uncertainty from Expert Anno- tations,” inInformation Processing in Medical Imaging, A. Feragen, S. Sommer, J. Schnabel, and M. Nielsen, Eds. Cham: Springer International Publishing, 2021, pp. 677– 688

2021

[22] [22]

Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation,

S. K. Warfield, K. H. Zou, and W. M. Wells, “Simultaneous Truth and Performance Level Estimation (STAPLE): An Algorithm for the Validation of Image Segmentation,”Ieee Transactions on Medical Imaging, vol. 23, no. 7, pp. 903–921, Jul. 2004. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC1283110/

2004

[23] [23]

Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agree- ment,

M. H. Jensen, D. R. Jørgensen, R. Jalaboi, M. E. Hansen, and M. A. Olsen, “Improving Uncertainty Estimation in Convolutional Neural Networks Using Inter-rater Agree- ment,” inMedical Image Computing and Computer As- sisted Intervention - MICCAI 2019, D. Shenet al., Eds. Cham: Springer International Publishing, 2019, pp. 540– 548

2019

[24] [24]

Learning Calibrated Medical Image Segmentation via Multi-rater Agreement Modeling,

W. Jiet al., “Learning Calibrated Medical Image Segmentation via Multi-rater Agreement Modeling,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 12 336–12 346, iSSN: 2575-7075. [Online]. Available: https://ieeexplore.ieee.org/document/9578194

work page arXiv 2021

[25] [25]

Joint categorical and ordinal learning for cancer grading in pathology images,

T. T. L. Vuong, K. Kim, B. Song, and J. T. Kwak, “Joint categorical and ordinal learning for cancer grading in pathology images,”Medical Image Analysis, vol. 73, p. 102206, Oct. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841521002516

2021

[26] [26]

Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images,

A. Galdran, J. Dolz, H. Chakor, H. Lombaert, and I. Ben Ayed, “Cost-Sensitive Regularization for Diabetic Retinopathy Grading from Eye Fundus Images,” inMedi- cal Image Computing and Computer Assisted Intervention - MICCAI 2020, A. L. Martel, P. Abolmaesumi, D. Stoy- anov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Raco- ceanu, and L. Joskowicz, Eds. Cham: ...

2020

[27] [27]

DIOR-ViT: Differential ordinal learning Vision Transformer for cancer classification in pathology im- ages,

J. C. Lee, K. Byeon, B. Song, K. Kim, and J. T. Kwak, “DIOR-ViT: Differential ordinal learning Vision Transformer for cancer classification in pathology im- ages,”Medical Image Analysis, vol. 105, p. 103708, Oct

[28] [28]

Available: https://www.sciencedirect

[Online]. Available: https://www.sciencedirect. com/science/article/pii/S1361841525002555

[29] [29]

Ordinal Label Distribution Learning,

C. Wen, X. Zhang, X. Yao, and J. Yang, “Ordinal Label Distribution Learning,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2023, pp. 23 424–23 434, iSSN: 2380-7504. [Online]. Available: https://ieeexplore.ieee.org/document/10378036

work page arXiv 2023

[30] [30]

Performance Metrics for Probabilistic Ordi- nal Classifiers,

A. Galdran, “Performance Metrics for Probabilistic Ordi- nal Classifiers,” inMedical Image Computing and Com- puter Assisted Intervention - MICCAI 2023, H. Greenspan et al., Eds. Cham: Springer Nature Switzerland, 2023, pp. 357–366

2023

[31] [31]

Multi-Rater Calibra- tion Error Estimation,

M. Riera-Marín, J. G. López, J. Rodríguez-Comas, M. A. G. Ballester, and A. Galdran, “Multi-Rater Calibra- tion Error Estimation,” inUncertainty for Safe Utilization 11 of Machine Learning in Medical Imaging, C. H. Sudre, M. I. Hoque, R. Mehta, C. Ouyang, C. Qin, M. Rakic, and W. M. Wells, Eds. Cham: Springer Nature Switzerland, 2026, pp. 147–157

2026

[32] [32]

De- compositions of the mean continuous ranked probability score,

S. Arnold, E.-M. Walz, J. Ziegel, and T. Gneiting, “De- compositions of the mean continuous ranked probability score,”Electronic Journal of Statistics, vol. 18, no. 2, pp. 4992–5044, Jan. 2024

2024

[33] [33]

A dataset and a methodology for intraoperative computer-aided diagnosis of a metastatic colon cancer in a liver,

D. Sitniket al., “A dataset and a methodology for intraoperative computer-aided diagnosis of a metastatic colon cancer in a liver,”Biomedical Signal Processing and Control, vol. 66, p. 102402, Apr. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1746809420305085

2021

[34] [34]

REFUGE Challenge: A unified framework for evaluating automated methods for glau- coma assessment from fundus photographs,

J. I. Orlandoet al., “REFUGE Challenge: A unified framework for evaluating automated methods for glau- coma assessment from fundus photographs,”Medical Image Analysis, vol. 59, p. 101570, Jan. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S1361841519301100

2020

[35] [35]

Long COVID Iowa-UNICAMP,

D. S. Carmoet al., “Long COVID Iowa-UNICAMP,” Jun. 2024, publisher: University of Iowa. [Online]. Available: https://iro.uiowa.edu/esploro/outputs/dataset/ Long-COVID-Iowa-UNICAMP/9984632558202771

work page arXiv 2024

[36] [36]

Using Soft Labels to Model Uncertainty in Medical Image Segmentation,

J. Lourenço-Silva and A. L. Oliveira, “Using Soft Labels to Model Uncertainty in Medical Image Segmentation,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Trau- matic Brain Injuries, A. Crimi and S. Bakas, Eds. Cham: Springer International Publishing, 2022, pp. 585–596

2022

[37] [37]

To Smooth or Not? When Label Smoothing Meets Noisy Labels,

J. Wei, H. Liu, T. Liu, G. Niu, and Y . Liu, “To Smooth or Not? When Label Smoothing Meets Noisy Labels,” Jun. 2021. [Online]. Available: https: //api.semanticscholar.org/CorpusID:246485845

2021

[38] [38]

Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE),

T. R. Langerak, U. A. van der Heide, A. N. T. J. Kotte, M. A. Viergever, M. van Vulpen, and J. P. W. Pluim, “Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE),”IEEE Transactions on Medical Imaging, vol. 29, no. 12, pp. 2000–2008, Dec

2000

[39] [39]

Available: https://ieeexplore.ieee.org/ document/5523952 12

[Online]. Available: https://ieeexplore.ieee.org/ document/5523952 12

work page arXiv