Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors

Adrien Depeursinge; Alessandro Cagol; Anna St\"olting; Cristina Granziera; Daniel Reich; Delphine Ribes; Erin S. Beck; Haris Tsagkas; Henning M\"uller; Mario Ocampo--Pineda

arxiv: 2504.04814 · v3 · submitted 2025-04-07 · 📡 eess.IV · cs.CV

Explaining Uncertainty in Multiple Sclerosis Cortical Lesion Segmentation Beyond Prediction Errors

Nataliia Molchanova , Pedro M. Gordaliza , Alessandro Cagol , Mario Ocampo--Pineda , Po--Jui Lu , Matthias Weigel , Xinjie Chen , Erin S. Beck

show 9 more authors

Haris Tsagkas Daniel Reich Anna St\"olting Pietro Maggi Delphine Ribes Adrien Depeursinge Cristina Granziera Henning M\"uller Meritxell Bach Cuadra

This is my paper

Pith reviewed 2026-05-22 21:07 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords multiple sclerosiscortical lesion segmentationuncertainty quantificationdeep ensemblesinterpretabilitylesion sizeexplainable AIimage segmentation

0 comments

The pith

Instance-wise uncertainty in cortical lesion segmentation relates to lesion size, shape, and cortical involvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an interpretability framework to analyze what drives predictive uncertainty in deep ensemble models for cortical lesion segmentation in multiple sclerosis. It moves beyond uncertainty-error correlations to examine clinical and imaging factors such as lesion size, shape, and cortical involvement. The analysis finds strong relations between these factors and uncertainty, which match the elements expert raters identify as reducing their own annotation confidence. The framework is tested across two datasets with 206 patients and nearly 2000 lesions, covering both standard and distribution-shifted conditions.

Core claim

The central claim is that instance-wise uncertainty is strongly related to lesion size, shape, and cortical involvement. Expert rater feedback confirms that similar factors impede annotator confidence. Evaluations on two datasets (206 patients, almost 2000 lesions) under both in-domain and distribution-shift conditions highlight the utility of the framework in different scenarios.

What carries the argument

The interpretability framework for lesion-scale predictive uncertainty that relates deep ensemble outputs to medical factors like size, shape, and cortical involvement rather than prediction errors alone.

If this is right

Uncertainty values track lesion size, shape, and cortical involvement more than raw prediction errors.
The same lesion properties that raise model uncertainty also lower human annotator confidence.
The framework remains informative under both in-domain and distribution-shift conditions.
Results hold across two independent datasets covering 206 patients and nearly 2000 lesions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Segmentation models could be retrained with targeted augmentation on small or irregularly shaped lesions to lower uncertainty in those cases.
Uncertainty maps might serve as a triage tool to flag lesions for mandatory human review based on size and shape cues.
The same analysis approach could be applied to other lesion segmentation tasks where boundary and location properties vary widely.

Load-bearing premise

The assumption that uncertainty estimates produced by deep ensembles capture the same factors that reduce human annotator confidence, as validated only through qualitative expert feedback rather than quantitative inter-rater agreement metrics on the same lesions.

What would settle it

A direct quantitative comparison of uncertainty values against inter-rater agreement scores for the identical set of lesions, checking whether higher uncertainty aligns with lower agreement.

read the original abstract

Trustworthy artificial intelligence (AI) is essential in healthcare, particularly for high-stakes tasks like medical image segmentation. Explainable AI and uncertainty quantification significantly enhance AI reliability by addressing key attributes such as robustness, usability, and explainability. Despite extensive technical advances in uncertainty quantification for medical imaging, understanding the clinical informativeness and interpretability of uncertainty remains limited. This study presents an interpretability framework for analyzing lesion-scale predictive uncertainty in cortical lesion segmentation in multiple sclerosis using deep ensembles. The analysis shifts the focus from the uncertainty--error relationship towards clinically relevant medical and engineering factors. Our findings reveal that instance-wise uncertainty is strongly related to lesion size, shape, and cortical involvement. Expert rater feedback confirms that similar factors impede annotator confidence. Evaluations conducted on two datasets (206 patients, almost 2000 lesions) under both in-domain and distribution-shift conditions highlight the utility of the framework in different scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shifts uncertainty analysis in MS cortical lesion segmentation toward lesion size, shape and cortical involvement with some expert input, but the human alignment rests on qualitative feedback only.

read the letter

The main thing here is the move away from uncertainty-error correlation toward linking deep-ensemble uncertainty to concrete lesion properties: size, shape, and cortical involvement. They run this on two datasets covering 206 patients and nearly 2000 lesions, including distribution-shift cases, which gives the correlations some breadth. Expert rater comments are included to suggest the same factors affect human annotators. That framing is a reasonable practical step for making uncertainty maps more usable in neurology workflows. The analysis stays empirical and dataset-driven rather than circular, and the sample size is decent for lesion-level work. The soft spot is the human side. The claim that ensemble uncertainty captures the same drivers as annotator confidence rests on qualitative feedback alone; no lesion-wise inter-rater Dice, disagreement rates, or other quantitative metrics on the identical lesions are mentioned. The abstract also omits effect sizes and statistical tests, so the strength of the reported relations is hard to judge. This is aimed at researchers doing uncertainty quantification in medical imaging who want a more clinically grounded interpretation. A reader already working on MS segmentation or explainable AI would get some value from the framing and the shift/out-of-domain checks. It is not a methods breakthrough, but the question it asks is useful enough that it should go to peer review rather than desk rejection; the main fixes would be adding quantitative human metrics and clearer effect sizes.

Referee Report

2 major / 2 minor

Summary. The paper presents an interpretability framework for lesion-scale predictive uncertainty in multiple sclerosis cortical lesion segmentation using deep ensembles. It shifts analysis from the uncertainty-error relationship to clinically relevant factors, claiming that instance-wise uncertainty is strongly related to lesion size, shape, and cortical involvement, with expert rater feedback indicating that similar factors impede annotator confidence. Evaluations are performed on two datasets (206 patients, nearly 2000 lesions) under in-domain and distribution-shift conditions.

Significance. If the central correlations hold after addressing validation gaps, the framework offers a practical approach to interpreting uncertainty estimates in medical segmentation beyond raw error metrics, linking them to interpretable clinical properties. The multi-dataset evaluation including distribution shifts and the focus on instance-wise (lesion-level) analysis are strengths that could improve AI trustworthiness in neurology imaging.

major comments (2)

[Expert rater feedback] The claim that ensemble uncertainty and human annotator confidence are driven by the same factors (size, shape, cortical involvement) rests on qualitative expert feedback alone. No quantitative metrics such as lesion-wise inter-rater Dice scores, disagreement rates, or agreement statistics are reported on the identical lesions used for the uncertainty analysis, leaving the alignment between the two uncertainty sources under-supported.
[Results] The reported correlations between uncertainty and lesion properties lack quantitative effect sizes, correlation coefficients, p-values, or statistical tests. Details on measurement protocols for lesion size, shape, and cortical involvement (e.g., how boundaries were defined, controls for confounding variables like scanner type) are also absent, weakening the 'strongly related' claim.

minor comments (2)

[Abstract] The abstract states 'almost 2000 lesions' without a precise count; providing the exact number would improve reproducibility.
[Methods] Notation for uncertainty metrics and lesion properties could be standardized across figures and text for clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made or where limitations prevent full resolution.

read point-by-point responses

Referee: [Expert rater feedback] The claim that ensemble uncertainty and human annotator confidence are driven by the same factors (size, shape, cortical involvement) rests on qualitative expert feedback alone. No quantitative metrics such as lesion-wise inter-rater Dice scores, disagreement rates, or agreement statistics are reported on the identical lesions used for the uncertainty analysis, leaving the alignment between the two uncertainty sources under-supported.

Authors: We agree that the alignment between ensemble uncertainty and annotator confidence relies on qualitative expert feedback rather than quantitative inter-rater metrics. We do not possess multiple independent annotations on the identical lesions analyzed for uncertainty, precluding computation of lesion-wise Dice scores or agreement statistics. In the revised manuscript we have added explicit clarification in the discussion that this support is qualitative only, to prevent overstatement of the claim. revision: partial
Referee: [Results] The reported correlations between uncertainty and lesion properties lack quantitative effect sizes, correlation coefficients, p-values, or statistical tests. Details on measurement protocols for lesion size, shape, and cortical involvement (e.g., how boundaries were defined, controls for confounding variables like scanner type) are also absent, weakening the 'strongly related' claim.

Authors: We thank the referee for highlighting these omissions. The revised manuscript now reports correlation coefficients, effect sizes, p-values, and appropriate statistical tests for the relationships between uncertainty and lesion properties. We have also expanded the methods section with detailed protocols for measuring lesion size, shape, and cortical involvement, including boundary definitions and controls for confounders such as scanner type. revision: yes

standing simulated objections not resolved

We lack multiple expert annotations on the identical lesions, which prevents reporting quantitative inter-rater metrics such as Dice scores or disagreement rates.

Circularity Check

0 steps flagged

No significant circularity; empirical analysis is data-driven and self-contained

full rationale

The paper reports empirical correlations between ensemble-derived uncertainty and lesion properties (size, shape, cortical involvement) observed across two datasets with nearly 2000 lesions. These relations are measured directly from the data rather than derived via equations that reduce to fitted inputs or self-definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the central claims. The qualitative expert feedback serves as supplementary interpretation and does not close a definitional loop. The work is therefore self-contained against external benchmarks with no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work relies on standard assumptions of deep ensemble uncertainty estimation and the validity of expert rater feedback as a proxy for clinical difficulty.

pith-pipeline@v0.9.0 · 5766 in / 1060 out tokens · 26963 ms · 2026-05-22T21:07:03.111818+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
cs.CV 2026-05 unverdicted novelty 7.0

K-fold CV ensembles and deep ensembles produce distinct uncertainty behaviors, with deep ensembles improving calibration and failure detection while CV ensembles correlate more with inter-rater variability.
Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation
cs.CV 2026-05 conditional novelty 6.0

K-fold CV ensembles differ from deep ensembles in uncertainty properties for medical segmentation, with DE improving calibration and failure detection while CV ensembles can better correlate with inter-rater variability.