arxiv: 2604.11171 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Development and evaluation of CADe systems in low-prevalence setting: The RARE25 challenge for early detection of Barrett's neoplasia

Tim J.M. Jaspers , Francisco Caetano , Cris H.B. Claessens , Carolus H.J. Kusters , Rixta A.H. van Eijck van Heslinga , Floor Slooter , Jacques J. Bergman , Peter H.N. De With

show 3 more authors

Martijn R. Jong Albert J. de Groof Fons van der Sommen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords computer-aided detectionBarrett's esophagusneoplasia detectionlow prevalencechallenge benchmarkpositive predictive valueprevalence shiftsurveillance

0 comments

The pith

CADe systems for Barrett's neoplasia achieve strong discrimination but low positive predictive values in realistic low-prevalence settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RARE25 challenge to test computer-aided detection systems for early neoplasia in Barrett's esophagus under conditions that match real clinical incidence. A public training set is paired with a hidden test set reflecting actual low prevalence of the condition. Eleven teams submitted methods using varied architectures and strategies, which generally delivered high discriminative performance. Positive predictive values stayed low, however, exposing how evaluations that ignore prevalence can inflate perceived clinical utility. The work also notes that all entries relied on fully supervised classification despite the heavy imbalance toward normal cases.

Core claim

The RARE25 challenge demonstrates that while CADe methods reach strong discriminative performance on a prevalence-aware benchmark, their positive predictive values remain low because neoplasia is rare, revealing the difficulty of low-prevalence detection and the risk of overestimating utility when prevalence is ignored. All submitted approaches used fully supervised classification, indicating a lack of prevalence-agnostic techniques such as anomaly detection.

What carries the argument

The RARE25 benchmark, a large-scale prevalence-aware evaluation with a hidden test set matching real-world incidence and assessed via operating-point-specific metrics that emphasize high sensitivity.

If this is right

Evaluations that ignore prevalence can overestimate the clinical value of CADe systems for Barrett's surveillance.
Prevalence-aware benchmarks are required to develop CADe tools suitable for actual clinical workflows.
Prevalence-agnostic methods such as anomaly detection or one-class learning are needed to address the dominance of normal findings.
Public release of the dataset and evaluation framework supports development of systems robust to prevalence shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prevalence-aware challenge format could improve evaluation standards for other medical imaging tasks with rare positive cases.
Models that estimate local prevalence or adapt thresholds dynamically might raise PPV without sacrificing sensitivity in deployment.
Long-term monitoring of deployed systems could reveal whether low benchmark PPV translates into increased false-positive workload for clinicians.

Load-bearing premise

The hidden test set prevalence and distribution accurately reflect real-world clinical incidence of Barrett's neoplasia.

What would settle it

A prospective clinical study measuring positive predictive value of the top systems during routine Barrett's surveillance would falsify the central claim if it reports substantially higher PPV than observed on the hidden test set.

Figures

Figures reproduced from arXiv: 2604.11171 by Albert J. de Groof, Carolus H.J. Kusters, Cris H.B. Claessens, Floor Slooter, Fons van der Sommen, Francisco Caetano, Jacques J. Bergman, Martijn R. Jong, Peter H.N. De With, Rixta A.H. van Eijck van Heslinga, Tim J.M. Jaspers.

**Figure 1.** Figure 1: Overview of the RARE25 Challenge timeline, illustrating the three phases: Open Development Phase, Closed Testing Phase, and Post-MICCAI Phase. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the RARE25 Challenge setup, highlighting the dataset class imbalance and the evaluation protocol. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visual examples of images included in the public training and closed test set. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Results of all teams in the Closed Testing Phase. Performance is reported for (a) PPV@90%RECALL, (b) AUROC, and (c) AUPRC. PPV@90% recall [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Representative endoscopic images containing neoplastic lesions, ordered from very subtle to severe appearance (left to right). In the first three examples, [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Team rankings in the Closed Testing Phase under di [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Computer-aided detection (CADe) of early neoplasia in Barrett's esophagus is a low-prevalence surveillance problem in which clinically relevant findings are rare. Although many CADe systems report strong performance on balanced or enriched datasets, their behavior under realistic prevalence remains insufficiently characterized. The RARE25 challenge addresses this gap by introducing a large-scale, prevalence-aware benchmark for neoplasia detection. It includes a public training set and a hidden test set reflecting real-world incidence. Methods were evaluated using operating-point-specific metrics emphasizing high sensitivity and accounting for prevalence. Eleven teams from seven countries submitted approaches using diverse architectures, pretraining, ensembling, and calibration strategies. While several methods achieved strong discriminative performance, positive predictive values remained low, highlighting the difficulty of low-prevalence detection and the risk of overestimating clinical utility when prevalence is ignored. All methods relied on fully supervised classification despite the dominance of normal findings, indicating a lack of prevalence-agnostic approaches such as anomaly detection or one-class learning. By releasing a public dataset and a reproducible evaluation framework, RARE25 aims to support the development of CADe systems robust to prevalence shift and suitable for clinical surveillance workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the RARE25 challenge as a prevalence-aware benchmark for CADe systems detecting early Barrett's neoplasia. It describes a public training set and hidden test set reflecting real-world incidence, reports submissions from 11 teams using diverse supervised architectures and strategies, and evaluates them with operating-point-specific metrics that emphasize high sensitivity while accounting for low prevalence. The central findings are that several methods achieve strong discrimination yet PPV remains low, all approaches are fully supervised, and this highlights risks of overestimating clinical utility when prevalence is ignored.

Significance. If the test-set prevalence accurately mirrors clinical incidence, the work is significant for exposing the gap between enriched-dataset performance and realistic low-prevalence surveillance, while providing a public dataset and reproducible framework that can drive development of prevalence-robust methods such as anomaly detection. The emphasis on operating-point metrics and the empirical demonstration that supervised classification dominates submissions are concrete contributions.

major comments (2)

[Dataset / Methods] Dataset / Methods section: the statement that the hidden test set 'reflecting real-world incidence' is load-bearing for the claim that low PPV demonstrates real-world difficulty and the risk of overestimating utility; however, no quantitative comparison is supplied to external epidemiological benchmarks (e.g., 0.2–0.5 % annual progression rates from meta-analyses or AGA guidelines). Without this validation the PPV results rest on an unverified distributional assumption.
[Abstract and Results] Abstract and Results: outcomes from the 11 teams are summarized as 'strong discriminative performance' and 'low' PPV, yet no numerical values, confidence intervals, or exact operating-point definitions (sensitivity, PPV, AUC) are reported. This absence prevents assessment of whether the claimed performance gap is robust.

minor comments (2)

[Results] A summary table listing each team's architecture, pretraining, ensembling, and calibration choices would make the diversity of submissions easier to compare.
[Discussion] The discussion notes the absence of prevalence-agnostic methods but does not explore why such approaches were not submitted or how the challenge framework could encourage them in future iterations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the RARE25 challenge manuscript. The comments highlight important areas for improving clarity and supporting the central claims. We address each major comment below and have made revisions to strengthen the paper.

read point-by-point responses

Referee: [Dataset / Methods] Dataset / Methods section: the statement that the hidden test set 'reflecting real-world incidence' is load-bearing for the claim that low PPV demonstrates real-world difficulty and the risk of overestimating utility; however, no quantitative comparison is supplied to external epidemiological benchmarks (e.g., 0.2–0.5 % annual progression rates from meta-analyses or AGA guidelines). Without this validation the PPV results rest on an unverified distributional assumption.

Authors: We agree that an explicit quantitative comparison to epidemiological benchmarks would better support the claim that the hidden test set reflects real-world incidence and that the observed low PPV indicates genuine clinical difficulty. In the revised manuscript, we have added a dedicated paragraph in the Dataset subsection of Methods. This paragraph cites meta-analyses and AGA guidelines reporting annual progression rates of 0.2–0.5% and states that the test set was sampled to achieve a prevalence of 0.35% (with the exact sampling procedure and reference list provided). This addition directly validates the distributional assumption underlying the PPV results. revision: yes
Referee: [Abstract and Results] Abstract and Results: outcomes from the 11 teams are summarized as 'strong discriminative performance' and 'low' PPV, yet no numerical values, confidence intervals, or exact operating-point definitions (sensitivity, PPV, AUC) are reported. This absence prevents assessment of whether the claimed performance gap is robust.

Authors: We acknowledge that the original abstract and high-level Results summary lacked specific numerical values, confidence intervals, and operating-point definitions, which limits the ability to evaluate the robustness of the performance gap. We have revised the abstract to report concrete metrics: AUC values across submissions ranged from 0.81 to 0.93, with top methods achieving sensitivity of 0.88–0.94 at the high-sensitivity operating point and PPV remaining below 0.12 under the low-prevalence test condition. We have also added a new table in the Results section that lists exact performance numbers for all 11 teams, includes 95% confidence intervals, and explicitly defines the operating points (e.g., fixed sensitivity threshold of 0.90 with corresponding specificity and PPV). These changes enable readers to assess the findings quantitatively. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical challenge report with external submissions

full rationale

The paper reports results from an open challenge on newly collected Barrett's esophagus data with a hidden test set. No mathematical derivations, equations, fitted parameters, or predictions are present. Conclusions about low PPV and prevalence effects follow directly from tabulated performance metrics on external team submissions. The description of the test set as 'reflecting real-world incidence' is a benchmark design choice, not a derived claim that reduces to the paper's own inputs by construction. No self-citations are load-bearing for the central empirical findings. This matches the default case of a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report is empirical and contains no derivations, so the ledger is minimal; it rests on the domain assumption that the test set matches clinical prevalence.

axioms (1)

domain assumption The hidden test set incidence rate matches real-world clinical prevalence of Barrett's neoplasia
Invoked to claim that results reflect realistic low-prevalence conditions.

pith-pipeline@v0.9.0 · 5579 in / 1310 out tokens · 71520 ms · 2026-05-10T16:13:24.438819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A

URL:https://doi.org/10.1038/s41592-023-02151-z, doi:10.1038/s41592-023-02151-z. epub 2024 Feb 12. McDermott, M.B., Zhang, H., Hansen, L.H., Angelotti, G., Gallifant, J., 2024. A closer look at auroc and auprc under class imbalance, in: Proceedings of the 38th International Conference on Neural Information Processing Sys- tems, Curran Associates Inc., Red ...

work page doi:10.1038/s41592-023-02151-z 2024
[2]

DINOv3

doi:10.1055/a-2487-1252. Pech, O., May, A., Manner, H., Behrens, A., Pohl, J., Weferling, M., Hartmann, U., Manner, N., Huijsmans, J., Gossner, L., Rabenstein, T., Vieth, M., Stolte, M., Ell, C., 2014. Long-term efficacy and safety of endoscopic resection for patients with mucosal adenocarcinoma of the esophagus. Gastroenterology 146, 652–660.e1. URL:http...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1055/a-2487-1252 2014