Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

Christof A. Bertram; Christopher Kaltenecker; Jonas Ammeling; Katharina Breininger; Marc Aubreville; Nils Porsche; Robert Klopfleisch; Rutger H.J. Fick; Sweta Banerjee; Taryn A. Donovan

arxiv: 2506.21444 · v4 · submitted 2025-06-26 · 💻 cs.CV

Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

Sweta Banerjee , Viktoria Weiss , Taryn A. Donovan , Rutger H.J. Fick , Thomas Conrad , Jonas Ammeling , Nils Porsche , Robert Klopfleisch

show 4 more authors

Christopher Kaltenecker Katharina Breininger Marc Aubreville Christof A. Bertram

This is my paper

Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords atypical mitosismitotic figure classificationdeep learningfoundation modelstransfer learningLoRA fine-tuningbreast cancercross-dataset evaluation

0 comments

The pith

Transfer learning and LoRA fine-tuning enable balanced accuracies up to 0.81 for atypical mitosis classification across in-domain and held-out datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks deep learning models for separating atypical mitotic figures from normal ones in breast cancer pathology images. Atypical mitoses serve as an independent marker of tumor malignancy but are rare, morphologically subtle, and prone to low agreement among experts. The study tests end-to-end convolutional networks, foundation models with linear probing, and foundation models adapted via low-rank fine-tuning on the AMi-Br collection plus two newly introduced held-out sets drawn from TCGA and MIDOG++ cohorts. Results demonstrate that these transfer-learning methods sustain solid performance even when the test data come from different sources and acquisition conditions.

Core claim

Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on

What carries the argument

Vision foundation models fine-tuned with low-rank adaptation (LoRA) on atypical versus normal mitotic figure images, evaluated through cross-dataset protocols that include newly constructed held-out collections.

Load-bearing premise

The labels assigned to mitotic figures in the new AtNorM-Br and AtNorM-MD datasets reliably reflect true atypical versus normal status and remain consistent when different pathologists or staining protocols are used.

What would settle it

A multi-rater study that quantifies inter-pathologist disagreement on the same mitotic figures used in AtNorM-Br and AtNorM-MD and then measures whether model accuracy tracks that disagreement level would directly test whether reported balanced accuracies capture biological signal or annotation noise.

read the original abstract

Atypical mitosis marks a deviation in the cell division process that has been shown be an independent prognostic marker for tumor malignancy. However, atypical mitosis classification remains challenging due to low prevalence, at times subtle morphological differences from normal mitotic figures, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. Our work shows that atypical mitotic figure classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make all code and data used in this paper available in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks deep learning approaches—including end-to-end trained CNNs, vision foundation models with linear probing, and LoRA fine-tuning—for atypical versus normal mitosis classification. It extends the existing AMi-Br dataset by introducing two new held-out test sets (AtNorM-Br from TCGA breast cancer slides and AtNorM-MD from a MIDOG++ subset) and reports peak balanced accuracies of 0.8135 (in-domain AMi-Br), 0.7788 (AtNorM-Br), and 0.7723 (AtNorM-MD). The authors conclude that recent transfer-learning and fine-tuning techniques can effectively address this clinically relevant but challenging task, and they release all code and data.

Significance. If the reported cross-dataset numbers are reliable, the work supplies a concrete, reproducible benchmark for a prognostic marker with known low prevalence and inter-rater variability. The explicit out-of-domain evaluation on two independently sourced held-out collections and the public release of code, data, and models are clear strengths that directly support the effectiveness claim and enable follow-on research.

major comments (1)

[§3] §3 (Construction of AtNorM-Br and AtNorM-MD): No inter-rater agreement statistics, multi-pathologist consensus protocol, or side-by-side comparison of annotation guidelines with those used for AMi-Br are provided. Because the introduction itself identifies low inter-rater agreement as an intrinsic difficulty, the absence of these details leaves open the possibility that the reported out-of-domain balanced accuracies (0.7788 and 0.7723) partly reflect labeling shifts rather than model generalization.

minor comments (2)

[Abstract] Abstract: The out-of-domain dataset is referred to once as “AtNorm-Br” and once as “AtNorM-Br”; consistent naming throughout the manuscript would reduce confusion.
[Results] Table captions and result sections: Balanced-accuracy values are given to four decimal places while standard deviations or confidence intervals are not shown; adding variability estimates would strengthen the quantitative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and indicate the revisions that will be incorporated to improve transparency around dataset annotation.

read point-by-point responses

Referee: §3 (Construction of AtNorM-Br and AtNorM-MD): No inter-rater agreement statistics, multi-pathologist consensus protocol, or side-by-side comparison of annotation guidelines with those used for AMi-Br are provided. Because the introduction itself identifies low inter-rater agreement as an intrinsic difficulty, the absence of these details leaves open the possibility that the reported out-of-domain balanced accuracies (0.7788 and 0.7723) partly reflect labeling shifts rather than model generalization.

Authors: We agree that additional details on annotation are warranted given the known inter-rater variability challenges highlighted in the introduction. In the revised manuscript we will expand Section 3 with: (i) an explicit side-by-side comparison showing that the morphological criteria for atypical mitoses in AtNorM-Br and AtNorM-MD are identical to those used for AMi-Br; (ii) a description of the annotation protocol (single board-certified pathologist per new dataset, performed to maintain consistency with the original AMi-Br guidelines). Inter-rater agreement statistics are not available for the new collections because they were created with single-expert annotation for practical reasons. We will add a dedicated limitations paragraph acknowledging that this leaves open the possibility of labeling shifts and discussing how this affects interpretation of the cross-dataset balanced accuracies as pure measures of generalization. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with direct held-out evaluation exhibits no circularity

full rationale

The paper is a standard empirical benchmarking study that trains models on the AMi-Br dataset and reports balanced accuracy directly on held-out images from AMi-Br (in-domain) plus two newly introduced out-of-domain sets (AtNorM-Br from TCGA and AtNorM-MD from MIDOG++). No equations, derivations, or fitted parameters are present; performance numbers are measured outcomes rather than quantities defined in terms of themselves. The central claim rests on observable classification results against external annotations, not on any self-referential construction, self-citation chain, or renamed ansatz. The annotation-reliability concern raised in the skeptic note is a validity issue for the ground-truth labels, not a circularity reduction in the reported pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about data representativeness and label quality rather than any new free parameters, axioms invented for this paper, or postulated entities. Pre-trained foundation models are treated as external inputs from prior literature.

axioms (2)

domain assumption Ground-truth labels in the new AtNorM-Br and AtNorM-MD datasets are sufficiently accurate and consistent to serve as reliable benchmarks for generalization.
Invoked when reporting out-of-domain balanced accuracies as evidence of effective classification.
standard math Standard i.i.d. assumptions hold sufficiently for cross-dataset evaluation to measure true generalization.
Implicit in the use of held-out test sets to evaluate model performance.

pith-pipeline@v0.9.0 · 5862 in / 1643 out tokens · 42293 ms · 2026-05-19T07:36:45.840599+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. ... foundation models fine-tuned with low-rank adaptation (LoRA)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Atypical mitosis marks a deviation in the cell division process ... low inter-rater agreement among pathologists, and class imbalance in datasets.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.