Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Pith reviewed 2026-05-19 07:36 UTC · model grok-4.3
The pith
Transfer learning and LoRA fine-tuning enable balanced accuracies up to 0.81 for atypical mitosis classification across in-domain and held-out datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on
What carries the argument
Vision foundation models fine-tuned with low-rank adaptation (LoRA) on atypical versus normal mitotic figure images, evaluated through cross-dataset protocols that include newly constructed held-out collections.
Load-bearing premise
The labels assigned to mitotic figures in the new AtNorM-Br and AtNorM-MD datasets reliably reflect true atypical versus normal status and remain consistent when different pathologists or staining protocols are used.
What would settle it
A multi-rater study that quantifies inter-pathologist disagreement on the same mitotic figures used in AtNorM-Br and AtNorM-MD and then measures whether model accuracy tracks that disagreement level would directly test whether reported balanced accuracies capture biological signal or annotation noise.
read the original abstract
Atypical mitosis marks a deviation in the cell division process that has been shown be an independent prognostic marker for tumor malignancy. However, atypical mitosis classification remains challenging due to low prevalence, at times subtle morphological differences from normal mitotic figures, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including end-to-end trained deep learning models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new held-out AMF datasets - AtNorM-Br, a dataset of mitotic figures from the TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitotic figures from a subset of the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. Our work shows that atypical mitotic figure classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make all code and data used in this paper available in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks deep learning approaches—including end-to-end trained CNNs, vision foundation models with linear probing, and LoRA fine-tuning—for atypical versus normal mitosis classification. It extends the existing AMi-Br dataset by introducing two new held-out test sets (AtNorM-Br from TCGA breast cancer slides and AtNorM-MD from a MIDOG++ subset) and reports peak balanced accuracies of 0.8135 (in-domain AMi-Br), 0.7788 (AtNorM-Br), and 0.7723 (AtNorM-MD). The authors conclude that recent transfer-learning and fine-tuning techniques can effectively address this clinically relevant but challenging task, and they release all code and data.
Significance. If the reported cross-dataset numbers are reliable, the work supplies a concrete, reproducible benchmark for a prognostic marker with known low prevalence and inter-rater variability. The explicit out-of-domain evaluation on two independently sourced held-out collections and the public release of code, data, and models are clear strengths that directly support the effectiveness claim and enable follow-on research.
major comments (1)
- [§3] §3 (Construction of AtNorM-Br and AtNorM-MD): No inter-rater agreement statistics, multi-pathologist consensus protocol, or side-by-side comparison of annotation guidelines with those used for AMi-Br are provided. Because the introduction itself identifies low inter-rater agreement as an intrinsic difficulty, the absence of these details leaves open the possibility that the reported out-of-domain balanced accuracies (0.7788 and 0.7723) partly reflect labeling shifts rather than model generalization.
minor comments (2)
- [Abstract] Abstract: The out-of-domain dataset is referred to once as “AtNorm-Br” and once as “AtNorM-Br”; consistent naming throughout the manuscript would reduce confusion.
- [Results] Table captions and result sections: Balanced-accuracy values are given to four decimal places while standard deviations or confidence intervals are not shown; adding variability estimates would strengthen the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below and indicate the revisions that will be incorporated to improve transparency around dataset annotation.
read point-by-point responses
-
Referee: §3 (Construction of AtNorM-Br and AtNorM-MD): No inter-rater agreement statistics, multi-pathologist consensus protocol, or side-by-side comparison of annotation guidelines with those used for AMi-Br are provided. Because the introduction itself identifies low inter-rater agreement as an intrinsic difficulty, the absence of these details leaves open the possibility that the reported out-of-domain balanced accuracies (0.7788 and 0.7723) partly reflect labeling shifts rather than model generalization.
Authors: We agree that additional details on annotation are warranted given the known inter-rater variability challenges highlighted in the introduction. In the revised manuscript we will expand Section 3 with: (i) an explicit side-by-side comparison showing that the morphological criteria for atypical mitoses in AtNorM-Br and AtNorM-MD are identical to those used for AMi-Br; (ii) a description of the annotation protocol (single board-certified pathologist per new dataset, performed to maintain consistency with the original AMi-Br guidelines). Inter-rater agreement statistics are not available for the new collections because they were created with single-expert annotation for practical reasons. We will add a dedicated limitations paragraph acknowledging that this leaves open the possibility of labeling shifts and discussing how this affects interpretation of the cross-dataset balanced accuracies as pure measures of generalization. revision: yes
Circularity Check
Empirical benchmarking study with direct held-out evaluation exhibits no circularity
full rationale
The paper is a standard empirical benchmarking study that trains models on the AMi-Br dataset and reports balanced accuracy directly on held-out images from AMi-Br (in-domain) plus two newly introduced out-of-domain sets (AtNorM-Br from TCGA and AtNorM-MD from MIDOG++). No equations, derivations, or fitted parameters are present; performance numbers are measured outcomes rather than quantities defined in terms of themselves. The central claim rests on observable classification results against external annotations, not on any self-referential construction, self-citation chain, or renamed ansatz. The annotation-reliability concern raised in the skeptic note is a validity issue for the ground-truth labels, not a circularity reduction in the reported pipeline.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Ground-truth labels in the new AtNorM-Br and AtNorM-MD datasets are sufficiently accurate and consistent to serve as reliable benchmarks for generalization.
- standard math Standard i.i.d. assumptions hold sufficiently for cross-dataset evaluation to measure true generalization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We found average balanced accuracy values of up to 0.8135, 0.7788, and 0.7723 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively. ... foundation models fine-tuned with low-rank adaptation (LoRA)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Atypical mitosis marks a deviation in the cell division process ... low inter-rater agreement among pathologists, and class imbalance in datasets.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.