Transformer-Based Hematological Malignancy Prediction from Peripheral Blood Smears in a Real-World Cohort

Ario Sadafi; Carsten Marr; Christian Pohlkamp; Fatih Ozlugedik; Ivan Kukuljan; Karsten Spiekermann; Matthias Hehr; Muhammed Furkan Dasdelen; Peter Lienemann

arxiv: 2509.20402 · v3 · submitted 2025-09-23 · 🧬 q-bio.QM

Transformer-Based Hematological Malignancy Prediction from Peripheral Blood Smears in a Real-World Cohort

Muhammed Furkan Dasdelen , Ivan Kukuljan , Peter Lienemann , Fatih Ozlugedik , Ario Sadafi , Matthias Hehr , Karsten Spiekermann , Christian Pohlkamp

show 1 more author

Carsten Marr

This is my paper

Pith reviewed 2026-05-18 13:38 UTC · model grok-4.3

classification 🧬 q-bio.QM

keywords hematological malignancyperipheral blood smeartransformer modelacute leukemiaAI diagnosisfalse discovery ratebone marrow aspirationexplainable AI

0 comments

The pith

Transformer model on peripheral blood images classifies hematological malignancies and lowers false discovery rate for acute leukemia from 13.5% to 8.7% without missing cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents cAItomorph, a transformer-based AI system trained to classify eight coarse hematological conditions from single-cell images in peripheral blood smears. The training data come from 6115 patients plus healthy controls whose ground-truth labels were established by full bone-marrow cytomorphology, cytogenetics, molecular genetics, and immunophenotyping. The model reaches 0.72 overall accuracy, rising to 0.87 when the top two predictions are considered, with strong F1 scores on acute leukemia, myeloproliferative neoplasms, and healthy samples. When prediction probabilities are calibrated, the false discovery rate for acute leukemia falls from 13.5% to 8.7% while sensitivity remains 100% on external test sets. This performance suggests a route to fewer unnecessary bone-marrow aspirations triggered by ambiguous peripheral-blood findings.

Core claim

cAItomorph aggregates cell-level encodings produced by the DinoBloom hematology foundation model through a transformer architecture to generate a single patient-level vector for eight-class classification. On real-world peripheral-blood data it attains 0.72 accuracy overall and 0.87 top-2 accuracy, with F1 scores of 0.76 for acute leukemia, 0.80 for myeloproliferative neoplasms, and 0.94 for healthy controls. Attention-head inspection shows focus on diagnostically relevant cells in both internal and external cohorts. Calibrated output probabilities specifically reduce the false discovery rate for acute leukemia from 13.5% to 8.7% while preserving detection of every case.

What carries the argument

Transformer aggregator that pools multiple cell encodings from the DinoBloom foundation model into one patient-level vector, equipped with attention heads for explainability and calibrated probabilities for decision support.

If this is right

High sensitivity for acute leukemia holds on external validation sets.
Top-2 accuracy of 0.87 supports narrowing of differential diagnoses from smear images alone.
Attention maps highlight cell-level features that align with clinical cytomorphologic criteria.
Calibrated probabilities enable triage that reduces unnecessary bone-marrow aspirations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Routine deployment could reduce inter-observer variability in initial smear review.
The same aggregation approach may extend to other marrow-derived disorders detectable in circulation.
Integration with standard laboratory parameters could further refine thresholds for invasive follow-up.
Real-world outcome trials would quantify net reduction in procedures and downstream patient benefit.

Load-bearing premise

Labels derived from comprehensive bone-marrow cytomorphology, cytogenetics, molecular genetics, and immunophenotyping provide reliable ground truth for a model that sees only peripheral-blood images.

What would settle it

A prospective study that routes patients to bone-marrow aspiration or observation solely on the basis of the model's calibrated probability threshold and then measures the number of missed acute-leukemia diagnoses and avoided procedures.

read the original abstract

Peripheral blood smears remain a cornerstone in the diagnosis of hematological neoplasms, offering rapid and valuable insights that inform subsequent diagnostic steps. However, since neoplastic transformations typically arise in the bone marrow, they may not manifest as detectable aberrations in peripheral blood, presenting a diagnostic challenge. In this paper, we introduce cAItomorph, an explainable transformer-based AI model, trained to classify hematological malignancies based on peripheral blood cytomorphology. Our data comprises peripheral blood single-cell images from 6115 patients with diagnoses confirmed by cytomorphology, cytogenetics, molecular genetics, and immunophenotyping from bone marrow samples, and 495 healthy controls, eight coarse classes. cAItomorph leverages the DinoBloom hematology foundation model and aggregates image encodings via a transformer-based architecture into a single vector. It achieves an overall accuracy of 0.72 in eight disease classification, with F1 scores of 0.76 for acute leukemia, 0.80 for myeloproliferative neoplasms and 0.94 for healthy cases. The overall accuracy increases to 0.87 in top-2 predictions. cAItomorph achieves high sensitivity for acute leukemia cases in external test sets. By analyzing attention heads, we demonstrate clinically relevant cell-level attentions in both internal and external test sets. Moreover, our model's calibrated prediction probabilities reduce the false discovery rate from 13.5% to 8.7% without missing any acute leukemia cases, thereby decreasing the number of unnecessary bone marrow aspirations based on peripheral blood smears. This study highlights the potential of AI-assisted diagnostics in hematological malignancies, illustrating how models trained on real-world data could enhance diagnostic accuracy and reduce invasive procedures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies DinoBloom embeddings with a transformer to a large real-world peripheral blood smear cohort and claims an FDR drop for acute leukemia triage, but marrow-derived labels create a real mismatch risk with blood-only inputs.

read the letter

The one or two things to take away: this is an application paper that uses DinoBloom plus a transformer to predict from peripheral blood smears in a 6115-patient cohort, and it claims the calibrated model cuts the false discovery rate for acute leukemia from 13.5% to 8.7% with no misses, which could mean fewer bone marrow aspirations. They handle a decent-sized real-world dataset with external validation and add attention visualization for explainability. Reporting specific metrics across internal and external sets and including calibration is better than many similar efforts. The top-2 accuracy boost to 0.87 shows they thought about how to use the model in practice. The main soft spot is the ground truth setup. Labels rely on bone marrow findings that include things the blood smear might not reflect. The abstract acknowledges this possibility but does not quantify the discordance rate or run any noise analysis. That makes the sensitivity and FDR results less convincing than they first appear. Basic things like patient-level splitting and imbalance handling are also not described clearly enough to evaluate the claims fully. This paper is for hematology clinicians or researchers working on AI support for smear reading. Someone looking for evidence on whether these models can impact routine triage would get something out of the scale and the clinical framing. It deserves a serious referee because the dataset is substantial and the question is clinically relevant, even with the gaps in validation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces cAItomorph, a transformer-based model built on the DinoBloom hematology foundation model for classifying eight coarse classes of hematological malignancies and healthy controls from peripheral blood smear single-cell images. The dataset includes images from 6115 patients with bone marrow-confirmed diagnoses and 495 healthy controls. The model achieves an overall accuracy of 0.72, with F1 scores of 0.76 for acute leukemia, 0.80 for myeloproliferative neoplasms, and 0.94 for healthy cases. Top-2 accuracy is 0.87. It demonstrates high sensitivity for acute leukemia on external test sets, provides attention-based explanations, and claims to reduce the false discovery rate from 13.5% to 8.7% without missing acute leukemia cases, potentially decreasing unnecessary bone marrow aspirations.

Significance. If the performance claims hold under rigorous validation, this work could have meaningful clinical impact by improving triage from peripheral blood smears and reducing invasive bone marrow procedures. Notable strengths include the large real-world cohort, use of a domain-specific foundation model, and attention-based explainability that links to clinically relevant cell features. The reported FDR reduction and zero-miss sensitivity for acute leukemia, if substantiated, represent a direct translational benefit.

major comments (2)

[Abstract] Abstract: The central claim that calibrated prediction probabilities reduce the false discovery rate from 13.5% to 8.7% without missing any acute leukemia cases provides no details on patient-level splitting, class imbalance handling, or statistical testing of the FDR improvement. These elements are load-bearing for assessing whether the reported reduction is robust or could arise from leakage or imbalance artifacts.
[Data and Methods] Data and Methods: The eight-class labels are derived from comprehensive bone marrow cytomorphology, cytogenetics, molecular genetics, and immunophenotyping, yet the model receives only peripheral blood images. The abstract notes that neoplastic changes may not manifest as detectable aberrations in peripheral blood; this mismatch creates potential label noise that directly biases calibration, sensitivity, and the FDR metric. A quantitative discordance estimate or label-noise sensitivity analysis is required to support the claims.

minor comments (2)

[Abstract] Abstract: The aggregation of image encodings via the transformer architecture is mentioned but would benefit from a brief description of the pooling or attention mechanism used to produce the single vector per patient.
[Results] Results: Attention visualizations are stated to be clinically relevant in both internal and external sets; ensure figure captions explicitly map highlighted cells to known morphological features for each class.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments have identified important areas where additional methodological transparency and analysis will strengthen the presentation of our results. We address each major comment point-by-point below, indicating the specific revisions we will incorporate in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that calibrated prediction probabilities reduce the false discovery rate from 13.5% to 8.7% without missing any acute leukemia cases provides no details on patient-level splitting, class imbalance handling, or statistical testing of the FDR improvement. These elements are load-bearing for assessing whether the reported reduction is robust or could arise from leakage or imbalance artifacts.

Authors: We agree that the abstract would benefit from concise methodological context to support the FDR claim. The full manuscript (Methods section) already specifies patient-level splitting, ensuring all single-cell images from any given patient are assigned exclusively to the training, validation, or test partition to prevent leakage. Class imbalance was handled via a weighted cross-entropy loss with weights inversely proportional to class frequencies in the training set. The reported FDR reduction was assessed using bootstrap resampling (1,000 iterations) to generate confidence intervals around the 13.5% to 8.7% change, together with McNemar’s test for paired proportions to evaluate statistical significance. We will revise the abstract to include a brief clause such as “using patient-level splitting, weighted loss, and bootstrap-validated calibration” while remaining within length constraints. revision: yes
Referee: [Data and Methods] Data and Methods: The eight-class labels are derived from comprehensive bone marrow cytomorphology, cytogenetics, molecular genetics, and immunophenotyping, yet the model receives only peripheral blood images. The abstract notes that neoplastic changes may not manifest as detectable aberrations in peripheral blood; this mismatch creates potential label noise that directly biases calibration, sensitivity, and the FDR metric. A quantitative discordance estimate or label-noise sensitivity analysis is required to support the claims.

Authors: We acknowledge the potential for label noise arising from the clinical mismatch between definitive bone-marrow diagnoses and peripheral-blood images. This limitation is already stated in the abstract and Introduction as an intrinsic feature of PB-smear triage. To address the request directly, we will add a dedicated sensitivity analysis in the revised Methods and Results sections: we will simulate increasing levels of label noise (random flips within clinically plausible ranges drawn from hematology literature) and report the resulting changes in calibration metrics, sensitivity for acute leukemia, and FDR. We will also include a quantitative discordance estimate synthesized from published concordance rates between PB and BM findings for the relevant malignancy classes. These additions will be placed in a new subsection of the Discussion to contextualize the robustness of the reported performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on external labels

full rationale

The paper reports standard supervised training of a transformer on peripheral-blood images with eight-class labels derived from independent bone-marrow cytogenetics, molecular, and immunophenotyping data. Reported metrics (accuracy 0.72, F1 scores, FDR drop from 13.5% to 8.7%, zero missed acute leukemias) are direct hold-out evaluation results, not quantities obtained by fitting a parameter to a subset and then re-using that same parameter as the reported prediction. No equations, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the abstract or described methods; the central performance claims remain statistically independent of the model weights once the held-out test set is fixed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that bone-marrow-confirmed labels are appropriate supervision for peripheral-blood images and that the eight coarse classes are clinically meaningful for triage decisions. No explicit free parameters or invented entities are introduced beyond standard neural-network training.

axioms (1)

domain assumption Bone-marrow cytogenetics, molecular genetics and immunophenotyping provide reliable ground-truth labels for peripheral-blood image classification
Stated in the data description and used to define the eight-class targets.

pith-pipeline@v0.9.0 · 5875 in / 1390 out tokens · 31760 ms · 2026-05-18T13:38:20.488927+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cAItomorph leverages the DinoBloom hematology foundation model and aggregates image encodings via a transformer-based architecture into a single vector... overall accuracy of 0.72 in eight disease classification
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We obtain cell level attentions from transformer heads using Attention Rollout method

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

& Garcia-Gutierrez, V

de Lima, M., Castillo, J., Merli, M. & Garcia-Gutierrez, V. Editorial: Epidemiological trends in hematological malignancies. Front. Oncol. 13 , 1151774 (2023). 2. Kantarjian, H. et al. Acute myeloid leukemia: current progress and future directions. Blood Cancer J. 11 , 41 (2021). 3. Surveillance Research Program, N. C. I. SEER*Explorer: An interactive web...

work page 2023
[2]

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again

Topol, E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again . (Hachette UK, 2019). 9. Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Healthc J 6 , 94–98 (2019). 10. Chan, H.-P., Samala, R. K., Hadjiiski, L. M. & Zhou, C. Deep Learning in Medical Image Analysis. Adv. Exp. Med. Biol. 121...

work page 2019
[3]

& Welling, M

Ilse, M., Tomczak, J. & Welling, M. Attention-based Deep Multiple Instance Learning. in International Conference on Machine Learning 2127–2136 (PMLR, 2018). 37. Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat Med 30 , 850–862 (2024). 38. Zimmermann, E. et al. Virchow2: Scaling Self-Supervised Mixed Magnificat...

work page doi:10.48550/arxiv.2408.00738 2018

[1] [1]

& Garcia-Gutierrez, V

de Lima, M., Castillo, J., Merli, M. & Garcia-Gutierrez, V. Editorial: Epidemiological trends in hematological malignancies. Front. Oncol. 13 , 1151774 (2023). 2. Kantarjian, H. et al. Acute myeloid leukemia: current progress and future directions. Blood Cancer J. 11 , 41 (2021). 3. Surveillance Research Program, N. C. I. SEER*Explorer: An interactive web...

work page 2023

[2] [2]

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again

Topol, E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again . (Hachette UK, 2019). 9. Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Healthc J 6 , 94–98 (2019). 10. Chan, H.-P., Samala, R. K., Hadjiiski, L. M. & Zhou, C. Deep Learning in Medical Image Analysis. Adv. Exp. Med. Biol. 121...

work page 2019

[3] [3]

& Welling, M

Ilse, M., Tomczak, J. & Welling, M. Attention-based Deep Multiple Instance Learning. in International Conference on Machine Learning 2127–2136 (PMLR, 2018). 37. Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat Med 30 , 850–862 (2024). 38. Zimmermann, E. et al. Virchow2: Scaling Self-Supervised Mixed Magnificat...

work page doi:10.48550/arxiv.2408.00738 2018