WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance
Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3
The pith
WBCBench 2026 creates a benchmark that tests white blood cell classifiers on severe class imbalance, patient-level splits, and synthetic domain shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WBCBench 2026 consists of single-site microscopic blood smear images annotated by expert hematopathologists, organized into a two-phase challenge where phase one supplies pristine training data and phase two adds degraded images with split-specific severity distributions of noise, blur, and illumination changes to emulate domain shift, while enforcing patient-level separation throughout and using macro-averaged F1 as the primary ranking metric.
What carries the argument
The two-phase benchmark structure that applies controlled synthetic perturbations to a patient-separated collection of 13 morphologically distinct white blood cell classes, scored by macro-averaged F1.
If this is right
- Methods must address imbalance across 13 classes without simply favoring the most common types.
- Patient-level splits block data leakage and require models to generalize across different individuals.
- The added image degradations allow direct measurement of robustness to realistic quality variations.
- Standardized evaluation and open evaluator enable consistent ranking of submitted solutions.
- Phase-two results quantify the performance drop when domain shift is introduced after training.
Where Pith is reading between the lines
- Success on the benchmark may indicate which models are more likely to maintain accuracy when moved to new hospitals with different equipment.
- The design could push development of techniques that extract features stable across both class frequencies and image quality changes.
- Extending the benchmark with actual multi-site collections would test whether the synthetic perturbations match natural variations.
- The patient-separation rule might apply usefully to other medical imaging tasks where individual variation matters.
Load-bearing premise
The single-site images with expert labels and the controlled synthetic perturbations will produce a difficulty distribution that predicts performance on real multi-site clinical data.
What would settle it
Compare the leaderboard methods from this benchmark against their accuracy on an independent set of blood smear images collected from multiple sites, scanners, and staining protocols.
read the original abstract
We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents WBCBench 2026, an ISBI challenge benchmark for automated white blood cell classification. It is designed to stress-test algorithms under severe class imbalance across 13 morphologically fine-grained classes, strict patient-level separation between train/validation/test sets, and synthetic domain shifts applied via controlled noise, blur, and illumination perturbations to single-site expert-annotated blood smear images. The benchmark is organized into two phases (pristine data in Phase 1; split-specific severity degradations in Phase 2 to emulate development-to-deployment shifts), with a standardized submission schema, open-source evaluator, and macro-averaged F1 score as the primary metric. The paper also reviews submitted solutions and final challenge outcomes.
Significance. If the controlled synthetic perturbations induce difficulty distributions that meaningfully correlate with real multi-site clinical variations, WBCBench 2026 would provide a valuable, reproducible testbed for developing robust WBC classifiers that handle class imbalance and patient-level generalization. The explicit design rules, open-source evaluator, and focus on macro F1 are strengths that support fair comparisons; the patient-level splits and two-phase structure directly target common failure modes in hematology imaging.
major comments (2)
- [Phase 2 description] Phase 2 description (synthetic domain shift): The claim that controlled noise, blur, and illumination perturbations emulate realistic scanner- and setting-induced shifts is not accompanied by any calibration, statistical matching, or comparison to observed distributions from real multi-site data (e.g., inter-lab staining variability or microscope optics differences). This is load-bearing for the central claim that the benchmark stress-tests algorithms under deployment-like conditions.
- [Benchmark construction] Benchmark construction section: While patient-level separation is explicitly stated, no quantitative verification (e.g., checks for residual patient or acquisition leakage across splits) is reported, which is necessary to confirm that the strict separation is achieved in the released data partitions.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction would benefit from a brief table or paragraph summarizing the class distribution and total image counts per split to allow immediate assessment of the imbalance severity.
- [Phase 2 description] Perturbation parameters (e.g., exact noise variance ranges or blur kernel sizes per severity level) should be listed explicitly rather than described qualitatively, to enable exact reproduction by future users.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Phase 2 description] Phase 2 description (synthetic domain shift): The claim that controlled noise, blur, and illumination perturbations emulate realistic scanner- and setting-induced shifts is not accompanied by any calibration, statistical matching, or comparison to observed distributions from real multi-site data (e.g., inter-lab staining variability or microscope optics differences). This is load-bearing for the central claim that the benchmark stress-tests algorithms under deployment-like conditions.
Authors: We acknowledge that the manuscript does not include direct calibration or statistical matching to real multi-site distributions, as the source images are single-site acquisitions. The synthetic perturbations were chosen to represent common, reproducible imaging artifacts (noise, blur, illumination) that frequently arise in clinical deployment. We will revise the Phase 2 description to clarify that these constitute controlled synthetic shifts simulating plausible deployment variations rather than claiming exact emulation of specific real-world multi-site statistics. A limitations paragraph will be added to discuss this scope explicitly while preserving the benchmark's utility as a standardized, reproducible stress test. revision: partial
-
Referee: [Benchmark construction] Benchmark construction section: While patient-level separation is explicitly stated, no quantitative verification (e.g., checks for residual patient or acquisition leakage across splits) is reported, which is necessary to confirm that the strict separation is achieved in the released data partitions.
Authors: We agree that explicit verification strengthens the claim. The splits were constructed by grouping all images by patient ID and assigning entire patient groups exclusively to one partition. In the revision we will add a verification subsection reporting the number of unique patients per split, confirming zero patient-ID overlap across train/validation/test, and describing metadata checks performed to exclude acquisition leakage. These details will be included in the benchmark construction section. revision: yes
Circularity Check
No circularity: benchmark definition is self-contained
full rationale
The paper defines WBCBench 2026 as a challenge dataset with 13-class imbalance, patient-level splits, and controlled synthetic perturbations on single-site annotated smears. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The contribution is the benchmark specification itself (training/validation/test phases, evaluator, macro-F1 metric) rather than any result obtained from prior quantities. No self-citation chains, ansatzes, or uniqueness claims are invoked to support load-bearing steps. This matches the default non-circular case for a dataset/challenge paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
severe class imbalance across 13 morphologically fine-grained WBC classes... synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations... macro-averaged F1 score
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Phase 2 introduces degraded images with split-specific severity distributions... Table 2. Severity-dependent parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION White blood cell (WBC) morphology is central to diagnosing and monitoring haematologic and immunologic disorders, including leukaemia, myelodysplastic syndromes and severe infections. In routine practice, haematologists inspect Wright–Giemsa stained pe- ripheral blood smears to quantify and characterise WBC types such as neutrophils, lymphocy...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Raabin-WBC [6], ALL-IDB [7] and related collections) have stimulated research in this area
EXISTING DATASETS AND CHALLENGES Several public WBC datasets (e.g. Raabin-WBC [6], ALL-IDB [7] and related collections) have stimulated research in this area. Yet most exhibit modest sample sizes, coarse class taxonomies, or inadequate documentation of patient-level splits and acquisition conditions. Moreover, there is still no widely accepted benchmark t...
work page 2026
-
[3]
WBCBENCH 2026 DATASET 3.1. Clinical context and annotation WBCBench 2026 comprises 55,012 microscopic images derived from 493 patients. All images in the dataset are microscopic pe- ripheral blood smear patches acquired at a single institution using standardised Wright–Giemsa staining and a fixed imaging pipeline. Cells originate from patients routinely i...
work page 2026
-
[4]
BASELINES AND EV ALUATION A. Baseline Models The challenge does not prescribe a specific modelling approach; participants are free to design arbitrary architectures and train- ing strategies. To provide a reference point, we implement two baselines [8]. •Convolutional networks.ResNet-50 [9] initialised from ImageNet pretraining, fine-tuned end-to-end with...
-
[5]
RESULTS AND DISCUSSION A total of241teams registered for WBCBench 2026, spanning academia, industry and independent researchers, among which101 teams submitted at least one valid set of predictions [11–16]. Among the participants,73(72%) exceeded the ResNet-50 baseline (0.635) and66(65%) surpassed the stronger Swin-Tiny baseline (0.643).7 teams achieved m...
-
[6]
CONCLUSION We presented WBCBench 2026, an ISBI challenge and bench- mark targeting robust white blood cell classification under realistic class imbalance and synthetic domain shift. The dataset comprises single-site, expert-annotated blood smear images spanning 13 WBC classes, including blasts and other rare subtypes. With 241 registered teams and 101 val...
work page 2026
-
[7]
The dataset was commercially obtained from Chu- lalongkorn University
COMPLIANCE WITH ETHICAL STANDARDS This study was performed in line with the principles of the Declara- tion of Helsinki. The dataset was commercially obtained from Chu- lalongkorn University. Additional ethical approval was not required, as confirmed by the license
-
[8]
Between-examiner reproducibility in manual differential leukocyte counting,
X. Fuentes-Arderiu, M. Garc ´ıa-Panyella, and D. Dot-Bach, “Between-examiner reproducibility in manual differential leukocyte counting,”Accred Qual Assur, 2007
work page 2007
-
[9]
Y . Zhao et al., “Performance evaluation of the digital morphol- ogy analyser sysmex DI-60 for white blood cell differentials in abnormal samples,”Scientific Reports, 2024
work page 2024
-
[10]
Per- formance of automated digital cell imaging analyzer sysmex DI-60,
H. Kim, M. Hur, H. Kim, S. Kim, H. Moon, and Y . Yun, “Per- formance of automated digital cell imaging analyzer sysmex DI-60,”Clinical Chemistry and Laboratory Medicine (CCLM), vol. 56, no. 1, 2018
work page 2018
-
[11]
Y . Tabe, T. Yamamoto, I. Maenou, R. Nakai, M. Idei, T. Horii, T. Miida, and A. Ohsaka, “Performance evaluation of the dig- ital cell imaging analyzer DI-60 integrated into the fully au- tomated sysmex xn hematology analyzer system,”Clinical Chemistry and Laboratory Medicine (CCLM), vol. 53, no. 2, 2015
work page 2015
-
[12]
M. Nam, S. Yoon, M. Hur, G. Lee, et al., “Digital mor- phology analyzer sysmex DI-60 vs. manual counting for white blood cell differentials in leukopenic samples: a comparative assessment of risk and turnaround time,”Annals of laboratory medicine, vol. 42, no. 4, 2022
work page 2022
-
[13]
Zahra Mousavi Kouzehkanan, Sepehr Saghari, Sajad Tavakoli, Peyman Rostami, Mohammadjavad Abaszadeh, Farzaneh Mirzadeh, Esmaeil Shahabi Satlsar, Maryam Gheidishahran, Fatemeh Gorgi, Saeed Mohammadi, and Reshad Hosseini, “A large dataset of white blood cells containing cell locations and types, along with segmented nuclei and cytoplasm,”Scientific Reports, ...
work page 2022
-
[14]
ALL-IDB: The acute lymphoblastic leukemia image database for image processing,
Ruggero Donida Labati, Vincenzo Piuri, and Fabio Scotti, “ALL-IDB: The acute lymphoblastic leukemia image database for image processing,” inProc. IEEE Int. Conf. Image Process. (ICIP), 2011, pp. 2045–2048
work page 2011
-
[15]
Mamba-based ensemble learning for white blood cell classification,
Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, and Nan- theera Anantrasirichai, “Mamba-based ensemble learning for white blood cell classification,” inIEEE International Sympo- sium on Biomedical Imaging (ISBI), 2026
work page 2026
-
[16]
Deep residual learning for image recognition,
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
work page 2016
-
[17]
Swin transformer: Hi- erarchical vision transformer using shifted windows,
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hi- erarchical vision transformer using shifted windows,” inPro- ceedings of the IEEE/CVF international conference on com- puter vision (ICCV), 2021, pp. 10012–10022
work page 2021
-
[18]
Ensemble of small classifiers for im- balanced white blood cell classification,
Siddharth Srivastava, Adam Smith, Scott Brooks, Jack Bacon, and Till Bretschneider, “Ensemble of small classifiers for im- balanced white blood cell classification,” in2026 IEEE Inter- national Symposium on Biomedical Imaging (ISBI), 2026
work page 2026
-
[19]
Antony Gitau, Martin Paulson, Bjørn-Jostein Singstad, Karl Thomas Hjelmervik, Ola Marius Lysaker, and Ver- alia Gabriela Sanchez, “Multi-stage fine-tuning of pathol- ogy foundation models with head-diverse ensembling for white blood cell classification,” in2026 IEEE International Sympo- sium on Biomedical Imaging (ISBI), 2026
work page 2026
-
[20]
Foundation model enhanced hierarchical learning for white blood cell clas- sification,
Fan Xiao, Zirui Chen, Jilan Xu, and Junlin Hou, “Foundation model enhanced hierarchical learning for white blood cell clas- sification,” in2026 IEEE International Symposium on Biomed- ical Imaging (ISBI), 2026
work page 2026
-
[21]
Duc T. Nguyen, Hoang-Long Nguyen, and Huy-Hieu Pham, “Synergizing deep learning and biological heuristics for ex- treme long-tail white blood cell classification,” in2026 IEEE International Symposium on Biomedical Imaging (ISBI), 2026
work page 2026
-
[22]
Robust white blood cell classification with stain-normalized decoupled learning and ensembling,
Luu Le, Hoang-Loc Cao, Ha-Hieu Pham, Thanh-Huy Nguyen, and Ulas Bagci, “Robust white blood cell classification with stain-normalized decoupled learning and ensembling,” in2026 IEEE International Symposium on Biomedical Imaging (ISBI), 2026
work page 2026
-
[23]
Tingkwong Ng, Ruyi Dai, and Hao Chen, “A hierarchical en- semble inference pipeline for robust white blood cell classifi- cation under domain shifts,” in2026 IEEE International Sym- posium on Biomedical Imaging (ISBI), 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.