pith. sign in

arxiv: 2604.10797 · v1 · submitted 2026-04-12 · 💻 cs.CV

WBCBench 2026: A Challenge for Robust White Blood Cell Classification Under Class Imbalance

Pith reviewed 2026-05-10 15:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords white blood cell classificationclass imbalancedomain shiftbenchmark datasetmicroscopic image analysisrobust classificationmedical diagnosticspatient-level separation
0
0 comments X

The pith

WBCBench 2026 creates a benchmark that tests white blood cell classifiers on severe class imbalance, patient-level splits, and synthetic domain shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WBCBench 2026 as a benchmark and challenge for automated classification of white blood cells. It incorporates three main stresses: heavy imbalance among 13 fine-grained classes, strict separation of training, validation, and test data by individual patients, and controlled synthetic changes to image quality that mimic scanner and lighting variations. The setup divides the task into two phases, starting with clean images and then adding degradations with different severity levels across splits. A standard submission format and macro-averaged F1 score serve as the evaluation method. If the benchmark works as intended, it would better predict how classification models perform when moved from development labs to varied real-world hospital settings.

Core claim

WBCBench 2026 consists of single-site microscopic blood smear images annotated by expert hematopathologists, organized into a two-phase challenge where phase one supplies pristine training data and phase two adds degraded images with split-specific severity distributions of noise, blur, and illumination changes to emulate domain shift, while enforcing patient-level separation throughout and using macro-averaged F1 as the primary ranking metric.

What carries the argument

The two-phase benchmark structure that applies controlled synthetic perturbations to a patient-separated collection of 13 morphologically distinct white blood cell classes, scored by macro-averaged F1.

If this is right

  • Methods must address imbalance across 13 classes without simply favoring the most common types.
  • Patient-level splits block data leakage and require models to generalize across different individuals.
  • The added image degradations allow direct measurement of robustness to realistic quality variations.
  • Standardized evaluation and open evaluator enable consistent ranking of submitted solutions.
  • Phase-two results quantify the performance drop when domain shift is introduced after training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on the benchmark may indicate which models are more likely to maintain accuracy when moved to new hospitals with different equipment.
  • The design could push development of techniques that extract features stable across both class frequencies and image quality changes.
  • Extending the benchmark with actual multi-site collections would test whether the synthetic perturbations match natural variations.
  • The patient-separation rule might apply usefully to other medical imaging tasks where individual variation matters.

Load-bearing premise

The single-site images with expert labels and the controlled synthetic perturbations will produce a difficulty distribution that predicts performance on real multi-site clinical data.

What would settle it

Compare the leaderboard methods from this benchmark against their accuracy on an independent set of blood smear images collected from multiple sites, scanners, and staining protocols.

read the original abstract

We present WBCBench 2026, an ISBI challenge and benchmark for automated WBC classification designed to stress-test algorithms under three key difficulties: (i) severe class imbalance across 13 morphologically fine-grained WBC classes, (ii) strict patient-level separation between training, validation and test sets, and (iii) synthetic scanner- and setting-induced domain shift via controlled noise, blur and illumination perturbations. All images are single-site microscopic blood smear acquisitions with standardised staining and expert hematopathologist annotations. This paper reviews the challenge and summarises the proposed solutions and final outcomes. The benchmark is organised into two phases. Phase 1 provides a pristine training set. Phase 2 introduces degraded images with split-specific severity distributions for train, validation and test, emulating a realistic shift between development and deployment conditions. We specify a standardised submission schema, open-source evaluator, and macro-averaged F1 score as the primary ranking metric.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents WBCBench 2026, an ISBI challenge benchmark for automated white blood cell classification. It is designed to stress-test algorithms under severe class imbalance across 13 morphologically fine-grained classes, strict patient-level separation between train/validation/test sets, and synthetic domain shifts applied via controlled noise, blur, and illumination perturbations to single-site expert-annotated blood smear images. The benchmark is organized into two phases (pristine data in Phase 1; split-specific severity degradations in Phase 2 to emulate development-to-deployment shifts), with a standardized submission schema, open-source evaluator, and macro-averaged F1 score as the primary metric. The paper also reviews submitted solutions and final challenge outcomes.

Significance. If the controlled synthetic perturbations induce difficulty distributions that meaningfully correlate with real multi-site clinical variations, WBCBench 2026 would provide a valuable, reproducible testbed for developing robust WBC classifiers that handle class imbalance and patient-level generalization. The explicit design rules, open-source evaluator, and focus on macro F1 are strengths that support fair comparisons; the patient-level splits and two-phase structure directly target common failure modes in hematology imaging.

major comments (2)
  1. [Phase 2 description] Phase 2 description (synthetic domain shift): The claim that controlled noise, blur, and illumination perturbations emulate realistic scanner- and setting-induced shifts is not accompanied by any calibration, statistical matching, or comparison to observed distributions from real multi-site data (e.g., inter-lab staining variability or microscope optics differences). This is load-bearing for the central claim that the benchmark stress-tests algorithms under deployment-like conditions.
  2. [Benchmark construction] Benchmark construction section: While patient-level separation is explicitly stated, no quantitative verification (e.g., checks for residual patient or acquisition leakage across splits) is reported, which is necessary to confirm that the strict separation is achieved in the released data partitions.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction would benefit from a brief table or paragraph summarizing the class distribution and total image counts per split to allow immediate assessment of the imbalance severity.
  2. [Phase 2 description] Perturbation parameters (e.g., exact noise variance ranges or blur kernel sizes per severity level) should be listed explicitly rather than described qualitatively, to enable exact reproduction by future users.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Phase 2 description] Phase 2 description (synthetic domain shift): The claim that controlled noise, blur, and illumination perturbations emulate realistic scanner- and setting-induced shifts is not accompanied by any calibration, statistical matching, or comparison to observed distributions from real multi-site data (e.g., inter-lab staining variability or microscope optics differences). This is load-bearing for the central claim that the benchmark stress-tests algorithms under deployment-like conditions.

    Authors: We acknowledge that the manuscript does not include direct calibration or statistical matching to real multi-site distributions, as the source images are single-site acquisitions. The synthetic perturbations were chosen to represent common, reproducible imaging artifacts (noise, blur, illumination) that frequently arise in clinical deployment. We will revise the Phase 2 description to clarify that these constitute controlled synthetic shifts simulating plausible deployment variations rather than claiming exact emulation of specific real-world multi-site statistics. A limitations paragraph will be added to discuss this scope explicitly while preserving the benchmark's utility as a standardized, reproducible stress test. revision: partial

  2. Referee: [Benchmark construction] Benchmark construction section: While patient-level separation is explicitly stated, no quantitative verification (e.g., checks for residual patient or acquisition leakage across splits) is reported, which is necessary to confirm that the strict separation is achieved in the released data partitions.

    Authors: We agree that explicit verification strengthens the claim. The splits were constructed by grouping all images by patient ID and assigning entire patient groups exclusively to one partition. In the revision we will add a verification subsection reporting the number of unique patients per split, confirming zero patient-ID overlap across train/validation/test, and describing metadata checks performed to exclude acquisition leakage. These details will be included in the benchmark construction section. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition is self-contained

full rationale

The paper defines WBCBench 2026 as a challenge dataset with 13-class imbalance, patient-level splits, and controlled synthetic perturbations on single-site annotated smears. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure. The contribution is the benchmark specification itself (training/validation/test phases, evaluator, macro-F1 metric) rather than any result obtained from prior quantities. No self-citation chains, ansatzes, or uniqueness claims are invoked to support load-bearing steps. This matches the default non-circular case for a dataset/challenge paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark proposal paper containing no mathematical derivations, fitted parameters, or postulated entities; the contribution is the dataset construction rules and evaluation protocol.

pith-pipeline@v0.9.0 · 5496 in / 1116 out tokens · 42878 ms · 2026-05-10T15:28:56.642357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

  1. [1]

    INTRODUCTION White blood cell (WBC) morphology is central to diagnosing and monitoring haematologic and immunologic disorders, including leukaemia, myelodysplastic syndromes and severe infections. In routine practice, haematologists inspect Wright–Giemsa stained pe- ripheral blood smears to quantify and characterise WBC types such as neutrophils, lymphocy...

  2. [2]

    Raabin-WBC [6], ALL-IDB [7] and related collections) have stimulated research in this area

    EXISTING DATASETS AND CHALLENGES Several public WBC datasets (e.g. Raabin-WBC [6], ALL-IDB [7] and related collections) have stimulated research in this area. Yet most exhibit modest sample sizes, coarse class taxonomies, or inadequate documentation of patient-level splits and acquisition conditions. Moreover, there is still no widely accepted benchmark t...

  3. [3]

    Clinical context and annotation WBCBench 2026 comprises 55,012 microscopic images derived from 493 patients

    WBCBENCH 2026 DATASET 3.1. Clinical context and annotation WBCBench 2026 comprises 55,012 microscopic images derived from 493 patients. All images in the dataset are microscopic pe- ripheral blood smear patches acquired at a single institution using standardised Wright–Giemsa staining and a fixed imaging pipeline. Cells originate from patients routinely i...

  4. [4]

    Baseline Models The challenge does not prescribe a specific modelling approach; participants are free to design arbitrary architectures and train- ing strategies

    BASELINES AND EV ALUATION A. Baseline Models The challenge does not prescribe a specific modelling approach; participants are free to design arbitrary architectures and train- ing strategies. To provide a reference point, we implement two baselines [8]. •Convolutional networks.ResNet-50 [9] initialised from ImageNet pretraining, fine-tuned end-to-end with...

  5. [5]

    spikiness score

    RESULTS AND DISCUSSION A total of241teams registered for WBCBench 2026, spanning academia, industry and independent researchers, among which101 teams submitted at least one valid set of predictions [11–16]. Among the participants,73(72%) exceeded the ResNet-50 baseline (0.635) and66(65%) surpassed the stronger Swin-Tiny baseline (0.643).7 teams achieved m...

  6. [6]

    The dataset comprises single-site, expert-annotated blood smear images spanning 13 WBC classes, including blasts and other rare subtypes

    CONCLUSION We presented WBCBench 2026, an ISBI challenge and bench- mark targeting robust white blood cell classification under realistic class imbalance and synthetic domain shift. The dataset comprises single-site, expert-annotated blood smear images spanning 13 WBC classes, including blasts and other rare subtypes. With 241 registered teams and 101 val...

  7. [7]

    The dataset was commercially obtained from Chu- lalongkorn University

    COMPLIANCE WITH ETHICAL STANDARDS This study was performed in line with the principles of the Declara- tion of Helsinki. The dataset was commercially obtained from Chu- lalongkorn University. Additional ethical approval was not required, as confirmed by the license

  8. [8]

    Between-examiner reproducibility in manual differential leukocyte counting,

    X. Fuentes-Arderiu, M. Garc ´ıa-Panyella, and D. Dot-Bach, “Between-examiner reproducibility in manual differential leukocyte counting,”Accred Qual Assur, 2007

  9. [9]

    Performance evaluation of the digital morphol- ogy analyser sysmex DI-60 for white blood cell differentials in abnormal samples,

    Y . Zhao et al., “Performance evaluation of the digital morphol- ogy analyser sysmex DI-60 for white blood cell differentials in abnormal samples,”Scientific Reports, 2024

  10. [10]

    Per- formance of automated digital cell imaging analyzer sysmex DI-60,

    H. Kim, M. Hur, H. Kim, S. Kim, H. Moon, and Y . Yun, “Per- formance of automated digital cell imaging analyzer sysmex DI-60,”Clinical Chemistry and Laboratory Medicine (CCLM), vol. 56, no. 1, 2018

  11. [11]

    Performance evaluation of the dig- ital cell imaging analyzer DI-60 integrated into the fully au- tomated sysmex xn hematology analyzer system,

    Y . Tabe, T. Yamamoto, I. Maenou, R. Nakai, M. Idei, T. Horii, T. Miida, and A. Ohsaka, “Performance evaluation of the dig- ital cell imaging analyzer DI-60 integrated into the fully au- tomated sysmex xn hematology analyzer system,”Clinical Chemistry and Laboratory Medicine (CCLM), vol. 53, no. 2, 2015

  12. [12]

    Digital mor- phology analyzer sysmex DI-60 vs. manual counting for white blood cell differentials in leukopenic samples: a comparative assessment of risk and turnaround time,

    M. Nam, S. Yoon, M. Hur, G. Lee, et al., “Digital mor- phology analyzer sysmex DI-60 vs. manual counting for white blood cell differentials in leukopenic samples: a comparative assessment of risk and turnaround time,”Annals of laboratory medicine, vol. 42, no. 4, 2022

  13. [13]

    A large dataset of white blood cells containing cell locations and types, along with segmented nuclei and cytoplasm,

    Zahra Mousavi Kouzehkanan, Sepehr Saghari, Sajad Tavakoli, Peyman Rostami, Mohammadjavad Abaszadeh, Farzaneh Mirzadeh, Esmaeil Shahabi Satlsar, Maryam Gheidishahran, Fatemeh Gorgi, Saeed Mohammadi, and Reshad Hosseini, “A large dataset of white blood cells containing cell locations and types, along with segmented nuclei and cytoplasm,”Scientific Reports, ...

  14. [14]

    ALL-IDB: The acute lymphoblastic leukemia image database for image processing,

    Ruggero Donida Labati, Vincenzo Piuri, and Fabio Scotti, “ALL-IDB: The acute lymphoblastic leukemia image database for image processing,” inProc. IEEE Int. Conf. Image Process. (ICIP), 2011, pp. 2045–2048

  15. [15]

    Mamba-based ensemble learning for white blood cell classification,

    Lewis Clifton, Xin Tian, Duangdao Palasuwan, Phandee Watanaboonyongcharoen, Ponlapat Rojnuckarin, and Nan- theera Anantrasirichai, “Mamba-based ensemble learning for white blood cell classification,” inIEEE International Sympo- sium on Biomedical Imaging (ISBI), 2026

  16. [16]

    Deep residual learning for image recognition,

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  17. [17]

    Swin transformer: Hi- erarchical vision transformer using shifted windows,

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hi- erarchical vision transformer using shifted windows,” inPro- ceedings of the IEEE/CVF international conference on com- puter vision (ICCV), 2021, pp. 10012–10022

  18. [18]

    Ensemble of small classifiers for im- balanced white blood cell classification,

    Siddharth Srivastava, Adam Smith, Scott Brooks, Jack Bacon, and Till Bretschneider, “Ensemble of small classifiers for im- balanced white blood cell classification,” in2026 IEEE Inter- national Symposium on Biomedical Imaging (ISBI), 2026

  19. [19]

    Multi-stage fine-tuning of pathol- ogy foundation models with head-diverse ensembling for white blood cell classification,

    Antony Gitau, Martin Paulson, Bjørn-Jostein Singstad, Karl Thomas Hjelmervik, Ola Marius Lysaker, and Ver- alia Gabriela Sanchez, “Multi-stage fine-tuning of pathol- ogy foundation models with head-diverse ensembling for white blood cell classification,” in2026 IEEE International Sympo- sium on Biomedical Imaging (ISBI), 2026

  20. [20]

    Foundation model enhanced hierarchical learning for white blood cell clas- sification,

    Fan Xiao, Zirui Chen, Jilan Xu, and Junlin Hou, “Foundation model enhanced hierarchical learning for white blood cell clas- sification,” in2026 IEEE International Symposium on Biomed- ical Imaging (ISBI), 2026

  21. [21]

    Synergizing deep learning and biological heuristics for ex- treme long-tail white blood cell classification,

    Duc T. Nguyen, Hoang-Long Nguyen, and Huy-Hieu Pham, “Synergizing deep learning and biological heuristics for ex- treme long-tail white blood cell classification,” in2026 IEEE International Symposium on Biomedical Imaging (ISBI), 2026

  22. [22]

    Robust white blood cell classification with stain-normalized decoupled learning and ensembling,

    Luu Le, Hoang-Loc Cao, Ha-Hieu Pham, Thanh-Huy Nguyen, and Ulas Bagci, “Robust white blood cell classification with stain-normalized decoupled learning and ensembling,” in2026 IEEE International Symposium on Biomedical Imaging (ISBI), 2026

  23. [23]

    A hierarchical en- semble inference pipeline for robust white blood cell classifi- cation under domain shifts,

    Tingkwong Ng, Ruyi Dai, and Hao Chen, “A hierarchical en- semble inference pipeline for robust white blood cell classifi- cation under domain shifts,” in2026 IEEE International Sym- posium on Biomedical Imaging (ISBI), 2026