A Multi Center Breast FNAC Whole-Slide Cytology Dataset for AI-Assisted Patch-Wise Classification Using C1 to C5 Reporting Categories
Pith reviewed 2026-06-30 06:17 UTC · model grok-4.3
The pith
A multi-center dataset of 470 breast FNAC whole-slide images yields 7398 expert-labeled patches for C1-C5 patch-wise AI classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a dataset of 470 whole-slide images from 321 patients that produces 7398 PNG patches carrying expert-verified C1 to C5 reporting labels, gathered across centers with Papanicolaou or May-Grunwald Giemsa staining, scanned on a Hamamatsu NanoZoomer, and released in full with supporting annotation and metadata files for patch-wise classification tasks.
What carries the argument
The extraction of 7398 labeled PNG patches from 446 annotated whole-slide images using C1-C5 reporting categories.
If this is right
- Models trained on the patches can perform patch-wise classification into the five standard reporting categories.
- The multi-center origin allows testing of model robustness across staining protocols and sites.
- The released NDPI files and GeoJSON annotations support both patch-level and whole-slide experiments.
- Public availability on Zenodo enables direct reuse without new data collection.
- The accompanying code lowers the barrier for other groups to inspect or extend the data.
Where Pith is reading between the lines
- Models developed from this data might reduce inter-observer variability in FNAC interpretation if they generalize beyond the training patches.
- The dual-staining design could allow future work to quantify how staining choice affects AI performance.
- Release of the full 950 GB package sets a practical example for sharing large cytology imaging collections.
Load-bearing premise
Expert labels assigned to the extracted patches are accurate and consistent enough to serve as reliable ground truth.
What would settle it
Independent review by additional pathologists finding substantial disagreement with the supplied C1-C5 labels on more than a small fraction of patches.
Figures
read the original abstract
We present a multi center breast fine needle aspiration cytology (FNAC) dataset designed for patch wise classification using C1 to C5 reporting labels. The prospective dataset includes 321 patients and 470 whole-slide images (WSIs) collected from participating tertiary medical centers in India between May 2023 and March 2026. Slides were stained using Papanicolaou (190 WSIs) or MayGrunwald Giemsa (280 WSIs), scanned on a Hamamatsu NanoZoomer S360 at 40X magnification and 0.25 microns per pixel, and stored directly in NDPI format. Across the 470 WSIs, 446 WSIs contain annotated patch regions, yielding 7,398 PNG image patches with expert-verified C1 to C5 labels. The release includes NDPI WSIs, WSI-level GeoJSON annotation files, extracted patch images, deidentified metadata, a data dictionary, a validation summary, a manifest linking WSIs to Zenodo records, and code for dataset inspection and reuse. The complete dataset is approximately 950 GB and is available through Zenodo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a multi-center prospective dataset of breast FNAC whole-slide images collected from tertiary centers in India (321 patients, 470 WSIs stained with Pap or MGG, scanned at 40X on Hamamatsu NanoZoomer), from which 7,398 PNG patches with C1-C5 labels have been extracted. The release includes NDPI files, GeoJSON annotations, patches, metadata, and supporting code on Zenodo (~950 GB).
Significance. A publicly released multi-center cytology dataset with both Pap and MGG staining and explicit patch extraction is uncommon and could support development of patch-wise classifiers for the C1-C5 reporting system if label provenance is adequately documented.
major comments (1)
- [Abstract] Abstract (final sentence) and the description of patch labeling: the claim that the 7,398 patches carry 'expert-verified C1 to C5 labels' provides no information on (a) the number of pathologists involved, (b) whether labels were assigned directly at patch level or propagated from WSI-level reports, (c) any inter-rater agreement statistic, or (d) adjudication rules for staining or center differences. Because the central utility of the release is supervised patch classification, this omission leaves the noise level of the ground truth unquantified and is load-bearing for the dataset's claimed suitability.
Simulated Author's Rebuttal
We thank the referee for their detailed review and for emphasizing the need for clear documentation of label provenance, which is essential for the utility of this dataset in supervised learning. We address the major comment below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract (final sentence) and the description of patch labeling: the claim that the 7,398 patches carry 'expert-verified C1 to C5 labels' provides no information on (a) the number of pathologists involved, (b) whether labels were assigned directly at patch level or propagated from WSI-level reports, (c) any inter-rater agreement statistic, or (d) adjudication rules for staining or center differences. Because the central utility of the release is supervised patch classification, this omission leaves the noise level of the ground truth unquantified and is load-bearing for the dataset's claimed suitability.
Authors: We agree that the manuscript would benefit from explicit details on how the C1-C5 labels were obtained. (a) Labels derive from the original clinical FNAC reports generated by pathologists at the participating tertiary centers in India; the precise number of pathologists was not recorded during data collection. (b) Labels were determined at the WSI level from the clinical reports and propagated to the patches extracted from annotated regions; no separate patch-level review was conducted. (c) No inter-rater agreement metrics were computed. (d) No formal adjudication process was used for staining (Pap vs. MGG) or inter-center variations; these are recorded in the metadata for user awareness. We will add a dedicated subsection in the Methods describing the label provenance and update the abstract to avoid overstatement of 'expert-verified' without context. This will help quantify potential label noise for downstream users. revision: yes
Circularity Check
No circularity: descriptive dataset release with no derivations
full rationale
The paper is a data-release document describing collection of 470 WSIs and 7,398 patches with C1-C5 labels. It contains no equations, predictions, fitted parameters, or derivation chains. Claims rest on data collection and expert verification statements, none of which reduce to self-definition or self-citation by construction. No load-bearing steps exist to analyze.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.CA: A Cancer Journal for Clinicians74(3), 229-263 (2024)
Bray, Freddie, Laversanne, Mathieu, Sung, Hyuna, Ferlay, Jacques, Siegel, Rebecca L., Soerjomataram, Isabelle, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.CA: A Cancer Journal for Clinicians74(3), 229-263 (2024). DOI: https://doi.org/10.3322/ caac.21834
2022
-
[2]
Sathishkumar, K., Chaturvedi, Meena, Das, Prashant, Stephen, Santhappan, Mathur, Prashant. Cancer incidence estimates for 2022 & projection for 2025: Result from National Cancer Registry Programme, India.Indian Journal of Medical Research156(4&5), 598-607 (2022). DOI:https://doi.org/10.4103/ijmr.ijmr_1821_22
-
[3]
Mendoza, Pamela, Lacambra, Ma., Tan, Puay-Hoon, Tse, Gary M. Fine Needle Aspiration Cytology of the Breast: The Nonmalignant Categories.Pathology Research International2011, 547580 (2011). DOI: https://doi.org/10.4061/2011/547580. 7
-
[4]
Yu, Yu-Hua, Wei, Wei, Liu, Jin-Liang. Diagnostic value of fine-needle aspiration biopsy for breast mass: a systematic review and meta-analysis.BMC Cancer12, 41 (2012). DOI:https://doi.org/10.1186/1471-240 7-12-41
-
[5]
Willems, Stefan M., van Deurzen, Carolien H. M., van Diest, Paul J. Diagnosis of breast lesions: fine-needle aspiration cytology or core needle biopsy? A review.Journal of Clinical Pathology65(4), 287-292 (2012). DOI: https://doi.org/10.1136/jclinpath-2011-200410
-
[6]
Wang, Ming, He, Xian, Chang, Yong, Sun, Gang, Thabane, Lehana. A sensitivity and specificity comparison of fine needle aspiration cytology and core needle biopsy in evaluation of suspicious breast lesions: A systematic review and meta-analysis.The Breast31, 157-166 (2017). DOI:https://doi.org/10.1016/j.breast.2016.11.009
-
[7]
Morris, Arden, Pommier, Rodney F., Schmidt, Waldemar A., Shih, Robert L., Alexander, Priscilla W., Vetto, John T. Accurate evaluation of palpable breast masses by the triple test score.Archives of Surgery133(9), 930-934 (1998). DOI:https://doi.org/10.1001/archsurg.133.9.930
-
[8]
Morris, Katherine T., Pommier, Rodney F., Morris, Arden, Schmidt, Waldemar A., Beagle, Gregory, Alexander, Priscilla W., et al. Usefulness of the triple test score for palpable breast masses.Archives of Surgery136(9), 1008-1013 (2001). DOI:https://doi.org/10.1001/archsurg.136.9.1008
-
[9]
Sidawy, Mary K., Stoler, Mark H., Frable, William J., Frost, Andra R., Masood, Shahla, Miller, Theodore R., et al. Interobserver variability in the classification of proliferative breast lesions by fine-needle aspiration: results of the Papanicolaou Society of Cytopathology Study.Diagnostic Cytopathology18(2), 150-165 (1998). DOI: https://doi.org/10.1002/...
-
[10]
Mitra, Suvradeep, Dey, Pranab. Grey zone lesions of breast: Potential areas of error in cytology.Journal of Cytology32(3), 145-152 (2015). DOI:https://doi.org/10.4103/0970-9371.168812
-
[11]
’Atypical’ and ’suspicious’ diagnoses in breast aspiration cytology.Cancer102(3), 164-167 (2004)
Kanhoush, Rima, Jorda, Merce, Gomez-Fernandez, Carmen, Wang, Hong, Mirzabeigi, Marjan, Ghorab, Zeina, et al. ’Atypical’ and ’suspicious’ diagnoses in breast aspiration cytology.Cancer102(3), 164-167 (2004). DOI: https://doi.org/10.1002/cncr.20283
-
[12]
Field, Andrew S., Schmitt, Fernando, Vielh, Philippe. IAC Standardized Reporting of Breast Fine-Needle Aspiration Biopsy Cytology.Acta Cytologica61(1), 3-6 (2017). DOI:https://doi.org/10.1159/000450880
-
[13]
Field, Andrew S., Raymond, Wendy A., Rickard, Mary, Schmitt, Fernando, Vielh, Philippe. The International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy Cytopathology. Acta Cytologica63(4), 257-273 (2019). DOI:https://doi.org/10.1159/000499509
-
[14]
Hoda, Rana S., Brachtel, Elena F. International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy Cytopathology: A Review of Predictive Values and Risks of Malignancy.Acta Cytologica63(4), 292-301 (2019). DOI:https://doi.org/10.1159/000500704
-
[15]
Paul, Pranoy, Azad, Shweta, Agrawal, Shruti, Rao, Shalinee, Chowdhury, Nilotpal. Systematic Review and Meta-Analysis of the Diagnostic Accuracy of the International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy in Diagnosing Breast Cancer.Acta Cytologica67(1), 1-16 (2023). DOI:https://doi.org/10.1159/000527346
-
[16]
Nikas, Ilias P., Vey, Johannes A., Proctor, Thomas, AlRawashdeh, Mohammad M., Ishak, Ashraf, Ko, Hyun M., et al. The Use of the International Academy of Cytology Yokohama System for Reporting Breast Fine- Needle Aspiration Biopsy.American Journal of Clinical Pathology159(2), 138-145 (2023). DOI: https: //doi.org/10.1093/ajcp/aqac132
-
[17]
Thakur, Nishant, Alam, Mohammad Rizwan, Abdul-Ghafar, Jamshid, Chong, Yosep. Recent Application of Artificial Intelligence in Non-Gynecological Cancer Cytopathology: A Systematic Review.Cancers14(14), 3529 (2022). DOI:https://doi.org/10.3390/cancers14143529
-
[18]
Digital pathology and artificial intelligence
Niazi, Muhammad Khalid Khan, Parwani, Anil V., Gurcan, Metin N. Digital pathology and artificial intelligence. The Lancet Oncology20(5), e253-e261 (2019). DOI:https://doi.org/10.1016/S1470-2045(19)30154-8
-
[19]
Deep Learning for Whole Slide Image Analysis: An Overview.Frontiers in Medicine6, 264 (2019)
Dimitriou, Neofytos, Arandjelović, Ognjen, Caie, Peter D. Deep Learning for Whole Slide Image Analysis: An Overview.Frontiers in Medicine6, 264 (2019). DOI:https://doi.org/10.3389/fmed.2019.00264. 8
-
[20]
Jiang, Hao, Zhou, Yanning, Lin, Yi, Chan, Ronald C. K., Liu, Jiang, Chen, Hao. Deep learning for computational cytology: A survey.Medical Image Analysis84, 102691 (2023). DOI: https://doi.org/10.1016/j.media.20 22.102691
-
[21]
A survey on deep learning in medical image analysis.Medical Image Analysis42, 60-88 (2017)
Litjens, Geert, Kooi, Thijs, Bejnordi, Babak Ehteshami, Setio, Arnaud Arindra Adiyoso, Ciompi, Francesco, Ghafoorian, Mohsen, et al. A survey on deep learning in medical image analysis.Medical Image Analysis42, 60-88 (2017). DOI:https://doi.org/10.1016/j.media.2017.07.005
-
[22]
Bera, Kaustav, Schalper, Kurt A., Rimm, David L., Velcheti, Vamsidhar, Madabhushi, Anant. Artificial intelligence in digital pathology: new tools for diagnosis and precision oncology.Nature Reviews Clinical Oncology 16, 703-715 (2019). DOI:https://doi.org/10.1038/s41571-019-0252-y
-
[23]
Nick, Wolberg, William H., Mangasarian, Olvi L
Street, W. Nick, Wolberg, William H., Mangasarian, Olvi L. Nuclear feature extraction for breast tumor diagnosis. InBiomedical Image Processing and Biomedical Visualization, vol. 1905, 861-870 (1993)
1905
-
[24]
Spanhol, Fabio A., Oliveira, Luiz S., Petitjean, Caroline, Heutte, Laurent. A Dataset for Breast Cancer Histopathological Image Classification.IEEE Transactions on Biomedical Engineering63(7), 1455-1462 (2016). DOI:https://doi.org/10.1109/TBME.2015.2496264
-
[25]
A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma.arXivarXiv:2506.09661 (2025)
Jain, G., Pati, S., Duggal, M., Sethi, A., Patil, A., Malekar, G., Kowe, N., Kumar, J., Kashyap, J., Rout, D., et al. A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma.arXivarXiv:2506.09661 (2025). DOI:https://doi.org/10.48550/arXiv.2506.09661
-
[26]
Saikia, Amartya Ranjan, Bora, Kangkana, Mahanta, Lipi B., Das, Anup Kumar. Comparative assessment of CNN architectures for classification of breast FNAC images.Tissue and Cell57, 8-14 (2019). DOI: https: //doi.org/10.1016/j.tice.2019.02.001
-
[27]
Classifying Breast Cytological Images using Deep Learning Architectures
Zerouaoui, Hasnae, Idri, Ali. Classifying Breast Cytological Images using Deep Learning Architectures. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF, 557-564 (2022). DOI:https://doi.org/10.5220/0010850000003123
-
[28]
Patil, Abhijeet, Jain, Garima, Sethi, Amit. A Multi-Center Breast FNAC Cytology Dataset for AI-Assisted Patch-wise Classification Using C1–C5 Reporting Categories. Zenodo. DOI:https://doi.org/10.5281/zenodo .20763900(2026). Competing interests The authors declare no competing interests. Funding Not Applicable 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.