pith. sign in

arxiv: 2606.30209 · v1 · pith:HXEEHYPXnew · submitted 2026-06-29 · 💻 cs.CV · cs.AI

A Multi Center Breast FNAC Whole-Slide Cytology Dataset for AI-Assisted Patch-Wise Classification Using C1 to C5 Reporting Categories

Pith reviewed 2026-06-30 06:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords breast FNACcytology datasetwhole-slide imagespatch classificationC1-C5 categoriesmulti-centerAI-assisted diagnosispathology imaging
0
0 comments X

The pith

A multi-center dataset of 470 breast FNAC whole-slide images yields 7398 expert-labeled patches for C1-C5 patch-wise AI classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a prospective dataset collected from multiple tertiary centers in India for training AI models on breast fine needle aspiration cytology. It covers 321 patients and 470 whole-slide images scanned at 40X, using two different staining methods, from which 7398 patches were extracted and assigned C1 to C5 labels by experts. The release package supplies the raw NDPI files, GeoJSON annotations, PNG patches, metadata, and inspection code. A sympathetic reader would care because publicly available, multi-center labeled cytology data at this scale remains limited for developing diagnostic AI tools.

Core claim

The authors present a dataset of 470 whole-slide images from 321 patients that produces 7398 PNG patches carrying expert-verified C1 to C5 reporting labels, gathered across centers with Papanicolaou or May-Grunwald Giemsa staining, scanned on a Hamamatsu NanoZoomer, and released in full with supporting annotation and metadata files for patch-wise classification tasks.

What carries the argument

The extraction of 7398 labeled PNG patches from 446 annotated whole-slide images using C1-C5 reporting categories.

If this is right

  • Models trained on the patches can perform patch-wise classification into the five standard reporting categories.
  • The multi-center origin allows testing of model robustness across staining protocols and sites.
  • The released NDPI files and GeoJSON annotations support both patch-level and whole-slide experiments.
  • Public availability on Zenodo enables direct reuse without new data collection.
  • The accompanying code lowers the barrier for other groups to inspect or extend the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models developed from this data might reduce inter-observer variability in FNAC interpretation if they generalize beyond the training patches.
  • The dual-staining design could allow future work to quantify how staining choice affects AI performance.
  • Release of the full 950 GB package sets a practical example for sharing large cytology imaging collections.

Load-bearing premise

Expert labels assigned to the extracted patches are accurate and consistent enough to serve as reliable ground truth.

What would settle it

Independent review by additional pathologists finding substantial disagreement with the supplied C1-C5 labels on more than a small fraction of patches.

Figures

Figures reproduced from arXiv: 2606.30209 by Abhijeet Patil, Amit Sethi, Arvind Kumar, Basumitra Das, B. G. Malathi, Biswajit Dey, Deepali Tirkey, Deepika Hemranjani, Garima Jain, Indu R. Nair, Jatin Kashyap, Manveen Kaur, Nilam Adhav, Niraj Kumari, Nishi Halduniya, Pulkit Verma, Rakesh Kumar Gupta, Ranjana Solanki, Ratan Konjengbam, Ravindra Karle, Sandeep Mathur, Sanghamitra Pati, Saurav Banerjee, Sharat Kumar, Shashank Nath Singh, Shivani Kalhan, Shruti Gupta, Simmi Kharb, Sucheta Devi Khuraijam, Sunil Kumar Komanapalli, Sunita Singh, Surabhi Jain, Sushma Khuraijam, Tanaya Kulkarni, Uma Handa, Vaishali Gaikwad, Vandana Raphael, Vidya C., Yogender P..

Figure 1
Figure 1. Figure 1: Overall workflow for breast FNAC dataset creation, slide digitization, WSI-level patch annotation, patch [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

We present a multi center breast fine needle aspiration cytology (FNAC) dataset designed for patch wise classification using C1 to C5 reporting labels. The prospective dataset includes 321 patients and 470 whole-slide images (WSIs) collected from participating tertiary medical centers in India between May 2023 and March 2026. Slides were stained using Papanicolaou (190 WSIs) or MayGrunwald Giemsa (280 WSIs), scanned on a Hamamatsu NanoZoomer S360 at 40X magnification and 0.25 microns per pixel, and stored directly in NDPI format. Across the 470 WSIs, 446 WSIs contain annotated patch regions, yielding 7,398 PNG image patches with expert-verified C1 to C5 labels. The release includes NDPI WSIs, WSI-level GeoJSON annotation files, extracted patch images, deidentified metadata, a data dictionary, a validation summary, a manifest linking WSIs to Zenodo records, and code for dataset inspection and reuse. The complete dataset is approximately 950 GB and is available through Zenodo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a multi-center prospective dataset of breast FNAC whole-slide images collected from tertiary centers in India (321 patients, 470 WSIs stained with Pap or MGG, scanned at 40X on Hamamatsu NanoZoomer), from which 7,398 PNG patches with C1-C5 labels have been extracted. The release includes NDPI files, GeoJSON annotations, patches, metadata, and supporting code on Zenodo (~950 GB).

Significance. A publicly released multi-center cytology dataset with both Pap and MGG staining and explicit patch extraction is uncommon and could support development of patch-wise classifiers for the C1-C5 reporting system if label provenance is adequately documented.

major comments (1)
  1. [Abstract] Abstract (final sentence) and the description of patch labeling: the claim that the 7,398 patches carry 'expert-verified C1 to C5 labels' provides no information on (a) the number of pathologists involved, (b) whether labels were assigned directly at patch level or propagated from WSI-level reports, (c) any inter-rater agreement statistic, or (d) adjudication rules for staining or center differences. Because the central utility of the release is supervised patch classification, this omission leaves the noise level of the ground truth unquantified and is load-bearing for the dataset's claimed suitability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for emphasizing the need for clear documentation of label provenance, which is essential for the utility of this dataset in supervised learning. We address the major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final sentence) and the description of patch labeling: the claim that the 7,398 patches carry 'expert-verified C1 to C5 labels' provides no information on (a) the number of pathologists involved, (b) whether labels were assigned directly at patch level or propagated from WSI-level reports, (c) any inter-rater agreement statistic, or (d) adjudication rules for staining or center differences. Because the central utility of the release is supervised patch classification, this omission leaves the noise level of the ground truth unquantified and is load-bearing for the dataset's claimed suitability.

    Authors: We agree that the manuscript would benefit from explicit details on how the C1-C5 labels were obtained. (a) Labels derive from the original clinical FNAC reports generated by pathologists at the participating tertiary centers in India; the precise number of pathologists was not recorded during data collection. (b) Labels were determined at the WSI level from the clinical reports and propagated to the patches extracted from annotated regions; no separate patch-level review was conducted. (c) No inter-rater agreement metrics were computed. (d) No formal adjudication process was used for staining (Pap vs. MGG) or inter-center variations; these are recorded in the metadata for user awareness. We will add a dedicated subsection in the Methods describing the label provenance and update the abstract to avoid overstatement of 'expert-verified' without context. This will help quantify potential label noise for downstream users. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive dataset release with no derivations

full rationale

The paper is a data-release document describing collection of 470 WSIs and 7,398 patches with C1-C5 labels. It contains no equations, predictions, fitted parameters, or derivation chains. Claims rest on data collection and expert verification statements, none of which reduce to self-definition or self-citation by construction. No load-bearing steps exist to analyze.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-release paper; no mathematical derivations, fitted parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5936 in / 1053 out tokens · 34928 ms · 2026-06-30T06:17:53.880473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 26 canonical work pages

  1. [1]

    Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.CA: A Cancer Journal for Clinicians74(3), 229-263 (2024)

    Bray, Freddie, Laversanne, Mathieu, Sung, Hyuna, Ferlay, Jacques, Siegel, Rebecca L., Soerjomataram, Isabelle, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries.CA: A Cancer Journal for Clinicians74(3), 229-263 (2024). DOI: https://doi.org/10.3322/ caac.21834

  2. [2]

    Cancer incidence estimates for 2022 & projection for 2025: Result from National Cancer Registry Programme, India.Indian Journal of Medical Research156(4&5), 598-607 (2022)

    Sathishkumar, K., Chaturvedi, Meena, Das, Prashant, Stephen, Santhappan, Mathur, Prashant. Cancer incidence estimates for 2022 & projection for 2025: Result from National Cancer Registry Programme, India.Indian Journal of Medical Research156(4&5), 598-607 (2022). DOI:https://doi.org/10.4103/ijmr.ijmr_1821_22

  3. [3]

    Fine Needle Aspiration Cytology of the Breast: The Nonmalignant Categories.Pathology Research International2011, 547580 (2011)

    Mendoza, Pamela, Lacambra, Ma., Tan, Puay-Hoon, Tse, Gary M. Fine Needle Aspiration Cytology of the Breast: The Nonmalignant Categories.Pathology Research International2011, 547580 (2011). DOI: https://doi.org/10.4061/2011/547580. 7

  4. [4]

    Diagnostic value of fine-needle aspiration biopsy for breast mass: a systematic review and meta-analysis.BMC Cancer12, 41 (2012)

    Yu, Yu-Hua, Wei, Wei, Liu, Jin-Liang. Diagnostic value of fine-needle aspiration biopsy for breast mass: a systematic review and meta-analysis.BMC Cancer12, 41 (2012). DOI:https://doi.org/10.1186/1471-240 7-12-41

  5. [5]

    M., van Diest, Paul J

    Willems, Stefan M., van Deurzen, Carolien H. M., van Diest, Paul J. Diagnosis of breast lesions: fine-needle aspiration cytology or core needle biopsy? A review.Journal of Clinical Pathology65(4), 287-292 (2012). DOI: https://doi.org/10.1136/jclinpath-2011-200410

  6. [6]

    Wang, Ming, He, Xian, Chang, Yong, Sun, Gang, Thabane, Lehana. A sensitivity and specificity comparison of fine needle aspiration cytology and core needle biopsy in evaluation of suspicious breast lesions: A systematic review and meta-analysis.The Breast31, 157-166 (2017). DOI:https://doi.org/10.1016/j.breast.2016.11.009

  7. [7]

    Accurate evaluation of palpable breast masses by the triple test score.Archives of Surgery133(9), 930-934 (1998)

    Morris, Arden, Pommier, Rodney F., Schmidt, Waldemar A., Shih, Robert L., Alexander, Priscilla W., Vetto, John T. Accurate evaluation of palpable breast masses by the triple test score.Archives of Surgery133(9), 930-934 (1998). DOI:https://doi.org/10.1001/archsurg.133.9.930

  8. [8]

    Usefulness of the triple test score for palpable breast masses.Archives of Surgery136(9), 1008-1013 (2001)

    Morris, Katherine T., Pommier, Rodney F., Morris, Arden, Schmidt, Waldemar A., Beagle, Gregory, Alexander, Priscilla W., et al. Usefulness of the triple test score for palpable breast masses.Archives of Surgery136(9), 1008-1013 (2001). DOI:https://doi.org/10.1001/archsurg.136.9.1008

  9. [9]

    Sidawy, Mary K., Stoler, Mark H., Frable, William J., Frost, Andra R., Masood, Shahla, Miller, Theodore R., et al. Interobserver variability in the classification of proliferative breast lesions by fine-needle aspiration: results of the Papanicolaou Society of Cytopathology Study.Diagnostic Cytopathology18(2), 150-165 (1998). DOI: https://doi.org/10.1002/...

  10. [10]

    Grey zone lesions of breast: Potential areas of error in cytology.Journal of Cytology32(3), 145-152 (2015)

    Mitra, Suvradeep, Dey, Pranab. Grey zone lesions of breast: Potential areas of error in cytology.Journal of Cytology32(3), 145-152 (2015). DOI:https://doi.org/10.4103/0970-9371.168812

  11. [11]

    ’Atypical’ and ’suspicious’ diagnoses in breast aspiration cytology.Cancer102(3), 164-167 (2004)

    Kanhoush, Rima, Jorda, Merce, Gomez-Fernandez, Carmen, Wang, Hong, Mirzabeigi, Marjan, Ghorab, Zeina, et al. ’Atypical’ and ’suspicious’ diagnoses in breast aspiration cytology.Cancer102(3), 164-167 (2004). DOI: https://doi.org/10.1002/cncr.20283

  12. [12]

    IAC Standardized Reporting of Breast Fine-Needle Aspiration Biopsy Cytology.Acta Cytologica61(1), 3-6 (2017)

    Field, Andrew S., Schmitt, Fernando, Vielh, Philippe. IAC Standardized Reporting of Breast Fine-Needle Aspiration Biopsy Cytology.Acta Cytologica61(1), 3-6 (2017). DOI:https://doi.org/10.1159/000450880

  13. [13]

    The International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy Cytopathology

    Field, Andrew S., Raymond, Wendy A., Rickard, Mary, Schmitt, Fernando, Vielh, Philippe. The International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy Cytopathology. Acta Cytologica63(4), 257-273 (2019). DOI:https://doi.org/10.1159/000499509

  14. [14]

    Hoda, Rana S., Brachtel, Elena F. International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy Cytopathology: A Review of Predictive Values and Risks of Malignancy.Acta Cytologica63(4), 292-301 (2019). DOI:https://doi.org/10.1159/000500704

  15. [15]

    Paul, Pranoy, Azad, Shweta, Agrawal, Shruti, Rao, Shalinee, Chowdhury, Nilotpal. Systematic Review and Meta-Analysis of the Diagnostic Accuracy of the International Academy of Cytology Yokohama System for Reporting Breast Fine-Needle Aspiration Biopsy in Diagnosing Breast Cancer.Acta Cytologica67(1), 1-16 (2023). DOI:https://doi.org/10.1159/000527346

  16. [16]

    The Use of the International Academy of Cytology Yokohama System for Reporting Breast Fine- Needle Aspiration Biopsy.American Journal of Clinical Pathology159(2), 138-145 (2023)

    Nikas, Ilias P., Vey, Johannes A., Proctor, Thomas, AlRawashdeh, Mohammad M., Ishak, Ashraf, Ko, Hyun M., et al. The Use of the International Academy of Cytology Yokohama System for Reporting Breast Fine- Needle Aspiration Biopsy.American Journal of Clinical Pathology159(2), 138-145 (2023). DOI: https: //doi.org/10.1093/ajcp/aqac132

  17. [17]

    Recent Application of Artificial Intelligence in Non-Gynecological Cancer Cytopathology: A Systematic Review.Cancers14(14), 3529 (2022)

    Thakur, Nishant, Alam, Mohammad Rizwan, Abdul-Ghafar, Jamshid, Chong, Yosep. Recent Application of Artificial Intelligence in Non-Gynecological Cancer Cytopathology: A Systematic Review.Cancers14(14), 3529 (2022). DOI:https://doi.org/10.3390/cancers14143529

  18. [18]

    Digital pathology and artificial intelligence

    Niazi, Muhammad Khalid Khan, Parwani, Anil V., Gurcan, Metin N. Digital pathology and artificial intelligence. The Lancet Oncology20(5), e253-e261 (2019). DOI:https://doi.org/10.1016/S1470-2045(19)30154-8

  19. [19]

    Deep Learning for Whole Slide Image Analysis: An Overview.Frontiers in Medicine6, 264 (2019)

    Dimitriou, Neofytos, Arandjelović, Ognjen, Caie, Peter D. Deep Learning for Whole Slide Image Analysis: An Overview.Frontiers in Medicine6, 264 (2019). DOI:https://doi.org/10.3389/fmed.2019.00264. 8

  20. [20]

    K., Liu, Jiang, Chen, Hao

    Jiang, Hao, Zhou, Yanning, Lin, Yi, Chan, Ronald C. K., Liu, Jiang, Chen, Hao. Deep learning for computational cytology: A survey.Medical Image Analysis84, 102691 (2023). DOI: https://doi.org/10.1016/j.media.20 22.102691

  21. [21]

    A survey on deep learning in medical image analysis.Medical Image Analysis42, 60-88 (2017)

    Litjens, Geert, Kooi, Thijs, Bejnordi, Babak Ehteshami, Setio, Arnaud Arindra Adiyoso, Ciompi, Francesco, Ghafoorian, Mohsen, et al. A survey on deep learning in medical image analysis.Medical Image Analysis42, 60-88 (2017). DOI:https://doi.org/10.1016/j.media.2017.07.005

  22. [22]

    Artificial intelligence in digital pathology: new tools for diagnosis and precision oncology.Nature Reviews Clinical Oncology 16, 703-715 (2019)

    Bera, Kaustav, Schalper, Kurt A., Rimm, David L., Velcheti, Vamsidhar, Madabhushi, Anant. Artificial intelligence in digital pathology: new tools for diagnosis and precision oncology.Nature Reviews Clinical Oncology 16, 703-715 (2019). DOI:https://doi.org/10.1038/s41571-019-0252-y

  23. [23]

    Nick, Wolberg, William H., Mangasarian, Olvi L

    Street, W. Nick, Wolberg, William H., Mangasarian, Olvi L. Nuclear feature extraction for breast tumor diagnosis. InBiomedical Image Processing and Biomedical Visualization, vol. 1905, 861-870 (1993)

  24. [24]

    A Dataset for Breast Cancer Histopathological Image Classification.IEEE Transactions on Biomedical Engineering63(7), 1455-1462 (2016)

    Spanhol, Fabio A., Oliveira, Luiz S., Petitjean, Caroline, Heutte, Laurent. A Dataset for Breast Cancer Histopathological Image Classification.IEEE Transactions on Biomedical Engineering63(7), 1455-1462 (2016). DOI:https://doi.org/10.1109/TBME.2015.2496264

  25. [25]

    A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma.arXivarXiv:2506.09661 (2025)

    Jain, G., Pati, S., Duggal, M., Sethi, A., Patil, A., Malekar, G., Kowe, N., Kumar, J., Kashyap, J., Rout, D., et al. A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma.arXivarXiv:2506.09661 (2025). DOI:https://doi.org/10.48550/arXiv.2506.09661

  26. [26]

    Comparative assessment of CNN architectures for classification of breast FNAC images.Tissue and Cell57, 8-14 (2019)

    Saikia, Amartya Ranjan, Bora, Kangkana, Mahanta, Lipi B., Das, Anup Kumar. Comparative assessment of CNN architectures for classification of breast FNAC images.Tissue and Cell57, 8-14 (2019). DOI: https: //doi.org/10.1016/j.tice.2019.02.001

  27. [27]

    Classifying Breast Cytological Images using Deep Learning Architectures

    Zerouaoui, Hasnae, Idri, Ali. Classifying Breast Cytological Images using Deep Learning Architectures. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF, 557-564 (2022). DOI:https://doi.org/10.5220/0010850000003123

  28. [28]

    14590990

    Patil, Abhijeet, Jain, Garima, Sethi, Amit. A Multi-Center Breast FNAC Cytology Dataset for AI-Assisted Patch-wise Classification Using C1–C5 Reporting Categories. Zenodo. DOI:https://doi.org/10.5281/zenodo .20763900(2026). Competing interests The authors declare no competing interests. Funding Not Applicable 9