Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training
Pith reviewed 2026-05-10 13:14 UTC · model grok-4.3
The pith
A Vision Transformer trained with weak supervision on 100,000 patches classifies anaplastic large cell lymphoma versus classic Hodgkin lymphoma at 91.85 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that their Vision Transformer model, trained via weak supervision on a dataset of 100,000 image patches, achieves 91.85 percent accuracy, 0.92 F1 score, and 0.98 AUC when classifying anaplastic large cell lymphoma against classic Hodgkin lymphoma on held-out test data, and they conclude this performance qualifies the model for use in clinical deep-learning pipelines that employ automated image patch extraction.
What carries the argument
Vision Transformer architecture trained under weak supervision, where every patch automatically receives the diagnostic label of the whole-slide image from which it was extracted.
If this is right
- Training no longer requires pathologists to label every individual patch.
- The model can slot into existing workflows that automatically extract patches from scanned slides.
- Vision Transformers become viable for this morphology task even when only slide-level labels are available.
- Performance remains high enough to support further development of automated diagnostic support tools.
Where Pith is reading between the lines
- The same weak-supervision recipe could be tried on other lymphoma subtypes or additional cancer types where slide-level reports already exist.
- Hospitals could run such models on incoming slides to pre-screen cases and flag those needing urgent pathologist review.
- Combining the model output with other clinical data might further improve diagnostic consistency across different labs.
Load-bearing premise
Slide-level labels supply accurate enough supervision for reliable patch-level predictions, and the independent test set matches the variability of real clinical cases.
What would settle it
A new test collection in which pathologists directly label individual patches and the model accuracy drops well below 80 percent would show the approach does not meet the claimed suitability for clinical use.
Figures
read the original abstract
Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a Vision Transformer (ViT) model for distinguishing anaplastic large cell lymphoma (ALCL) from classic Hodgkin lymphoma (cHL) via weakly supervised training. It contrasts a prior fully supervised ViT achieving 100% accuracy and F1=1.0 on 1,200 patches with a new model trained on 100,000 automatically slide-level-labeled patches that reports 91.85% accuracy, 0.92 F1 score, and 0.98 AUC on an independent test set, claiming this qualifies the approach as suitable for clinical deep-learning modules using automated patch extraction.
Significance. If the weak-supervision results prove robust after proper validation, the work could demonstrate a scalable route to ViT-based lymphoma classification that reduces expert annotation costs compared with fully supervised patch labeling. The reported AUC of 0.98 on a larger dataset would support further development of automated pathology tools, provided label noise and distribution-shift issues are addressed.
major comments (2)
- [Abstract] Abstract: The reported metrics (91.85% accuracy, 0.92 F1, 0.98 AUC) are presented without any description of data sources, number of WSIs, train-test split details, patch extraction procedure, how slide-level labels were propagated to patches, baseline comparisons, or statistical validation (e.g., confidence intervals). This information is load-bearing for the central claim that the model is suitable for clinical use.
- [Abstract] Abstract: The performance drop from 100% accuracy (fully supervised, 1,200 patches) to 91.85% (weakly supervised, 100k patches) is not analyzed in light of the known risk that slide-level labels assign the same class to all patches within a WSI, many of which contain stroma, necrosis, normal lymphoid tissue, or artifacts rather than diagnostic lymphoma morphology. This label noise is a plausible explanation for the observed drop and directly challenges the suitability claim.
Simulated Author's Rebuttal
We thank the referee for their thorough review and insightful comments on our work. We address each major comment below and have prepared revisions to the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] The reported metrics (91.85% accuracy, 0.92 F1, 0.98 AUC) are presented without any description of data sources, number of WSIs, train-test split details, patch extraction procedure, how slide-level labels were propagated to patches, baseline comparisons, or statistical validation (e.g., confidence intervals). This information is load-bearing for the central claim that the model is suitable for clinical use.
Authors: We concur that the abstract would benefit from additional context to support the reported metrics and the claim of suitability for clinical use. The manuscript's Methods section details the data sources (pathology slides from our institution), the number of whole-slide images, the automated patch extraction process, slide-level label assignment to patches, and comparisons to the prior fully supervised model. We will revise the abstract to briefly mention the scale (100,000 patches from multiple WSIs) and refer readers to the full methods for specifics. Additionally, we will include statistical validation such as confidence intervals in the results. This change will be implemented in the revised version. revision: yes
-
Referee: [Abstract] The performance drop from 100% accuracy (fully supervised, 1,200 patches) to 91.85% (weakly supervised, 100k patches) is not analyzed in light of the known risk that slide-level labels assign the same class to all patches within a WSI, many of which contain stroma, necrosis, normal lymphoid tissue, or artifacts rather than diagnostic lymphoma morphology. This label noise is a plausible explanation for the observed drop and directly challenges the suitability claim.
Authors: The observation regarding label noise is valid and represents a known limitation of weak supervision. The performance drop is expected when moving from a small, expertly curated patch dataset to a large set with automatically propagated slide-level labels that include non-diagnostic regions. Our approach prioritizes scalability and reduced expert annotation requirements, which is crucial for clinical deployment. The maintained high AUC of 0.98 demonstrates that the model learns relevant features despite the noise. We will add a dedicated analysis and discussion of label noise effects in the revised manuscript, including potential mitigation strategies, and will moderate the suitability claim to reflect this as an initial demonstration requiring further clinical validation. We believe this addresses the concern while preserving the contribution. revision: yes
Circularity Check
No circularity: purely empirical ML evaluation
full rationale
The paper reports an empirical study of training a Vision Transformer on 100k image patches for ALCL vs cHL classification under weakly supervised slide-level labeling, with standard accuracy/F1/AUC metrics on an independent test set. No mathematical derivations, equations, fitted parameters, or ansatzes appear; the central claims are direct experimental outcomes rather than reductions of inputs by construction. Prior self-citation to a small fully-supervised run is present but is not load-bearing for any derivation and does not create a self-referential chain. The work is self-contained against external benchmarks and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- ViT model weights and training hyperparameters
axioms (1)
- domain assumption Slide-level labels are reliable proxies for the presence of diagnostic features in individual patches
Reference graph
Works this paper leans on
-
[1]
Deep Learning for Medical Image Processing: Overview, Challenges and the Future
Razzak MI, Naz S, Zaib A. Deep Learning for Medical Image Processing: Overview, Challenges and the Future. In: Dey N., Ashour A., Borra S. (eds) Classification in BioApps. First ed. Springer International Publishing; 2018:323-350
work page 2018
-
[2]
MIT Technol. Rev. 2013. Available at (last accessed on 10/30/18): https://www.technologyreview.com/s/513696/ deep-learning
work page 2013
-
[3]
LeCun, Y , Bengio, Y, Hinton, G. Deep learning. Nature. 2015;521:436–444
work page 2015
-
[4]
Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases
Janowczyk, A, Madabhushi, A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J Pathol Inform. 2016;7:29
work page 2016
-
[5]
WHO Classification of Tumours Editorial Board
WHO Classification of Tumours: Haematolymphoid Tumours, 5th Edition, V olume 11, 2024. WHO Classification of Tumours Editorial Board. 69008 Lyon, France: International Agency for Research on Cancer (IARC)
work page 2024
-
[6]
Feature Extraction for CBIR and Biometrics Applications
Choras RS. Feature Extraction for CBIR and Biometrics Applications. 7th WSEAS International Conference on Applied Computer Science. V ol. 7. 2007
work page 2007
-
[7]
Machine learning: an algorithmic perspective
Marsland S. Machine learning: an algorithmic perspective. Chapman and Hall/CRC, 2011
work page 2011
-
[8]
Mitchell TM, Mitchell TM. Machine learning. V ol. 1. No. 9. New York: McGraw-Hill, 1997
work page 1997
-
[9]
Patch-based system for classification of breast histology images using deep learning
Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Computerized Medical Imaging and Graphics 71 (2019): 90-103
work page 2019
-
[10]
Waqas, A, Bui, MM, Glassy, EF, et al. Revolutionizing Digital Pathology With the Power of Generative Artificial Intelligence and Foundation Models. Laboratory Investigation. V olume 103, Issue 11,100255, November 2023 17
work page 2023
-
[11]
Vaswani A, Shazeer, N, Parmar, N, et al. Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017)
work page 2017
-
[12]
A foundation model for clinical-grade computational pathology and rare cancers detection
V orontsov, E, Bozkurt, A, Casson, A, Shaikovski, G, Zelechowski, M, Severson, K, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine | V olume 30 | October 2024 | 2924–2935
work page 2024
-
[13]
Vision transformer introduces a new vitality to the classification of renal pathology
Zhang, J, Lu, JD, Chen, B, et al. Vision transformer introduces a new vitality to the classification of renal pathology. BMC Nephrology (2024) 25:337
work page 2024
-
[14]
The classification of the bladder cancer based on Vision Transformers (ViT)
Khedr, OS, Wahed, ME, Al‑Attar, AR, et al. The classification of the bladder cancer based on Vision Transformers (ViT). Nature Scientific Reports | (2023) 13:20639
work page 2023
-
[15]
Better plain ViT baselines for ImageNet-1k
Beyer, L, et al. Better plain ViT baselines for ImageNet-1k. Google Brain Research, Zurich https://github.com/google-research/big_vision
-
[16]
An image is worth 16x16 words: transformers for image recognition at scale
Dosovitskiy, A, Beyer, L, Kolesnikov, A, et al. An image is worth 16x16 words: transformers for image recognition at scale. Proceedings of ICLR 2021
work page 2021
-
[17]
Rivera D, Banerjee A, Zhang R, El Achi H, Wahed A, Ho L, et al(2025) Vision Transformers for Diagnostic Classification of Lymphomas: A Matched Comparison with a Convolutional Neural Network, 21st Century Pathol, V olume 5 (1): 160
work page 2025
-
[18]
Rivera, D, Ali, K, Zhang, R, Mai, B, El Achi, H, Armstrong, J, Wahed, A, Nguyen, A. Deep Learning-Based Morphological Classification between Classical Hodgkin Lymphoma and Anaplastic Large Cell Lymphoma: A Proof of Concept and Literature Review, 21st Century Pathology, V olume 4 (1): 159
-
[19]
Foundations of Machine Learning
Mohri, M, Rostamizadeh, A,Talwalkar, A. Foundations of Machine Learning. MIT Press, Second Edition, 2018. 18
work page 2018
-
[20]
Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors
Li, Z, Cong, Y, Chen, X, et al. Vision transformer-based weakly supervised histopathological image analysis of primary brain tumors. iScience 26, 105872, January 20, 2023
work page 2023
-
[21]
High performance neural network inference, streaming, and visualization of medical images using FAST
Smistad, E, Østvik, A, Pedersen, A. High performance neural network inference, streaming, and visualization of medical images using FAST. IEEE Access, volume 7, 2019
work page 2019
-
[22]
https://doi.org/10.1038/s41598-017-17204-5, https://www.nature.com/articles/s41598-017-17204-5
Bankhead, P., et al. QuPath: Open source software for digital pathology image analysis. Scientific Reports (2017). https://doi.org/10.1038/s41598-017-17204-5
-
[23]
https://www.geeksforgeeks.org/machine- learning/metrics-for-machine-learning-model/
Evaluation metrics in machine learning. https://www.geeksforgeeks.org/machine- learning/metrics-for-machine-learning-model/
-
[24]
Chaurasia, A, Toohey, PW, Harris, H, Hewitt, AW. Multi-resolution vision transformer model for histopathological skin cancer subtype classification using whole slide images. Computers in Biology and Medicine. V olume 196, Part A, September 2025, 110724. LEGENDS Figure 1. Representative histology of anaplastic large cell lymphoma (L) and classical Hodgkin ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.