pith. sign in

arxiv: 2606.22892 · v1 · pith:XNUWIVSVnew · submitted 2026-06-22 · 📡 eess.IV · cs.CV

IViT: A Novel Interpretable Visual Transformer for Skin Disease Detection

Pith reviewed 2026-06-26 06:40 UTC · model grok-4.3

classification 📡 eess.IV cs.CV
keywords skin disease detectioninterpretable vision transformerquadratic programmingfeature selectionfew-shot medical imagingactivation map alignmentmulti-objective loss
0
0 comments X

The pith

A quadratic programming constraint on vision transformers selects skin-disease features aligned with clinical logic while keeping accuracy within 0.21 percent of the baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IViT as a vision transformer that incorporates quadratic programming to address black-box opacity and few-shot limitations in skin disease detection. It adapts pre-trained models for limited medical data and applies a discrete QP feature selection step plus a multi-objective loss to cut redundancy and align activations with lesion areas. Results across six datasets report 93.80 percent accuracy, 29.5 percent lower feature redundancy, and core activation regions that match areas clinicians examine. The approach aims to deliver both competitive performance and explanations that track diagnostic reasoning.

Core claim

IViT builds a discrete QP feature selection framework that screens generic and discriminative features consistent with clinical diagnostic logic. A multi-objective loss then reduces feature redundancy and optimizes activation distribution without degrading classification performance, yielding 93.80 percent accuracy on six standard datasets with core activations matching clinically relevant lesion regions.

What carries the argument

The discrete quadratic programming feature selection framework that screens features for consistency with clinical logic while preserving classification performance.

If this is right

  • The model supplies explanations that directly reference lesion locations used in clinical practice.
  • Reduced feature redundancy lowers storage and compute demands during deployment on limited hardware.
  • Transfer learning plus the QP step enables competitive results when only small numbers of labeled medical images are available.
  • The same constraint pattern could be applied to other transformer architectures that require both accuracy and built-in interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the QP selection proves stable across imaging modalities, it could replace post-hoc explanation tools in other diagnostic pipelines.
  • The alignment between activations and clinical regions offers a direct test for whether learned features capture medically meaningful patterns rather than dataset artifacts.
  • Extending the framework to multi-label or longitudinal skin data would test whether the clinical-logic constraint generalizes beyond single-image classification.

Load-bearing premise

The quadratic programming step is assumed to identify features that remain consistent with clinical diagnostic logic without causing a meaningful drop in classification accuracy.

What would settle it

Running the model on the same six datasets and finding that core activation regions no longer overlap with clinically identified lesion areas, or that accuracy falls more than 1 percent below the baseline, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.22892 by Di Lin, Haibiao Li, WeiWei Wu, Xue Jiang, Yanxi Li, Yugang Chi.

Figure 1
Figure 1. Figure 1: AI Inspection Model Patient Database AI-Assisted Medical Diagnosis System Offline Medical Consultation Medical Training Remote Medical Consultation Medical Image Analysis Drug Discovery [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Transfer Learning ViT Training on the Few [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Flowchart of the IViT Classification Algorithm Framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training accuracy comparison: Pre-trained weights vs. Random initialization [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of ViT Model Performance which is determined by the appearance, color and morphology of the skin lesions. Acne is dominated by follicular papules, pustules and cysts, with red or dark red in color. The lesions are scattered, isolated and tend to occur around the follicular ostia. The core features focus on papules, pustules, oily skin and cysts. Psoriasis manifests as well-defined red plaques co… view at source ↗
read the original abstract

The clinical diagnosis of skin diseases is susceptible to interference from inter-class similarity of skin lesions, and over-reliance on clinicians'experience easily leads to subjective bias. Although existing deep learning aided diagnosis methods achieve competitive accuracy, they suffer from the black-box opacity of Vision Transformer (ViT) and poor adaptability to medical few-shot scenarios. Moreover, mainstream explainable algorithms generally face the bottleneck of significant accuracy degradation when improving interpretability. This paper proposes an interpretable ViT (IViT) constrained by Quadratic Programming (QP). The introduced pre-trained transfer learning adapts to few-shot feature extraction. A discrete QP feature selection framework is constructed to screen generic and discriminative features consistent with clinical diagnostic logic. A multi-objective loss function is designed to reduce feature redundancy and optimize activation distribution while preserving classification performance. Experimental results on six standard skin disease datasets show that IViT achieves an accuracy of 93.80%, only 0.21% lower than the baseline, with feature redundancy reduced by 29.5%. Its core activation regions are consistent with clinically concerned lesion areas. The proposed model balances accuracy and interpretability, providing a reliable solution for the clinical deployment of few-shot intelligent skin disease diagnosis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an interpretable Vision Transformer (IViT) for skin disease detection using Quadratic Programming (QP) constraints. It incorporates pre-trained transfer learning for few-shot adaptation, a discrete QP feature selection framework to identify generic and discriminative features aligned with clinical logic, and a multi-objective loss to minimize feature redundancy while maintaining classification performance. On six standard datasets, IViT achieves 93.80% accuracy (0.21% below baseline) with 29.5% redundancy reduction, and core activation regions consistent with lesion areas.

Significance. If the QP-selected features can be shown to align with clinical reasoning without performance loss, the work would meaningfully advance interpretable models for few-shot medical imaging by addressing the accuracy-interpretability trade-off in ViTs. The reported metrics indicate only marginal accuracy degradation alongside substantial redundancy reduction, which is a positive indicator for practical deployment if the clinical consistency claim holds under quantitative scrutiny.

major comments (3)
  1. [Abstract] Abstract: The central claim that the discrete QP feature selection framework screens features 'consistent with clinical diagnostic logic' is supported only by qualitative activation-map consistency and the 0.21% accuracy gap; no quantitative overlap metrics (Dice, IoU) or correlation with dermatologist-annotated lesion attributes are reported on the six datasets, which is load-bearing for the interpretability guarantee in few-shot medical deployment.
  2. [Experimental results] Experimental results: The reported accuracy of 93.80% and 29.5% redundancy reduction supply no derivation details, error bars, dataset splits, cross-validation procedure, or statistical significance tests, limiting evaluation of whether the QP selections are robust or merely optimized to the internal multi-objective loss.
  3. [Method] Method (QP framework): The multi-objective loss and discrete QP feature selection are defined internally to optimize the reported metrics; without the full equations it remains unclear whether the claimed reductions are independent of the fitting choices or circular by construction.
minor comments (2)
  1. [Abstract] Abstract: Typo in 'clinicians'experience' (missing space after apostrophe).
  2. Notation: The term 'QP' is introduced without an initial expansion or reference to the quadratic programming formulation used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the discrete QP feature selection framework screens features 'consistent with clinical diagnostic logic' is supported only by qualitative activation-map consistency and the 0.21% accuracy gap; no quantitative overlap metrics (Dice, IoU) or correlation with dermatologist-annotated lesion attributes are reported on the six datasets, which is load-bearing for the interpretability guarantee in few-shot medical deployment.

    Authors: The manuscript supports the interpretability claim through qualitative activation-map consistency with lesion areas, as explicitly stated in the abstract and experimental results. No quantitative metrics such as Dice or IoU were computed because the standard datasets lack the required pixel-level dermatologist annotations. This qualitative demonstration aligns with common practices in medical imaging interpretability studies. We stand by the presented evidence and will not add unsubstantiated quantitative claims. revision: no

  2. Referee: [Experimental results] Experimental results: The reported accuracy of 93.80% and 29.5% redundancy reduction supply no derivation details, error bars, dataset splits, cross-validation procedure, or statistical significance tests, limiting evaluation of whether the QP selections are robust or merely optimized to the internal multi-objective loss.

    Authors: We agree that additional experimental details are needed for full evaluation. The revised manuscript will include dataset splits, cross-validation procedure, error bars from repeated runs, and statistical significance tests to substantiate the reported accuracy and redundancy reduction. revision: yes

  3. Referee: [Method] Method (QP framework): The multi-objective loss and discrete QP feature selection are defined internally to optimize the reported metrics; without the full equations it remains unclear whether the claimed reductions are independent of the fitting choices or circular by construction.

    Authors: The Method section provides the full equations for the discrete QP feature selection and multi-objective loss. The QP step is formulated as an independent optimization for feature discriminativeness and genericity prior to loss-based training, avoiding circularity. We will add explicit clarification and restate the key equations in the revision to address this concern. revision: partial

Circularity Check

2 steps flagged

QP framework and multi-objective loss make clinical consistency and redundancy reduction claims tautological by design

specific steps
  1. self definitional [Abstract]
    "A discrete QP feature selection framework is constructed to screen generic and discriminative features consistent with clinical diagnostic logic."

    The framework is built with the explicit goal of producing features consistent with clinical logic; the later assertion that the selected features exhibit this consistency therefore restates the design objective rather than deriving it from data or external criteria.

  2. fitted input called prediction [Abstract]
    "A multi-objective loss function is designed to reduce feature redundancy and optimize activation distribution while preserving classification performance. Experimental results ... with feature redundancy reduced by 29.5%."

    The loss is constructed to minimize redundancy; the reported 29.5% reduction is the direct numerical outcome of optimizing that loss on the datasets, rendering the reduction a fitted result rather than an a-priori prediction.

full rationale

The paper defines the discrete QP feature selection and multi-objective loss explicitly to enforce the properties later reported as results (consistency with clinical logic, 29.5% redundancy reduction). These outcomes therefore follow from the construction and fitting choices rather than constituting independent derivations or predictions. Accuracy preservation is shown empirically and is non-circular, but the interpretability guarantees rest on the internal definitions without external quantitative validation (e.g., Dice overlap with annotations). No load-bearing self-citations appear. This yields moderate circularity confined to the interpretability claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full text would be required to enumerate them.

pith-pipeline@v0.9.1-grok · 5755 in / 1059 out tokens · 24665 ms · 2026-06-26T06:40:11.920914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 38 canonical work pages · 7 internal anchors

  1. [1]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    N. Codella et al., “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC),” Mar. 29, 2019, arXiv: arXiv:1902.03368. doi: 10.48550/arXiv.1902.03368

  2. [2]

    The HAM10000 dataset, a large collection of multi-source dermatoscopic images of com- mon pigmented skin lesions.Scientific Data

    P. Tschandl, C. Rosendahl, and H. Kittler, “The HAM10000 dataset, a large collection of multi -source dermatoscopic images of common pigmented skin lesions,” Sci. Data, vol. 5, no. 1, p. 180161, Aug. 2018, doi: 10.1038/sdata.2018.161

  3. [3]

    Machine learning -based prediction models for atopic dermatitis diagnosis and evaluation,

    S. Wu et al. , “Machine learning -based prediction models for atopic dermatitis diagnosis and evaluation,” Fundam. Res., vol. 5, no. 3, pp. 1313–1322, May 2025, doi: 10.1016/j.fmre.2023.02.021

  4. [4]

    Advances in the study and application of digital technology in the clinical practice of atopic dermatitis,

    Y . Chen et al. , “Advances in the study and application of digital technology in the clinical practice of atopic dermatitis,” Digit. Health, vol. 11, p. 20552076251377957, May 2025, doi: 10.1177/20552076251377957

  5. [5]

    Deep Ensemble Learning for Multiclass Skin Lesion Classification,

    T.-M. Chiu, I. -C. Chi, Y .-C. Li, and M. -H. Tseng, “Deep Ensemble Learning for Multiclass Skin Lesion Classification,” Bioengineering, vol. 12, no. 9, p. 934, Aug. 2025, doi: 10.3390/bioengineering12090934

  6. [6]

    Federated deep reinforcement learning based trajectory design for UAV -assisted networks with mobile ground devices,

    Z. Jiang et al., “Accurate diagnosis of atopic dermatitis by combining transcriptome and microbiota data with supervised machine learning,” Sci. Rep., vol. 12, no. 1, p. 290, Jan. 2022, doi: 10.1038/s41598 -021- 04373-7

  7. [7]

    Advancements in artificial intelligence for atopic dermatitis: diagnosis, treatment, and patient management,

    F. Cao, Y . Yang, C. Guo, H. Zhang, Q. Yu, and J. Guo, “Advancements in artificial intelligence for atopic dermatitis: diagnosis, treatment, and patient management,” Ann. Med., vol. 57, no. 1, p. 2484665, Dec. 2025, doi: 10.1080/07853890.2025.2484665

  8. [8]

    Classification of skin diseases with deep learning based approaches,

    M. O. Sarı and K. Keser, “Classification of skin diseases with deep learning based approaches,” Sci. Rep., vol. 15, no. 1, p. 27506, Jul. 2025, doi: 10.1038/s41598-025-13275-x

  9. [9]

    Evaluation of atopic dermatitis severity using artificial intelligence,

    A. Maulana et al. , “Evaluation of atopic dermatitis severity using artificial intelligence,” Narra J, vol. 3, no. 3, p. e511, Dec. 2023, doi: 10.52225/narra.v3i3.511

  10. [10]

    Comparative performance of deep learning models and non-dermatologists in diagnosing psoriasis, dermatophytosis, and eczema,

    N. Yodrabum et al., “Comparative performance of deep learning models and non-dermatologists in diagnosing psoriasis, dermatophytosis, and eczema,” Sci. Rep. , vol. 16, no. 1, p. 245, Dec. 2025, doi: 10.1038/s41598-025-29562-6

  11. [11]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Dec. 10, 2015, arXiv: arXiv:1512.03385. doi: 10.48550/arXiv.1512.03385

  12. [12]

    Enhancing Interpretability in Medical Image Classification by Integrating Formal Concept Analysis with Convolutional Neural Networks,

    M. Khatri, Y . Yin, and J. Deogun, “Enhancing Interpretability in Medical Image Classification by Integrating Formal Concept Analysis with Convolutional Neural Networks,” Biomimetics, vol. 9, no. 7, p. 421, Jul. 2024, doi: 10.3390/biomimetics9070421

  13. [13]

    Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI),

    A. Adadi and M. Berrada, “Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI),” IEEE Access , vol. 6, pp. 52138–52160, 2018, doi: 10.1109/ACCESS.2018.2870052

  14. [14]

    Explainable artificial intelligence in skin cancer recognition: A systematic review,

    K. Hauser et al. , “Explainable artificial intelligence in skin cancer recognition: A systematic review,” Eur. J. Cancer, vol. 167, pp. 54–69, May 2022, doi: 10.1016/j.ejca.2022.02.025

  15. [15]

    Visualizing and Understanding Convolutional Networks

    M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” Nov. 28, 2013, arXiv: arXiv:1311.2901. doi: 10.48550/arXiv.1311.2901

  16. [16]

    An explainable hybrid deep learning framework for precise skin lesion segmentation and multi -class classification,

    M. Fiaz et al. , “An explainable hybrid deep learning framework for precise skin lesion segmentation and multi -class classification,” Front. Med., vol. 12, p. 1681542, Oct. 2025, doi: 10.3389/fmed.2025.1681542

  17. [17]

    A Deep Learning Approach Based on Explainable Artificial Intelligence for Skin Lesion Classification,

    N. Nigar, M. Umar, M. K. Shahzad, S. Islam, and D. Abalo, “A Deep Learning Approach Based on Explainable Artificial Intelligence for Skin Lesion Classification,” IEEE Access, vol. 10, pp. 113715–113725, 2022, doi: 10.1109/ACCESS.2022.3217217

  18. [18]

    Attention Is All You Need

    A. Vaswani et al., “Attention Is All You Need,” Aug. 02, 2023, arXiv: arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762

  19. [20]

    Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis,

    E. A. Taufik, A. Khondoker, A. F. Parsa, and S. A. M. Mostafa, “Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis,” Aug. 06, 2025, arXiv: arXiv:2508.04573. doi: 10.48550/arXiv.2508.04573

  20. [21]

    MT-TransUNet: Mediating Multi -Task Tokens in Transformers for Skin Lesion Segmentation and Classification,

    J. Chen, J. Chen, Z. Zhou, B. Li, A. Y uille, and Y . Lu, “MT-TransUNet: Mediating Multi -Task Tokens in Transformers for Skin Lesion Segmentation and Classification,” Dec. 03, 2021, arXiv: arXiv:2112.01767. doi: 10.48550/arXiv.2112.01767

  21. [22]

    DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification.Bioengineering

    X. Zhang et al., “DermViT: Diagnosis-Guided Vision Transformer for Robust and Efficient Skin Lesion Classification,” Bioengineering, vol. 12, no. 4, p. 421, Apr. 2025, doi: 10.3390/bioengineering12040421

  22. [23]

    B. Li, H. Chen, and H. Duan, “Artificial intelligence-driven prognostic system for conception prediction and management in intrauterine adhesions following hysteroscopic adhesiolysis: a diagnostic study using hysteroscopic images,” Front. Bioeng. Biotechnol. , vol. 12, p. 1327207, Apr. 2024, doi: 10.3389/fbioe.2024.1327207

  23. [24]

    A Deep CNN Transformer Hybrid Model for Skin Lesion Classification of Dermoscopic Images Using Focal Loss,

    Y . Nie, P. Sommella, M. Carratù, M. O’Nils, and J. Lundgren, “A Deep CNN Transformer Hybrid Model for Skin Lesion Classification of Dermoscopic Images Using Focal Loss,” Diagnostics, vol. 13, no. 1, p. 72, Dec. 2022, doi: 10.3390/diagnostics13010072

  24. [25]

    FA T -Net: Feature adaptive transformers for automated skin lesion segmentation,

    H. Wu, S. Chen, G. Chen, W. Wang, B. Lei, and Z. Wen, “FA T -Net: Feature adaptive transformers for automated skin lesion segmentation,” Med. Image Anal. , vol. 76, p. 102327, 2022, doi: https://doi.org/10.1016/j.media.2021.102327

  25. [26]

    CFI-Net: A Choquet Fuzzy Integral Based Ensemble Network With PSO -Optimized Fuzzy Measures for Diagnosing Multiple Skin Diseases Including Mpox,

    S. Asif, M. Zhao, Y . Li, F. Tang, and Y . Zhu, “CFI-Net: A Choquet Fuzzy Integral Based Ensemble Network With PSO -Optimized Fuzzy Measures for Diagnosing Multiple Skin Diseases Including Mpox,” IEEE J. Biomed. Health Inform. , vol. 28, no. 9, pp. 5573 –5586, Sep. 2024, doi: 10.1109/JBHI.2024.3411658

  26. [27]

    HiTrace: Hierarchical Class Tracing Approach for Open-Set Recognition on Skin Lesions,

    B. W.-Y . Hsu and V . S. Tseng, “HiTrace: Hierarchical Class Tracing Approach for Open-Set Recognition on Skin Lesions,” IEEE J. Biomed. Health Inform. , vol. 29, no. 8, pp. 5700 –5711, Aug. 2025, doi: 10.1109/JBHI.2025.3560555

  27. [28]

    Federated Machine Learning for Detection of Skin Diseases and Enhancement of Internet of Medical Things (IoMT) Security,

    Md. N. Hossen, V . Panneerselvam, D. Koundal, K. Ahmed, F. M. Bui, and S. M. Ibrahim, “Federated Machine Learning for Detection of Skin Diseases and Enhancement of Internet of Medical Things (IoMT) Security,” IEEE J. Biomed. Health Inform., vol. 27, no. 2, pp. 835–841, Feb. 2023, doi: 10.1109/JBHI.2022.3149288

  28. [29]

    Deep Neural Forest for Out -of- Distribution Detection of Skin Lesion Images,

    X. Li, C. Desrosiers, and X. Liu, “Deep Neural Forest for Out -of- Distribution Detection of Skin Lesion Images,” IEEE J. Biomed. Health Inform., vol. 27, no. 1, pp. 157 –165, Jan. 2023, doi: 10.1109/JBHI.2022.3171582

  29. [30]

    Novoa, Justin Ko, Susan M

    A. Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115 –118, Feb. 2017, doi: 10.1038/nature21056

  30. [31]

    Transformer Interpretability Beyond Attention Visualization,

    H. Chefer, S. Gur, and L. Wolf, “Transformer Interpretability Beyond Attention Visualization,” Apr. 05, 2021, arXiv: arXiv:2012.09838. doi: 10.48550/arXiv.2012.09838

  31. [32]

    ADIC: An Adaptive Disentangled CNN Classifier for Interpretable Image Recognition,

    Zhao Xiaoyang, “ADIC: An Adaptive Disentangled CNN Classifier for Interpretable Image Recognition,” J. Comput. Res. Dev., vol. 60, no. 8, p. 1754, 2023, doi: 10.7544/issn1000-1239.202330231

  32. [33]

    A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI,

    N. Ahmad et al. , “A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable AI,” Front. Oncol., vol. 13, p. 1151257, Jun. 2023, doi: 10.3389/fonc.2023.1151257

  33. [34]

    Grad -CAM: Visual Explanations From Deep Networks via Gradient-Based Localization

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad -CAM: Visual Explanations From Deep Networks via Gradient-Based Localization”

  34. [35]

    "Why Should I Trust You?": Explaining the Predictions of Any Classifier

    M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier,” Aug. 09, 2016, arXiv: arXiv:1602.04938. doi: 10.48550/arXiv.1602.04938

  35. [36]

    A Comprehensive Taxonomy for Explainable Artificial Intelligence: A Systematic Survey of Surveys on Methods and Concepts,

    G. Schwalbe and B. Finzel, “A Comprehensive Taxonomy for Explainable Artificial Intelligence: A Systematic Survey of Surveys on Methods and Concepts,” Data Min. Knowl. Discov., vol. 38, no. 5, pp. 3043–3101, Sep. 2024, doi: 10.1007/s10618-022-00867-8

  36. [37]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” Aug. 17, 2021, arXiv: arXiv:2103.14030. doi: 10.48550/arXiv.2103.14030

  37. [38]

    Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers,

    S. Zheng et al., “Rethinking Semantic Segmentation from a Sequence- to-Sequence Perspective with Transformers,” Jul. 25, 2021, arXiv: arXiv:2012.15840. doi: 10.48550/arXiv.2012.15840

  38. [39]

    TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

    Z. Shao et al. , “TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification”

  39. [40]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 03, 2021, arXiv: arXiv:2010.11929. doi: 10.48550/arXiv.2010.11929

  40. [41]

    Convnext v2: Co-designing and scaling convnets with masked autoen- coders,

    P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “MobileOne: An Improved One millisecond Mobile Backbone,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 7907 –7917. doi: 10.1109/CVPR52729.2023.00764