From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models

Carmine Gravino; Fabio Palomba; Pir Bakhsh Khokhar; Sarang Shaikh; Sule Yildirim Yayilgan

arxiv: 2604.23079 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI

From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models

Pir Bakhsh Khokhar , Carmine Gravino , Fabio Palomba , Sule Yildirim Yayilgan , Sarang Shaikh This is my paper

Pith reviewed 2026-05-08 08:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diabetic retinopathyensemble learningGrad-CAMvision-language modelsmodel interpretabilityfundus image analysisquadratic weighted kappaAPTOS 2019

0 comments

The pith

An ensemble of CNN and transformer models paired with Grad-CAM++ maps and vision-language rationales produces accurate and inspectable diabetic retinopathy grades.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a pipeline that grades diabetic retinopathy severity from fundus photographs while generating both visual heatmaps and short textual explanations. It evaluates six CNN and transformer backbones on the APTOS 2019 dataset under stratified five-fold cross-validation, then tests hard voting, weighted soft voting, stacking, and a hybrid class-level fusion approach. Weighted soft voting emerges as the most consistent performer. The work further shows that Grad-CAM++ maps localize relevant retinal regions and that vision-language models can produce grade-aligned textual rationales when prompted conservatively on the image and model output. A reader would care because most high-accuracy medical image classifiers remain opaque, which restricts their safe use in screening programs where clinicians must understand and trust each decision.

Core claim

Modern CNN backbones such as ResNet-50 and ConvNeXt-Tiny establish strong single-model baselines with cross-validated quadratic weighted kappa up to 0.919 and 0.914. Ensembling via weighted soft voting raises performance to 0.934 with low fold-to-fold variance. Hybrid class-level fusion remains competitive yet shows no statistically reliable gain over standard fusion. Grad-CAM++ yields plausible though coarse localization, while VLM rationales stay grade-consistent, trading off clinical completeness against semantic similarity metrics.

What carries the argument

Weighted soft voting ensemble of CNN and transformer backbones, augmented by Grad-CAM++ attribution maps and vision-language models conditioned on the fundus image plus classifier output to generate short textual rationales.

If this is right

Weighted soft voting delivers the highest and most stable quadratic weighted kappa across cross-validation folds.
Hybrid class-level fusion offers no statistically significant improvement over ordinary fusion in paired comparisons.
Grad-CAM++ maps supply coarse but plausible localization of features used by the model.
VLM rationales remain consistent with the predicted grade while showing a measurable trade-off between completeness and template similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could let clinicians verify AI suggestions before acting on them during routine screening.
Adding patient metadata to the VLM prompt might increase the clinical relevance of the generated rationales.
The method's use of publicly available models could ease deployment in telemedicine settings with limited specialist access.

Load-bearing premise

The generated Grad-CAM++ maps and VLM rationales are sufficiently accurate and clinically meaningful to support real-world diagnostic use.

What would settle it

A panel of ophthalmologists rating Grad-CAM++ maps as highlighting non-diagnostic regions or VLM rationales as factually wrong or unhelpful on a large held-out set of APTOS cases would falsify the interpretability claim.

Figures

Figures reproduced from arXiv: 2604.23079 by Carmine Gravino, Fabio Palomba, Pir Bakhsh Khokhar, Sarang Shaikh, Sule Yildirim Yayilgan.

**Figure 1.** Figure 1: Distribution of DR grades (0–4) in APTOS 2019. The imbalance across severity levels motivates imbalance-aware training and the use of complementary evaluation metrics (e.g., macro-F1 and QWK). severity grades rather than performing well only on the majority classes. Third, the variability in acquisition conditions makes the dataset valuable not only for evaluating predictive performance but also for asses… view at source ↗

**Figure 3.** Figure 3: Tripanel example for Class 0 (No DR): original image, Grad-CAM++ overlay with predicted probabilities, and VLM explanation view at source ↗

**Figure 5.** Figure 5: Tripanel example for Class 3 (Severe DR): view at source ↗

**Figure 6.** Figure 6: Tripanel example for Class 4 (Proliferative view at source ↗

read the original abstract

The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a clean empirical win on APTOS with weighted soft voting at QWK 0.934 but the interpretability claims sit on unvalidated proxies.

read the letter

The main thing to know is that weighted soft voting of the CNN and transformer backbones reaches 0.934 QWK with a tight standard deviation across the five folds on APTOS 2019. That beats the single-model numbers they report and holds up in the paired comparisons they ran. The hybrid class-level fusion they tried did not improve things enough to matter statistically. They also layer on Grad-CAM++ maps and short VLM rationales, then measure the text outputs with coverage, BERTScore, and CLIPScore to show the usual completeness-versus-similarity trade-off. Those numbers are new even if the pieces are not. The controlled five-fold protocol and the direct comparison of voting strategies are the parts that feel solid. They kept the evaluation straightforward and reported the Holm-adjusted p-values, which is better than many applied medical imaging papers. The soft spots are exactly where the stress-test note flags them. The explanations are called plausible and grade-consistent, but the only support is those proxy scores and the authors' own qualitative read. No ophthalmologist ratings, no expert lesion maps, and no inter-rater numbers appear, so the claim that the outputs are clinically interpretable stays unanchored. The methods also give little on hyperparameter search or preprocessing, which makes the performance numbers harder to reproduce. This is useful for someone who needs a ready-to-run DR grading pipeline with basic visuals and text attached. It is not the place to look for new methods or for explanations that have been checked against real clinical judgment. The empirical comparison is clean enough that it deserves peer review, though any referee will want stronger validation on the explanation side before the interpretability half of the title can be taken at face value.

Referee Report

2 major / 0 minor

Summary. This manuscript describes an ensemble-based approach to grading diabetic retinopathy severity using CNN and transformer architectures on the APTOS 2019 dataset. It evaluates various ensembling methods including weighted soft voting, which achieves the highest cross-validated QWK of 0.934 ± 0.017. Additionally, it integrates Grad-CAM++ for visual attributions and vision-language models for generating textual explanations, assessing their quality through quantitative proxies and qualitative observations.

Significance. The results demonstrate that ensembling can modestly improve ordinal agreement in DR grading, and the multimodal explanation pipeline is a step toward interpretable medical AI. Strengths include the use of five-fold stratified cross-validation, reporting of standard deviations, and statistical comparisons (Holm-adjusted p-values). If the explanation components were clinically validated, this could have practical significance for screening programs. Currently, the interpretability aspect relies on unverified assumptions about the meaningfulness of the generated maps and rationales.

major comments (2)

[Abstract] Abstract: Details on hyperparameter search, data preprocessing steps, and handling of potential label noise in the APTOS 2019 dataset are not provided, making it difficult to reproduce or fully assess the robustness of the reported QWK values and statistical tests.
[Interpretability evaluation] Interpretability evaluation (as described in the abstract): The assessment of explanation quality uses only proxy metrics (e.g., coverage 0.700, BERTScore 0.072, CLIPScore ~0.34) and qualitative statements like 'plausible but coarse' without any ophthalmologist ratings, inter-rater reliability, or comparison to expert-annotated lesion maps. This does not sufficiently support the claim of producing 'clinically interpretable outputs'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address each major comment below and have revised the manuscript where feasible to improve reproducibility and clarify the scope of our interpretability claims.

read point-by-point responses

Referee: [Abstract] Abstract: Details on hyperparameter search, data preprocessing steps, and handling of potential label noise in the APTOS 2019 dataset are not provided, making it difficult to reproduce or fully assess the robustness of the reported QWK values and statistical tests.

Authors: We agree that these implementation details are necessary for full reproducibility. In the revised manuscript we have added an expanded Methods subsection that specifies the hyperparameter search (grid search over learning rates [1e-5, 1e-4, 1e-3], batch sizes [16, 32], and optimizers with 5-fold CV selection), the preprocessing pipeline (224×224 resizing, ImageNet mean/std normalization, and on-the-fly augmentations of random horizontal flips and rotations), and our treatment of label noise (reliance on the dataset-provided expert labels without additional denoising, while noting known annotation variability in APTOS 2019). These additions directly address the concern and allow readers to replicate the experimental protocol. revision: yes
Referee: [Interpretability evaluation] Interpretability evaluation (as described in the abstract): The assessment of explanation quality uses only proxy metrics (e.g., coverage 0.700, BERTScore 0.072, CLIPScore ~0.34) and qualitative statements like 'plausible but coarse' without any ophthalmologist ratings, inter-rater reliability, or comparison to expert-annotated lesion maps. This does not sufficiently support the claim of producing 'clinically interpretable outputs'.

Authors: We acknowledge that proxy metrics and qualitative observations alone do not constitute clinical validation. Direct ophthalmologist ratings, inter-rater reliability, or comparison against expert lesion annotations would require a separate human-subject study with appropriate ethics approval and is outside the scope of the present work. In the revision we have (i) removed or qualified phrases such as “clinically interpretable outputs” in the abstract and conclusions, (ii) added an explicit limitations paragraph stating that the current evaluation is preliminary and that clinical validation remains future work, and (iii) retained the proxy metrics as an initial quantitative sanity check while clarifying their limitations. This constitutes a partial revision that addresses the referee’s concern without over-claiming. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation on public benchmark with no derivation chain

full rationale

The paper reports results from training and evaluating six CNN/transformer backbones on the APTOS 2019 dataset under stratified 5-fold CV, then compares standard ensemble methods (hard voting, weighted soft voting, stacking) plus a hybrid class-level fusion variant. All performance numbers are direct cross-validated metrics (QWK, accuracy) computed on held-out folds; explanation quality is assessed via standard proxy scores (coverage, BERTScore, CLIPScore) plus qualitative description of Grad-CAM++ maps and VLM outputs. No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear. No self-citations are invoked as load-bearing premises. The entire chain is external to the paper's own outputs and therefore non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard supervised learning assumptions and the representativeness of the APTOS 2019 benchmark; no new entities are postulated.

axioms (1)

domain assumption Stratified five-fold cross-validation produces unbiased estimates of generalization performance on the APTOS 2019 distribution
Invoked when reporting cross-validated QWK scores

pith-pipeline@v0.9.0 · 5640 in / 1257 out tokens · 53941 ms · 2026-05-08T08:45:13.151855+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Pro-ceedings of the CLEF, Madrid, Spain , 9–12

Overview of imageclefmedical 2025–medical concept detection and interpretable caption generation. Pro-ceedings of the CLEF, Madrid, Spain , 9–12. Dharrao, D., Dharrao, M., Patil, S., Salvin, S., Ahire, P., Dongre, Y.,

work page 2025
[2]

7514–7528

Clipscore: A reference -free evaluation metric for image captioning, in: Proceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528. Holland, R., Taylor, T.R., Holmes, C., Riedl, S., Mai, J., Patsiamanidi, M., Mitsopoulou, D., Hager, P., Müller, P., Paetzold, J.C., et al.,

work page 2021
[3]

https://kaggle.com/competitions/aptos2019 -blindness-d etection

Aptos 2019 blindness detection. https://kaggle.com/competitions/aptos2019 -blindness-d etection . Kaggle. Kocak, B., Klontzas, M.E., Stanzione, A., Meddeb, A., Demir- cioğlu, A., Bluethgen, C., Bressem, K., Ugga, L., Mercaldo, N., Díaz, O., et al.,

work page 2019
[4]

arXiv preprint arXiv:2508.15168

Xdr- lvlm: An explainable vision -language large model for diabetic retinopathy diagnosis. arXiv preprint arXiv:2508.15168 . Tibshirani, R.J., Efron, B.,

work page arXiv

[1] [1]

Pro-ceedings of the CLEF, Madrid, Spain , 9–12

Overview of imageclefmedical 2025–medical concept detection and interpretable caption generation. Pro-ceedings of the CLEF, Madrid, Spain , 9–12. Dharrao, D., Dharrao, M., Patil, S., Salvin, S., Ahire, P., Dongre, Y.,

work page 2025

[2] [2]

7514–7528

Clipscore: A reference -free evaluation metric for image captioning, in: Proceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528. Holland, R., Taylor, T.R., Holmes, C., Riedl, S., Mai, J., Patsiamanidi, M., Mitsopoulou, D., Hager, P., Müller, P., Paetzold, J.C., et al.,

work page 2021

[3] [3]

https://kaggle.com/competitions/aptos2019 -blindness-d etection

Aptos 2019 blindness detection. https://kaggle.com/competitions/aptos2019 -blindness-d etection . Kaggle. Kocak, B., Klontzas, M.E., Stanzione, A., Meddeb, A., Demir- cioğlu, A., Bluethgen, C., Bressem, K., Ugga, L., Mercaldo, N., Díaz, O., et al.,

work page 2019

[4] [4]

arXiv preprint arXiv:2508.15168

Xdr- lvlm: An explainable vision -language large model for diabetic retinopathy diagnosis. arXiv preprint arXiv:2508.15168 . Tibshirani, R.J., Efron, B.,

work page arXiv