pith. sign in

arxiv: 2603.08235 · v2 · pith:V32CC5BRnew · submitted 2026-03-09 · 💻 cs.CV · cs.AI

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

Pith reviewed 2026-05-21 11:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diabetic retinopathymacular edemaultra-widefield imagingdeep learningvision transformersfeature fusionfrequency domainimage quality assessment
0
0 comments X

The pith

Deep learning models using ultra-widefield images detect referable diabetic retinopathy and macular edema with consistent strength across architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks deep learning methods on ultra-widefield images for three tasks: assessing image quality, spotting referable diabetic retinopathy, and identifying diabetic macular edema. It compares convolutional networks to newer vision transformers and foundation models, processing both standard color images and frequency representations, then combines features from multiple models for better results. Explanations come from Grad-CAM heatmaps. A sympathetic reader would care because these eye conditions cause preventable blindness in working-age adults and the wider field of view in ultra-widefield imaging could catch more cases earlier than standard photos.

Core claim

Using the UWF4DR Challenge dataset, state-of-the-art deep learning models achieve consistently strong performance across all tested architectures on image quality assessment, referable diabetic retinopathy identification, and diabetic macular edema identification. This performance highlights the competitiveness of vision transformers and foundation models as well as the value of feature-level fusion and frequency-domain representations for ultra-widefield image analysis.

What carries the argument

Benchmarking CNNs, vision transformers, and foundation models in spatial RGB and frequency domains with feature-level fusion and Grad-CAM analysis on the UWF4DR Challenge dataset for the three clinical tasks.

If this is right

  • Deep learning models deliver strong results on ultra-widefield image quality assessment.
  • Referable diabetic retinopathy identification works reliably with the tested approaches.
  • Diabetic macular edema detection improves with the wider field and advanced models.
  • Feature-level fusion adds robustness across different model types.
  • Frequency-domain representations complement spatial processing for ultra-widefield analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Eye clinics might switch to ultra-widefield cameras paired with these models to capture more peripheral retina in routine visits.
  • Frequency-domain processing could help when image quality varies due to different camera hardware.
  • Adding the Grad-CAM maps to doctor review workflows might increase trust and speed up screening.
  • Testing the same pipeline on images from broader age and ethnic groups would check how well results hold in diverse populations.

Load-bearing premise

The labels and imaging conditions in the UWF4DR Challenge dataset are representative of real-world clinical ultra-widefield images so that reported performance will translate to practical screening.

What would settle it

A new collection of ultra-widefield images from different clinics and cameras, labeled independently, on which the trained models show substantially lower accuracy than reported on the challenge set.

Figures

Figures reproduced from arXiv: 2603.08235 by Aythami Morales, Guillermo Gonz\'alez de Rivera, Julian Fierrez, Pablo Jimenez-Lizcano, Ruben Tolosana, Ruben Vera-Rodriguez, Sergio Romero-Tapiador.

Figure 1
Figure 1. Figure 1: Graphical representation of the proposed framework, summarizing the main components and processing stages of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: UWF4DR tasks. (A–B) Quality Assessment: (A) Gradable image with good focus, contrast, and clear structures; (B) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DFT Magnitude (clipped 99%) comparison: (A) Grad [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grad-CAM examples. Task 1 (Quality Assessment): (A–B) Correct gradable predictions focus on optic disc/vessels; [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript benchmarks deep learning models including CNNs, Vision Transformers, and foundation models for three tasks on ultra-widefield images from the public UWF4DR Challenge dataset (MICCAI 2024): image quality assessment, referable diabetic retinopathy detection, and diabetic macular edema detection. It compares spatial and frequency-domain representations, applies feature-level fusion, and uses Grad-CAM for explainability, claiming consistently strong performance that demonstrates the competitiveness of ViTs/foundation models and the value of fusion and frequency-domain methods for UWF analysis.

Significance. If the performance holds under broader testing, the work could advance automated DR/DME screening by leveraging UWF's wider field of view and modern architectures like ViTs. Use of a public challenge dataset aids reproducibility, and inclusion of explainability is a positive step toward clinical utility. However, significance is constrained by single-dataset evaluation without external validation or domain-shift testing, reducing confidence that results generalize beyond the challenge data.

major comments (2)
  1. Abstract: The claim of 'consistently strong performance across all architectures' is presented without any quantitative metrics (e.g., AUC, sensitivity, specificity), confidence intervals, data-split details, or statistical tests. This absence directly undermines evaluation of the central claim regarding ViT competitiveness and the promise of fusion/frequency-domain methods.
  2. Experiments section: All benchmarking occurs exclusively on the single UWF4DR Challenge dataset with no external validation cohort, multi-center data, or domain-shift experiments described. This is load-bearing for the claim that results underscore the 'promise ... for UWF analysis,' as performance consistency may reflect dataset-specific properties rather than methodological advantages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, with revisions made where they strengthen the manuscript without misrepresenting our results.

read point-by-point responses
  1. Referee: Abstract: The claim of 'consistently strong performance across all architectures' is presented without any quantitative metrics (e.g., AUC, sensitivity, specificity), confidence intervals, data-split details, or statistical tests. This absence directly undermines evaluation of the central claim regarding ViT competitiveness and the promise of fusion/frequency-domain methods.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claim. The manuscript already reports detailed AUC, sensitivity, specificity, and statistical comparisons in the Experiments section and Tables 2–4. We have revised the abstract to incorporate representative metrics (e.g., peak AUC values per task) while respecting length constraints. revision: yes

  2. Referee: Experiments section: All benchmarking occurs exclusively on the single UWF4DR Challenge dataset with no external validation cohort, multi-center data, or domain-shift experiments described. This is load-bearing for the claim that results underscore the 'promise ... for UWF analysis,' as performance consistency may reflect dataset-specific properties rather than methodological advantages.

    Authors: We acknowledge that single-dataset evaluation limits strong generalization claims. The UWF4DR dataset is the official public benchmark from the MICCAI 2024 challenge, chosen specifically to enable reproducible comparisons. We have added an expanded limitations paragraph in the Discussion that explicitly addresses domain-shift risks and calls for future multi-center validation. The observed consistency across CNNs, ViTs, and foundation models on this standardized data still supports the methodological points regarding fusion and frequency-domain processing. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmarking on external public dataset

full rationale

The paper is a standard empirical benchmarking study that evaluates off-the-shelf CNNs, ViTs, and foundation models on the publicly released UWF4DR Challenge dataset for three classification tasks. It reports experimental results in spatial and frequency domains, applies feature-level fusion, and uses Grad-CAM for visualization. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present that could reduce any claim to its own inputs by construction. All performance numbers derive from direct evaluation on an external challenge dataset rather than from internal definitions or prior author results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

As an empirical benchmarking study the central claims rest on the representativeness of the UWF4DR dataset labels and the assumption that standard training procedures on this data will produce generalizable detectors for clinical ultra-widefield images.

free parameters (1)
  • Model selection and training hyperparameters
    Choice of specific CNN, ViT, and foundation model variants plus their optimization settings are selected to achieve the reported performance.
axioms (1)
  • domain assumption UWF4DR Challenge dataset provides accurate ground-truth labels for the three tasks
    All reported results depend on the correctness and consistency of the challenge annotations.

pith-pipeline@v0.9.0 · 5798 in / 1346 out tokens · 62910 ms · 2026-05-21T11:54:49.950341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Diabetic retinopathy,

    N. Cheung, P. Mitchell, and T. Y. Wong, “Diabetic retinopathy,”The Lancet, vol. 376, no. 9735, pp. 124–136, 2010

  2. [2]

    Global prevalence of diabetic retinopathy and projection of burden through 2045,

    Z. L. Teoet al., “Global prevalence of diabetic retinopathy and projection of burden through 2045,”Ophthalmology, 2021

  3. [3]

    Diabetic macular edema: Diagnosis and management,

    N. Elyasi and H. D. Hemmati, “Diabetic macular edema: Diagnosis and management,” AAO EyeNet, 2021

  4. [4]

    Grad- ing diabetic retinopathy from stereoscopic color fundus photographs— an extension of the modified airlie house classification,

    Early Treatment Diabetic Retinopathy Study Research Group, “Grad- ing diabetic retinopathy from stereoscopic color fundus photographs— an extension of the modified airlie house classification,”Ophthalmol- ogy, vol. 98, no. 5 Suppl, pp. 786–806, 1991

  5. [5]

    The future of ultrawide field imaging for diabetic retinopathy: Pondering the retinal periphery,

    J. K. Sun and L. P. Aiello, “The future of ultrawide field imaging for diabetic retinopathy: Pondering the retinal periphery,”JAMA Ophthal- mology, vol. 134, no. 3, pp. 247–248, 2016

  6. [6]

    Grad-CAM: visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Daset al., “Grad-CAM: visual explanations from deep networks via gradient-based localization,” in Proc. IEEE/CVF International Conference on Computer Vision, 2017

  7. [7]

    Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus pho- tographs,

    V. Gulshanet al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus pho- tographs,”JAMA, vol. 316, no. 22, pp. 2402–2410, 2016

  8. [8]

    Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes,

    D. S. W. Tinget al., “Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes,”JAMA, 2017

  9. [9]

    Automated identification of diabetic retinopathy using deep learning,

    R. Gargeya and T. Leng, “Automated identification of diabetic retinopathy using deep learning,”Ophthalmology, vol. 124, no. 7, pp. 962–969, 2017

  10. [10]

    Deep learning-based classification of retinal vascular diseases using ultra-widefield colour fundus photographs,

    E. Abitbol and others., “Deep learning-based classification of retinal vascular diseases using ultra-widefield colour fundus photographs,” BMJ Open Ophthalmology, vol. 7, no. 1, p. e001056, 2022

  11. [11]

    Deep learning for the detection of multiple fundus diseases using ultra-widefield images,

    G. Sunet al., “Deep learning for the detection of multiple fundus diseases using ultra-widefield images,”Ophthalmology and Therapy, vol. 12, pp. 895–907, 2022

  12. [12]

    Early detection of diabetic retinopathy based on deep learning and ultra-wide-field fundus images,

    K. Ohet al., “Early detection of diabetic retinopathy based on deep learning and ultra-wide-field fundus images,”Scientific Reports, vol. 11, no. 1, p. 1897, 2021

  13. [13]

    A teleophthalmology support system based on the visibility of retinal elements using cnns,

    G. Calderon-Auzaet al., “A teleophthalmology support system based on the visibility of retinal elements using cnns,”Sensors, 2020

  14. [14]

    Deeplearningfrom“passivefeeding

    Z.Lietal.,“Deeplearningfrom“passivefeeding”to“selectiveeating” of real-world data,”NPJ Digital Medicine, vol. 3, no. 1, p. 143, 2020

  15. [15]

    Vision trans- former model for predicting the severity of diabetic retinopathy in fundus photography-based retina images,

    W. Nazih, A. Aseeri, O. Atallah, and S. El-Sappagh, “Vision trans- former model for predicting the severity of diabetic retinopathy in fundus photography-based retina images,”IEEE Access, 2023

  16. [16]

    Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image,

    S. Q. Y. Yang, Z. Cai, and P. Xu, “Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image,”PLoS ONE, vol. 19, no. 3, 2024

  17. [17]

    A foundation model for generalizable disease detection from retinal images,

    Y. Zhouet al., “A foundation model for generalizable disease detection from retinal images,”Nature, vol. 622, pp. 156–163, 2023

  18. [18]

    Deep learning-based detection of referable diabetic retinopathy and macular edema using ultra-widefield fundus imaging,

    P. Zhang, P.-H. Conze, M. Lamardet al., “Deep learning-based detection of referable diabetic retinopathy and macular edema using ultra-widefield fundus imaging,”arXiv:2409.12854, 2024

  19. [19]

    MobileNetV2: inverted resid- uals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhuet al., “MobileNetV2: inverted resid- uals and linear bottlenecks,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018

  20. [20]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. on Computer Vision and Pattern Recognition, 2016

  21. [21]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv:2010.11929, 2020

  22. [22]

    Imagenet: A large-scale hierar- chical image database,

    J. Deng, W. Dong, R. Socheret al., “Imagenet: A large-scale hierar- chical image database,” inProc. IEEE Conf. on Computer Vision and Pattern Recognition, 2009

  23. [23]

    Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales,

    C. P. Wilkinsonet al., “Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales,” Ophthalmology, vol. 110, no. 9, pp. 1677–1682, 2003

  24. [24]

    Synthetic data for the mitigation of demographic biases in face recognition,

    P. Melzi, C. Rathgeb, R. Tolosanaet al., “Synthetic data for the mitigation of demographic biases in face recognition,” inProc. IEEE Conf. on International Joint Conference on Biometrics, 2023

  25. [25]

    Ai4food-nutritionfw: A novel framework for the automatic synthesis and analysis of eating behaviours,

    S. Romero-Tapiador, R. Tolosanaet al., “Ai4food-nutritionfw: A novel framework for the automatic synthesis and analysis of eating behaviours,”IEEE Access, vol. 11, pp. 112199–112211, 2023

  26. [26]

    Sdfr: Synthetic data for face recognition competition,

    H. O. Shahreza, C. Ecabert, A. Georgeet al., “Sdfr: Synthetic data for face recognition competition,” inProc. International Conference on Automatic Face and Gesture Recognition, 2024

  27. [27]

    From pixels to words: Leveraging explainability in face recognition through interac- tive natural language processing,

    I. DeAndres-Tame, M. Faisal, R. Tolosanaet al., “From pixels to words: Leveraging explainability in face recognition through interac- tive natural language processing,” inProc. International Conference on Pattern Recognition Workshops, 2024

  28. [28]

    Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition,

    S. Romero-Tapiador, R. Tolosanaet al., “Are vision-language models ready for dietary assessment? exploring the next frontier in ai-powered food image recognition,” inProc. Computer Vision and Pattern Recognition Conference Workshops, 2025

  29. [29]

    How good is chatgpt at face biometrics? a first look into recognition, soft biometrics, and explainability,

    I. Deandres-Tame, R. Tolosana, R. Vera-Rodriguezet al., “How good is chatgpt at face biometrics? a first look into recognition, soft biometrics, and explainability,”IEEE Access, 2024

  30. [30]

    Attzoom:Atten- tion zoom for better visual features,

    D.DeAlcala,A.Morales,J.Fierrez,andR.Tolosana,“Attzoom:Atten- tion zoom for better visual features,” inProc. IEEE/CVF International Conference on Computer Vision, 2025

  31. [31]

    Exploiting multiple representations: 3d face biometrics fusion with application to surveillance,

    S. M. La Cava, R. Casula, S. Concas, G. Orrùet al., “Exploiting multiple representations: 3d face biometrics fusion with application to surveillance,”arXiv:2504.18886, 2025