Recognition: no theorem link
Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement
Pith reviewed 2026-05-13 17:11 UTC · model grok-4.3
The pith
EyeBench-V2 evaluates fundus enhancement models by clinical task performance and expert review rather than pixel metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EyeBench-V2 bridges restoration and diagnosis by supplying a unified benchmark that measures enhancement models through multi-dimensional clinical alignment, including performance on vessel segmentation, DR grading, and lesion segmentation plus expert manual review of lesion alterations, color shifts, and introduced artifacts on a dataset supporting fair paired and unpaired comparisons.
What carries the argument
EyeBench-V2 benchmark with its downstream clinical tasks and expert-guided manual assessment protocol that enables unified evaluation of enhancement methods.
Where Pith is reading between the lines
- The benchmark could drive training objectives that directly optimize for segmentation and grading accuracy rather than image similarity alone.
- Similar task-oriented benchmarks might be applied to enhancement in other medical imaging areas to link restoration more tightly to diagnostic outcomes.
- Current generative models may need architectural changes to avoid introducing artifacts that the expert protocol flags as clinically harmful.
Load-bearing premise
The chosen downstream tasks and expert assessment protocol accurately reflect real clinical diagnostic utility and the dataset represents typical real-world fundus images and noise.
What would settle it
A controlled reader study showing that models ranked highest by EyeBench-V2 produce no measurable gain in ophthalmologist diagnostic accuracy on enhanced images compared with originals.
Figures
read the original abstract
Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EyeBench-V2, a benchmark for retinal fundus image enhancement. It argues that standard metrics such as PSNR and SSIM fail to capture clinically relevant features like lesion preservation and vessel morphology. The benchmark evaluates generative models through downstream clinical tasks including vessel segmentation, diabetic retinopathy grading, lesion segmentation, and generalization to unseen noise patterns, supplemented by expert-guided manual assessment on a curated dataset supporting paired and unpaired method comparisons. The goal is to deliver actionable insights that align enhancement performance with diagnostic utility and guide future model development.
Significance. If the chosen downstream tasks and expert protocol are validated as reliable proxies, EyeBench-V2 could establish a much-needed standardized evaluation framework in ophthalmic image enhancement. This would help the community move beyond generic image-quality metrics toward assessments that better reflect real diagnostic value, potentially improving model selection for clinical deployment and highlighting specific weaknesses in current generative approaches.
major comments (3)
- [Abstract] Abstract, point (1): The claim that downstream tasks (vessel segmentation, DR grading, lesion segmentation) bridge enhancement performance to clinical utility is load-bearing but unsupported. No correlation analysis, inter-rater reliability statistics, or evidence linking task improvements to actual diagnostic accuracy is provided, leaving the proxy assumption untested.
- [Contributions] Contributions (2): The expert-guided evaluation design is described at a high level but lacks concrete protocol details such as scoring rubrics for lesion structure alterations, color shifts, and artificial structures, expert selection criteria, or quantitative agreement measures. Without these, reproducibility and clinical fidelity cannot be assessed.
- [Contributions] Contributions (3) and dataset description: The assertion of 'actionable insights' and fair paired/unpaired comparisons depends on the curated dataset representing real-world noise patterns across camera types and populations. No statistics on dataset size, diversity, or noise modeling are supplied, weakening the generalization claims.
minor comments (1)
- [Abstract] The abstract mentions 'generalization to unseen noise patterns' without defining the patterns or the protocol used to introduce them; adding one sentence of clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our manuscript on EyeBench-V2. The comments highlight important areas for strengthening the presentation of our clinical-alignment claims, evaluation protocols, and dataset details. We address each major comment below and outline the specific revisions planned for the next version of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract, point (1): The claim that downstream tasks (vessel segmentation, DR grading, lesion segmentation) bridge enhancement performance to clinical utility is load-bearing but unsupported. No correlation analysis, inter-rater reliability statistics, or evidence linking task improvements to actual diagnostic accuracy is provided, leaving the proxy assumption untested.
Authors: We agree that an explicit correlation analysis between enhancement outputs and downstream diagnostic accuracy would provide stronger support for the proxy assumption. The selected tasks (vessel segmentation, DR grading, lesion segmentation) were chosen because they are established clinical endpoints in ophthalmology literature, directly tied to diagnostic decisions. In the revision we will add a dedicated paragraph in the abstract and methods sections justifying these proxies with supporting references, report inter-rater reliability statistics for the expert assessments, and include a supplementary correlation table (e.g., Spearman rank between perceptual scores and task metrics) computed on the existing evaluation data. This will make the bridging claim more robust without requiring new experiments. revision: partial
-
Referee: [Contributions] Contributions (2): The expert-guided evaluation design is described at a high level but lacks concrete protocol details such as scoring rubrics for lesion structure alterations, color shifts, and artificial structures, expert selection criteria, or quantitative agreement measures. Without these, reproducibility and clinical fidelity cannot be assessed.
Authors: We appreciate this observation. The original description was kept concise, but we recognize that full reproducibility requires the complete protocol. In the revised manuscript we will expand the expert evaluation subsection to include: (i) the full 5-point scoring rubrics for lesion structure, color fidelity, and artifact introduction; (ii) expert selection criteria (board-certified ophthalmologists with at least five years of fundus-image experience); and (iii) quantitative agreement metrics (Cohen’s kappa and percentage agreement) computed across the three experts. These additions will be placed in the main text with the rubrics also provided as supplementary material. revision: yes
-
Referee: [Contributions] Contributions (3) and dataset description: The assertion of 'actionable insights' and fair paired/unpaired comparisons depends on the curated dataset representing real-world noise patterns across camera types and populations. No statistics on dataset size, diversity, or noise modeling are supplied, weakening the generalization claims.
Authors: We acknowledge the need for explicit dataset statistics. The revised version will add a new table and accompanying text detailing: total image count, breakdown by camera manufacturer (Topcon, Zeiss, Canon, etc.), patient demographics (age range, ethnicity distribution where available), and the noise-modeling procedure (synthetic noise calibrated to real acquisition artifacts observed across the source cameras). These statistics will directly support the claims of real-world representativeness and fair paired/unpaired comparisons. revision: yes
Circularity Check
No circularity: benchmark proposal uses external proxies without self-referential derivation
full rationale
The paper introduces EyeBench-V2 as an evaluation framework for fundus enhancement models, relying on downstream tasks (vessel segmentation, DR grading, lesion segmentation) and an expert assessment protocol. No mathematical derivations, equations, or parameter fittings exist that could reduce to self-definition or fitted inputs called predictions. The central premise—that these tasks and protocols bridge to clinical utility—is presented as an assumption supported by curation choices, not derived internally or justified via self-citation chains. No uniqueness theorems, ansatzes smuggled through citations, or renamings of known results appear. The work is self-contained as a benchmark definition whose validity hinges on external clinical correlation, not circular construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Downstream tasks such as vessel segmentation and DR grading serve as valid indicators of clinical utility for enhanced fundus images.
- domain assumption Structured expert review can reliably detect clinically critical alterations such as lesion structure changes or artificial structures.
Reference graph
Works this paper leans on
-
[1]
Wasserstein generative adversarial networks
Martin Arjovsky, Soumith Chintala, and L ´eon Bottou. Wasserstein generative adversarial networks. InInterna- tional conference on machine learning, pages 214–223. PMLR, 2017. 3
work page 2017
-
[2]
On the mathematical properties of the structural similarity index
Dominique Brunet, Edward R Vrscay, and Zhou Wang. On the mathematical properties of the structural similarity index. IEEE Transactions on Image Processing, 21(4):1488–1499,
-
[3]
I-secret: Importance-guided fundus image enhance- ment via semi-supervised contrastive constraining
Pujin Cheng, Li Lin, Yijin Huang, Junyan Lyu, and Xiaoying Tang. I-secret: Importance-guided fundus image enhance- ment via semi-supervised contrastive constraining. InMed- ical Image Computing and Computer Assisted Intervention– MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pages ...
work page 2021
-
[4]
Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, and Xin Gao. Improving rep- resentation of high-frequency components for medical foun- dation models.arXiv preprint arXiv:2407.14651, 2024. 8
-
[5]
Zhuo Deng, Yuanhao Cai, Lu Chen, Zheng Gong, Qiqi Bao, Xue Yao, Dong Fang, Wenming Yang, Shaochong Zhang, and Lan Ma. Rformer: Transformer-based generative ad- versarial network for real fundus image restoration on a new clinical benchmark.IEEE Journal of Biomedical and Health Informatics, 26(9):4645–4655, 2022. 3, 4, 5, 12
work page 2022
-
[6]
Xuanzhao Dong, Vamsi Krishna Vasa, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Yi Su, Yujian Xiong, Zhangsihao Yang, Yanxi Chen, and Yalin Wang. Cunsb-rfie: Context-aware unpaired neural schr”{o}dinger bridge in retinal fundus im- age enhancement.arXiv preprint arXiv:2409.10966, 2024. 2, 3, 4, 5, 7, 8, 13
-
[7]
Xuanzhao Dong, Wenhui Zhu, Xin Li, Guoxin Sun, Yi Su, Oana M Dumitrascu, and Yalin Wang. Tpot: Topology pre- serving optimal transport in retinal fundus image enhance- ment.arXiv preprint arXiv:2411.01403, 2024. 3, 4, 5, 13
-
[8]
Jiawei Du, Jia Guo, Weihang Zhang, Shengzhu Yang, Han- ruo Liu, Huiqi Li, and Ningli Wang. Ret-clip: A retinal im- age foundation model pre-trained with clinical diagnostic re- ports.arXiv preprint arXiv:2405.14137, 2024. 5, 6, 7, 14
-
[9]
Di- abetic retinopathy detection.https : / / kaggle
Emma Dugas, Jared, Jorge, and Will Cukierski. Di- abetic retinopathy detection.https : / / kaggle . com / competitions / diabetic - retinopathy - detection, 2015. Kaggle. 11
work page 2015
-
[10]
Automated retinal imaging anal- ysis for alzheimers disease screening
Oana M Dumitrascu, Wenhui Zhu, Peijie Qiu, Keshav Nan- dakumar, and Yalin Wang. Automated retinal imaging anal- ysis for alzheimers disease screening. InIEEE International Symposium on Biomedical Imaging: From Nano to Macro (ISBI), 2022. 2
work page 2022
-
[11]
Evaluation of reti- nal image quality assessment networks in different color- spaces
Huazhu Fu, Boyang Wang, Jianbing Shen, Shanshan Cui, Yanwu Xu, Jiang Liu, and Ling Shao. Evaluation of reti- nal image quality assessment networks in different color- spaces. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, pages 48–56. Sp...
work page 2019
-
[12]
Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 3
work page 2020
-
[13]
Improved training of wasserstein gans.Advances in neural information processing systems, 30, 2017
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans.Advances in neural information processing systems, 30, 2017. 3, 4, 5, 12
work page 2017
-
[14]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[15]
Image-to-image translation with conditional adver- sarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,
-
[16]
Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, and Jong Chul Ye. Unpaired image-to-image translation via neu- ral schr\” odinger bridge.arXiv preprint arXiv:2305.15086,
-
[17]
Structure-consistent restoration network for cataract fundus image enhancement
Heng Li, Haofeng Liu, Huazhu Fu, Hai Shu, Yitian Zhao, Xiaoling Luo, Yan Hu, and Jiang Liu. Structure-consistent restoration network for cataract fundus image enhancement. InInternational Conference on Medical Image Comput- ing and Computer-Assisted Intervention, pages 487–496. Springer, 2022. 3, 4, 5, 11
work page 2022
-
[18]
Heng Li, Haofeng Liu, Huazhu Fu, Yanwu Xu, Hai Shu, Ke Niu, Yan Hu, and Jiang Liu. A generic fundus image enhancement network boosted by frequency self-supervised representation learning.Medical Image Analysis, 90:102945,
-
[19]
Degradation-invariant enhance- ment of fundus images via pyramid constraint network
Haofeng Liu, Heng Li, Huazhu Fu, Ruoxiu Xiao, Yunshu Gao, Yan Hu, and Jiang Liu. Degradation-invariant enhance- ment of fundus images via pyramid constraint network. In Medical Image Computing and Computer Assisted Interven- tion – MICCAI 2022, pages 507–516, Cham, 2022. Springer Nature Switzerland. 3, 4, 5, 11
work page 2022
-
[20]
Least squares genera- tive adversarial networks
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares genera- tive adversarial networks. InProceedings of the IEEE inter- national conference on computer vision, pages 2794–2802,
-
[21]
The contextual loss for image transformation with non-aligned data, 2018
Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation with non-aligned data, 2018. 3, 13
work page 2018
-
[22]
Contrastive learning for unpaired image-to-image translation
Taesung Park, Alexei A Efros, Richard Zhang, and Jun- Yan Zhu. Contrastive learning for unpaired image-to-image translation. InComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Pro- ceedings, Part IX 16, pages 319–345. Springer, 2020. 13
work page 2020
-
[23]
Idrid: A database for diabetic retinopathy screening research.Data, 3(3), 2018
Prasanna Porwal and et al. Idrid: A database for diabetic retinopathy screening research.Data, 3(3), 2018. 4, 15
work page 2018
-
[24]
Bo Qian, Bin Sheng, Hao Chen, Xiangning Wang, Tingyao Li, Yixiao Jin, Zhouyu Guan, Zehua Jiang, Yilan Wu, Jinyuan Wang, et al. A competition for the diagnosis of my- opic maculopathy by artificial intelligence algorithms.JAMA ophthalmology, 2024. 2
work page 2024
-
[25]
Liu Qiong, Li Chaofan, Teng Jinnan, Chen Liping, and Song Jianxiang. Medical image segmentation based on frequency domain decomposition svd linear attention.Scientific Re- ports, 15(1):2833, 2025. 8
work page 2025
-
[26]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 7
work page 2022
-
[27]
U-net: Convolutional networks for biomedical image segmentation,
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation,
-
[28]
Antoine Salmona, Valentin De Bortoli, Julie Delon, and Agnes Desolneux. Can push-forward generative models fit multimodal distributions?Advances in Neural Information Processing Systems, 35:10766–10779, 2022. 3
work page 2022
-
[29]
Ziyi Shen, Huazhu Fu, Jianbing Shen, and Ling Shao. Mod- eling and enhancing low-quality retinal fundus images.IEEE transactions on medical imaging, 40(3):996–1006, 2020. 2, 3, 4, 5, 11
work page 2020
-
[30]
Freeu: Free lunch in diffusion u-net
Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024. 8
work page 2024
-
[31]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[32]
Yang Song and Stefano Ermon. Generative modeling by esti- mating gradients of the data distribution.Advances in neural information processing systems, 32, 2019. 3
work page 2019
-
[33]
J. Staal and et al. Ridge-based vessel segmentation in color images of the retina.IEEE Trans Med Imaging, 23(4):501– 509, 2004. 4, 15
work page 2004
-
[34]
Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008. 6, 7
work page 2008
-
[35]
Context-aware opti- mal transport learning for retinal fundus image enhancement
Vamsi Krishna Vasa, Peijie Qiu, Wenhui Zhu, Yujian Xiong, Oana Dumitrascu, and Yalin Wang. Context-aware opti- mal transport learning for retinal fundus image enhancement. arXiv preprint arXiv:2409.07862, 2024. 2, 3, 4, 5, 12, 13
-
[36]
C ´edric Villani et al.Optimal transport: old and new. Springer, 2009. 3
work page 2009
-
[37]
Rbad: A dataset and benchmark for retinal bifurcation angle detection
Hao Wang, Wenhui Zhu, Jiayou Qin, Xin Li, Oana Dumi- trascu, Xiwen Chen, Peijie Qiu, Abolfazl Razi, and Yalin Wang. Rbad: A dataset and benchmark for retinal bifurcation angle detection. InIEEE-EMBS International Conference on Biomedical and Health Informatics. 2
-
[38]
Wei Wang, Fei Wen, Zeyu Yan, and Peilin Liu. Optimal transport for unsupervised denoising learning.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 45(2): 2104–2118, 2022. 2, 3, 4, 5, 12
work page 2022
-
[39]
Mul- tiscale structural similarity for image quality assessment
Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul- tiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. Ieee, 2003. 3
work page 2003
-
[40]
Study group learning: Improving retinal vessel segmentation trained with noisy labels
Yuqian Zhou, Hanchao Yu, and Humphrey Shi. Study group learning: Improving retinal vessel segmentation trained with noisy labels. InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 57–67. Springer, 2021. 13
work page 2021
-
[41]
Yukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Tim- ing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward- Court, et al. A foundation model for generalizable disease detection from retinal images.Nature, 622(7981):156–163,
-
[42]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired Image-to-Image Translation Using Cycle- Consistent Adversarial Networks.CVPR, pages 2242–2251,
-
[43]
Be- yond mobilenet: An improved mobilenet for retinal diseases
Wenhui Zhu, Peijie Qiu, Xiwen Chen, Huayu Li, Hao Wang, Natasha Lepore, Oana M Dumitrascu, and Yalin Wang. Be- yond mobilenet: An improved mobilenet for retinal diseases. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 56–65. Springer,
-
[44]
Wenhui Zhu, Peijie Qiu, Oana M Dumitrascu, Jacob M Sobczak, Mohammad Farazi, Zhangsihao Yang, Keshav Nandakumar, and Yalin Wang. Otre: Where optimal trans- port guided unpaired image-to-image translation meets reg- ularization by enhancing. InInternational Conference on In- formation Processing in Medical Imaging, pages 415–427. Springer, 2023. 2, 3, 12
work page 2023
-
[45]
Wenhui Zhu, Peijie Qiu, Mohammad Farazi, Keshav Nan- dakumar, Oana M Dumitrascu, and Yalin Wang. Optimal transport guided unsupervised learning for enhancing low- quality retinal images.Proc IEEE Int Symp Biomed Imaging,
-
[46]
Wenhui Zhu, Peijie Qiu, Natasha Lepore, Oana M Dumi- trascu, and Yalin Wang. Self-supervised equivariant regu- larization reconciles multiple-instance learning: Joint refer- able diabetic retinopathy classification and lesion segmenta- tion. In18th International Symposium on Medical Informa- tion Processing and Analysis, pages 100–107. SPIE, 2023. 2
work page 2023
-
[47]
Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xin Li, Natasha Lep- ore, Oana M. Dumitrascu, and Yalin Wang. nnmobilenet: Rethinking cnn for retinopathy research. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2285–2294, 2024. 2, 4, 14 Supplementary Materials - Bridging Restoration and Diagnosis: A Compre-...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.