Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
Pith reviewed 2026-05-22 07:25 UTC · model grok-4.3
The pith
Spatially grounding lesions in retinal images makes ophthalmic VQA models more accurate and clinically transparent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that incorporating lesion-level visual evidence, achieved through ETDRS grid localization of 15,595 lesions in 10,719 images to create 72,706 questions, consistently improves model performance on answer accuracy and lesion-level reasoning metrics, demonstrating the necessity of explicit spatial grounding for reliable and explainable ophthalmic visual question answering.
What carries the argument
The three-stage pipeline that annotates lesions with the ETDRS grid for standardized mapping to nine retinal regions and generates questions with dual evaluation on accuracy and reasoning.
If this is right
- Models using lesion-level evidence achieve higher accuracy across open-ended, closed-ended, single-choice, and multiple-choice question formats.
- Transparency improves because answers can be directly evaluated against the specific lesion region used for reasoning.
- The benchmark enables consistent, standardized assessment of clinical interpretability in ophthalmic VQA tasks.
- Models trained with this explicit grounding can produce explanations that trace back to defined anatomical retinal areas.
Where Pith is reading between the lines
- The same lesion-localization approach could transfer to other medical imaging VQA settings where spatial evidence matters, such as radiology.
- If the nine-region ETDRS division overlooks certain subtle or peripheral lesions, gains in performance may not extend to all real-world imaging variations.
- Adopting the benchmark during model training might produce systems whose reasoning steps align more closely with standard clinical review processes.
Load-bearing premise
Spatially localizing all lesions using the ETDRS grid ensures anatomical consistency and clinical validity for the generated questions and model evaluations.
What would settle it
A review by ophthalmologists finding that a substantial fraction of the generated questions lack clinical validity, or an experiment where ungrounded models achieve equal or higher accuracy than grounded models on a new set of real clinical cases.
Figures
read the original abstract
Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FundusGround, a benchmark for clinically interpretable ophthalmic VQA. It presents a three-stage pipeline that collects 10,719 fundus images containing 15,595 image-level annotated lesions, all spatially localized to nine retinal regions via the ETDRS grid to ensure anatomical consistency. This structured evidence is used to generate 72,706 questions across open-ended, closed-ended, single-choice, and multiple-choice formats. The work then benchmarks general- and medical-domain large vision-language models with dual metrics of answer accuracy and lesion-level reasoning, claiming that explicit lesion-level visual evidence improves both performance and transparency and that spatial grounding is necessary for reliable ophthalmic VQA.
Significance. If the reported gains hold after proper validation, the dataset would constitute a valuable resource for the field by supplying the first large-scale ophthalmic VQA benchmark that explicitly ties questions to spatially localized lesion evidence. The scale (10k+ images, 72k questions) and the shift toward lesion-level interpretability address a documented gap in existing ophthalmic VQA work, potentially supporting more clinically trustworthy model development.
major comments (2)
- [Abstract / three-stage pipeline] Abstract, three-stage pipeline paragraph: the claim that ETDRS-grid localization 'ensures anatomical consistency and clinical validity' for all 15,595 lesions is load-bearing for the necessity-of-spatial-grounding argument. The ETDRS grid was designed for DR severity grading and uses coarse macular-centered sectors; if a non-negligible fraction of lesions involve fine-grained positions or non-DR pathologies, the generated questions may not actually require or test the nine-region mapping, weakening both the performance improvement and the transparency conclusions.
- [Abstract / Experiments] Abstract / Experiments section: the manuscript states that lesion-level evidence 'consistently improves model performance and transparency' yet supplies no annotation validation (accuracy, inter-rater agreement), no error bars on the reported metrics, and no breakdown of results by lesion type or question format. These omissions leave the central empirical claim unverified and constitute a soundness issue for a benchmarking paper.
minor comments (1)
- [Abstract] The abstract reports 72,706 questions derived from 10,719 images and 15,595 lesions; the main text should include an explicit accounting of how questions are sampled per image/lesion to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript introducing FundusGround. We address each of the major comments point by point below and describe the revisions we intend to make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract / three-stage pipeline] Abstract, three-stage pipeline paragraph: the claim that ETDRS-grid localization 'ensures anatomical consistency and clinical validity' for all 15,595 lesions is load-bearing for the necessity-of-spatial-grounding argument. The ETDRS grid was designed for DR severity grading and uses coarse macular-centered sectors; if a non-negligible fraction of lesions involve fine-grained positions or non-DR pathologies, the generated questions may not actually require or test the nine-region mapping, weakening both the performance improvement and the transparency conclusions.
Authors: We appreciate the referee pointing out the potential overstatement in our claim regarding the ETDRS grid. The ETDRS grid provides a standardized framework for dividing the retina into nine clinically relevant regions, which we used to ensure consistent anatomical localization across the dataset. This choice supports the clinical interpretability of the VQA questions by mapping lesions to meaningful retinal areas. However, we acknowledge that the grid is coarser for certain fine-grained lesions or non-DR pathologies and may not capture all nuances. In the revised version, we will revise the wording in the abstract and pipeline description to 'facilitates anatomical consistency and clinical relevance' and include a new subsection discussing the applicability and limitations of the ETDRS grid for diverse ophthalmic lesions, along with statistics on the proportion of DR versus other pathologies in the dataset. revision: partial
-
Referee: [Abstract / Experiments] Abstract / Experiments section: the manuscript states that lesion-level evidence 'consistently improves model performance and transparency' yet supplies no annotation validation (accuracy, inter-rater agreement), no error bars on the reported metrics, and no breakdown of results by lesion type or question format. These omissions leave the central empirical claim unverified and constitute a soundness issue for a benchmarking paper.
Authors: We agree that these elements are crucial for validating the empirical claims in a benchmarking study. The current manuscript includes details on the three-stage pipeline and model evaluations, but we recognize the gaps in reporting. For the revision, we will incorporate: inter-rater agreement metrics for the lesion annotations (e.g., Cohen's kappa or percentage agreement from the annotation process), error bars (standard deviation across multiple evaluation runs or seeds) for all accuracy and reasoning metrics, and comprehensive breakdowns of results stratified by lesion type (DR-related vs. others) and question format (open-ended, closed-ended, etc.). These additions will provide stronger evidence for the improvements from lesion-level visual evidence and address the soundness concerns. revision: yes
Circularity Check
Dataset construction and empirical benchmarking with no derivational circularity
full rationale
This is a constructive dataset paper introducing FundusGround via a three-stage pipeline of image collection, ETDRS-grid lesion localization, and VQA question generation, followed by benchmarking of vision-language models on accuracy and reasoning metrics. No equations, fitted parameters, or predictions appear that reduce claims to inputs by construction. The central claim that lesion-level spatial grounding improves performance and transparency rests on direct empirical comparisons within the new benchmark rather than self-definitional loops, self-citation chains, or renamed known results. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Lesions can be accurately annotated on fundus images and mapped to ETDRS grid regions to ensure anatomical consistency and clinical validity.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Attiku, Y., Nittala, M.G., Velaga, S.B., Ramachandra, C., Bhat, S., Solanki, K., Jayadev, C., Choudhry, N., Orr, S.M.A., Jiang, S., et al.: Comparison of diabetic retinopathy severity grading on etdrs 7-field versus ultrawide-field assessment. Eye 37(14), 2946–2949 (2023)
work page 2023
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
In: European Conference on Com- puter Vision
Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2024)
work page 2024
-
[5]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, Y., Xu, D., Huang, Y., Zhan, S., Wang, H., Chen, D., Wang, X., Qiu, M., Li, H.: Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24732–24741 (2025)
work page 2025
-
[6]
IEEE transactions on medical imaging41(10), 2828–2847 (2022)
Fang, H., Li, F., Fu, H., Sun, X., Cao, X., Lin, F., Son, J., Kim, S., Quellec, G., Matta, S., et al.: Adam challenge: Detecting age-related macular degeneration from fundus images. IEEE transactions on medical imaging41(10), 2828–2847 (2022)
work page 2022
-
[7]
https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03
Fu, H., Li, F., Orlando, J.I., Bogunović, H., Sun, X., Liao, J., Xu, Y., Zhang, S., Zhang, X.: Palm: Pathologic myopia challenge (2019). https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03
-
[8]
Ophthalmology and therapy13(8), 2125–2149 (2024)
Grzybowski, A., Jin, K., Zhou, J., Pan, X., Wang, M., Ye, J., Wong, T.Y.: Retina fundusphotograph-basedartificialintelligencealgorithmsinmedicine:asystematic review. Ophthalmology and therapy13(8), 2125–2149 (2024)
work page 2024
-
[9]
Huang, S., Li, Z., Lin, B., Zhang, S., Yi, Q., Wang, L.: Hpmi: A retinal fundus image dataset for identification of high and pathological myopia based on deep learning. Dataset (2023)
work page 2023
-
[10]
In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)
Kamble, R., Srivastava, A., Singhal, N.: Laden: lesion-aware adversarial deep net- work for grading of macular diseases using color fundus images. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). pp. 1–4. IEEE (2022)
work page 2022
-
[11]
arXiv preprint arXiv:2102.11343 (2021)
Kaushik, P., Gain, A., Kortylewski, A., Yuille, A.: Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. arXiv preprint arXiv:2102.11343 (2021)
-
[12]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Li, S., Lin, T., Lin, L., Zhang, W., Liu, J., Yang, X., Li, J., He, Y., Song, X., Xiao, J., et al.: Eyecaregpt: Boosting comprehensive ophthalmology understanding with tailored dataset, benchmark and model. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3893–3902 (2025)
work page 2025
-
[13]
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 X. Wang et al
work page 2021
-
[14]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Liu, B., Zhao, X., He, A., Chen, Y., Fu, H., Wu, X.M.: Gemex-rmcot: An en- hanced med-vqa dataset for region-aware multimodal chain-of-thought reasoning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 13213–13220 (2025)
work page 2025
-
[15]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)
Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 21310–21320 (October 2025)
work page 2025
-
[16]
Medical image analysis59, 101570 (2020)
Orlando, J.I., Fu, H., Breda, J.B., Van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Heng, P.A., Kim, J., Lee, J., et al.: Refuge challenge: A unified frame- work for evaluating automated methods for glaucoma assessment from fundus pho- tographs. Medical image analysis59, 101570 (2020)
work page 2020
-
[17]
Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Meriaudeau, F.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data3(3), 25 (2018)
work page 2018
-
[18]
Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026), https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[19]
Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)
Wang, Y., Yu, M., Hu, B., Jin, X., Li, Y., Zhang, X., Zhang, Y., Gong, D., Wu, C., Zhang, B., et al.: Deep learning-based detection and stage grading for optimis- ing diagnosis of diabetic retinopathy. Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)
work page 2021
-
[22]
In: 2020 25th International Conference on Pattern Recognition (ICPR)
Wei, Q., Li, X., Yu, W., Zhang, X., Zhang, Y., Hu, B., Mo, B., Gong, D., Chen, N., Ding, D., et al.: Learn to segment retinal lesions and beyond. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 7403–7410. IEEE (2021)
work page 2020
-
[23]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Wei, Q., Qian, K., Li, X.: Funbench: Benchmarking fundus reading skills of mllms. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 278–288. Springer (2025)
work page 2025
-
[24]
Advances in Ophthalmology Practice and Research (2025)
Xu, P., Gong, X., Chen, X., Zhang, W., Yang, J., Yan, B., Yuan, M., Zheng, Y., He, M., Shi, D.: Benchmarking large multimodal models for ophthalmic visual question answering with ophthalwechat. Advances in Ophthalmology Practice and Research (2025)
work page 2025
-
[25]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
In: 2010 Annual international conference of the IEEE engineering in medicine and biology
Zhang, Z., Yin, F.S., Liu, J., Wong, W.K., Tan, N.M., Lee, B.H., Cheng, J., Wong, T.Y.: Origa-light: An online retinal fundus image database for glaucoma analysis and research. In: 2010 Annual international conference of the IEEE engineering in medicine and biology. pp. 3065–3068. IEEE (2010)
work page 2010
-
[27]
IEEE transactions on medical imaging40(3), 818–828 (2020)
Zhou, Y., Wang, B., Huang, L., Cui, S., Shao, L.: A benchmark for studying di- abetic retinopathy: segmentation, grading, and transferability. IEEE transactions on medical imaging40(3), 818–828 (2020)
work page 2020
-
[28]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.