Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

Bo Liu; Chengcheng Zhu; Huazhu Fu; Jiang Liu; Meng Wang; Xingyue Wang; Zhixuan Zhang

arxiv: 2605.22414 · v1 · pith:CKRGY6RZnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

Xingyue Wang , Bo Liu , Meng Wang , Zhixuan Zhang , Chengcheng Zhu , Huazhu Fu , Jiang Liu This is my paper

Pith reviewed 2026-05-22 07:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords ophthalmic VQAlesion localizationETDRS gridfundus imagesclinical interpretabilityvisual groundingretinal regions

0 comments

The pith

Spatially grounding lesions in retinal images makes ophthalmic VQA models more accurate and clinically transparent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FundusGround, a benchmark that links visual questions about eye fundus photos to specific lesion locations mapped onto a standard nine-region grid. By annotating 15,595 lesions across 10,719 images and generating 72,706 questions, the work tests whether models benefit from explicit spatial evidence rather than image-wide features alone. Experiments with large vision-language models show consistent gains in both answer correctness and the ability to reference the right lesion when spatial grounding is provided. This setup addresses the gap in current ophthalmic VQA systems that prioritize accuracy without requiring visible reasoning steps.

Core claim

The authors establish that incorporating lesion-level visual evidence, achieved through ETDRS grid localization of 15,595 lesions in 10,719 images to create 72,706 questions, consistently improves model performance on answer accuracy and lesion-level reasoning metrics, demonstrating the necessity of explicit spatial grounding for reliable and explainable ophthalmic visual question answering.

What carries the argument

The three-stage pipeline that annotates lesions with the ETDRS grid for standardized mapping to nine retinal regions and generates questions with dual evaluation on accuracy and reasoning.

If this is right

Models using lesion-level evidence achieve higher accuracy across open-ended, closed-ended, single-choice, and multiple-choice question formats.
Transparency improves because answers can be directly evaluated against the specific lesion region used for reasoning.
The benchmark enables consistent, standardized assessment of clinical interpretability in ophthalmic VQA tasks.
Models trained with this explicit grounding can produce explanations that trace back to defined anatomical retinal areas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lesion-localization approach could transfer to other medical imaging VQA settings where spatial evidence matters, such as radiology.
If the nine-region ETDRS division overlooks certain subtle or peripheral lesions, gains in performance may not extend to all real-world imaging variations.
Adopting the benchmark during model training might produce systems whose reasoning steps align more closely with standard clinical review processes.

Load-bearing premise

Spatially localizing all lesions using the ETDRS grid ensures anatomical consistency and clinical validity for the generated questions and model evaluations.

What would settle it

A review by ophthalmologists finding that a substantial fraction of the generated questions lack clinical validity, or an experiment where ungrounded models achieve equal or higher accuracy than grounded models on a new set of real clinical cases.

Figures

Figures reproduced from arXiv: 2605.22414 by Bo Liu, Chengcheng Zhu, Huazhu Fu, Jiang Liu, Meng Wang, Xingyue Wang, Zhixuan Zhang.

**Figure 1.** Figure 1: Three-stage pipeline for constructing FundusGround. Stage 1 defines clinically meaningful retinal regions using the ETDRS grid; Stage 2 outputs fine-grained visual annotations (in both image- and lesion- levels); Stage 3 constructs and filters multiple ophthalmic VQA types grounded in the structured lesion-aware visual evidence. 2.1 Anatomical Region Definition To standardize lesion localization across het… view at source ↗

**Figure 2.** Figure 2: Dataset statistics of FundusGround, including disease category distribution (a), lesion type distribution (b), and region-wise lesion distribution (c). into four quadrants: superior (S), inferior (I), nasal (N), and temporal (T). This macula-centered framework can be extended toward the near-peripheral retina when broader spatial coverage is required. 2.2 Fine-grained Lesion Annotation Building upon the st… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on a DR grading example. Green denotes correct predictions, while red denotes incorrect parts. case demonstrates that FundusGround differentiates superficial answer correctness from true lesion-aware reasoning, and that explicit lesion type and location supervision are essential for clinically interpretable and accurate predictions. 4 Conclusion We present FundusGround, a novel les… view at source ↗

read the original abstract

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FundusGround gives a new dataset with ETDRS-mapped lesions and 72k questions, but the validation details stay thin.

read the letter

The main thing to know is that this paper ships FundusGround, a benchmark that annotates 10,719 fundus images with 15,595 lesions placed on the nine-region ETDRS grid and then generates 72,706 questions in four formats. The claim is that forcing models to use these explicit lesion locations improves both answer accuracy and transparency in ophthalmic VQA. They benchmark several vision-language models and report consistent gains when the lesion evidence is supplied. That pipeline for collecting and structuring the data is the concrete new piece; prior ophthalmic VQA sets did not tie answers to standardized spatial regions at this scale. The dual evaluation of accuracy plus lesion-level reasoning is a sensible way to check whether the model is actually looking at the right spots. The soft spots sit in the missing checks. The abstract gives no inter-rater agreement numbers or expert validation for the lesion placements, so it is difficult to judge how reliable the grounding actually is. The ETDRS grid itself is coarse by design for diabetic retinopathy grading; if many questions involve fine lesion positions or non-DR conditions, the spatial signal may be weaker than presented. The reported improvements also lack error bars or full baseline tables in the summary, which makes the size of the gain hard to gauge. This work is aimed at researchers who build or evaluate medical VQA systems and want datasets that push toward clinical interpretability. Anyone looking for grounded retinal data to test explainability methods would get direct value from the resource. It deserves a serious referee because the dataset and the grounding idea are substantial enough to warrant detailed feedback on annotation quality and grid resolution, even if the current evidence is still preliminary.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FundusGround, a benchmark for clinically interpretable ophthalmic VQA. It presents a three-stage pipeline that collects 10,719 fundus images containing 15,595 image-level annotated lesions, all spatially localized to nine retinal regions via the ETDRS grid to ensure anatomical consistency. This structured evidence is used to generate 72,706 questions across open-ended, closed-ended, single-choice, and multiple-choice formats. The work then benchmarks general- and medical-domain large vision-language models with dual metrics of answer accuracy and lesion-level reasoning, claiming that explicit lesion-level visual evidence improves both performance and transparency and that spatial grounding is necessary for reliable ophthalmic VQA.

Significance. If the reported gains hold after proper validation, the dataset would constitute a valuable resource for the field by supplying the first large-scale ophthalmic VQA benchmark that explicitly ties questions to spatially localized lesion evidence. The scale (10k+ images, 72k questions) and the shift toward lesion-level interpretability address a documented gap in existing ophthalmic VQA work, potentially supporting more clinically trustworthy model development.

major comments (2)

[Abstract / three-stage pipeline] Abstract, three-stage pipeline paragraph: the claim that ETDRS-grid localization 'ensures anatomical consistency and clinical validity' for all 15,595 lesions is load-bearing for the necessity-of-spatial-grounding argument. The ETDRS grid was designed for DR severity grading and uses coarse macular-centered sectors; if a non-negligible fraction of lesions involve fine-grained positions or non-DR pathologies, the generated questions may not actually require or test the nine-region mapping, weakening both the performance improvement and the transparency conclusions.
[Abstract / Experiments] Abstract / Experiments section: the manuscript states that lesion-level evidence 'consistently improves model performance and transparency' yet supplies no annotation validation (accuracy, inter-rater agreement), no error bars on the reported metrics, and no breakdown of results by lesion type or question format. These omissions leave the central empirical claim unverified and constitute a soundness issue for a benchmarking paper.

minor comments (1)

[Abstract] The abstract reports 72,706 questions derived from 10,719 images and 15,595 lesions; the main text should include an explicit accounting of how questions are sampled per image/lesion to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing FundusGround. We address each of the major comments point by point below and describe the revisions we intend to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract / three-stage pipeline] Abstract, three-stage pipeline paragraph: the claim that ETDRS-grid localization 'ensures anatomical consistency and clinical validity' for all 15,595 lesions is load-bearing for the necessity-of-spatial-grounding argument. The ETDRS grid was designed for DR severity grading and uses coarse macular-centered sectors; if a non-negligible fraction of lesions involve fine-grained positions or non-DR pathologies, the generated questions may not actually require or test the nine-region mapping, weakening both the performance improvement and the transparency conclusions.

Authors: We appreciate the referee pointing out the potential overstatement in our claim regarding the ETDRS grid. The ETDRS grid provides a standardized framework for dividing the retina into nine clinically relevant regions, which we used to ensure consistent anatomical localization across the dataset. This choice supports the clinical interpretability of the VQA questions by mapping lesions to meaningful retinal areas. However, we acknowledge that the grid is coarser for certain fine-grained lesions or non-DR pathologies and may not capture all nuances. In the revised version, we will revise the wording in the abstract and pipeline description to 'facilitates anatomical consistency and clinical relevance' and include a new subsection discussing the applicability and limitations of the ETDRS grid for diverse ophthalmic lesions, along with statistics on the proportion of DR versus other pathologies in the dataset. revision: partial
Referee: [Abstract / Experiments] Abstract / Experiments section: the manuscript states that lesion-level evidence 'consistently improves model performance and transparency' yet supplies no annotation validation (accuracy, inter-rater agreement), no error bars on the reported metrics, and no breakdown of results by lesion type or question format. These omissions leave the central empirical claim unverified and constitute a soundness issue for a benchmarking paper.

Authors: We agree that these elements are crucial for validating the empirical claims in a benchmarking study. The current manuscript includes details on the three-stage pipeline and model evaluations, but we recognize the gaps in reporting. For the revision, we will incorporate: inter-rater agreement metrics for the lesion annotations (e.g., Cohen's kappa or percentage agreement from the annotation process), error bars (standard deviation across multiple evaluation runs or seeds) for all accuracy and reasoning metrics, and comprehensive breakdowns of results stratified by lesion type (DR-related vs. others) and question format (open-ended, closed-ended, etc.). These additions will provide stronger evidence for the improvements from lesion-level visual evidence and address the soundness concerns. revision: yes

Circularity Check

0 steps flagged

Dataset construction and empirical benchmarking with no derivational circularity

full rationale

This is a constructive dataset paper introducing FundusGround via a three-stage pipeline of image collection, ETDRS-grid lesion localization, and VQA question generation, followed by benchmarking of vision-language models on accuracy and reasoning metrics. No equations, fitted parameters, or predictions appear that reduce claims to inputs by construction. The central claim that lesion-level spatial grounding improves performance and transparency rests on direct empirical comparisons within the new benchmark rather than self-definitional loops, self-citation chains, or renamed known results. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that ETDRS grid mapping yields clinically valid regions and that lesion annotations are accurate enough to support question generation and model evaluation; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Lesions can be accurately annotated on fundus images and mapped to ETDRS grid regions to ensure anatomical consistency and clinical validity.
Invoked in the abstract to justify the spatial localization step for standardized mapping to nine retinal regions.

pith-pipeline@v0.9.0 · 5755 in / 1254 out tokens · 51251 ms · 2026-05-22T07:25:20.179839+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Eye 37(14), 2946–2949 (2023)

Attiku, Y., Nittala, M.G., Velaga, S.B., Ramachandra, C., Bhat, S., Solanki, K., Jayadev, C., Choudhry, N., Orr, S.M.A., Jiang, S., et al.: Comparison of diabetic retinopathy severity grading on etdrs 7-field versus ultrawide-field assessment. Eye 37(14), 2946–2949 (2023)

work page 2023
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In: European Conference on Com- puter Vision

Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2024)

work page 2024
[5]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Y., Xu, D., Huang, Y., Zhan, S., Wang, H., Chen, D., Wang, X., Qiu, M., Li, H.: Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24732–24741 (2025)

work page 2025
[6]

IEEE transactions on medical imaging41(10), 2828–2847 (2022)

Fang, H., Li, F., Fu, H., Sun, X., Cao, X., Lin, F., Son, J., Kim, S., Quellec, G., Matta, S., et al.: Adam challenge: Detecting age-related macular degeneration from fundus images. IEEE transactions on medical imaging41(10), 2828–2847 (2022)

work page 2022
[7]

https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03

Fu, H., Li, F., Orlando, J.I., Bogunović, H., Sun, X., Liao, J., Xu, Y., Zhang, S., Zhang, X.: Palm: Pathologic myopia challenge (2019). https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03

work page doi:10.21227/55pk-8z03 2019
[8]

Ophthalmology and therapy13(8), 2125–2149 (2024)

Grzybowski, A., Jin, K., Zhou, J., Pan, X., Wang, M., Ye, J., Wong, T.Y.: Retina fundusphotograph-basedartificialintelligencealgorithmsinmedicine:asystematic review. Ophthalmology and therapy13(8), 2125–2149 (2024)

work page 2024
[9]

Dataset (2023)

Huang, S., Li, Z., Lin, B., Zhang, S., Yi, Q., Wang, L.: Hpmi: A retinal fundus image dataset for identification of high and pathological myopia based on deep learning. Dataset (2023)

work page 2023
[10]

In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)

Kamble, R., Srivastava, A., Singhal, N.: Laden: lesion-aware adversarial deep net- work for grading of macular diseases using color fundus images. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). pp. 1–4. IEEE (2022)

work page 2022
[11]

arXiv preprint arXiv:2102.11343 (2021)

Kaushik, P., Gain, A., Kortylewski, A., Yuille, A.: Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. arXiv preprint arXiv:2102.11343 (2021)

work page arXiv 2021
[12]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, S., Lin, T., Lin, L., Zhang, W., Liu, J., Yang, X., Li, J., He, Y., Song, X., Xiao, J., et al.: Eyecaregpt: Boosting comprehensive ophthalmology understanding with tailored dataset, benchmark and model. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3893–3902 (2025)

work page 2025
[13]

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 X. Wang et al

work page 2021
[14]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Liu, B., Zhao, X., He, A., Chen, Y., Fu, H., Wu, X.M.: Gemex-rmcot: An en- hanced med-vqa dataset for region-aware multimodal chain-of-thought reasoning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 13213–13220 (2025)

work page 2025
[15]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 21310–21320 (October 2025)

work page 2025
[16]

Medical image analysis59, 101570 (2020)

Orlando, J.I., Fu, H., Breda, J.B., Van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Heng, P.A., Kim, J., Lee, J., et al.: Refuge challenge: A unified frame- work for evaluating automated methods for glaucoma assessment from fundus pho- tographs. Medical image analysis59, 101570 (2020)

work page 2020
[17]

Data3(3), 25 (2018)

Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Meriaudeau, F.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data3(3), 25 (2018)

work page 2018
[18]

Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026), https://qwen.ai/blog?id=qwen3.5

work page 2026
[19]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)

Wang, Y., Yu, M., Hu, B., Jin, X., Li, Y., Zhang, X., Zhang, Y., Gong, D., Wu, C., Zhang, B., et al.: Deep learning-based detection and stage grading for optimis- ing diagnosis of diabetic retinopathy. Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)

work page 2021
[22]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Wei, Q., Li, X., Yu, W., Zhang, X., Zhang, Y., Hu, B., Mo, B., Gong, D., Chen, N., Ding, D., et al.: Learn to segment retinal lesions and beyond. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 7403–7410. IEEE (2021)

work page 2020
[23]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Wei, Q., Qian, K., Li, X.: Funbench: Benchmarking fundus reading skills of mllms. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 278–288. Springer (2025)

work page 2025
[24]

Advances in Ophthalmology Practice and Research (2025)

Xu, P., Gong, X., Chen, X., Zhang, W., Yang, J., Yan, B., Yuan, M., Zheng, Y., He, M., Shi, D.: Benchmarking large multimodal models for ophthalmic visual question answering with ophthalwechat. Advances in Ophthalmology Practice and Research (2025)

work page 2025
[25]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

In: 2010 Annual international conference of the IEEE engineering in medicine and biology

Zhang, Z., Yin, F.S., Liu, J., Wong, W.K., Tan, N.M., Lee, B.H., Cheng, J., Wong, T.Y.: Origa-light: An online retinal fundus image database for glaucoma analysis and research. In: 2010 Annual international conference of the IEEE engineering in medicine and biology. pp. 3065–3068. IEEE (2010)

work page 2010
[27]

IEEE transactions on medical imaging40(3), 818–828 (2020)

Zhou, Y., Wang, B., Huang, L., Cui, S., Shao, L.: A benchmark for studying di- abetic retinopathy: segmentation, grading, and transferability. IEEE transactions on medical imaging40(3), 818–828 (2020)

work page 2020
[28]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Eye 37(14), 2946–2949 (2023)

Attiku, Y., Nittala, M.G., Velaga, S.B., Ramachandra, C., Bhat, S., Solanki, K., Jayadev, C., Choudhry, N., Orr, S.M.A., Jiang, S., et al.: Comparison of diabetic retinopathy severity grading on etdrs 7-field versus ultrawide-field assessment. Eye 37(14), 2946–2949 (2023)

work page 2023

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In: European Conference on Com- puter Vision

Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2024)

work page 2024

[5] [5]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, Y., Xu, D., Huang, Y., Zhan, S., Wang, H., Chen, D., Wang, X., Qiu, M., Li, H.: Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24732–24741 (2025)

work page 2025

[6] [6]

IEEE transactions on medical imaging41(10), 2828–2847 (2022)

Fang, H., Li, F., Fu, H., Sun, X., Cao, X., Lin, F., Son, J., Kim, S., Quellec, G., Matta, S., et al.: Adam challenge: Detecting age-related macular degeneration from fundus images. IEEE transactions on medical imaging41(10), 2828–2847 (2022)

work page 2022

[7] [7]

https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03

Fu, H., Li, F., Orlando, J.I., Bogunović, H., Sun, X., Liao, J., Xu, Y., Zhang, S., Zhang, X.: Palm: Pathologic myopia challenge (2019). https://doi.org/10.21227/55pk-8z03, https://dx.doi.org/10.21227/55pk-8z03

work page doi:10.21227/55pk-8z03 2019

[8] [8]

Ophthalmology and therapy13(8), 2125–2149 (2024)

Grzybowski, A., Jin, K., Zhou, J., Pan, X., Wang, M., Ye, J., Wong, T.Y.: Retina fundusphotograph-basedartificialintelligencealgorithmsinmedicine:asystematic review. Ophthalmology and therapy13(8), 2125–2149 (2024)

work page 2024

[9] [9]

Dataset (2023)

Huang, S., Li, Z., Lin, B., Zhang, S., Yi, Q., Wang, L.: Hpmi: A retinal fundus image dataset for identification of high and pathological myopia based on deep learning. Dataset (2023)

work page 2023

[10] [10]

In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)

Kamble, R., Srivastava, A., Singhal, N.: Laden: lesion-aware adversarial deep net- work for grading of macular diseases using color fundus images. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). pp. 1–4. IEEE (2022)

work page 2022

[11] [11]

arXiv preprint arXiv:2102.11343 (2021)

Kaushik, P., Gain, A., Kortylewski, A., Yuille, A.: Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. arXiv preprint arXiv:2102.11343 (2021)

work page arXiv 2021

[12] [12]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, S., Lin, T., Lin, L., Zhang, W., Liu, J., Yang, X., Li, J., He, Y., Song, X., Xiao, J., et al.: Eyecaregpt: Boosting comprehensive ophthalmology understanding with tailored dataset, benchmark and model. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 3893–3902 (2025)

work page 2025

[13] [13]

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 X. Wang et al

work page 2021

[14] [14]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Liu, B., Zhao, X., He, A., Chen, Y., Fu, H., Wu, X.M.: Gemex-rmcot: An en- hanced med-vqa dataset for region-aware multimodal chain-of-thought reasoning. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 13213–13220 (2025)

work page 2025

[15] [15]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 21310–21320 (October 2025)

work page 2025

[16] [16]

Medical image analysis59, 101570 (2020)

Orlando, J.I., Fu, H., Breda, J.B., Van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Heng, P.A., Kim, J., Lee, J., et al.: Refuge challenge: A unified frame- work for evaluating automated methods for glaucoma assessment from fundus pho- tographs. Medical image analysis59, 101570 (2020)

work page 2020

[17] [17]

Data3(3), 25 (2018)

Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Meriaudeau, F.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data3(3), 25 (2018)

work page 2018

[18] [18]

Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026), https://qwen.ai/blog?id=qwen3.5

work page 2026

[19] [19]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)

Wang, Y., Yu, M., Hu, B., Jin, X., Li, Y., Zhang, X., Zhang, Y., Gong, D., Wu, C., Zhang, B., et al.: Deep learning-based detection and stage grading for optimis- ing diagnosis of diabetic retinopathy. Diabetes/Metabolism Research and Reviews 37(4), e3445 (2021)

work page 2021

[22] [22]

In: 2020 25th International Conference on Pattern Recognition (ICPR)

Wei, Q., Li, X., Yu, W., Zhang, X., Zhang, Y., Hu, B., Mo, B., Gong, D., Chen, N., Ding, D., et al.: Learn to segment retinal lesions and beyond. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 7403–7410. IEEE (2021)

work page 2020

[23] [23]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Wei, Q., Qian, K., Li, X.: Funbench: Benchmarking fundus reading skills of mllms. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 278–288. Springer (2025)

work page 2025

[24] [24]

Advances in Ophthalmology Practice and Research (2025)

Xu, P., Gong, X., Chen, X., Zhang, W., Yang, J., Yan, B., Yuan, M., Zheng, Y., He, M., Shi, D.: Benchmarking large multimodal models for ophthalmic visual question answering with ophthalwechat. Advances in Ophthalmology Practice and Research (2025)

work page 2025

[25] [25]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

In: 2010 Annual international conference of the IEEE engineering in medicine and biology

Zhang, Z., Yin, F.S., Liu, J., Wong, W.K., Tan, N.M., Lee, B.H., Cheng, J., Wong, T.Y.: Origa-light: An online retinal fundus image database for glaucoma analysis and research. In: 2010 Annual international conference of the IEEE engineering in medicine and biology. pp. 3065–3068. IEEE (2010)

work page 2010

[27] [27]

IEEE transactions on medical imaging40(3), 818–828 (2020)

Zhou, Y., Wang, B., Huang, L., Cui, S., Shao, L.: A benchmark for studying di- abetic retinopathy: segmentation, grading, and transferability. IEEE transactions on medical imaging40(3), 818–828 (2020)

work page 2020

[28] [28]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025