pith. machine review for the scientific record. sign in

arxiv: 2604.08322 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords fundus imagingmultimodal large language modelsreasoning tracesretrieval-augmented generationreinforcement learningretinal disease detectionpublic datasetsvision-language models
0
0 comments X

The pith

A specialized fundus-reading multimodal model can be trained to outperform generic versions using only public datasets and image-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a reasoning-enhanced MLLM for fundus images can be developed without private clinical reports by auto-generating knowledge-aware reasoning traces from public data and training with an enhanced reinforcement learning process. Most of the training data carries only image-level labels, so the method first retrieves ophthalmic knowledge to link visual findings to those labels, then adds a process reward that favors self-consistent reasoning traces during rollouts. A sympathetic reader would care because prior work depended on inaccessible in-house samples, restricting development of tools for early retinal disease detection to a few groups. The resulting model, Fundus-R1, shows clear gains over its generic base and over versions trained without the traces across three benchmarks. This removes a practical barrier to reproducibility and broader participation in fundus AI research.

Core claim

We train Fundus-R1 exclusively on public datasets in which over 94 percent of samples have only image-level labels. A RAG-based procedure first composes image-specific reasoning traces that connect visual findings identified by a generic MLLM to the available labels through ophthalmic knowledge. We then apply RLVR augmented by a process reward that encourages self-consistency of each generated trace. On FunBench, Omni-Fundus and GMAI-Fundus the resulting model outperforms both its generic counterpart and a stronger post-trained edition that omits the generated traces.

What carries the argument

The RAG-based method that composes image-specific, knowledge-aware reasoning traces linking visual findings to image labels.

If this is right

  • Fundus-R1 outperforms its generic counterpart and a post-trained ablation on three public fundus-reading benchmarks.
  • Training succeeds when more than 94 percent of the data carries only image-level labels rather than full clinical reports.
  • The process reward for reasoning-trace self-consistency measurably improves the RLVR stage.
  • The same pipeline removes the previous dependence on private in-house samples for fundus MLLM development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The RAG-plus-process-reward recipe could be tested on other medical imaging domains such as chest X-rays or pathology slides where expert reports are scarce.
  • If the generated traces prove robust, the method offers a route to more interpretable and reproducible medical vision-language models without proprietary data.
  • Broader adoption would allow research groups lacking private datasets to contribute directly to tools for retinal anomaly detection.

Load-bearing premise

The automatically generated reasoning traces are accurate and useful enough that training on them with the added process reward produces measurable gains over the generic base model.

What would settle it

If a model trained without the RAG-generated traces or without the process reward shows no improvement or degrades relative to the generic base on FunBench, Omni-Fundus and GMAI-Fundus, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.08322 by Bangxiang Lan, Jianfeng Dong, Jiazhen Liu, Jingyu Liu, Kaiheng Qian, Qijie Wei, Xirong Li, Yuchuan Deng, Zijie Xin.

Figure 1
Figure 1. Figure 1: Showcases of multimodal large language model (MLLM) based fundus image reading. Our proposed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Proposed method for generating visual-finding-embedded reasoning traces. Given a task label [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our process reward 𝑟𝑝𝑟𝑜 . Best viewed digitally. Our design of 𝑟𝑝𝑟𝑜,𝑖 is guided by the following considerations. Ideally, we want M to produce a correct answer based on correct visual findings. Therefore, when 𝑦ˆ𝑖 matches 𝑦, 𝑟𝑝𝑟𝑜,𝑖 should verify that 𝜏 aligns with𝑉 𝐹 [𝐼], i.e. the visual findings previously extracted from the training image 𝐼. However, such a criterion might be over strict, especially in t… view at source ↗
Figure 4
Figure 4. Figure 4: The use of the process reward 𝑟𝑝𝑟𝑜 makes the model achieve larger answer rewards with shorter output length. 5 Conclusions We introduced Fundus-R1, a reasoning-enhanced fundus-reading MLLM trained using only public datasets. Our central goal is to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Fundus-R1, a reasoning-enhanced MLLM for fundus image understanding (CFP, OCT, UWF) trained exclusively on public datasets (over 94% with only image-level labels). It proposes a RAG-based method to auto-generate image-specific, knowledge-aware reasoning traces linking generic MLLM visual findings to labels via ophthalmic knowledge, plus an enhanced RLVR procedure with an added process reward for self-consistency of reasoning traces per rollout. Experiments on FunBench, Omni-Fundus and GMAI-Fundus are said to show clear outperformance versus Qwen2.5-VL and a stronger post-trained baseline that omits the generated traces.

Significance. If the empirical claims hold after verification, the work would be significant for medical vision-language modeling: it shows a reproducible path to specialized fundus MLLMs without proprietary clinical reports, using only public data plus automated knowledge injection. The combination of RAG trace generation and process-reward RLVR is a concrete technical step toward more interpretable, knowledge-intensive medical MLLMs and could broaden participation beyond labs with private data access.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.
  2. [Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.
minor comments (1)
  1. [Abstract] The acronym RLVR is used without prior expansion; while context implies reinforcement learning with verifiable rewards, an explicit definition on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.

    Authors: We agree that the abstract would be clearer with quantitative support. In the revised version we will insert concise performance highlights (e.g., accuracy or F1 gains on FunBench, Omni-Fundus and GMAI-Fundus) together with references to the exact dataset sizes, training configurations and the ablation baseline that omits the generated traces. These numbers already appear in the experimental tables and will now be summarized in the abstract so that the claimed outperformance and the incremental value of the knowledge-aware traces can be assessed directly from the abstract. revision: yes

  2. Referee: [Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.

    Authors: We accept the referee’s point that the absence of direct validation leaves open the possibility that gains arise from RLVR scaffolding rather than the content of the traces. We will revise the method section to include (1) representative examples of the RAG-generated traces with the ophthalmic knowledge sources used, (2) a factual-accuracy check performed on a held-out subset of traces by comparing them against publicly available clinical guidelines, and (3) an expanded discussion of the ablation that isolates the traces (the “stronger edition post-trained without using the generated traces”). While a large-scale expert review was not feasible within the original scope, the added analysis will allow readers to judge the alignment between the auto-generated reasoning and clinical knowledge. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical method using RAG to generate reasoning traces from public data and a generic MLLM, followed by RLVR training with a self-consistency process reward. Performance is evaluated via experiments on external benchmarks against baselines including a no-trace variant. No equations, fitted parameters, or self-referential definitions are present that would make any claimed result equivalent to its inputs by construction. The central gains are presented as experimental outcomes, not derived tautologically from author-defined quantities or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified premise that auto-generated reasoning traces accurately link visual findings to image labels and that the added process reward measurably improves model quality; no free parameters, axioms, or invented entities are explicitly introduced.

pith-pipeline@v0.9.0 · 5638 in / 1262 out tokens · 55770 ms · 2026-05-10T18:12:23.544767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Hrvoje Bogunović, Freerk Venhuizen, et al. 2019. RETOUCH: The retinal OCT fluid detection and segmentation benchmark and challenge.TMI38, 8 (2019), 1858–1874

  2. [2]

    Ling-Ping Cen, Jie Ji, Jian-Wei Lin, Si-Tong Ju, Hong-Jie Lin, Tai-Ping Li, Yun Wang, Jian-Feng Yang, Yu-Fen Liu, Shaoying Tan, et al. 2021. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks.NComms.12, 1 (2021), 4828

  3. [3]

    2022.LangChain

    Harrison Chase. 2022.LangChain. https://github.com/langchain-ai/langchain

  4. [4]

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guim- ing Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al . 2024. Towards injecting medical visual knowledge into multimodal llms at scale. In EMNLP. 7346–7370

  5. [5]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.arXiv preprint arXiv:2412.05271(2024)

  6. [6]

    Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. 2025. QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training. InNeurIPS

  7. [7]

    DateCazuki. 2022. TOP: Classifier using fundus image dataset provided by Tsukazaki Hospital. https://github.com/DateCazuki/Fundus_Diagnosis. Dataset of fundus images from Tsukazaki Hospital, used for multi-disease classification. Accessed: 2025-04-10

  8. [8]

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. 2024. VLMEvalKit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia. 11198–11201

  9. [9]

    Peyman Gholami, Priyanka Roy, Mohana Kuppuswamy Parthasarathy, and Va- sudevan Lakshminarayanan. 2020. OCTID: Optical coherence tomography image database.Computers & Electrical Engineering81 (2020), 106532

  10. [10]

    Yutao Hu, Tianbin Li, et al. 2024. OmniMedVQA: A new large-scale comprehen- sive evaluation benchmark for medical LVLM. InCVPR

  11. [11]

    Xiaoling Huang, Xiangyin Kong, Ziyan Shen, Jing Ouyang, Yunxiang Li, Kai Jin, and Juan Ye. 2023. GRAPE: A multi-modal dataset of longitudinal follow-up visual field and fundus images for glaucoma management.Scientific Data10, 1 (2023), 520

  12. [12]

    Kermany, Michael Goldbaum, Wenjia Cai, Carolina C

    Daniel S. Kermany, Michael Goldbaum, Wenjia Cai, Carolina C. S. Valentim, Huiying Liang, Sally L. Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying Medical Diagnoses and Treatable Diseases by Image- Based Deep Learning.Cell172, 5 (Feb. 2018), 1122–1131.e9

  13. [13]

    Hoda Kheradfallah, Janarthanam Jothi Balaji, Varadharajan Jayakumar, Mo- hammed Abdul Rasheed, and Vasudevan Lakshminarayanan. 2022. Annotation and segmentation of diabetic retinopathy lesions: an explainable AI application. InMedical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033. SPIE, 502–511

  14. [14]

    Mikhail Kulyabin, Aleksei Zhdanov, et al . 2024. OCTDL: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365

  15. [15]

    Jiajia Li, Zhouyu Guan, et al. 2024. Integrated image-based deep learning and language models for primary diabetes care.Nature medicine30, 10 (2024), 2886– 2896

  16. [16]

    Ning Li, Tao Li, Chunyu Hu, Kai Wang, and Hong Kang. 2021. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In BMO

  17. [17]

    Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, and Beng Chin Ooi. 2025. EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model. InACMMM

  18. [18]

    Tao Li, Yingqi Gao, Kai Wang, Song Guo, Hanruo Liu, and Hong Kang. 2019. Diag- nostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences501 (2019), 511–522

  19. [19]

    Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Wei- hong Yu, and Youxin Chen. 2021. Multi-modal multi-instance learning for retinal disease recognition. InACMMM

  20. [20]

    Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E Kinahan, and Yu Qiao. 2025. VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  21. [21]

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. 2025. HealthGPT: A Medical Large Vision- Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. InICML

  22. [22]

    Ruhan Liu, Xiangning Wang, Qiang Wu, Ling Dai, Xi Fang, Tao Yan, Jaemin Son, Shiqi Tang, Jiang Li, Zijian Gao, et al. 2022. DeepDRiD: Diabetic retinopa- thy—grading and image quality estimation challenge.Patterns3, 6 (2022)

  23. [23]

    Xinyao Liu and Diping Song. 2025. Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reason- ing. InICCV

  24. [24]

    Samiksha Pachade, Prasanna Porwal, Dhanshree Thulkar, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Luca Giancardo, Gwenolé Quellec, and Fabrice Mériaudeau. 2021. Retinal fundus multi-disease image dataset (RFMiD): A dataset for multi-disease detection research.Data6, 2 (2021), 14

  25. [25]

    Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. MedVLM-R1: Incen- tivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. InMICCAI

  26. [26]

    Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Desh- mukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. 2018. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research.Data3, 3 (2018), 25

  27. [27]

    Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, and Qingyu Chen. 2025. LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models. InNAACL

  28. [28]

    Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, et al. 2024. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence.NEJM AI1, 12 (2024), AIoa2300221

  29. [29]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  30. [30]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A flexible and efficient rlhf framework. InEuroSys. 1279–1297

  31. [31]

    Saman Sotoudeh-Paima, Ata Jodeiri, Fedra Hajizadeh, and Hamid Soltanian- Zadeh. 2022. Multi-scale convolutional neural network for automated AMD classification using retinal OCT images.Computers in biology and medicine144 (2022), 105368

  32. [32]

    Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

  33. [33]

    Qwen Team. 2025. Qwen3-Max: Just Scale it

  34. [34]

    Rongsheng Wang. 2025. Med-R1: Encourage Medical LLM to engage in deep thinking similar to DeepSeek-R1. https://github.com/WangRongsheng/Med-R1

  35. [35]

    Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, and Youxin Chen. 2022. Learning Two-Stream CNN for Multi-Modal Age-Related Macular Degeneration Categorization.IEEE Journal of Biomedical and Health Informatics26, 8 (2022), 4111–4122

  36. [36]

    Qijie Wei, Xirong Li, Weihong Yu, Xiao Zhang, Yongpeng Zhang, Bojie Hu, Bin Mo, Di Gong, Ning Chen, Dayong Ding, et al . 2021. Learn to segment retinal lesions and beyond. InICPR

  37. [37]

    Qijie Wei, Kaiheng Qian, and Xirong Li. 2025. FunBench: Benchmarking Fundus Reading Skills of MLLMs. InMICCAI. XX’XX, XXXX, XXXX Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, and Xirong Li

  38. [38]

    Ruiqi Wu, Yuang Yao, et al . 2025. Bridging the Gap in Ophthalmic AI: MM- Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning.arXiv preprint arXiv:2508.16129(2025)

  39. [39]

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan

  40. [40]

    Llava-cot: Let vision language models reason step-by-step. InICCV

  41. [41]

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al

  42. [42]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044(2025)

  43. [43]

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. 2025. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. InNeurIPS

  44. [44]

    Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, et al. 2024. GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. InNeurIPS

  45. [45]

    Xin Ye, Shucheng He, Xiaxing Zhong, Jiafeng Yu, Shangchao Yang, Yingjiao Shen, Yiqi Chen, Yaqi Wang, Xingru Huang, and Lijun Shen. 2023. OIMHS: An optical coherence tomography image dataset based on macular hole manual segmentation.Scientific Data10, 1 (2023), 769

  46. [46]

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics. Association for Computational Linguistics. http://arxiv.org/abs/2403.13372

  47. [47]

    Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, et al. 2025. Retinal- GPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models.arXiv preprint arXiv:2503.03987(2025)