arxiv: 2604.08322 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Yuchuan Deng , Qijie Wei , Kaiheng Qian , Jiazhen Liu , Zijie Xin , Bangxiang Lan , Jingyu Liu , Jianfeng Dong

show 1 more author

Xirong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords fundus imagingmultimodal large language modelsreasoning tracesretrieval-augmented generationreinforcement learningretinal disease detectionpublic datasetsvision-language models

0 comments

The pith

A specialized fundus-reading multimodal model can be trained to outperform generic versions using only public datasets and image-level labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a reasoning-enhanced MLLM for fundus images can be developed without private clinical reports by auto-generating knowledge-aware reasoning traces from public data and training with an enhanced reinforcement learning process. Most of the training data carries only image-level labels, so the method first retrieves ophthalmic knowledge to link visual findings to those labels, then adds a process reward that favors self-consistent reasoning traces during rollouts. A sympathetic reader would care because prior work depended on inaccessible in-house samples, restricting development of tools for early retinal disease detection to a few groups. The resulting model, Fundus-R1, shows clear gains over its generic base and over versions trained without the traces across three benchmarks. This removes a practical barrier to reproducibility and broader participation in fundus AI research.

Core claim

We train Fundus-R1 exclusively on public datasets in which over 94 percent of samples have only image-level labels. A RAG-based procedure first composes image-specific reasoning traces that connect visual findings identified by a generic MLLM to the available labels through ophthalmic knowledge. We then apply RLVR augmented by a process reward that encourages self-consistency of each generated trace. On FunBench, Omni-Fundus and GMAI-Fundus the resulting model outperforms both its generic counterpart and a stronger post-trained edition that omits the generated traces.

What carries the argument

The RAG-based method that composes image-specific, knowledge-aware reasoning traces linking visual findings to image labels.

If this is right

Fundus-R1 outperforms its generic counterpart and a post-trained ablation on three public fundus-reading benchmarks.
Training succeeds when more than 94 percent of the data carries only image-level labels rather than full clinical reports.
The process reward for reasoning-trace self-consistency measurably improves the RLVR stage.
The same pipeline removes the previous dependence on private in-house samples for fundus MLLM development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The RAG-plus-process-reward recipe could be tested on other medical imaging domains such as chest X-rays or pathology slides where expert reports are scarce.
If the generated traces prove robust, the method offers a route to more interpretable and reproducible medical vision-language models without proprietary data.
Broader adoption would allow research groups lacking private datasets to contribute directly to tools for retinal anomaly detection.

Load-bearing premise

The automatically generated reasoning traces are accurate and useful enough that training on them with the added process reward produces measurable gains over the generic base model.

What would settle it

If a model trained without the RAG-generated traces or without the process reward shows no improvement or degrades relative to the generic base on FunBench, Omni-Fundus and GMAI-Fundus, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.08322 by Bangxiang Lan, Jianfeng Dong, Jiazhen Liu, Jingyu Liu, Kaiheng Qian, Qijie Wei, Xirong Li, Yuchuan Deng, Zijie Xin.

**Figure 2.** Figure 2: Proposed method for generating visual-finding-embedded reasoning traces. Given a task label [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Our process reward 𝑟𝑝𝑟𝑜 . Best viewed digitally. Our design of 𝑟𝑝𝑟𝑜,𝑖 is guided by the following considerations. Ideally, we want M to produce a correct answer based on correct visual findings. Therefore, when 𝑦ˆ𝑖 matches 𝑦, 𝑟𝑝𝑟𝑜,𝑖 should verify that 𝜏 aligns with𝑉 𝐹 [𝐼], i.e. the visual findings previously extracted from the training image 𝐼. However, such a criterion might be over strict, especially in t… view at source ↗

**Figure 4.** Figure 4: The use of the process reward 𝑟𝑝𝑟𝑜 makes the model achieve larger answer rewards with shorter output length. 5 Conclusions We introduced Fundus-R1, a reasoning-enhanced fundus-reading MLLM trained using only public datasets. Our central goal is to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fundus-R1 gets a fundus MLLM working on public data with RAG traces plus process-reward RLVR, but the traces themselves are not checked for accuracy.

read the letter

The paper's main point is that you can train a reasoning-capable fundus MLLM using only public datasets, where most samples have nothing more than image-level labels. They generate image-specific reasoning traces by feeding a generic MLLM's visual findings into a RAG setup that pulls in ophthalmic knowledge, then train with an RLVR variant that adds a process reward to keep the traces self-consistent across rollouts. The result beats the base Qwen2.5-VL and a stronger post-trained version that skips the traces, on FunBench, Omni-Fundus, and GMAI-Fundus. That combination is new enough within the medical MLLM post-training line of work and directly tackles the reproducibility barrier created by private clinical reports. The method is simple to describe and stays within existing tools rather than inventing new architectures. The public-data constraint is handled cleanly, and the process reward is a reasonable tweak to standard RLVR. The soft spot is exactly what the stress test flags: no expert verification or error analysis on the RAG traces. With 94% of the training data carrying only image labels, it is possible the gains come from the RLVR scaffolding or extra training steps rather than the knowledge content in the traces. The comparison to the no-trace baseline shows the traces add something, but it does not isolate whether that something is factual or just longer, more structured output. The abstract gives no numbers, dataset sizes, or training details, so the full paper will need to supply those to make the claims stick. This is for groups working on domain-specific medical vision-language models who lack access to restricted hospital data. It is worth a serious referee because the practical problem is real and the proposed workaround is concrete, even if the trace-quality question needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces Fundus-R1, a reasoning-enhanced MLLM for fundus image understanding (CFP, OCT, UWF) trained exclusively on public datasets (over 94% with only image-level labels). It proposes a RAG-based method to auto-generate image-specific, knowledge-aware reasoning traces linking generic MLLM visual findings to labels via ophthalmic knowledge, plus an enhanced RLVR procedure with an added process reward for self-consistency of reasoning traces per rollout. Experiments on FunBench, Omni-Fundus and GMAI-Fundus are said to show clear outperformance versus Qwen2.5-VL and a stronger post-trained baseline that omits the generated traces.

Significance. If the empirical claims hold after verification, the work would be significant for medical vision-language modeling: it shows a reproducible path to specialized fundus MLLMs without proprietary clinical reports, using only public data plus automated knowledge injection. The combination of RAG trace generation and process-reward RLVR is a concrete technical step toward more interpretable, knowledge-intensive medical MLLMs and could broaden participation beyond labs with private data access.

major comments (2)

[Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.
[Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.

minor comments (1)

[Abstract] The acronym RLVR is used without prior expansion; while context implies reinforcement learning with verifiable rewards, an explicit definition on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.

Authors: We agree that the abstract would be clearer with quantitative support. In the revised version we will insert concise performance highlights (e.g., accuracy or F1 gains on FunBench, Omni-Fundus and GMAI-Fundus) together with references to the exact dataset sizes, training configurations and the ablation baseline that omits the generated traces. These numbers already appear in the experimental tables and will now be summarized in the abstract so that the claimed outperformance and the incremental value of the knowledge-aware traces can be assessed directly from the abstract. revision: yes
Referee: [Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.

Authors: We accept the referee’s point that the absence of direct validation leaves open the possibility that gains arise from RLVR scaffolding rather than the content of the traces. We will revise the method section to include (1) representative examples of the RAG-generated traces with the ophthalmic knowledge sources used, (2) a factual-accuracy check performed on a held-out subset of traces by comparing them against publicly available clinical guidelines, and (3) an expanded discussion of the ablation that isolates the traces (the “stronger edition post-trained without using the generated traces”). While a large-scale expert review was not feasible within the original scope, the added analysis will allow readers to judge the alignment between the auto-generated reasoning and clinical knowledge. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical method using RAG to generate reasoning traces from public data and a generic MLLM, followed by RLVR training with a self-consistency process reward. Performance is evaluated via experiments on external benchmarks against baselines including a no-trace variant. No equations, fitted parameters, or self-referential definitions are present that would make any claimed result equivalent to its inputs by construction. The central gains are presented as experimental outcomes, not derived tautologically from author-defined quantities or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified premise that auto-generated reasoning traces accurately link visual findings to image labels and that the added process reward measurably improves model quality; no free parameters, axioms, or invented entities are explicitly introduced.

pith-pipeline@v0.9.0 · 5638 in / 1262 out tokens · 55770 ms · 2026-05-10T18:12:23.544767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RAG-based method for composing image-specific, knowledge-aware reasoning traces... enhance RLVR with a process reward that encourages self-consistency
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on three fundus-reading benchmarks... Fundus-R1 clearly outperforms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Hrvoje Bogunović, Freerk Venhuizen, et al. 2019. RETOUCH: The retinal OCT fluid detection and segmentation benchmark and challenge.TMI38, 8 (2019), 1858–1874

2019
[2]

Ling-Ping Cen, Jie Ji, Jian-Wei Lin, Si-Tong Ju, Hong-Jie Lin, Tai-Ping Li, Yun Wang, Jian-Feng Yang, Yu-Fen Liu, Shaoying Tan, et al. 2021. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks.NComms.12, 1 (2021), 4828

2021
[3]

2022.LangChain

Harrison Chase. 2022.LangChain. https://github.com/langchain-ai/langchain

2022
[4]

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guim- ing Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al . 2024. Towards injecting medical visual knowledge into multimodal llms at scale. In EMNLP. 7346–7370

2024
[5]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.arXiv preprint arXiv:2412.05271(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. 2025. QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training. InNeurIPS

2025
[7]

DateCazuki. 2022. TOP: Classifier using fundus image dataset provided by Tsukazaki Hospital. https://github.com/DateCazuki/Fundus_Diagnosis. Dataset of fundus images from Tsukazaki Hospital, used for multi-disease classification. Accessed: 2025-04-10

2022
[8]

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. 2024. VLMEvalKit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia. 11198–11201

2024
[9]

Peyman Gholami, Priyanka Roy, Mohana Kuppuswamy Parthasarathy, and Va- sudevan Lakshminarayanan. 2020. OCTID: Optical coherence tomography image database.Computers & Electrical Engineering81 (2020), 106532

2020
[10]

Yutao Hu, Tianbin Li, et al. 2024. OmniMedVQA: A new large-scale comprehen- sive evaluation benchmark for medical LVLM. InCVPR

2024
[11]

Xiaoling Huang, Xiangyin Kong, Ziyan Shen, Jing Ouyang, Yunxiang Li, Kai Jin, and Juan Ye. 2023. GRAPE: A multi-modal dataset of longitudinal follow-up visual field and fundus images for glaucoma management.Scientific Data10, 1 (2023), 520

2023
[12]

Kermany, Michael Goldbaum, Wenjia Cai, Carolina C

Daniel S. Kermany, Michael Goldbaum, Wenjia Cai, Carolina C. S. Valentim, Huiying Liang, Sally L. Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying Medical Diagnoses and Treatable Diseases by Image- Based Deep Learning.Cell172, 5 (Feb. 2018), 1122–1131.e9

2018
[13]

Hoda Kheradfallah, Janarthanam Jothi Balaji, Varadharajan Jayakumar, Mo- hammed Abdul Rasheed, and Vasudevan Lakshminarayanan. 2022. Annotation and segmentation of diabetic retinopathy lesions: an explainable AI application. InMedical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033. SPIE, 502–511

2022
[14]

Mikhail Kulyabin, Aleksei Zhdanov, et al . 2024. OCTDL: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365

2024
[15]

Jiajia Li, Zhouyu Guan, et al. 2024. Integrated image-based deep learning and language models for primary diabetes care.Nature medicine30, 10 (2024), 2886– 2896

2024
[16]

Ning Li, Tao Li, Chunyu Hu, Kai Wang, and Hong Kang. 2021. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In BMO

2021
[17]

Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, and Beng Chin Ooi. 2025. EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model. InACMMM

2025
[18]

Tao Li, Yingqi Gao, Kai Wang, Song Guo, Hanruo Liu, and Hong Kang. 2019. Diag- nostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences501 (2019), 511–522

2019
[19]

Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Wei- hong Yu, and Youxin Chen. 2021. Multi-modal multi-instance learning for retinal disease recognition. InACMMM

2021
[20]

Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E Kinahan, and Yu Qiao. 2025. VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025
[21]

Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. 2025. HealthGPT: A Medical Large Vision- Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. InICML

2025
[22]

Ruhan Liu, Xiangning Wang, Qiang Wu, Ling Dai, Xi Fang, Tao Yan, Jaemin Son, Shiqi Tang, Jiang Li, Zijian Gao, et al. 2022. DeepDRiD: Diabetic retinopa- thy—grading and image quality estimation challenge.Patterns3, 6 (2022)

2022
[23]

Xinyao Liu and Diping Song. 2025. Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reason- ing. InICCV

2025
[24]

Samiksha Pachade, Prasanna Porwal, Dhanshree Thulkar, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Luca Giancardo, Gwenolé Quellec, and Fabrice Mériaudeau. 2021. Retinal fundus multi-disease image dataset (RFMiD): A dataset for multi-disease detection research.Data6, 2 (2021), 14

2021
[25]

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. MedVLM-R1: Incen- tivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. InMICCAI

2025
[26]

Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Desh- mukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. 2018. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research.Data3, 3 (2018), 25

2018
[27]

Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, and Qingyu Chen. 2025. LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models. InNAACL

2025
[28]

Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, et al. 2024. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence.NEJM AI1, 12 (2024), AIoa2300221

2024
[29]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A flexible and efficient rlhf framework. InEuroSys. 1279–1297

2025
[31]

Saman Sotoudeh-Paima, Ata Jodeiri, Fedra Hajizadeh, and Hamid Soltanian- Zadeh. 2022. Multi-scale convolutional neural network for automated AMD classification using retinal OCT images.Computers in biology and medicine144 (2022), 105368

2022
[32]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

2025
[33]

Qwen Team. 2025. Qwen3-Max: Just Scale it

2025
[34]

Rongsheng Wang. 2025. Med-R1: Encourage Medical LLM to engage in deep thinking similar to DeepSeek-R1. https://github.com/WangRongsheng/Med-R1

2025
[35]

Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, and Youxin Chen. 2022. Learning Two-Stream CNN for Multi-Modal Age-Related Macular Degeneration Categorization.IEEE Journal of Biomedical and Health Informatics26, 8 (2022), 4111–4122

2022
[36]

Qijie Wei, Xirong Li, Weihong Yu, Xiao Zhang, Yongpeng Zhang, Bojie Hu, Bin Mo, Di Gong, Ning Chen, Dayong Ding, et al . 2021. Learn to segment retinal lesions and beyond. InICPR

2021
[37]

Qijie Wei, Kaiheng Qian, and Xirong Li. 2025. FunBench: Benchmarking Fundus Reading Skills of MLLMs. InMICCAI. XX’XX, XXXX, XXXX Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, and Xirong Li

2025
[38]

Ruiqi Wu, Yuang Yao, et al . 2025. Bridging the Gap in Ophthalmic AI: MM- Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning.arXiv preprint arXiv:2508.16129(2025)

work page arXiv 2025
[39]

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan
[40]

Llava-cot: Let vision language models reason step-by-step. InICCV
[41]

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al
[42]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044(2025)

work page internal anchor Pith review arXiv 2025
[43]

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. 2025. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. InNeurIPS

2025
[44]

Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, et al. 2024. GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. InNeurIPS

2024
[45]

Xin Ye, Shucheng He, Xiaxing Zhong, Jiafeng Yu, Shangchao Yang, Yingjiao Shen, Yiqi Chen, Yaqi Wang, Xingru Huang, and Lijun Shen. 2023. OIMHS: An optical coherence tomography image dataset based on macular hole manual segmentation.Scientific Data10, 1 (2023), 769

2023
[46]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics. Association for Computational Linguistics. http://arxiv.org/abs/2403.13372

work page internal anchor Pith review arXiv 2024
[47]

Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, et al. 2025. Retinal- GPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models.arXiv preprint arXiv:2503.03987(2025)

work page arXiv 2025