Recognition: 2 theorem links
· Lean TheoremFundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data
Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3
The pith
A specialized fundus-reading multimodal model can be trained to outperform generic versions using only public datasets and image-level labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train Fundus-R1 exclusively on public datasets in which over 94 percent of samples have only image-level labels. A RAG-based procedure first composes image-specific reasoning traces that connect visual findings identified by a generic MLLM to the available labels through ophthalmic knowledge. We then apply RLVR augmented by a process reward that encourages self-consistency of each generated trace. On FunBench, Omni-Fundus and GMAI-Fundus the resulting model outperforms both its generic counterpart and a stronger post-trained edition that omits the generated traces.
What carries the argument
The RAG-based method that composes image-specific, knowledge-aware reasoning traces linking visual findings to image labels.
If this is right
- Fundus-R1 outperforms its generic counterpart and a post-trained ablation on three public fundus-reading benchmarks.
- Training succeeds when more than 94 percent of the data carries only image-level labels rather than full clinical reports.
- The process reward for reasoning-trace self-consistency measurably improves the RLVR stage.
- The same pipeline removes the previous dependence on private in-house samples for fundus MLLM development.
Where Pith is reading between the lines
- The RAG-plus-process-reward recipe could be tested on other medical imaging domains such as chest X-rays or pathology slides where expert reports are scarce.
- If the generated traces prove robust, the method offers a route to more interpretable and reproducible medical vision-language models without proprietary data.
- Broader adoption would allow research groups lacking private datasets to contribute directly to tools for retinal anomaly detection.
Load-bearing premise
The automatically generated reasoning traces are accurate and useful enough that training on them with the added process reward produces measurable gains over the generic base model.
What would settle it
If a model trained without the RAG-generated traces or without the process reward shows no improvement or degrades relative to the generic base on FunBench, Omni-Fundus and GMAI-Fundus, the central claim would be falsified.
Figures
read the original abstract
Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fundus-R1, a reasoning-enhanced MLLM for fundus image understanding (CFP, OCT, UWF) trained exclusively on public datasets (over 94% with only image-level labels). It proposes a RAG-based method to auto-generate image-specific, knowledge-aware reasoning traces linking generic MLLM visual findings to labels via ophthalmic knowledge, plus an enhanced RLVR procedure with an added process reward for self-consistency of reasoning traces per rollout. Experiments on FunBench, Omni-Fundus and GMAI-Fundus are said to show clear outperformance versus Qwen2.5-VL and a stronger post-trained baseline that omits the generated traces.
Significance. If the empirical claims hold after verification, the work would be significant for medical vision-language modeling: it shows a reproducible path to specialized fundus MLLMs without proprietary clinical reports, using only public data plus automated knowledge injection. The combination of RAG trace generation and process-reward RLVR is a concrete technical step toward more interpretable, knowledge-intensive medical MLLMs and could broaden participation beyond labs with private data access.
major comments (2)
- [Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.
- [Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.
minor comments (1)
- [Abstract] The acronym RLVR is used without prior expansion; while context implies reinforcement learning with verifiable rewards, an explicit definition on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces' is unsupported by any quantitative numbers, error bars, dataset sizes, or baseline-training details in the visible text; without these the outperformance cannot be assessed and the contribution of the knowledge-aware traces cannot be isolated.
Authors: We agree that the abstract would be clearer with quantitative support. In the revised version we will insert concise performance highlights (e.g., accuracy or F1 gains on FunBench, Omni-Fundus and GMAI-Fundus) together with references to the exact dataset sizes, training configurations and the ablation baseline that omits the generated traces. These numbers already appear in the experimental tables and will now be summarized in the abstract so that the claimed outperformance and the incremental value of the knowledge-aware traces can be assessed directly from the abstract. revision: yes
-
Referee: [Technical contributions / Method] The RAG-based trace generation method (technical contributions paragraph): no expert validation, factual-accuracy metric, or human evaluation of the auto-generated reasoning traces is reported. Given that >94% of training data carries only image-level labels, any systematic misalignment between the traces and true clinical reasoning would make the reported gains on the three benchmarks attributable to RLVR scaffolding or data scale rather than the claimed knowledge-aware content.
Authors: We accept the referee’s point that the absence of direct validation leaves open the possibility that gains arise from RLVR scaffolding rather than the content of the traces. We will revise the method section to include (1) representative examples of the RAG-generated traces with the ophthalmic knowledge sources used, (2) a factual-accuracy check performed on a held-out subset of traces by comparing them against publicly available clinical guidelines, and (3) an expanded discussion of the ablation that isolates the traces (the “stronger edition post-trained without using the generated traces”). While a large-scale expert review was not feasible within the original scope, the added analysis will allow readers to judge the alignment between the auto-generated reasoning and clinical knowledge. revision: partial
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical method using RAG to generate reasoning traces from public data and a generic MLLM, followed by RLVR training with a self-consistency process reward. Performance is evaluated via experiments on external benchmarks against baselines including a no-trace variant. No equations, fitted parameters, or self-referential definitions are present that would make any claimed result equivalent to its inputs by construction. The central gains are presented as experimental outcomes, not derived tautologically from author-defined quantities or self-citations.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RAG-based method for composing image-specific, knowledge-aware reasoning traces... enhance RLVR with a process reward that encourages self-consistency
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments on three fundus-reading benchmarks... Fundus-R1 clearly outperforms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hrvoje Bogunović, Freerk Venhuizen, et al. 2019. RETOUCH: The retinal OCT fluid detection and segmentation benchmark and challenge.TMI38, 8 (2019), 1858–1874
2019
-
[2]
Ling-Ping Cen, Jie Ji, Jian-Wei Lin, Si-Tong Ju, Hong-Jie Lin, Tai-Ping Li, Yun Wang, Jian-Feng Yang, Yu-Fen Liu, Shaoying Tan, et al. 2021. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks.NComms.12, 1 (2021), 4828
2021
-
[3]
2022.LangChain
Harrison Chase. 2022.LangChain. https://github.com/langchain-ai/langchain
2022
-
[4]
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guim- ing Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al . 2024. Towards injecting medical visual knowledge into multimodal llms at scale. In EMNLP. 7346–7370
2024
-
[5]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.arXiv preprint arXiv:2412.05271(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. 2025. QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training. InNeurIPS
2025
-
[7]
DateCazuki. 2022. TOP: Classifier using fundus image dataset provided by Tsukazaki Hospital. https://github.com/DateCazuki/Fundus_Diagnosis. Dataset of fundus images from Tsukazaki Hospital, used for multi-disease classification. Accessed: 2025-04-10
2022
-
[8]
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. 2024. VLMEvalKit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia. 11198–11201
2024
-
[9]
Peyman Gholami, Priyanka Roy, Mohana Kuppuswamy Parthasarathy, and Va- sudevan Lakshminarayanan. 2020. OCTID: Optical coherence tomography image database.Computers & Electrical Engineering81 (2020), 106532
2020
-
[10]
Yutao Hu, Tianbin Li, et al. 2024. OmniMedVQA: A new large-scale comprehen- sive evaluation benchmark for medical LVLM. InCVPR
2024
-
[11]
Xiaoling Huang, Xiangyin Kong, Ziyan Shen, Jing Ouyang, Yunxiang Li, Kai Jin, and Juan Ye. 2023. GRAPE: A multi-modal dataset of longitudinal follow-up visual field and fundus images for glaucoma management.Scientific Data10, 1 (2023), 520
2023
-
[12]
Kermany, Michael Goldbaum, Wenjia Cai, Carolina C
Daniel S. Kermany, Michael Goldbaum, Wenjia Cai, Carolina C. S. Valentim, Huiying Liang, Sally L. Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. 2018. Identifying Medical Diagnoses and Treatable Diseases by Image- Based Deep Learning.Cell172, 5 (Feb. 2018), 1122–1131.e9
2018
-
[13]
Hoda Kheradfallah, Janarthanam Jothi Balaji, Varadharajan Jayakumar, Mo- hammed Abdul Rasheed, and Vasudevan Lakshminarayanan. 2022. Annotation and segmentation of diabetic retinopathy lesions: an explainable AI application. InMedical Imaging 2022: Computer-Aided Diagnosis, Vol. 12033. SPIE, 502–511
2022
-
[14]
Mikhail Kulyabin, Aleksei Zhdanov, et al . 2024. OCTDL: Optical coherence tomography dataset for image-based deep learning methods.Scientific data11, 1 (2024), 365
2024
-
[15]
Jiajia Li, Zhouyu Guan, et al. 2024. Integrated image-based deep learning and language models for primary diabetes care.Nature medicine30, 10 (2024), 2886– 2896
2024
-
[16]
Ning Li, Tao Li, Chunyu Hu, Kai Wang, and Hong Kang. 2021. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In BMO
2021
-
[17]
Sijing Li, Tianwei Lin, Lingshuai Lin, Wenqiao Zhang, Jiang Liu, Xiaoda Yang, Juncheng Li, Yucheng He, Xiaohui Song, Jun Xiao, Yueting Zhuang, and Beng Chin Ooi. 2025. EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model. InACMMM
2025
-
[18]
Tao Li, Yingqi Gao, Kai Wang, Song Guo, Hanruo Liu, and Hong Kang. 2019. Diag- nostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences501 (2019), 511–522
2019
-
[19]
Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Wei- hong Yu, and Youxin Chen. 2021. Multi-modal multi-instance learning for retinal disease recognition. InACMMM
2021
-
[20]
Zihan Li, Diping Song, Zefeng Yang, Deming Wang, Fei Li, Xiulan Zhang, Paul E Kinahan, and Yu Qiao. 2025. VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)
2025
-
[21]
Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, and Beng Chin Ooi. 2025. HealthGPT: A Medical Large Vision- Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. InICML
2025
-
[22]
Ruhan Liu, Xiangning Wang, Qiang Wu, Ling Dai, Xi Fang, Tao Yan, Jaemin Son, Shiqi Tang, Jiang Li, Zijian Gao, et al. 2022. DeepDRiD: Diabetic retinopa- thy—grading and image quality estimation challenge.Patterns3, 6 (2022)
2022
-
[23]
Xinyao Liu and Diping Song. 2025. Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reason- ing. InICCV
2025
-
[24]
Samiksha Pachade, Prasanna Porwal, Dhanshree Thulkar, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Luca Giancardo, Gwenolé Quellec, and Fabrice Mériaudeau. 2021. Retinal fundus multi-disease image dataset (RFMiD): A dataset for multi-disease detection research.Data6, 2 (2021), 14
2021
-
[25]
Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. MedVLM-R1: Incen- tivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning. InMICCAI
2025
-
[26]
Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Desh- mukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. 2018. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research.Data3, 3 (2018), 25
2018
-
[27]
Zhenyue Qin, Yu Yin, Dylan Campbell, Xuansheng Wu, Ke Zou, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, and Qingyu Chen. 2025. LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models. InNAACL
2025
-
[28]
Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, et al. 2024. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence.NEJM AI1, 12 (2024), AIoa2300221
2024
-
[29]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A flexible and efficient rlhf framework. InEuroSys. 1279–1297
2025
-
[31]
Saman Sotoudeh-Paima, Ata Jodeiri, Fedra Hajizadeh, and Hamid Soltanian- Zadeh. 2022. Multi-scale convolutional neural network for automated AMD classification using retinal OCT images.Computers in biology and medicine144 (2022), 105368
2022
-
[32]
Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/
2025
-
[33]
Qwen Team. 2025. Qwen3-Max: Just Scale it
2025
-
[34]
Rongsheng Wang. 2025. Med-R1: Encourage Medical LLM to engage in deep thinking similar to DeepSeek-R1. https://github.com/WangRongsheng/Med-R1
2025
-
[35]
Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, and Youxin Chen. 2022. Learning Two-Stream CNN for Multi-Modal Age-Related Macular Degeneration Categorization.IEEE Journal of Biomedical and Health Informatics26, 8 (2022), 4111–4122
2022
-
[36]
Qijie Wei, Xirong Li, Weihong Yu, Xiao Zhang, Yongpeng Zhang, Bojie Hu, Bin Mo, Di Gong, Ning Chen, Dayong Ding, et al . 2021. Learn to segment retinal lesions and beyond. InICPR
2021
-
[37]
Qijie Wei, Kaiheng Qian, and Xirong Li. 2025. FunBench: Benchmarking Fundus Reading Skills of MLLMs. InMICCAI. XX’XX, XXXX, XXXX Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, and Xirong Li
2025
- [38]
-
[39]
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan
-
[40]
Llava-cot: Let vision language models reason step-by-step. InICCV
-
[41]
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al
-
[42]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044(2025)
work page internal anchor Pith review arXiv 2025
-
[43]
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. 2025. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. InNeurIPS
2025
-
[44]
Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, et al. 2024. GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. InNeurIPS
2024
-
[45]
Xin Ye, Shucheng He, Xiaxing Zhong, Jiafeng Yu, Shangchao Yang, Yingjiao Shen, Yiqi Chen, Yaqi Wang, Xingru Huang, and Lijun Shen. 2023. OIMHS: An optical coherence tomography image dataset based on macular hole manual segmentation.Scientific Data10, 1 (2023), 769
2023
-
[46]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics. Association for Computational Linguistics. http://arxiv.org/abs/2403.13372
work page internal anchor Pith review arXiv 2024
-
[47]
Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, et al. 2025. Retinal- GPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models.arXiv preprint arXiv:2503.03987(2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.