arxiv: 2604.25720 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.CL

Recognition: unknown

Toward Multimodal Conversational AI for Age-Related Macular Degeneration

Ran Gu , Benjamin Hou , M\'elanie H\'ebert , Asmita Indurkar , Yifan Yang , Emily Y. Chew , Tiarn\'an D. L. Keenan , Zhiyong Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:50 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords age-related macular degenerationmultimodal large language modelsfundus photographyvisual question answeringclinical dialogueAMD diagnosissimulated conversations

0 comments

The pith

OcularChat, an MLLM fine-tuned on simulated patient-physician dialogues, classifies age-related macular degeneration from fundus photos with high accuracy and provides interactive clinical explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OcularChat to combine image analysis with conversational reasoning for diagnosing AMD. It fine-tunes a multimodal large language model using hundreds of thousands of simulated dialogues paired with color fundus photographs. This allows the model to not only detect features like drusen and pigment changes but also explain its reasoning in dialogue. Sympathetic readers would care because current deep learning tools give only static labels, while this approach supports real clinical use through explanations and interaction. The results show it beats other MLLMs and scores higher with ophthalmologists on grading rubrics.

Core claim

OcularChat, fine-tuned from Qwen2.5-VL on 705,850 simulated dialogues and 46,167 CFPs, achieves accuracies of 0.954, 0.849, and 0.678 for advanced AMD, pigmentary abnormalities, and drusen size on AREDS data, outperforming existing MLLMs, and receives higher mean scores from ophthalmologists (3.503 vs 2.833 for advanced AMD, etc.) on a 5-point rubric while enabling diagnostic reasoning and interactive dialogue.

What carries the argument

OcularChat, the fine-tuned multimodal LLM that integrates visual question answering on color fundus photographs with clinically meaningful dialogue generated from simulated patient-physician interactions.

If this is right

MLLMs can produce accurate AMD severity classifications from images while generating reasoned predictions.
The model supports patient counseling and clinical decision-making through interactive explanations.
Performance remains strong across datasets like AREDS and AREDS2.
Ophthalmologist evaluations confirm higher clinical utility compared to baseline models.
Such systems enable interpretable image-based diagnosis beyond static predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fine-tuning approaches could extend to other retinal diseases or general medical imaging tasks.
Integration into electronic health records might allow real-time assistance during patient visits.
Further training on actual patient interactions could improve generalization to edge cases not captured in simulations.
Deployment in telemedicine platforms could make expert-level AMD assessment more accessible in underserved areas.

Load-bearing premise

The simulated patient-physician dialogues accurately represent the questions, reasoning, and edge cases that occur in real clinical practice with actual patients and ophthalmologists.

What would settle it

If real-world testing with live ophthalmologist-patient interactions on unseen cases shows OcularChat's accuracy dropping below existing MLLMs or receiving lower clinical scores than baselines, the claim of clinical usefulness would be falsified.

read the original abstract

Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OcularChat posts strong AMD classification numbers and better grader scores than baselines after fine-tuning on simulated dialogues, but the lack of any validation for those simulations is the load-bearing gap.

read the letter

The paper fine-tunes Qwen2.5-VL on 705,850 simulated patient-physician dialogues paired with 46,167 color fundus photos to create OcularChat for conversational AMD diagnosis. It reports accuracies of 0.954 on advanced AMD, 0.849 on pigmentary abnormalities, and 0.678 on drusen size in AREDS, staying ahead of other MLLMs on AREDS2 as well. Three ophthalmologists also gave its explanations and overall impressions higher marks on a 5-point rubric than the baseline model.

Referee Report

3 major / 2 minor

Summary. The paper introduces OcularChat, an MLLM fine-tuned from Qwen2.5-VL on 705,850 simulated patient-physician dialogues paired with 46,167 CFPs, for interactive AMD diagnosis via visual question answering on color fundus photographs. It reports classification accuracies of 0.954/0.849/0.678 on AREDS for advanced AMD, pigmentary abnormalities, and drusen size (outperforming other MLLMs), similar top performance on AREDS2, and superior mean scores from three ophthalmologist graders on a 5-point rubric for reasoning, explanations, and overall impression.

Significance. If the simulated dialogues prove representative of real clinical interactions, the work could meaningfully advance interpretable multimodal AI for retinal diagnostics by combining high classification accuracy with clinically relevant dialogue and explanations. Strengths include evaluation on public benchmarks (AREDS, AREDS2) and human expert grading; however, the absence of simulation validation limits claims of real-world generalization and clinical utility.

major comments (3)

[Methods (Dialogue Simulation)] Methods section on dialogue generation: No details are provided on how the 705,850 simulated dialogues were created, including the underlying generator, prompt strategies, diversity/edge-case controls, or any validation against real clinical transcripts or patient-physician interactions. This is load-bearing for the central claims, as the reported accuracies, outperformance, and subjective grader scores all depend on the simulations faithfully capturing real-world question distributions and reasoning patterns.
[Results (Classification Performance)] Results section on classification performance: The accuracies (0.954, 0.849, 0.678 on AREDS) and outperformance over baselines are presented without statistical testing (p-values, confidence intervals, or multiple-comparison corrections), making it impossible to assess whether differences are significant or could arise from simulation-test alignment.
[Human Evaluation] Human evaluation section: The ophthalmologist grading results (e.g., 3.503 vs. 2.833 for advanced AMD) lack details on blinding, inter-rater agreement (e.g., Fleiss' kappa), rubric calibration, or how graders were instructed, weakening the reliability of the subjective superiority claims.

minor comments (2)

[Abstract] Abstract: Baseline MLLM names and their exact scores are not listed, only that OcularChat 'significantly outperforming' them; adding this would improve readability.
[Discussion] The paper would benefit from a dedicated limitations subsection explicitly addressing potential simulation bias and generalization risks.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods (Dialogue Simulation)] Methods section on dialogue generation: No details are provided on how the 705,850 simulated dialogues were created, including the underlying generator, prompt strategies, diversity/edge-case controls, or any validation against real clinical transcripts or patient-physician interactions. This is load-bearing for the central claims, as the reported accuracies, outperformance, and subjective grader scores all depend on the simulations faithfully capturing real-world question distributions and reasoning patterns.

Authors: We agree that the Methods section requires substantially more detail on the dialogue simulation process to support the central claims. In the revised manuscript, we will add a dedicated subsection describing the full pipeline: the base LLM used for generation, the specific prompt strategies and templates (including system prompts for role-playing patient and physician), mechanisms for controlling diversity and incorporating edge cases (e.g., atypical presentations or ambiguous questions), and how dialogues were paired with the 46,167 CFPs. We will also include representative examples of simulated dialogues in the supplementary materials. Regarding validation against real clinical transcripts, we note that no such direct comparison was performed, as large-scale privacy-compliant real-world dialogue datasets are not publicly available; we will explicitly state this as a limitation and describe how the simulations were grounded in AMD clinical guidelines and expert ophthalmologist input to approximate realistic question distributions and reasoning patterns. revision: yes
Referee: [Results (Classification Performance)] Results section on classification performance: The accuracies (0.954, 0.849, 0.678 on AREDS) and outperformance over baselines are presented without statistical testing (p-values, confidence intervals, or multiple-comparison corrections), making it impossible to assess whether differences are significant or could arise from simulation-test alignment.

Authors: We acknowledge that the absence of statistical testing limits the interpretability of the performance comparisons. In the revised Results section and associated tables, we will add bootstrap-derived 95% confidence intervals for all accuracy metrics and perform appropriate statistical tests (e.g., McNemar's test for paired comparisons on the classification tasks) with Bonferroni correction for multiple comparisons. These additions will allow readers to assess the significance of OcularChat's outperformance over the baseline MLLMs. revision: yes
Referee: [Human Evaluation] Human evaluation section: The ophthalmologist grading results (e.g., 3.503 vs. 2.833 for advanced AMD) lack details on blinding, inter-rater agreement (e.g., Fleiss' kappa), rubric calibration, or how graders were instructed, weakening the reliability of the subjective superiority claims.

Authors: We agree that greater transparency is needed for the human evaluation. In the revised manuscript, we will expand the Human Evaluation section to specify: the blinding protocol (graders were not informed of model identities or which outputs came from which system), the exact instructions and rubric calibration process provided to the three ophthalmologists, and the inter-rater reliability (we will compute and report Fleiss' kappa on the 5-point scores). These details will be added to strengthen the credibility of the subjective grading results. revision: yes

standing simulated objections not resolved

Direct validation of the simulated dialogues against real clinical transcripts cannot be performed, as no suitable large-scale, de-identified real-world patient-physician dialogue datasets for AMD are available to the authors.

Circularity Check

0 steps flagged

No circularity: performance evaluated on external public datasets with no definitional or fitted-input reductions

full rationale

The paper fine-tunes OcularChat on 705,850 simulated dialogues paired to 46,167 CFPs and then reports classification accuracies (0.954/0.849/0.678) plus ophthalmologist rubric scores on the independent AREDS and AREDS2 datasets. No equations, fitted parameters, or self-referential definitions appear in the provided text. The evaluation metrics are computed on external benchmarks rather than on quantities defined from the training simulations themselves, so the reported results do not reduce to the inputs by construction. The derivation chain therefore remains self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that fine-tuning on generated dialogues transfers to clinical utility without additional domain-specific constraints.

pith-pipeline@v0.9.0 · 5664 in / 1314 out tokens · 50291 ms · 2026-05-07T16:50:31.445566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

In addiMon, Med-Gemma40, a VLM ﬁne-tuned for medical applicaMons, was included to provide a domain-adapted baseline model for comparison

Fine-tuned model inference Image: Question: What can you inspect from this retinal image? Arethere any pigmentation changes in the retina?… The exam indicates that thepatientdoes not have advanced age-related macular degeneration (ADVAMD), and there are no significant drusen deposits.No, there is no sign to show that the patient has pigmentary changes.… 1...

work page doi:10.1016/s2214-109x(13)70145-1 2020
[2]

Development and ValidaMon of a Deep Learning Algorithm for DetecMon of DiabeMc ReMnopathy in ReMnal Fundus Photographs

Gulshan V, Peng L, Coram M, et al. Development and ValidaMon of a Deep Learning Algorithm for DetecMon of DiabeMc ReMnopathy in ReMnal Fundus Photographs. JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216 10. Ting DSW, Cheung CY , Lim G, et al. Development and ValidaMon of a Deep Learning System for DiabeMc ReMnopathy and Related Eye Diseases Usin...

work page doi:10.1001/jama.2016.17216 2016
[3]

DeepLensNet: Deep Learning Automated Diagnosis and QuanMtaMve ClassiﬁcaMon of Cataract Type and Severity

Keenan TDL, Chen Q, Agrón E, et al. DeepLensNet: Deep Learning Automated Diagnosis and QuanMtaMve ClassiﬁcaMon of Cataract Type and Severity. Ophthalmology. 2022;129(5):571-584. doi:10.1016/j.ophtha.2021.12.017 18. Tham YC, Goh JHL, Anees A, et al. DetecMng visually signiﬁcant cataract using reMnal photograph-based deep learning. Nat Aging. 2022;2(3):264-...

work page doi:10.1016/j.ophtha.2021.12.017 2022
[4]

Hidden ﬂaws behind expert-level accuracy of mulMmodal GPT-4 vision in medicine

Jin Q, Chen F, Zhou Y , et al. Hidden ﬂaws behind expert-level accuracy of mulMmodal GPT-4 vision in medicine. NPJ Digit Med. 2024;7(1):190. Published 2024 Jul 23. doi:10.1038/s41746-024-01185-7 26. Hamamci IE, Er S, Sekuboyina A, et al. GenerateCT: Text-condiMonal generaMon of 3D chest CT volumes. arXiv preprint arXiv:2305.16037. Published May 26, 2023. ...

work page doi:10.1038/s41746-024-01185-7 2024
[5]

MM-REACT: PrompMng ChatGPT for MulMmodal Reasoning and AcMon

Yang Z, Li L, Wang J, Lin K, Azarnasab E, Ahmed F, Liu Z, Liu C, Zeng M, Wang L. MM-REACT: PrompMng ChatGPT for MulMmodal Reasoning and AcMon. arXiv. Preprint posted March 20, 2023. arXiv:2303.11381. Accessed February 6, 2026. hups://arxiv.org/abs/2303.11381 34. Qin Z, Li R, Chen C, et al. LMOD+: A Comprehensive MulMmodal Dataset and Benchmark for Develop...

work page doi:10.1001/archopht.123.11.1570 2023
[6]

MedGemma Technical Report

Sellergren A, Kazemzadeh S, Jaroensri T, et al. MedGemma Technical Report. arXiv. Preprint last revised July 12, 2025. arXiv:2507.05201. doi:10.48550/arXiv.2507.05201. Accessed February 6, 2026. hups://arxiv.org/abs/2507.05201 41. Tatham AJ, Medeiros FA. DetecMng Structural Progression in Glaucoma with OpMcal Coherence Tomography. Ophthalmology. 2017;124(...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.05201 2025