PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

Carsten Eickhoff; Philipp Berens; Verena Jasmin Hallitschke

arxiv: 2605.02720 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.CL

PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

Verena Jasmin Hallitschke , Carsten Eickhoff , Philipp Berens This is my paper

Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords ophthalmologyvision-language modelsimage-caption datasetPubMed Centralfigure extractionpanel decompositionmedical imaging

0 comments

The pith

PubMed-Ophtha releases 102,023 ophthalmology image-caption pairs extracted at full resolution from 15,842 scientific articles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large hierarchical dataset to fill the gap in high-quality image-text resources needed for training vision-language models in ophthalmology. Figures are pulled directly from article PDFs at full resolution, broken into individual panels with identifiers, and paired with split captions. Each image receives labels for imaging modality and the presence of annotation marks. This scale and structure allow models to learn from real medical literature rather than limited curated collections.

Core claim

We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality and a mark status. Figure captions are split into panel-level subcaptions using a two-step LLM approach.

What carries the argument

The PubMed-Ophtha dataset pipeline, which extracts figures from PDFs at full resolution, decomposes them into panels, classifies imaging modalities, and splits captions via LLM into panel-specific subcaptions.

If this is right

Panel-level subcaptions enable models to handle multi-figure medical papers that standard single-caption datasets cannot address.
Modality and mark annotations support training of models that distinguish color fundus photography from optical coherence tomography and ignore arrows or labels.
Release of ground-truth annotations, trained detection models, and the full extraction pipeline allows other groups to extend or audit the resource.
The dataset scale of 102,023 pairs provides sufficient volume for pre-training or fine-tuning large vision-language architectures in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction approach could be rerun on newer PubMed updates to keep the resource current without manual effort.
Models trained here may transfer to clinical image interpretation tasks even if the original papers contain only research-grade images.
Similar pipelines applied to other medical fields could reduce reliance on expensive manual dataset creation across healthcare AI.
If downstream models show clear gains on held-out ophthalmology benchmarks, the dataset would demonstrate that literature-derived pairs are a viable alternative to expert-annotated collections.

Load-bearing premise

Automated PDF figure extraction, panel decomposition, and LLM-based caption splitting produce image-text pairs accurate enough and free of systematic errors to support effective training of ophthalmology vision-language models.

What would settle it

Train a vision-language model on the released PubMed-Ophtha pairs and measure whether it improves ophthalmology-specific tasks such as panel-level captioning or modality-aware visual question answering compared with models trained on smaller or noisier alternatives.

Figures

Figures reproduced from arXiv: 2605.02720 by Carsten Eickhoff, Philipp Berens, Verena Jasmin Hallitschke.

**Figure 1.** Figure 1: Overview of the dataset extraction pipeline. (A) Articles are filtered by keywords to select those relevant to ophthalmological retinal imaging. (B) A heuristic detects figures and their captions in the article PDF and extracts them at full resolution. (C) Additional figure-level information, such as in-text mentions, is retrieved from the BIOMEDICA dataset. (D) The final dataset contains individual panels… view at source ↗

**Figure 2.** Figure 2: Detection performance and failure cases. (A) Example detections from the test set, showing panel (teal), image (purple), and panel identifier (blue) bounding boxes across three figures of varying complexity. (B) Precision-recall curves at an IoU threshold of 0.75 for (i) panel and panel identifier detection and (ii) image type detection across the four image type categories (CFP, OCT, Retinal Imaging, Othe… view at source ↗

**Figure 3.** Figure 3: Caption splitting and subcaption assignment. (A) Original figure with its full caption. (B) Result of the caption splitting step: the full caption is decomposed into panel-level subcaptions, each associated with a panel identifier. (C) Result of the panel assembly step: each subcaption is assigned to its corresponding panel, and panel identifier locations are matched to the detected panel bounding boxes. d… view at source ↗

**Figure 4.** Figure 4: Examples of extracted panels with their identifiers, subcaptions, and the detected images. 5. TECHNICAL VALIDATION We validated every step of the dataset extraction pipeline carefully ( view at source ↗

read the original abstract

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PubMed-Ophtha is a practical dataset release with strong pipeline metrics and full reproducibility artifacts, though its downstream value for VLMs remains to be shown.

read the letter

The paper delivers a large ophthalmology image-text dataset extracted from PubMed Central papers, with full-resolution figures broken into panels, modality labels, and split captions. They report concrete numbers on the extraction steps: median IoU of 0.997 for figures, mAP@0.5 around 0.9 for panel and image detection, and BLEU 0.913 for the LLM caption splitter on held-out human data. They also ship the ground-truth annotations, trained models, and the full pipeline code. That combination makes the resource immediately usable and auditable, which is the real strength here. Dataset papers often stop at description; this one gives users the tools to inspect and filter the output themselves. The work is empirical and avoids any parameter fitting or circular claims, so the validation metrics stand on their own. The main limitation is the absence of any VLM training or retrieval experiments that would show whether the pairs actually improve model performance over existing medical datasets. That gap is common in resource papers, but it means the practical payoff is still prospective. Minor issues include limited error analysis on edge cases like complex multi-panel figures or low-quality scans, though the human-annotated test sets help. This is aimed at groups building or fine-tuning vision-language models for retinal imaging and related clinical tasks. It is worth a serious referee because the contribution is a concrete, open artifact with measurable extraction quality rather than just another claim of scale.

Referee Report

0 major / 3 minor

Summary. The paper claims to introduce PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access PubMed Central articles. Figures are extracted directly from PDFs at full resolution, decomposed into panels with identifiers and individual images, annotated with imaging modalities (color fundus photography, optical coherence tomography, retinal imaging, or other) and mark status, and paired with panel-level subcaptions obtained via a two-step LLM-based splitting method. The authors report concrete performance metrics on held-out human annotations: mean sentence BLEU of 0.913 for caption splitting, mAP@0.50 of 0.909 for panel detection and 0.892 for image detection, and median IoU of 0.997 for figure extraction. They release the human-annotated ground-truth data, trained models, and full dataset generation pipeline to support reproducibility.

Significance. If the reported extraction fidelity holds, this work provides a valuable open resource that can accelerate development of ophthalmology-specific vision-language models by supplying a large-scale, hierarchically structured, modality-annotated image-text corpus that is currently scarce in the field. The full-resolution PDF extraction, panel decomposition, and release of ground-truth annotations, trained models, and the complete pipeline are particular strengths that enable community auditing, filtering, and extension of the data.

minor comments (3)

A comparison table with existing ophthalmology or medical image-caption datasets (size, structure, extraction method, and annotation granularity) would better highlight the advantages of PubMed-Ophtha.
Report the distribution of imaging modalities and mark statuses across the 102,023 pairs to allow users to assess potential class imbalance or biases in the dataset.
Provide additional details on the human annotation protocol for the ground-truth evaluation set, including the number of annotators and any measures of inter-annotator agreement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our manuscript, for recognizing the value of PubMed-Ophtha as an open resource for ophthalmology vision-language models, and for recommending acceptance. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical dataset construction pipeline for extracting and annotating ophthalmology image-caption pairs from PubMed Central PDFs, with validation via held-out human annotations (panel mAP@0.50 = 0.909, image mAP@0.50 = 0.892, figure extraction median IoU = 0.997, caption splitting BLEU = 0.913). No mathematical derivations, equations, predictions, or fitted parameters are present that could reduce to inputs by construction. All performance claims rely on external ground-truth annotations and released models/pipeline rather than self-referential steps. No self-citation load-bearing elements or ansatz smuggling appear in the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper contributes a new dataset and extraction pipeline rather than a mathematical derivation. It relies on standard computer vision and LLM capabilities without introducing new physical entities or free parameters that are fitted to support a central claim.

axioms (2)

domain assumption PDF parsing and computer vision models can reliably extract and decompose figures into panels at full resolution.
Invoked in the figure extraction and panel detection steps described in the abstract.
domain assumption Large language models can accurately split figure captions into panel-specific subcaptions.
Basis for the two-step LLM approach with reported BLEU evaluation.

pith-pipeline@v0.9.0 · 5514 in / 1406 out tokens · 87119 ms · 2026-05-08T18:42:44.201721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Bleu:AMethod forAutomaticEvaluationofMachineTranslation

K.Papineni,S.Roukos,T.Ward,andW. -J.Zhu,“Bleu:AMethod forAutomaticEvaluationofMachineTranslation”,inProceed- ingsofthe40thAnnualMeetingoftheAssociationforCompu- tationalLinguistics,P.Isabelle,E.Charniak,andD.Lin,Eds., Philadelphia, Pennsylvania, USA: Association for Computa- tionalLinguistics,Jul.2002,pp.311–318

work page 2002
[2]

Radiologyreporting, past,present,andfuture:Theradiologist’sperspective

B.I.Reiner,N.Knight,andE.L.Siegel,“Radiologyreporting, past,present,andfuture:Theradiologist’sperspective”,Journal oftheAmericanCollegeofRadiology,vol.4,no.5,pp.313–319, 2007,issn:1546-1440

work page 2007
[3]

Ima- geNet:Alarge-scalehierarchicalimagedatabase

J.Deng,W.Dong,R.Socher,L. -J.Li,K.Li,andL.Fei-Fei,“Ima- geNet:Alarge-scalehierarchicalimagedatabase”,in2009IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009,pp.248–255

work page 2009
[4]

EntrezProgrammingUtilitiesHelp.NationalCenterforBiotech- nologyInformation(US),2010

work page 2010
[5]

Microsoft COCO: Common Objects in Con- text

T.-Y. Lin et al., “Microsoft COCO: Common Objects in Con- text”, inComputerVision–ECCV2014, D. Fleet, T. Pajdla, B. Schiele,andT.Tuytelaars,Eds.,Cham:SpringerInternational Publishing,2014,pp.740–755,isbn:978-3-319-10602-1

work page 2014
[6]

Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation

N.ClynchandJ.Kellett,“Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation”,International Journal of Medical Informatics, vol. 84, no.4,pp.221–228,2015,issn:1386-5056

work page 2015
[7]

Overview of the medical tasks in ImageCLEF 2016

A. G. S. De Herrera, S. Bromuri, R. Schaer, and H. Müller, “Overview of the medical tasks in ImageCLEF 2016”,CLEF workingnotes.Evora,Portugal,2016

work page 2016
[8]

DeepResidualLearning forImageRecognition

K.He,X.Zhang,S.Ren,andJ.Sun,“DeepResidualLearning forImageRecognition”,in2016IEEEConferenceonComputer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE,Jun.2016,pp.770–778,isbn:978-1-4673-8851-1

work page 2016
[9]

Focal Loss for Dense Object Detection

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection”, inProceedingsoftheIEEE international conference on computer vision, 2017, pp. 2980– 2988

work page 2017
[10]

Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition

B.Shi,X.Bai,andC.Yao,“Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition”,IEEETrans.PatternAnal.Mach.In- tell.,vol.39,no.11,pp.2298–2304,Nov.2017,issn:0162-8828

work page 2017
[11]

Radiology objects in context (roco): A multimodal image dataset

O.Pelka,S.Koitka,J.Rückert,F.Nensa,andC.M.Friedrich, “Radiology objects in context (roco): A multimodal image dataset”,inIntravascularImagingandComputerAssistedStent- ingandLarge-ScaleAnnotationofBiomedicalDataandExpert LabelSynthesis,D.Stoyanovetal.,Eds.,Cham:SpringerInterna- tionalPublishing,2018,pp.180–189,isbn:978-3-030-01364-6

work page 2018
[12]

AcallforclarityinreportingBLEUscores

M.Post,“AcallforclarityinreportingBLEUscores”,inProceed- ingsoftheThirdConferenceonMachineTranslation:Research Papers,Belgium,Brussels:AssociationforComputationalLin- guistics,Oct.2018,pp.186–191

work page 2018
[13]

Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,De- tectron2, 2019. [Online]. Available: https : / / github. com / facebookresearch/detectron2

work page 2019
[14]

M.Tkachenko,M.Malyuk,A.Holmanyuk,andN.Liubimov, LabelStudio:Datalabelingsoftware,2020.[Online].Available: https://github.com/HumanSignal/label-studio

work page 2020
[15]

Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures

J.Zou,G.Thoma,andS.Antani,“Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures”,J.Assoc.Inf.Sci.Technol.,vol.71,no.11,pp.1327–1340, Oct.21,2020,issn:2330-1635

work page 2020
[16]

DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation

J.-H.Huangetal.,“DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation”,in 2021IEEEWinterConferenceonApplicationsofComputerVi- sion(WACV), Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 2441– 2451,isbn:978-1-6654-0477-8

work page 2021
[17]

Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles

N. Jardón-Maximino et al., “Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles”,Polymers,vol.13,no.11,p.1694,2021

work page 2021
[18]

Datasets: A Community Library for Natu- ral Language Processing

Q. Lhoest et al., “Datasets: A Community Library for Natu- ral Language Processing”, inProceedings of the 2021 Confer- enceonEmpiricalMethodsinNaturalLanguageProcessing:Sys- temDemonstrations,AssociationforComputationalLinguistics, Nov.2021,pp.175–184

work page 2021
[19]

Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention

W.Kwonetal.,“Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention”,inProceedingsofthe ACMSIGOPS29thSymposiumonOperatingSystemsPrinciples, 2023

work page 2023
[20]

Vision-languagemodelsformedical report generation and visual question answering: A review

I.HartsockandG.Rasool,“Vision-languagemodelsformedical report generation and visual question answering: A review”, Frontiersinartificialintelligence,vol.7,p.1430984,2024

work page 2024
[21]

Awq: Activation-aware weight quantization for llmcompressionandacceleration

J. Lin et al., “Awq: Activation-aware weight quantization for llmcompressionandacceleration”,inMLSys,2024

work page 2024
[22]

Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset

J.Rückertetal.,“Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset”,ScientificData,vol.11, no.1,p.688,2024

work page 2024
[23]

Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting

Q. Yu et al., “Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting”,inMedical ImageComputingandComputerAssistedIntervention–MIC- CAI2024:27thInternationalConference,Marrakesh,Morocco, October 6–10, 2024, Proceedings, Part I, Marrakesh, Morocco: Springer-Verlag,2024,pp.667–677,isbn:978-3-031-72377-3

work page 2024
[24]

AdvancingMedicalRepresentation LearningThroughHigh-QualityData

N.Baghbanzadehetal.,“AdvancingMedicalRepresentation LearningThroughHigh-QualityData”,pp.24–33,2025

work page 2025
[25]

Qwen3-VL Technical Report

S.Baietal.“Qwen3-VLTechnicalReport”.arXiv:2511.21631 [cs],pre-published

work page Pith review arXiv
[26]

BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature

A.Lozanoetal.,“BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature”,in2025IEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2025, pp.19724–19735

work page 2025
[27]

Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision

J. Silva-Rodríguez, H. Chakor, R. Kobbi, J. Dolz, and I. Ben Ayed,“Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision”,MedicalImage Analysis,vol.99,p.103357,2025,issn:1361-8415

work page 2025
[28]

Qwen3 Technical Report

A. Yang et al. “Qwen3 Technical Report”. arXiv: 2505.09388 [cs],pre-published

work page internal anchor Pith review arXiv
[29]

AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs

S.Zhangetal.,“AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs”,NEJM AI, vol.2,no.1,Jan.2025,issn:2836-9386

work page 2025
[30]

[Online].Available:https://github.com/pydantic/pydantic

S.Colvinetal.,PydanticValidation,versionv2.13.0b2,Feb.2026. [Online].Available:https://github.com/pydantic/pydantic

work page 2026
[31]

VOLMO: Versatile and Open Large Models for Ophthalmology

Z. Qin et al. “VOLMO: Versatile and Open Large Models for Ophthalmology”.arXiv:2603.23953[cs],pre-published. 8PubMed-Ophtha V. Hallitschke et al. PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

work page arXiv
[32]

AUTHOR CONTRIBUTIONS Conceptualization: VH,CE,PB;Methodology: VH,CE,PB;Software: VH; Validation: VH; Data Curation: VH; Writing - Original Draft: VH, PB; Writing - Review & Editing: CE; Visualization: VH, PB; Supervision: PB,CE;Fundingacquisition: PB

work page
[33]

COMPETING INTERESTS Nonetodeclare

work page
[34]

ACKNOWLEDGMENTS We thank Camila Roa, Sarah Müller, Ifeoma Nwabufo, Jan-Niklas Böhm, Fabio Seel, Samuel Ofosu Mensah, Simone Ebert, Rita GonzálezMárquezandJuliusGervelmeyerforannotatingPubMed- Ophtha-Annotation

work page
[35]

MachineLearning–NewPerspectivesfor Science

FUNDING WethanktheHertieFoundationandtheCarlZeissFoundation(CZ Nexus: CertificationandFoundationsofSafeMachineLearningSys- tems in Healthcare) for funding. PB and CE are members of the ClusterofExcellence2064"MachineLearning–NewPerspectivesfor Science"fundedbytheGermanResearchFoundation(DFG). A. FULL DATASET COLUMN DESCRIPTION Table 2.Overviewofthefieldsi...

work page

[1] [1]

Bleu:AMethod forAutomaticEvaluationofMachineTranslation

K.Papineni,S.Roukos,T.Ward,andW. -J.Zhu,“Bleu:AMethod forAutomaticEvaluationofMachineTranslation”,inProceed- ingsofthe40thAnnualMeetingoftheAssociationforCompu- tationalLinguistics,P.Isabelle,E.Charniak,andD.Lin,Eds., Philadelphia, Pennsylvania, USA: Association for Computa- tionalLinguistics,Jul.2002,pp.311–318

work page 2002

[2] [2]

Radiologyreporting, past,present,andfuture:Theradiologist’sperspective

B.I.Reiner,N.Knight,andE.L.Siegel,“Radiologyreporting, past,present,andfuture:Theradiologist’sperspective”,Journal oftheAmericanCollegeofRadiology,vol.4,no.5,pp.313–319, 2007,issn:1546-1440

work page 2007

[3] [3]

Ima- geNet:Alarge-scalehierarchicalimagedatabase

J.Deng,W.Dong,R.Socher,L. -J.Li,K.Li,andL.Fei-Fei,“Ima- geNet:Alarge-scalehierarchicalimagedatabase”,in2009IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009,pp.248–255

work page 2009

[4] [4]

EntrezProgrammingUtilitiesHelp.NationalCenterforBiotech- nologyInformation(US),2010

work page 2010

[5] [5]

Microsoft COCO: Common Objects in Con- text

T.-Y. Lin et al., “Microsoft COCO: Common Objects in Con- text”, inComputerVision–ECCV2014, D. Fleet, T. Pajdla, B. Schiele,andT.Tuytelaars,Eds.,Cham:SpringerInternational Publishing,2014,pp.740–755,isbn:978-3-319-10602-1

work page 2014

[6] [6]

Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation

N.ClynchandJ.Kellett,“Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation”,International Journal of Medical Informatics, vol. 84, no.4,pp.221–228,2015,issn:1386-5056

work page 2015

[7] [7]

Overview of the medical tasks in ImageCLEF 2016

A. G. S. De Herrera, S. Bromuri, R. Schaer, and H. Müller, “Overview of the medical tasks in ImageCLEF 2016”,CLEF workingnotes.Evora,Portugal,2016

work page 2016

[8] [8]

DeepResidualLearning forImageRecognition

K.He,X.Zhang,S.Ren,andJ.Sun,“DeepResidualLearning forImageRecognition”,in2016IEEEConferenceonComputer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE,Jun.2016,pp.770–778,isbn:978-1-4673-8851-1

work page 2016

[9] [9]

Focal Loss for Dense Object Detection

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection”, inProceedingsoftheIEEE international conference on computer vision, 2017, pp. 2980– 2988

work page 2017

[10] [10]

Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition

B.Shi,X.Bai,andC.Yao,“Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition”,IEEETrans.PatternAnal.Mach.In- tell.,vol.39,no.11,pp.2298–2304,Nov.2017,issn:0162-8828

work page 2017

[11] [11]

Radiology objects in context (roco): A multimodal image dataset

O.Pelka,S.Koitka,J.Rückert,F.Nensa,andC.M.Friedrich, “Radiology objects in context (roco): A multimodal image dataset”,inIntravascularImagingandComputerAssistedStent- ingandLarge-ScaleAnnotationofBiomedicalDataandExpert LabelSynthesis,D.Stoyanovetal.,Eds.,Cham:SpringerInterna- tionalPublishing,2018,pp.180–189,isbn:978-3-030-01364-6

work page 2018

[12] [12]

AcallforclarityinreportingBLEUscores

M.Post,“AcallforclarityinreportingBLEUscores”,inProceed- ingsoftheThirdConferenceonMachineTranslation:Research Papers,Belgium,Brussels:AssociationforComputationalLin- guistics,Oct.2018,pp.186–191

work page 2018

[13] [13]

Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,De- tectron2, 2019. [Online]. Available: https : / / github. com / facebookresearch/detectron2

work page 2019

[14] [14]

M.Tkachenko,M.Malyuk,A.Holmanyuk,andN.Liubimov, LabelStudio:Datalabelingsoftware,2020.[Online].Available: https://github.com/HumanSignal/label-studio

work page 2020

[15] [15]

Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures

J.Zou,G.Thoma,andS.Antani,“Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures”,J.Assoc.Inf.Sci.Technol.,vol.71,no.11,pp.1327–1340, Oct.21,2020,issn:2330-1635

work page 2020

[16] [16]

DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation

J.-H.Huangetal.,“DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation”,in 2021IEEEWinterConferenceonApplicationsofComputerVi- sion(WACV), Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 2441– 2451,isbn:978-1-6654-0477-8

work page 2021

[17] [17]

Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles

N. Jardón-Maximino et al., “Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles”,Polymers,vol.13,no.11,p.1694,2021

work page 2021

[18] [18]

Datasets: A Community Library for Natu- ral Language Processing

Q. Lhoest et al., “Datasets: A Community Library for Natu- ral Language Processing”, inProceedings of the 2021 Confer- enceonEmpiricalMethodsinNaturalLanguageProcessing:Sys- temDemonstrations,AssociationforComputationalLinguistics, Nov.2021,pp.175–184

work page 2021

[19] [19]

Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention

W.Kwonetal.,“Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention”,inProceedingsofthe ACMSIGOPS29thSymposiumonOperatingSystemsPrinciples, 2023

work page 2023

[20] [20]

Vision-languagemodelsformedical report generation and visual question answering: A review

I.HartsockandG.Rasool,“Vision-languagemodelsformedical report generation and visual question answering: A review”, Frontiersinartificialintelligence,vol.7,p.1430984,2024

work page 2024

[21] [21]

Awq: Activation-aware weight quantization for llmcompressionandacceleration

J. Lin et al., “Awq: Activation-aware weight quantization for llmcompressionandacceleration”,inMLSys,2024

work page 2024

[22] [22]

Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset

J.Rückertetal.,“Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset”,ScientificData,vol.11, no.1,p.688,2024

work page 2024

[23] [23]

Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting

Q. Yu et al., “Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting”,inMedical ImageComputingandComputerAssistedIntervention–MIC- CAI2024:27thInternationalConference,Marrakesh,Morocco, October 6–10, 2024, Proceedings, Part I, Marrakesh, Morocco: Springer-Verlag,2024,pp.667–677,isbn:978-3-031-72377-3

work page 2024

[24] [24]

AdvancingMedicalRepresentation LearningThroughHigh-QualityData

N.Baghbanzadehetal.,“AdvancingMedicalRepresentation LearningThroughHigh-QualityData”,pp.24–33,2025

work page 2025

[25] [25]

Qwen3-VL Technical Report

S.Baietal.“Qwen3-VLTechnicalReport”.arXiv:2511.21631 [cs],pre-published

work page Pith review arXiv

[26] [26]

BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature

A.Lozanoetal.,“BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature”,in2025IEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2025, pp.19724–19735

work page 2025

[27] [27]

Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision

J. Silva-Rodríguez, H. Chakor, R. Kobbi, J. Dolz, and I. Ben Ayed,“Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision”,MedicalImage Analysis,vol.99,p.103357,2025,issn:1361-8415

work page 2025

[28] [28]

Qwen3 Technical Report

A. Yang et al. “Qwen3 Technical Report”. arXiv: 2505.09388 [cs],pre-published

work page internal anchor Pith review arXiv

[29] [29]

AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs

S.Zhangetal.,“AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs”,NEJM AI, vol.2,no.1,Jan.2025,issn:2836-9386

work page 2025

[30] [30]

[Online].Available:https://github.com/pydantic/pydantic

S.Colvinetal.,PydanticValidation,versionv2.13.0b2,Feb.2026. [Online].Available:https://github.com/pydantic/pydantic

work page 2026

[31] [31]

VOLMO: Versatile and Open Large Models for Ophthalmology

Z. Qin et al. “VOLMO: Versatile and Open Large Models for Ophthalmology”.arXiv:2603.23953[cs],pre-published. 8PubMed-Ophtha V. Hallitschke et al. PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

work page arXiv

[32] [32]

AUTHOR CONTRIBUTIONS Conceptualization: VH,CE,PB;Methodology: VH,CE,PB;Software: VH; Validation: VH; Data Curation: VH; Writing - Original Draft: VH, PB; Writing - Review & Editing: CE; Visualization: VH, PB; Supervision: PB,CE;Fundingacquisition: PB

work page

[33] [33]

COMPETING INTERESTS Nonetodeclare

work page

[34] [34]

ACKNOWLEDGMENTS We thank Camila Roa, Sarah Müller, Ifeoma Nwabufo, Jan-Niklas Böhm, Fabio Seel, Samuel Ofosu Mensah, Simone Ebert, Rita GonzálezMárquezandJuliusGervelmeyerforannotatingPubMed- Ophtha-Annotation

work page

[35] [35]

MachineLearning–NewPerspectivesfor Science

FUNDING WethanktheHertieFoundationandtheCarlZeissFoundation(CZ Nexus: CertificationandFoundationsofSafeMachineLearningSys- tems in Healthcare) for funding. PB and CE are members of the ClusterofExcellence2064"MachineLearning–NewPerspectivesfor Science"fundedbytheGermanResearchFoundation(DFG). A. FULL DATASET COLUMN DESCRIPTION Table 2.Overviewofthefieldsi...

work page