PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3
The pith
PubMed-Ophtha releases 102,023 ophthalmology image-caption pairs extracted at full resolution from 15,842 scientific articles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality and a mark status. Figure captions are split into panel-level subcaptions using a two-step LLM approach.
What carries the argument
The PubMed-Ophtha dataset pipeline, which extracts figures from PDFs at full resolution, decomposes them into panels, classifies imaging modalities, and splits captions via LLM into panel-specific subcaptions.
If this is right
- Panel-level subcaptions enable models to handle multi-figure medical papers that standard single-caption datasets cannot address.
- Modality and mark annotations support training of models that distinguish color fundus photography from optical coherence tomography and ignore arrows or labels.
- Release of ground-truth annotations, trained detection models, and the full extraction pipeline allows other groups to extend or audit the resource.
- The dataset scale of 102,023 pairs provides sufficient volume for pre-training or fine-tuning large vision-language architectures in this domain.
Where Pith is reading between the lines
- The same extraction approach could be rerun on newer PubMed updates to keep the resource current without manual effort.
- Models trained here may transfer to clinical image interpretation tasks even if the original papers contain only research-grade images.
- Similar pipelines applied to other medical fields could reduce reliance on expensive manual dataset creation across healthcare AI.
- If downstream models show clear gains on held-out ophthalmology benchmarks, the dataset would demonstrate that literature-derived pairs are a viable alternative to expert-annotated collections.
Load-bearing premise
Automated PDF figure extraction, panel decomposition, and LLM-based caption splitting produce image-text pairs accurate enough and free of systematic errors to support effective training of ophthalmology vision-language models.
What would settle it
Train a vision-language model on the released PubMed-Ophtha pairs and measure whether it improves ophthalmology-specific tasks such as panel-level captioning or modality-aware visual question answering compared with models trained on smaller or noisier alternatives.
Figures
read the original abstract
Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access PubMed Central articles. Figures are extracted directly from PDFs at full resolution, decomposed into panels with identifiers and individual images, annotated with imaging modalities (color fundus photography, optical coherence tomography, retinal imaging, or other) and mark status, and paired with panel-level subcaptions obtained via a two-step LLM-based splitting method. The authors report concrete performance metrics on held-out human annotations: mean sentence BLEU of 0.913 for caption splitting, mAP@0.50 of 0.909 for panel detection and 0.892 for image detection, and median IoU of 0.997 for figure extraction. They release the human-annotated ground-truth data, trained models, and full dataset generation pipeline to support reproducibility.
Significance. If the reported extraction fidelity holds, this work provides a valuable open resource that can accelerate development of ophthalmology-specific vision-language models by supplying a large-scale, hierarchically structured, modality-annotated image-text corpus that is currently scarce in the field. The full-resolution PDF extraction, panel decomposition, and release of ground-truth annotations, trained models, and the complete pipeline are particular strengths that enable community auditing, filtering, and extension of the data.
minor comments (3)
- A comparison table with existing ophthalmology or medical image-caption datasets (size, structure, extraction method, and annotation granularity) would better highlight the advantages of PubMed-Ophtha.
- Report the distribution of imaging modalities and mark statuses across the 102,023 pairs to allow users to assess potential class imbalance or biases in the dataset.
- Provide additional details on the human annotation protocol for the ground-truth evaluation set, including the number of annotators and any measures of inter-annotator agreement.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of our manuscript, for recognizing the value of PubMed-Ophtha as an open resource for ophthalmology vision-language models, and for recommending acceptance. No major comments were raised in the report.
Circularity Check
No significant circularity
full rationale
The paper describes an empirical dataset construction pipeline for extracting and annotating ophthalmology image-caption pairs from PubMed Central PDFs, with validation via held-out human annotations (panel mAP@0.50 = 0.909, image mAP@0.50 = 0.892, figure extraction median IoU = 0.997, caption splitting BLEU = 0.913). No mathematical derivations, equations, predictions, or fitted parameters are present that could reduce to inputs by construction. All performance claims rely on external ground-truth annotations and released models/pipeline rather than self-referential steps. No self-citation load-bearing elements or ansatz smuggling appear in the central claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption PDF parsing and computer vision models can reliably extract and decompose figures into panels at full resolution.
- domain assumption Large language models can accurately split figure captions into panel-specific subcaptions.
Reference graph
Works this paper leans on
-
[1]
Bleu:AMethod forAutomaticEvaluationofMachineTranslation
K.Papineni,S.Roukos,T.Ward,andW. -J.Zhu,“Bleu:AMethod forAutomaticEvaluationofMachineTranslation”,inProceed- ingsofthe40thAnnualMeetingoftheAssociationforCompu- tationalLinguistics,P.Isabelle,E.Charniak,andD.Lin,Eds., Philadelphia, Pennsylvania, USA: Association for Computa- tionalLinguistics,Jul.2002,pp.311–318
work page 2002
-
[2]
Radiologyreporting, past,present,andfuture:Theradiologist’sperspective
B.I.Reiner,N.Knight,andE.L.Siegel,“Radiologyreporting, past,present,andfuture:Theradiologist’sperspective”,Journal oftheAmericanCollegeofRadiology,vol.4,no.5,pp.313–319, 2007,issn:1546-1440
work page 2007
-
[3]
Ima- geNet:Alarge-scalehierarchicalimagedatabase
J.Deng,W.Dong,R.Socher,L. -J.Li,K.Li,andL.Fei-Fei,“Ima- geNet:Alarge-scalehierarchicalimagedatabase”,in2009IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009,pp.248–255
work page 2009
-
[4]
EntrezProgrammingUtilitiesHelp.NationalCenterforBiotech- nologyInformation(US),2010
work page 2010
-
[5]
Microsoft COCO: Common Objects in Con- text
T.-Y. Lin et al., “Microsoft COCO: Common Objects in Con- text”, inComputerVision–ECCV2014, D. Fleet, T. Pajdla, B. Schiele,andT.Tuytelaars,Eds.,Cham:SpringerInternational Publishing,2014,pp.740–755,isbn:978-3-319-10602-1
work page 2014
-
[6]
N.ClynchandJ.Kellett,“Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation”,International Journal of Medical Informatics, vol. 84, no.4,pp.221–228,2015,issn:1386-5056
work page 2015
-
[7]
Overview of the medical tasks in ImageCLEF 2016
A. G. S. De Herrera, S. Bromuri, R. Schaer, and H. Müller, “Overview of the medical tasks in ImageCLEF 2016”,CLEF workingnotes.Evora,Portugal,2016
work page 2016
-
[8]
DeepResidualLearning forImageRecognition
K.He,X.Zhang,S.Ren,andJ.Sun,“DeepResidualLearning forImageRecognition”,in2016IEEEConferenceonComputer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE,Jun.2016,pp.770–778,isbn:978-1-4673-8851-1
work page 2016
-
[9]
Focal Loss for Dense Object Detection
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection”, inProceedingsoftheIEEE international conference on computer vision, 2017, pp. 2980– 2988
work page 2017
-
[10]
B.Shi,X.Bai,andC.Yao,“Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition”,IEEETrans.PatternAnal.Mach.In- tell.,vol.39,no.11,pp.2298–2304,Nov.2017,issn:0162-8828
work page 2017
-
[11]
Radiology objects in context (roco): A multimodal image dataset
O.Pelka,S.Koitka,J.Rückert,F.Nensa,andC.M.Friedrich, “Radiology objects in context (roco): A multimodal image dataset”,inIntravascularImagingandComputerAssistedStent- ingandLarge-ScaleAnnotationofBiomedicalDataandExpert LabelSynthesis,D.Stoyanovetal.,Eds.,Cham:SpringerInterna- tionalPublishing,2018,pp.180–189,isbn:978-3-030-01364-6
work page 2018
-
[12]
AcallforclarityinreportingBLEUscores
M.Post,“AcallforclarityinreportingBLEUscores”,inProceed- ingsoftheThirdConferenceonMachineTranslation:Research Papers,Belgium,Brussels:AssociationforComputationalLin- guistics,Oct.2018,pp.186–191
work page 2018
-
[13]
Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,De- tectron2, 2019. [Online]. Available: https : / / github. com / facebookresearch/detectron2
work page 2019
-
[14]
M.Tkachenko,M.Malyuk,A.Holmanyuk,andN.Liubimov, LabelStudio:Datalabelingsoftware,2020.[Online].Available: https://github.com/HumanSignal/label-studio
work page 2020
-
[15]
Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures
J.Zou,G.Thoma,andS.Antani,“Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures”,J.Assoc.Inf.Sci.Technol.,vol.71,no.11,pp.1327–1340, Oct.21,2020,issn:2330-1635
work page 2020
-
[16]
DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation
J.-H.Huangetal.,“DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation”,in 2021IEEEWinterConferenceonApplicationsofComputerVi- sion(WACV), Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 2441– 2451,isbn:978-1-6654-0477-8
work page 2021
-
[17]
Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles
N. Jardón-Maximino et al., “Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles”,Polymers,vol.13,no.11,p.1694,2021
work page 2021
-
[18]
Datasets: A Community Library for Natu- ral Language Processing
Q. Lhoest et al., “Datasets: A Community Library for Natu- ral Language Processing”, inProceedings of the 2021 Confer- enceonEmpiricalMethodsinNaturalLanguageProcessing:Sys- temDemonstrations,AssociationforComputationalLinguistics, Nov.2021,pp.175–184
work page 2021
-
[19]
Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention
W.Kwonetal.,“Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention”,inProceedingsofthe ACMSIGOPS29thSymposiumonOperatingSystemsPrinciples, 2023
work page 2023
-
[20]
Vision-languagemodelsformedical report generation and visual question answering: A review
I.HartsockandG.Rasool,“Vision-languagemodelsformedical report generation and visual question answering: A review”, Frontiersinartificialintelligence,vol.7,p.1430984,2024
work page 2024
-
[21]
Awq: Activation-aware weight quantization for llmcompressionandacceleration
J. Lin et al., “Awq: Activation-aware weight quantization for llmcompressionandacceleration”,inMLSys,2024
work page 2024
-
[22]
Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset
J.Rückertetal.,“Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset”,ScientificData,vol.11, no.1,p.688,2024
work page 2024
-
[23]
Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting
Q. Yu et al., “Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting”,inMedical ImageComputingandComputerAssistedIntervention–MIC- CAI2024:27thInternationalConference,Marrakesh,Morocco, October 6–10, 2024, Proceedings, Part I, Marrakesh, Morocco: Springer-Verlag,2024,pp.667–677,isbn:978-3-031-72377-3
work page 2024
-
[24]
AdvancingMedicalRepresentation LearningThroughHigh-QualityData
N.Baghbanzadehetal.,“AdvancingMedicalRepresentation LearningThroughHigh-QualityData”,pp.24–33,2025
work page 2025
-
[25]
S.Baietal.“Qwen3-VLTechnicalReport”.arXiv:2511.21631 [cs],pre-published
-
[26]
A.Lozanoetal.,“BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature”,in2025IEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2025, pp.19724–19735
work page 2025
-
[27]
Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision
J. Silva-Rodríguez, H. Chakor, R. Kobbi, J. Dolz, and I. Ben Ayed,“Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision”,MedicalImage Analysis,vol.99,p.103357,2025,issn:1361-8415
work page 2025
-
[28]
A. Yang et al. “Qwen3 Technical Report”. arXiv: 2505.09388 [cs],pre-published
work page internal anchor Pith review arXiv
-
[29]
AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs
S.Zhangetal.,“AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs”,NEJM AI, vol.2,no.1,Jan.2025,issn:2836-9386
work page 2025
-
[30]
[Online].Available:https://github.com/pydantic/pydantic
S.Colvinetal.,PydanticValidation,versionv2.13.0b2,Feb.2026. [Online].Available:https://github.com/pydantic/pydantic
work page 2026
-
[31]
VOLMO: Versatile and Open Large Models for Ophthalmology
Z. Qin et al. “VOLMO: Versatile and Open Large Models for Ophthalmology”.arXiv:2603.23953[cs],pre-published. 8PubMed-Ophtha V. Hallitschke et al. PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
-
[32]
AUTHOR CONTRIBUTIONS Conceptualization: VH,CE,PB;Methodology: VH,CE,PB;Software: VH; Validation: VH; Data Curation: VH; Writing - Original Draft: VH, PB; Writing - Review & Editing: CE; Visualization: VH, PB; Supervision: PB,CE;Fundingacquisition: PB
-
[33]
COMPETING INTERESTS Nonetodeclare
-
[34]
ACKNOWLEDGMENTS We thank Camila Roa, Sarah Müller, Ifeoma Nwabufo, Jan-Niklas Böhm, Fabio Seel, Samuel Ofosu Mensah, Simone Ebert, Rita GonzálezMárquezandJuliusGervelmeyerforannotatingPubMed- Ophtha-Annotation
-
[35]
MachineLearning–NewPerspectivesfor Science
FUNDING WethanktheHertieFoundationandtheCarlZeissFoundation(CZ Nexus: CertificationandFoundationsofSafeMachineLearningSys- tems in Healthcare) for funding. PB and CE are members of the ClusterofExcellence2064"MachineLearning–NewPerspectivesfor Science"fundedbytheGermanResearchFoundation(DFG). A. FULL DATASET COLUMN DESCRIPTION Table 2.Overviewofthefieldsi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.