pith. sign in

arxiv: 2605.02720 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.CL

PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords ophthalmologyvision-language modelsimage-caption datasetPubMed Centralfigure extractionpanel decompositionmedical imaging
0
0 comments X

The pith

PubMed-Ophtha releases 102,023 ophthalmology image-caption pairs extracted at full resolution from 15,842 scientific articles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large hierarchical dataset to fill the gap in high-quality image-text resources needed for training vision-language models in ophthalmology. Figures are pulled directly from article PDFs at full resolution, broken into individual panels with identifiers, and paired with split captions. Each image receives labels for imaging modality and the presence of annotation marks. This scale and structure allow models to learn from real medical literature rather than limited curated collections.

Core claim

We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality and a mark status. Figure captions are split into panel-level subcaptions using a two-step LLM approach.

What carries the argument

The PubMed-Ophtha dataset pipeline, which extracts figures from PDFs at full resolution, decomposes them into panels, classifies imaging modalities, and splits captions via LLM into panel-specific subcaptions.

If this is right

  • Panel-level subcaptions enable models to handle multi-figure medical papers that standard single-caption datasets cannot address.
  • Modality and mark annotations support training of models that distinguish color fundus photography from optical coherence tomography and ignore arrows or labels.
  • Release of ground-truth annotations, trained detection models, and the full extraction pipeline allows other groups to extend or audit the resource.
  • The dataset scale of 102,023 pairs provides sufficient volume for pre-training or fine-tuning large vision-language architectures in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction approach could be rerun on newer PubMed updates to keep the resource current without manual effort.
  • Models trained here may transfer to clinical image interpretation tasks even if the original papers contain only research-grade images.
  • Similar pipelines applied to other medical fields could reduce reliance on expensive manual dataset creation across healthcare AI.
  • If downstream models show clear gains on held-out ophthalmology benchmarks, the dataset would demonstrate that literature-derived pairs are a viable alternative to expert-annotated collections.

Load-bearing premise

Automated PDF figure extraction, panel decomposition, and LLM-based caption splitting produce image-text pairs accurate enough and free of systematic errors to support effective training of ophthalmology vision-language models.

What would settle it

Train a vision-language model on the released PubMed-Ophtha pairs and measure whether it improves ophthalmology-specific tasks such as panel-level captioning or modality-aware visual question answering compared with models trained on smaller or noisier alternatives.

Figures

Figures reproduced from arXiv: 2605.02720 by Carsten Eickhoff, Philipp Berens, Verena Jasmin Hallitschke.

Figure 1
Figure 1. Figure 1: Overview of the dataset extraction pipeline. (A) Articles are filtered by keywords to select those relevant to ophthalmological retinal imaging. (B) A heuristic detects figures and their captions in the article PDF and extracts them at full resolution. (C) Additional figure-level information, such as in-text mentions, is retrieved from the BIOMEDICA dataset. (D) The final dataset contains individual panels… view at source ↗
Figure 2
Figure 2. Figure 2: Detection performance and failure cases. (A) Example detections from the test set, showing panel (teal), image (purple), and panel identifier (blue) bounding boxes across three figures of varying complexity. (B) Precision-recall curves at an IoU threshold of 0.75 for (i) panel and panel identifier detection and (ii) image type detection across the four image type categories (CFP, OCT, Retinal Imaging, Othe… view at source ↗
Figure 3
Figure 3. Figure 3: Caption splitting and subcaption assignment. (A) Original figure with its full caption. (B) Result of the caption splitting step: the full caption is decomposed into panel-level subcaptions, each associated with a panel identifier. (C) Result of the panel assembly step: each subcaption is assigned to its corresponding panel, and panel identifier locations are matched to the detected panel bounding boxes. d… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of extracted panels with their identifiers, subcaptions, and the detected images. 5. TECHNICAL VALIDATION We validated every step of the dataset extraction pipeline carefully ( view at source ↗
read the original abstract

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims to introduce PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access PubMed Central articles. Figures are extracted directly from PDFs at full resolution, decomposed into panels with identifiers and individual images, annotated with imaging modalities (color fundus photography, optical coherence tomography, retinal imaging, or other) and mark status, and paired with panel-level subcaptions obtained via a two-step LLM-based splitting method. The authors report concrete performance metrics on held-out human annotations: mean sentence BLEU of 0.913 for caption splitting, mAP@0.50 of 0.909 for panel detection and 0.892 for image detection, and median IoU of 0.997 for figure extraction. They release the human-annotated ground-truth data, trained models, and full dataset generation pipeline to support reproducibility.

Significance. If the reported extraction fidelity holds, this work provides a valuable open resource that can accelerate development of ophthalmology-specific vision-language models by supplying a large-scale, hierarchically structured, modality-annotated image-text corpus that is currently scarce in the field. The full-resolution PDF extraction, panel decomposition, and release of ground-truth annotations, trained models, and the complete pipeline are particular strengths that enable community auditing, filtering, and extension of the data.

minor comments (3)
  1. A comparison table with existing ophthalmology or medical image-caption datasets (size, structure, extraction method, and annotation granularity) would better highlight the advantages of PubMed-Ophtha.
  2. Report the distribution of imaging modalities and mark statuses across the 102,023 pairs to allow users to assess potential class imbalance or biases in the dataset.
  3. Provide additional details on the human annotation protocol for the ground-truth evaluation set, including the number of annotators and any measures of inter-annotator agreement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our manuscript, for recognizing the value of PubMed-Ophtha as an open resource for ophthalmology vision-language models, and for recommending acceptance. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical dataset construction pipeline for extracting and annotating ophthalmology image-caption pairs from PubMed Central PDFs, with validation via held-out human annotations (panel mAP@0.50 = 0.909, image mAP@0.50 = 0.892, figure extraction median IoU = 0.997, caption splitting BLEU = 0.913). No mathematical derivations, equations, predictions, or fitted parameters are present that could reduce to inputs by construction. All performance claims rely on external ground-truth annotations and released models/pipeline rather than self-referential steps. No self-citation load-bearing elements or ansatz smuggling appear in the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper contributes a new dataset and extraction pipeline rather than a mathematical derivation. It relies on standard computer vision and LLM capabilities without introducing new physical entities or free parameters that are fitted to support a central claim.

axioms (2)
  • domain assumption PDF parsing and computer vision models can reliably extract and decompose figures into panels at full resolution.
    Invoked in the figure extraction and panel detection steps described in the abstract.
  • domain assumption Large language models can accurately split figure captions into panel-specific subcaptions.
    Basis for the two-step LLM approach with reported BLEU evaluation.

pith-pipeline@v0.9.0 · 5514 in / 1406 out tokens · 87119 ms · 2026-05-08T18:42:44.201721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Bleu:AMethod forAutomaticEvaluationofMachineTranslation

    K.Papineni,S.Roukos,T.Ward,andW. -J.Zhu,“Bleu:AMethod forAutomaticEvaluationofMachineTranslation”,inProceed- ingsofthe40thAnnualMeetingoftheAssociationforCompu- tationalLinguistics,P.Isabelle,E.Charniak,andD.Lin,Eds., Philadelphia, Pennsylvania, USA: Association for Computa- tionalLinguistics,Jul.2002,pp.311–318

  2. [2]

    Radiologyreporting, past,present,andfuture:Theradiologist’sperspective

    B.I.Reiner,N.Knight,andE.L.Siegel,“Radiologyreporting, past,present,andfuture:Theradiologist’sperspective”,Journal oftheAmericanCollegeofRadiology,vol.4,no.5,pp.313–319, 2007,issn:1546-1440

  3. [3]

    Ima- geNet:Alarge-scalehierarchicalimagedatabase

    J.Deng,W.Dong,R.Socher,L. -J.Li,K.Li,andL.Fei-Fei,“Ima- geNet:Alarge-scalehierarchicalimagedatabase”,in2009IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009,pp.248–255

  4. [4]

    EntrezProgrammingUtilitiesHelp.NationalCenterforBiotech- nologyInformation(US),2010

  5. [5]

    Microsoft COCO: Common Objects in Con- text

    T.-Y. Lin et al., “Microsoft COCO: Common Objects in Con- text”, inComputerVision–ECCV2014, D. Fleet, T. Pajdla, B. Schiele,andT.Tuytelaars,Eds.,Cham:SpringerInternational Publishing,2014,pp.740–755,isbn:978-3-319-10602-1

  6. [6]

    Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation

    N.ClynchandJ.Kellett,“Medicaldocumentation:Partofthe solution,orpartoftheproblem?Anarrativereviewofthelit- erature on the time spent on and value of medical documen- tation”,International Journal of Medical Informatics, vol. 84, no.4,pp.221–228,2015,issn:1386-5056

  7. [7]

    Overview of the medical tasks in ImageCLEF 2016

    A. G. S. De Herrera, S. Bromuri, R. Schaer, and H. Müller, “Overview of the medical tasks in ImageCLEF 2016”,CLEF workingnotes.Evora,Portugal,2016

  8. [8]

    DeepResidualLearning forImageRecognition

    K.He,X.Zhang,S.Ren,andJ.Sun,“DeepResidualLearning forImageRecognition”,in2016IEEEConferenceonComputer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE,Jun.2016,pp.770–778,isbn:978-1-4673-8851-1

  9. [9]

    Focal Loss for Dense Object Detection

    T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection”, inProceedingsoftheIEEE international conference on computer vision, 2017, pp. 2980– 2988

  10. [10]

    Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition

    B.Shi,X.Bai,andC.Yao,“Anend-to-endtrainableneuralnet- workforimage-basedsequencerecognitionanditsapplication toscenetextrecognition”,IEEETrans.PatternAnal.Mach.In- tell.,vol.39,no.11,pp.2298–2304,Nov.2017,issn:0162-8828

  11. [11]

    Radiology objects in context (roco): A multimodal image dataset

    O.Pelka,S.Koitka,J.Rückert,F.Nensa,andC.M.Friedrich, “Radiology objects in context (roco): A multimodal image dataset”,inIntravascularImagingandComputerAssistedStent- ingandLarge-ScaleAnnotationofBiomedicalDataandExpert LabelSynthesis,D.Stoyanovetal.,Eds.,Cham:SpringerInterna- tionalPublishing,2018,pp.180–189,isbn:978-3-030-01364-6

  12. [12]

    AcallforclarityinreportingBLEUscores

    M.Post,“AcallforclarityinreportingBLEUscores”,inProceed- ingsoftheThirdConferenceonMachineTranslation:Research Papers,Belgium,Brussels:AssociationforComputationalLin- guistics,Oct.2018,pp.186–191

  13. [13]

    Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick,De- tectron2, 2019. [Online]. Available: https : / / github. com / facebookresearch/detectron2

  14. [14]

    M.Tkachenko,M.Malyuk,A.Holmanyuk,andN.Liubimov, LabelStudio:Datalabelingsoftware,2020.[Online].Available: https://github.com/HumanSignal/label-studio

  15. [15]

    Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures

    J.Zou,G.Thoma,andS.Antani,“Unifieddeepneuralnetwork for segmentation and labeling of multipanel biomedical fig- ures”,J.Assoc.Inf.Sci.Technol.,vol.71,no.11,pp.1327–1340, Oct.21,2020,issn:2330-1635

  16. [16]

    DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation

    J.-H.Huangetal.,“DeepOpht:MedicalReportGenerationfor RetinalImagesviaDeepModelsandVisualExplanation”,in 2021IEEEWinterConferenceonApplicationsofComputerVi- sion(WACV), Waikoloa, HI, USA: IEEE, Jan. 2021, pp. 2441– 2451,isbn:978-1-6654-0477-8

  17. [17]

    Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles

    N. Jardón-Maximino et al., “Antimicrobial property of polypropylenecompositesandfunctionalizedcoppernanopar- ticles”,Polymers,vol.13,no.11,p.1694,2021

  18. [18]

    Datasets: A Community Library for Natu- ral Language Processing

    Q. Lhoest et al., “Datasets: A Community Library for Natu- ral Language Processing”, inProceedings of the 2021 Confer- enceonEmpiricalMethodsinNaturalLanguageProcessing:Sys- temDemonstrations,AssociationforComputationalLinguistics, Nov.2021,pp.175–184

  19. [19]

    Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention

    W.Kwonetal.,“Efficientmemorymanagementforlargelan- guagemodelservingwithpagedattention”,inProceedingsofthe ACMSIGOPS29thSymposiumonOperatingSystemsPrinciples, 2023

  20. [20]

    Vision-languagemodelsformedical report generation and visual question answering: A review

    I.HartsockandG.Rasool,“Vision-languagemodelsformedical report generation and visual question answering: A review”, Frontiersinartificialintelligence,vol.7,p.1430984,2024

  21. [21]

    Awq: Activation-aware weight quantization for llmcompressionandacceleration

    J. Lin et al., “Awq: Activation-aware weight quantization for llmcompressionandacceleration”,inMLSys,2024

  22. [22]

    Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset

    J.Rückertetal.,“Rocov2:Radiologyobjectsincontextversion2, anupdatedmultimodalimagedataset”,ScientificData,vol.11, no.1,p.688,2024

  23. [23]

    Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting

    Q. Yu et al., “Clip-dr: Textual knowledge-guided diabetic retinopathygradingwithranking-awareprompting”,inMedical ImageComputingandComputerAssistedIntervention–MIC- CAI2024:27thInternationalConference,Marrakesh,Morocco, October 6–10, 2024, Proceedings, Part I, Marrakesh, Morocco: Springer-Verlag,2024,pp.667–677,isbn:978-3-031-72377-3

  24. [24]

    AdvancingMedicalRepresentation LearningThroughHigh-QualityData

    N.Baghbanzadehetal.,“AdvancingMedicalRepresentation LearningThroughHigh-QualityData”,pp.24–33,2025

  25. [25]

    Qwen3-VL Technical Report

    S.Baietal.“Qwen3-VLTechnicalReport”.arXiv:2511.21631 [cs],pre-published

  26. [26]

    BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature

    A.Lozanoetal.,“BIOMEDICA:AnOpenBiomedicalImage- Caption Archive, Dataset, and Vision-Language Models De- rivedfromScientificLiterature”,in2025IEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2025, pp.19724–19735

  27. [27]

    Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision

    J. Silva-Rodríguez, H. Chakor, R. Kobbi, J. Dolz, and I. Ben Ayed,“Afoundationlanguage-imagemodeloftheretina(flair): Encodingexpertknowledgeintextsupervision”,MedicalImage Analysis,vol.99,p.103357,2025,issn:1361-8415

  28. [28]

    Qwen3 Technical Report

    A. Yang et al. “Qwen3 Technical Report”. arXiv: 2505.09388 [cs],pre-published

  29. [29]

    AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs

    S.Zhangetal.,“AMultimodalBiomedicalFoundationModel Trained from Fifteen Million Image–Text Pairs”,NEJM AI, vol.2,no.1,Jan.2025,issn:2836-9386

  30. [30]

    [Online].Available:https://github.com/pydantic/pydantic

    S.Colvinetal.,PydanticValidation,versionv2.13.0b2,Feb.2026. [Online].Available:https://github.com/pydantic/pydantic

  31. [31]

    VOLMO: Versatile and Open Large Models for Ophthalmology

    Z. Qin et al. “VOLMO: Versatile and Open Large Models for Ophthalmology”.arXiv:2603.23953[cs],pre-published. 8PubMed-Ophtha V. Hallitschke et al. PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

  32. [32]

    AUTHOR CONTRIBUTIONS Conceptualization: VH,CE,PB;Methodology: VH,CE,PB;Software: VH; Validation: VH; Data Curation: VH; Writing - Original Draft: VH, PB; Writing - Review & Editing: CE; Visualization: VH, PB; Supervision: PB,CE;Fundingacquisition: PB

  33. [33]

    COMPETING INTERESTS Nonetodeclare

  34. [34]

    ACKNOWLEDGMENTS We thank Camila Roa, Sarah Müller, Ifeoma Nwabufo, Jan-Niklas Böhm, Fabio Seel, Samuel Ofosu Mensah, Simone Ebert, Rita GonzálezMárquezandJuliusGervelmeyerforannotatingPubMed- Ophtha-Annotation

  35. [35]

    MachineLearning–NewPerspectivesfor Science

    FUNDING WethanktheHertieFoundationandtheCarlZeissFoundation(CZ Nexus: CertificationandFoundationsofSafeMachineLearningSys- tems in Healthcare) for funding. PB and CE are members of the ClusterofExcellence2064"MachineLearning–NewPerspectivesfor Science"fundedbytheGermanResearchFoundation(DFG). A. FULL DATASET COLUMN DESCRIPTION Table 2.Overviewofthefieldsi...