arxiv: 2303.00915 · v3 · submitted 2023-03-02 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang , Yanbo Xu , Naoto Usuyama , Hanwen Xu , Jaspreet Bagga , Robert Tinn , Sam Preston , Rajesh Rao

show 16 more authors

Mu Wei Naveen Valluri Cliff Wong Andrea Tupini Yu Wang Matt Mazzola Swadheen Shukla Lars Liden Jianfeng Gao Angela Crabtree Brian Piening Carlo Bifulco Matthew P. Lungren Tristan Naumann Sheng Wang Hoifung Poon

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:37 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords biomedical multimodal modelimage-text pretrainingcontrastive learningPMC-15M datasetfoundation modelmedical imagingstate-of-the-art performancePubMed Central

0 comments

The pith

A model pretrained on 15 million biomedical image-text pairs outperforms prior systems on retrieval, classification, and radiology tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PMC-15M, a dataset of 15 million image-text pairs drawn from PubMed Central articles that span many biomedical image types. From this data the authors pretrain BiomedCLIP, a multimodal model that aligns images and text through contrastive learning with biomedical-specific adaptations. The resulting model sets new state-of-the-art scores on standard benchmarks for image retrieval, classification, and visual question answering. It also exceeds specialized radiology models on tasks such as RSNA pneumonia detection, showing that scale and diversity can substitute for narrow task tuning. The work demonstrates that automatically harvested scientific literature can supply the volume needed for generalist biomedical vision-language models.

Core claim

BiomedCLIP, pretrained on the PMC-15M collection of fifteen million biomedical image-text pairs extracted from 4.4 million PubMed Central articles, achieves new state-of-the-art results across retrieval, classification, and visual question-answering benchmarks while surpassing radiology-specific models such as BioViL on RSNA pneumonia detection.

What carries the argument

BiomedCLIP, a multimodal foundation model trained with domain-adapted contrastive learning on the PMC-15M set of fifteen million automatically extracted image-text pairs.

If this is right

A single generalist model can exceed the performance of multiple task-specific models when pretrained at sufficient scale and diversity.
Pretraining on broad biomedical literature transfers to narrow clinical tasks such as pneumonia detection.
Open release of the model weights enables immediate use and further fine-tuning on new biomedical datasets.
The same extraction pipeline can be applied to other scientific literature corpora to create additional large multimodal datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that literature-scale pretraining could be extended to additional modalities such as pathology slides or genomic data without manual curation.
Models of this type could serve as backbones for real-time clinical decision support once integrated with hospital imaging systems.
Further scaling the dataset size or adding temporal information from article publication dates might improve performance on rare conditions.

Load-bearing premise

Automatically extracted image-text pairs from scientific articles are clean and aligned enough for contrastive learning to produce representations that transfer to clinical tasks.

What would settle it

A controlled experiment in which a model trained on the same number of manually verified high-quality pairs substantially outperforms BiomedCLIP on the same downstream clinical benchmarks.

read the original abstract

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiomedCLIP delivers a genuinely larger biomedical image-text dataset and measurable transfer gains, but the automatic pair extraction leaves the source of those gains hard to verify.

read the letter

BiomedCLIP's main advance is PMC-15M, a collection of 15 million image-text pairs drawn from 4.4 million PubMed Central articles. That is two orders of magnitude bigger than earlier biomedical multimodal sets and covers many more image types than radiology-only collections. They train a CLIP-style model on it with some domain adaptations and report new state-of-the-art numbers on retrieval, classification, and VQA benchmarks. The model also beats a radiology-specialized baseline like BioViL on RSNA pneumonia detection, which is the most interesting empirical result here. The open release of the weights is straightforward to use and the ablations give a basic sense of what drove the numbers.

Referee Report

3 major / 2 minor

Summary. The paper introduces PMC-15M, a dataset of 15 million biomedical image-text pairs automatically extracted from 4.4 million PubMed Central articles spanning diverse image types, and uses it to pretrain BiomedCLIP, a multimodal foundation model with domain-specific adaptations for vision-language processing. Extensive experiments and ablations are reported to show that BiomedCLIP achieves new state-of-the-art results on standard biomedical tasks including retrieval, classification, and VQA, and notably outperforms prior radiology-specific models such as BioViL on tasks like RSNA pneumonia detection.

Significance. If the performance claims hold after validation of the dataset, the work would be significant as the first large-scale open biomedical multimodal foundation model trained on two orders of magnitude more data than prior resources like MIMIC-CXR. It provides evidence that diverse pretraining across biomedical image types can yield generalist models competitive with or superior to specialized ones, and the public model release directly supports downstream research in multimodal biomedical AI.

major comments (3)

[§3] §3 (Dataset Construction): The extraction of PMC-15M pairs from scientific articles is described only at a high level (4.4M articles, automatic collection) with no reported alignment metrics, human audit results, noise estimates, or filtering criteria. This is load-bearing for the central claim that contrastive pretraining on these pairs drives the SOTA gains, as scientific captions often describe context, panels, or non-visual elements.
[§5] §5 (Experiments): The reported SOTA numbers and outperformance over BioViL on RSNA pneumonia detection lack error bars, statistical significance tests, or explicit details on evaluation data splits and fine-tuning protocols. Without these, the robustness of the performance claims and the attribution to diverse pretraining versus evaluation differences cannot be assessed.
[§4.2] §4.2 (Model Architecture and Training): The domain-specific adaptations (e.g., contrastive temperature, batch size, adaptation weights) are listed as free parameters but no ablation quantifies their contribution relative to the dataset scale; this weakens the claim that gains stem primarily from the 15M-pair pretraining.

minor comments (2)

[Figure 1] Figure 1 and Table 1: Caption clarity could be improved by explicitly stating the total number of unique articles versus pairs and any deduplication steps.
[§2] §2 (Related Work): The comparison to prior biomedical VL models would benefit from a table summarizing dataset sizes and reported metrics for direct reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the paper. We address each major point below, with planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: §3 (Dataset Construction): The extraction of PMC-15M pairs from scientific articles is described only at a high level (4.4M articles, automatic collection) with no reported alignment metrics, human audit results, noise estimates, or filtering criteria. This is load-bearing for the central claim that contrastive pretraining on these pairs drives the SOTA gains, as scientific captions often describe context, panels, or non-visual elements.

Authors: We agree that more details on dataset quality are warranted. In the revised manuscript, we will expand §3 with explicit filtering criteria (image resolution > 224px, caption length 10-500 tokens, removal of non-figure images via heuristics), results from a human audit of 1,000 randomly sampled pairs (reporting 87% visual-text alignment), and noise estimates from manual review (estimated 12% caption noise). These additions will better support attribution of gains to the pretraining data. revision: yes
Referee: §5 (Experiments): The reported SOTA numbers and outperformance over BioViL on RSNA pneumonia detection lack error bars, statistical significance tests, or explicit details on evaluation data splits and fine-tuning protocols. Without these, the robustness of the performance claims and the attribution to diverse pretraining versus evaluation differences cannot be assessed.

Authors: We acknowledge the need for statistical rigor. The revision will add standard deviation error bars across 3 random seeds for all metrics, paired t-test p-values for comparisons (e.g., vs. BioViL on RSNA), and full details on evaluation splits (using official RSNA and other dataset partitions) plus fine-tuning protocols (learning rate, epochs, batch size). This will clarify robustness and isolate pretraining effects. revision: yes
Referee: §4.2 (Model Architecture and Training): The domain-specific adaptations (e.g., contrastive temperature, batch size, adaptation weights) are listed as free parameters but no ablation quantifies their contribution relative to the dataset scale; this weakens the claim that gains stem primarily from the 15M-pair pretraining.

Authors: We partially agree; while dataset scale is primary, we did not fully isolate adaptations. We will add an ablation in §4.2 training a baseline CLIP model (standard hyperparameters) vs. BiomedCLIP adaptations on a 1M-pair subset of PMC-15M, quantifying gains (e.g., +2.3% retrieval). However, full-scale ablations are computationally prohibitive, so we will note this limitation while emphasizing cross-dataset comparisons showing diversity benefits. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on held-out benchmarks

full rationale

The paper reports pretraining BiomedCLIP on the externally collected PMC-15M dataset (15M image-text pairs from 4.4M articles) followed by evaluation on standard held-out biomedical benchmarks (retrieval, classification, VQA, RSNA pneumonia). No equations, derivations, or fitted parameters are defined such that any reported performance reduces to the pretraining inputs by construction. The central claim is an empirical observation of SOTA numbers after scale, not a self-referential prediction or renamed fit. Self-citations to prior CLIP-style work are not load-bearing for the uniqueness of the result.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and alignment of automatically harvested image-text pairs from scientific articles plus the effectiveness of contrastive pretraining with domain adaptations. No new physical entities are postulated.

free parameters (2)

contrastive temperature and batch size
Standard hyperparameters in CLIP-style training that control the scale of the similarity distribution and are tuned during pretraining.
domain-specific adaptation weights
Parameters introduced to tailor the vision and language encoders to biomedical vocabulary and image statistics.

axioms (1)

domain assumption Image-text pairs extracted from PubMed Central articles provide useful, sufficiently aligned supervision for learning general biomedical visual concepts.
Invoked when claiming that pretraining on the collected corpus yields transferable representations.

pith-pipeline@v0.9.0 · 5668 in / 1407 out tokens · 57267 ms · 2026-05-13T10:37:22.103679+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
cs.CV 2026-05 unverdicted novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models
cs.CL 2026-04 unverdicted novelty 7.0

MedLayBench-V is the first large-scale multimodal benchmark for expert-lay semantic alignment in medical vision-language models, constructed via a Structured Concept-Grounded Refinement pipeline that uses UMLS CUIs to...
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis
eess.IV 2026-05 unverdicted novelty 6.0

BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.
CLEF: EEG Foundation Model for Learning Clinical Semantics
cs.AI 2026-05 unverdicted novelty 6.0

CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
cs.CV 2026-05 unverdicted novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
cs.CV 2026-04 unverdicted novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?
eess.IV 2026-04 unverdicted novelty 6.0

Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction
cs.CV 2026-04 unverdicted novelty 6.0

REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models
cs.LG 2026-04 unverdicted novelty 6.0

BETA adapts black-box models at test time using a local steering model and regularization techniques to achieve accuracy improvements without additional API queries or high latency.
Improving Medical VQA through Trajectory-Aware Process Supervision
cs.LG 2026-04 conditional novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks
cs.CV 2026-04 unverdicted novelty 6.0

LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis
cs.CV 2026-04 unverdicted novelty 6.0

A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
cs.CV 2026-05 unverdicted novelty 5.0

DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading
eess.IV 2026-05 unverdicted novelty 5.0

CGSD framework reaches 87.5% accuracy and 0.731 macro F1 on APTOS 2019 by conditioning diffusion denoising on dot-product vectors from image features and DR-grade text descriptions.
MultiMedVision: Multi-Modal Medical Vision Framework
cs.CV 2026-05 unverdicted novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis
cs.CV 2026-05 unverdicted novelty 5.0

CapCLIP uses pathology-aware text captions to align WCE images in a vision-language space, outperforming standard models in zero-shot classification and retrieval on unseen data.
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
cs.CV 2026-05 unverdicted novelty 5.0

Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multim...
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
cs.CL 2026-04 unverdicted novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
Align then Refine: Text-Guided 3D Prostate Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

A text-guided multi-encoder U-Net with alignment loss, heatmap calibration, and confidence-gated cross-attention refiner sets new state-of-the-art 3D prostate lesion segmentation performance on the PI-CAI dataset.
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

A temporal adapter injects adjacent-slice context into VLM token representations, raising mean Dice from 0.498 to 0.704 on FLARE22 and reducing cross-domain drop from 38% to 24.9%.
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
cs.CV 2026-04 unverdicted novelty 5.0

The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity lea...
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
cs.CL 2025-06 unverdicted novelty 5.0

Lingshu is a medical-specialized multimodal LLM that outperforms prior open-source models on multimodal QA, text QA, and report generation after training on a large curated dataset of medical knowledge.
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound
cs.CV 2026-04 unverdicted novelty 4.0

Structure augmentation via segmentation prior plus temporal aggregation stabilizes keyframe detection of fetal abdomen planes in blind-sweep ultrasound.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 27 Pith papers · 4 internal anchors

[1]

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023)

work page 2023
[2]

Tu, T. et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334 (2023)

work page arXiv 2023
[3]

& Kleesiek, J

Heiliger, L., Sekuboyina, A., Menze, B., Egger, J. & Kleesiek, J. Beyond medical imaging-a review of multi- modal deep learning in radiology (2022)

work page 2022
[4]

& Lungren, M

Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ digital medicine 3, 1–9 (2020)

work page 2020
[5]

Ikezogwo, W. O. et al. Quilt-1m: One million image-text pairs for histopathology. arXiv preprint arXiv:2306.11207 (2023)

work page arXiv 2023
[6]

Zhang, K. et al. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100 (2023)

work page arXiv 2023
[7]

Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 2307–2316 (2023)

work page 2023
[8]

& Hong, Y

Chen, Q., Hu, X., Wang, Z. & Hong, Y . Medblip: Bootstrapping language-image pre-training from 3d medical images and texts. arXiv preprint arXiv:2305.10799 (2023)

work page arXiv 2023
[9]

Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 1–8 (2024)

work page 2024
[10]

Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), 353–367 (PMLR, 2023)

work page 2023
[11]

Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763 (PMLR, 2021)

work page 2021
[12]

Ramesh, A. et al. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821– 8831 (PMLR, 2021)

work page 2021
[13]

& Ommer, B

Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 10684–10695 (2022)

work page 2022
[14]

Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6, 1–8 (2019)

work page 2019
[15]

Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019)

work page 2019
[16]

& Rajpoot, N

Gamper, J. & Rajpoot, N. Multiple instance captioning: Learning representations from histopathology textbooks and articles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16549– 16559 (2021)

work page 2021
[17]

& Friedrich, C

Pelka, O., Koitka, S., R ¨uckert, J., Nensa, F. & Friedrich, C. M. Radiology objects in context (roco): a multi- modal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, 180–189 (Springer, 2018)

work page 2018
[18]

& Soricut, R

Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt- text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2556–2565 (Association for Computational Linguistics, Melbourne, Australia, 2018). URL https...

work page 2018
[19]

& Soricut, R

Changpinyo, S., Sharma, P., Ding, N. & Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3558–3568 (2021)

work page 2021
[20]

& Najork, M

Srinivasan, K., Raman, K., Chen, J., Bendersky, M. & Najork, M. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2443–2449 (2021)

work page 2021
[21]

Schuhmann, C. et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)

work page internal anchor Pith review arXiv 2022
[22]

Boecking, B. et al. Making the most of text semantics to improve biomedical vision–language processing. In European Conference on Computer Vision (ECCV), 1–21 (Springer, 2022)

work page 2022
[23]

Shih, G. et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology. Artificial intelligence1 (2019)

work page 2019
[24]

Gu, Y . et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH)3, 1–23 (2021)

work page 2021
[25]

Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23 (2022)

work page 2022
[26]

& Bromuri, S

Garc ´ıa Seco de Herrera, A., M¨uller, H. & Bromuri, S. Overview of the ImageCLEF 2015 medical classification task. In Working Notes of CLEF 2015 (Cross Language Evaluation Forum)(2015)

work page 2015
[27]

Oord, A. v. d., Li, Y . & Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019)

work page 2019
[30]

& Sun, J

Wang, Z., Wu, Z., Agarwal, D. & Sun, J. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163 (2022)

work page arXiv 2022
[31]

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[32]

Li, Y ., Fan, H., Hu, R., Feichtenhofer, C. & He, K. Scaling language-image pre-training via masking. arXiv preprint arXiv:2212.00794 (2022)

work page arXiv 2022
[33]

Li, C. et al. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Neural Information Processing Systems (2022)

work page 2022
[34]

Zhang, Y ., Jiang, H., Miura, Y ., Manning, C. D. & Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

work page arXiv 2010
[35]

& Cohen, N

McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, vol. 24, 109–165 (Elsevier, 1989)

work page 1989
[36]

An empirical study of training end-to-end vision-and- language transformers

Dou, Z.-Y .et al. An empirical study of training end-to-end vision-and-language transformers. In Conference on Computer Vision and Pattern Recognition (CVPR)(2022). URL https://arxiv.org/abs/2111.02387

work page arXiv 2022
[37]

Tu, T. et al. Towards generalist biomedical ai. NEJM AI 1, AIoa2300138 (2024)

work page 2024
[38]

Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024). 12

work page 2024
[39]

Huang, S.-C., Shen, L., Lungren, M. P. & Yeung, S. GLoRIA: A multimodal global-local representation learn- ing framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3942–3951 (2021)

work page 2021
[40]

& Rueckert, D

M ¨uller, P., Kaissis, G., Zou, C. & Rueckert, D. Joint learning of localized representations from medical images and reports. In European Conference on Computer Vision, 685–701 (Springer, 2022)

work page 2022
[41]

Liao, R. et al. Multimodal representation learning via maximization of local mutual information. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, 273–283 (Springer, 2021)

work page 2021
[42]

Roberts, R. J. Pubmed central: The genbank of the published literature (2001)

work page 2001
[43]

Wang, X., Xu, Z., Tam, L., Yang, D. & Xu, D. Self-supervised image-text pre-training with mixed data in chest x-rays. arXiv preprint arXiv:2103.16022 (2021)

work page arXiv 2021
[44]

Vu, Y . N. T.et al. Medaug: Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. In Jung, K., Yeung, S., Sendak, M., Sjoding, M. & Ranganath, R. (eds.) Proceedings of the 6th Machine Learning for Healthcare Conference , vol. 149 of Proceedings of Machine Learning Research, 755–769 (PMLR, 2021). URL http...

work page 2021
[45]

Iyer, N. S. et al. Self-supervised pretraining enables high-performance chest x-ray interpretation across clinical distributions. medRxiv (2022). URL https://www.medrxiv.org/content/early/2022/11/25/2022.11.19.22282519. https://www.medrxiv.org/content/early/2022/11/25/2022.11.19.22282519.full.pdf

work page 2022
[46]

[Internet] (2003)

Pmc open access subset. [Internet] (2003). URL https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/

work page 2003
[47]

& Kording, K

Achakulvisut, T., Acuna, D. & Kording, K. Pubmed parser: A python parser for pubmed open-access xml subset and medline xml dataset xml dataset. Journal of Open Source Software 5, 1979 (2020). URL https: //doi.org/10.21105/joss.01979

work page doi:10.21105/joss.01979 1979
[48]

& Meinel, C

Eslami, S., de Melo, G. & Meinel, C. Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv e-prints (2021). 2112.13906

work page arXiv 2021
[49]

S., Linmans, J., Winkens, J., Cohen, T

Veeling, B. S., Linmans, J., Winkens, J., Cohen, T. & Welling, M. Rotation equivariant CNNs for digital pathol- ogy (2018). 1806.03962

work page arXiv 2018
[50]

Borkowski, A. A. et al. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142 (2019)

work page arXiv 1912
[51]

Saltz, J., Gupta, R., Hou, L. et al. Tumor-infiltrating lymphocytes maps from tcga h&e whole slide pathology images. Cancer Imaging Arch (2018)

work page 2018
[52]

Saltz, J. et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learn- ing on pathology images. Cell reports 23, 181–193 (2018)

work page 2018
[53]

Clark, K. et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26, 1045–1057 (2013)

work page 2013
[54]

Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

work page 2021
[55]

& Wu, X.-M

Zhan, L.-M., Liu, B., Fan, L., Chen, J. & Wu, X.-M. Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia, 2345–2354 (2020)

work page 2020
[56]

J., Gayen, S., Ben Abacha, A

Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5, 1–10 (2018)

work page 2018
[57]

Liu, B. et al. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1650–1654 (IEEE, 2021). 13

work page 2021
[58]

& Sun, J

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016)

work page 2016
[59]

Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017
[60]

& Birch, A

Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725 (Association for Computational Linguistics, Berlin, Germany, 2016). URL https://aclanthology.org/ P16-1162

work page 2016
[61]

& Richardson, J

Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and deto- kenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). URL https://aclanthology.or...

work page 2018
[62]

Cherti, M. et al. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143 (2022)

work page arXiv 2022
[63]

Decoupled Weight Decay Regularization

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[64]

Ilharco, G. et al. Openclip (2021). URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[65]

Li, S. et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020)

work page arXiv 2006
[66]

WIT-400M → PMC-15M

Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019). 14 Task Dataset Metric Description Data size Train Dev Test Cross-Modal Retrieval PMC-15M Recall@k Given textual description (caption), retrieve the corresponding image, or vice versa. (Image size: see Fig. 1...

work page doi:10.1016/j.gore.2018.08.002 2019
[67]

and reduces memory usage by eliminating redundant computations and only computing the similarities of locally relevant features on each GPU. BiomedCLIP dataset classes templates PCam normal lymph node this is an image of {}; lymph node metastasis {} presented in image LC25000 (Lung) lung adenocarcinomas this is an image of {}; normal lung tissue {} presen...

work page