Recognition: 2 theorem links
· Lean TheoremBiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Pith reviewed 2026-05-13 10:37 UTC · model grok-4.3
The pith
A model pretrained on 15 million biomedical image-text pairs outperforms prior systems on retrieval, classification, and radiology tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BiomedCLIP, pretrained on the PMC-15M collection of fifteen million biomedical image-text pairs extracted from 4.4 million PubMed Central articles, achieves new state-of-the-art results across retrieval, classification, and visual question-answering benchmarks while surpassing radiology-specific models such as BioViL on RSNA pneumonia detection.
What carries the argument
BiomedCLIP, a multimodal foundation model trained with domain-adapted contrastive learning on the PMC-15M set of fifteen million automatically extracted image-text pairs.
If this is right
- A single generalist model can exceed the performance of multiple task-specific models when pretrained at sufficient scale and diversity.
- Pretraining on broad biomedical literature transfers to narrow clinical tasks such as pneumonia detection.
- Open release of the model weights enables immediate use and further fine-tuning on new biomedical datasets.
- The same extraction pipeline can be applied to other scientific literature corpora to create additional large multimodal datasets.
Where Pith is reading between the lines
- The approach suggests that literature-scale pretraining could be extended to additional modalities such as pathology slides or genomic data without manual curation.
- Models of this type could serve as backbones for real-time clinical decision support once integrated with hospital imaging systems.
- Further scaling the dataset size or adding temporal information from article publication dates might improve performance on rare conditions.
Load-bearing premise
Automatically extracted image-text pairs from scientific articles are clean and aligned enough for contrastive learning to produce representations that transfer to clinical tasks.
What would settle it
A controlled experiment in which a model trained on the same number of manually verified high-quality pairs substantially outperforms BiomedCLIP on the same downstream clinical benchmarks.
read the original abstract
Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PMC-15M, a dataset of 15 million biomedical image-text pairs automatically extracted from 4.4 million PubMed Central articles spanning diverse image types, and uses it to pretrain BiomedCLIP, a multimodal foundation model with domain-specific adaptations for vision-language processing. Extensive experiments and ablations are reported to show that BiomedCLIP achieves new state-of-the-art results on standard biomedical tasks including retrieval, classification, and VQA, and notably outperforms prior radiology-specific models such as BioViL on tasks like RSNA pneumonia detection.
Significance. If the performance claims hold after validation of the dataset, the work would be significant as the first large-scale open biomedical multimodal foundation model trained on two orders of magnitude more data than prior resources like MIMIC-CXR. It provides evidence that diverse pretraining across biomedical image types can yield generalist models competitive with or superior to specialized ones, and the public model release directly supports downstream research in multimodal biomedical AI.
major comments (3)
- [§3] §3 (Dataset Construction): The extraction of PMC-15M pairs from scientific articles is described only at a high level (4.4M articles, automatic collection) with no reported alignment metrics, human audit results, noise estimates, or filtering criteria. This is load-bearing for the central claim that contrastive pretraining on these pairs drives the SOTA gains, as scientific captions often describe context, panels, or non-visual elements.
- [§5] §5 (Experiments): The reported SOTA numbers and outperformance over BioViL on RSNA pneumonia detection lack error bars, statistical significance tests, or explicit details on evaluation data splits and fine-tuning protocols. Without these, the robustness of the performance claims and the attribution to diverse pretraining versus evaluation differences cannot be assessed.
- [§4.2] §4.2 (Model Architecture and Training): The domain-specific adaptations (e.g., contrastive temperature, batch size, adaptation weights) are listed as free parameters but no ablation quantifies their contribution relative to the dataset scale; this weakens the claim that gains stem primarily from the 15M-pair pretraining.
minor comments (2)
- [Figure 1] Figure 1 and Table 1: Caption clarity could be improved by explicitly stating the total number of unique articles versus pairs and any deduplication steps.
- [§2] §2 (Related Work): The comparison to prior biomedical VL models would benefit from a table summarizing dataset sizes and reported metrics for direct reference.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the paper. We address each major point below, with planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: §3 (Dataset Construction): The extraction of PMC-15M pairs from scientific articles is described only at a high level (4.4M articles, automatic collection) with no reported alignment metrics, human audit results, noise estimates, or filtering criteria. This is load-bearing for the central claim that contrastive pretraining on these pairs drives the SOTA gains, as scientific captions often describe context, panels, or non-visual elements.
Authors: We agree that more details on dataset quality are warranted. In the revised manuscript, we will expand §3 with explicit filtering criteria (image resolution > 224px, caption length 10-500 tokens, removal of non-figure images via heuristics), results from a human audit of 1,000 randomly sampled pairs (reporting 87% visual-text alignment), and noise estimates from manual review (estimated 12% caption noise). These additions will better support attribution of gains to the pretraining data. revision: yes
-
Referee: §5 (Experiments): The reported SOTA numbers and outperformance over BioViL on RSNA pneumonia detection lack error bars, statistical significance tests, or explicit details on evaluation data splits and fine-tuning protocols. Without these, the robustness of the performance claims and the attribution to diverse pretraining versus evaluation differences cannot be assessed.
Authors: We acknowledge the need for statistical rigor. The revision will add standard deviation error bars across 3 random seeds for all metrics, paired t-test p-values for comparisons (e.g., vs. BioViL on RSNA), and full details on evaluation splits (using official RSNA and other dataset partitions) plus fine-tuning protocols (learning rate, epochs, batch size). This will clarify robustness and isolate pretraining effects. revision: yes
-
Referee: §4.2 (Model Architecture and Training): The domain-specific adaptations (e.g., contrastive temperature, batch size, adaptation weights) are listed as free parameters but no ablation quantifies their contribution relative to the dataset scale; this weakens the claim that gains stem primarily from the 15M-pair pretraining.
Authors: We partially agree; while dataset scale is primary, we did not fully isolate adaptations. We will add an ablation in §4.2 training a baseline CLIP model (standard hyperparameters) vs. BiomedCLIP adaptations on a 1M-pair subset of PMC-15M, quantifying gains (e.g., +2.3% retrieval). However, full-scale ablations are computationally prohibitive, so we will note this limitation while emphasizing cross-dataset comparisons showing diversity benefits. revision: partial
Circularity Check
No circularity: empirical results on held-out benchmarks
full rationale
The paper reports pretraining BiomedCLIP on the externally collected PMC-15M dataset (15M image-text pairs from 4.4M articles) followed by evaluation on standard held-out biomedical benchmarks (retrieval, classification, VQA, RSNA pneumonia). No equations, derivations, or fitted parameters are defined such that any reported performance reduces to the pretraining inputs by construction. The central claim is an empirical observation of SOTA numbers after scale, not a self-referential prediction or renamed fit. Self-citations to prior CLIP-style work are not load-bearing for the uniqueness of the result.
Axiom & Free-Parameter Ledger
free parameters (2)
- contrastive temperature and batch size
- domain-specific adaptation weights
axioms (1)
- domain assumption Image-text pairs extracted from PubMed Central articles provide useful, sufficiently aligned supervision for learning general biomedical visual concepts.
Forward citations
Cited by 27 Pith papers
-
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
-
MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models
MedLayBench-V is the first large-scale multimodal benchmark for expert-lay semantic alignment in medical vision-language models, constructed via a Structured Concept-Grounded Refinement pipeline that uses UMLS CUIs to...
-
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
-
A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis
BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.
-
CLEF: EEG Foundation Model for Learning Clinical Semantics
CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.
-
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
-
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
-
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?
Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.
-
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction
REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
-
Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models
BETA adapts black-box models at test time using a local steering model and regularization techniques to achieve accuracy improvements without additional API queries or high latency.
-
Improving Medical VQA through Trajectory-Aware Process Supervision
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
-
Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks
LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
-
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
-
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
-
Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading
CGSD framework reaches 87.5% accuracy and 0.731 macro F1 on APTOS 2019 by conditioning diffusion denoising on dot-product vectors from image features and DR-grade text descriptions.
-
MultiMedVision: Multi-Modal Medical Vision Framework
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
-
CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis
CapCLIP uses pathology-aware text captions to align WCE images in a vision-language space, outperforming standard models in zero-shot classification and retrieval on unseen data.
-
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multim...
-
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
-
Align then Refine: Text-Guided 3D Prostate Lesion Segmentation
A text-guided multi-encoder U-Net with alignment loss, heatmap calibration, and confidence-gated cross-attention refiner sets new state-of-the-art 3D prostate lesion segmentation performance on the PI-CAI dataset.
-
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
A temporal adapter injects adjacent-slice context into VLM token representations, raising mean Dice from 0.498 to 0.704 on FLARE22 and reducing cross-domain drop from 38% to 24.9%.
-
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity lea...
-
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Lingshu is a medical-specialized multimodal LLM that outperforms prior open-source models on multimodal QA, text QA, and report generation after training on a large curated dataset of medical knowledge.
-
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
-
Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound
Structure augmentation via segmentation prior plus temporal aggregation stabilizes keyframe detection of fetal abdomen planes in blind-sweep ultrasound.
Reference graph
Works this paper leans on
-
[1]
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023)
work page 2023
- [2]
-
[3]
Heiliger, L., Sekuboyina, A., Menze, B., Egger, J. & Kleesiek, J. Beyond medical imaging-a review of multi- modal deep learning in radiology (2022)
work page 2022
-
[4]
Huang, S.-C., Pareek, A., Seyyedi, S., Banerjee, I. & Lungren, M. P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ digital medicine 3, 1–9 (2020)
work page 2020
- [5]
- [6]
-
[7]
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 2307–2316 (2023)
work page 2023
- [8]
-
[9]
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 1–8 (2024)
work page 2024
-
[10]
Moor, M. et al. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), 353–367 (PMLR, 2023)
work page 2023
-
[11]
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763 (PMLR, 2021)
work page 2021
-
[12]
Ramesh, A. et al. Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821– 8831 (PMLR, 2021)
work page 2021
-
[13]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 10684–10695 (2022)
work page 2022
-
[14]
Johnson, A. E. et al. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6, 1–8 (2019)
work page 2019
-
[15]
Irvin, J. et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 590–597 (2019)
work page 2019
-
[16]
Gamper, J. & Rajpoot, N. Multiple instance captioning: Learning representations from histopathology textbooks and articles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16549– 16559 (2021)
work page 2021
-
[17]
Pelka, O., Koitka, S., R ¨uckert, J., Nensa, F. & Friedrich, C. M. Radiology objects in context (roco): a multi- modal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, 180–189 (Springer, 2018)
work page 2018
-
[18]
Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt- text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2556–2565 (Association for Computational Linguistics, Melbourne, Australia, 2018). URL https...
work page 2018
-
[19]
Changpinyo, S., Sharma, P., Ding, N. & Soricut, R. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3558–3568 (2021)
work page 2021
-
[20]
Srinivasan, K., Raman, K., Chen, J., Bendersky, M. & Najork, M. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2443–2449 (2021)
work page 2021
-
[21]
Schuhmann, C. et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)
work page internal anchor Pith review arXiv 2022
-
[22]
Boecking, B. et al. Making the most of text semantics to improve biomedical vision–language processing. In European Conference on Computer Vision (ECCV), 1–21 (Springer, 2022)
work page 2022
-
[23]
Shih, G. et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology. Artificial intelligence1 (2019)
work page 2019
-
[24]
Gu, Y . et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH)3, 1–23 (2021)
work page 2021
-
[25]
Luo, R. et al. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23 (2022)
work page 2022
-
[26]
Garc ´ıa Seco de Herrera, A., M¨uller, H. & Bromuri, S. Overview of the ImageCLEF 2015 medical classification task. In Working Notes of CLEF 2015 (Cross Language Evaluation Forum)(2015)
work page 2015
-
[27]
Oord, A. v. d., Li, Y . & Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog 1, 9 (2019)
work page 2019
- [30]
-
[31]
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [32]
-
[33]
Li, C. et al. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Neural Information Processing Systems (2022)
work page 2022
- [34]
-
[35]
McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, vol. 24, 109–165 (Elsevier, 1989)
work page 1989
-
[36]
An empirical study of training end-to-end vision-and- language transformers
Dou, Z.-Y .et al. An empirical study of training end-to-end vision-and-language transformers. In Conference on Computer Vision and Pattern Recognition (CVPR)(2022). URL https://arxiv.org/abs/2111.02387
-
[37]
Tu, T. et al. Towards generalist biomedical ai. NEJM AI 1, AIoa2300138 (2024)
work page 2024
-
[38]
Li, C. et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems 36 (2024). 12
work page 2024
-
[39]
Huang, S.-C., Shen, L., Lungren, M. P. & Yeung, S. GLoRIA: A multimodal global-local representation learn- ing framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3942–3951 (2021)
work page 2021
-
[40]
M ¨uller, P., Kaissis, G., Zou, C. & Rueckert, D. Joint learning of localized representations from medical images and reports. In European Conference on Computer Vision, 685–701 (Springer, 2022)
work page 2022
-
[41]
Liao, R. et al. Multimodal representation learning via maximization of local mutual information. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, 273–283 (Springer, 2021)
work page 2021
-
[42]
Roberts, R. J. Pubmed central: The genbank of the published literature (2001)
work page 2001
- [43]
-
[44]
Vu, Y . N. T.et al. Medaug: Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. In Jung, K., Yeung, S., Sendak, M., Sjoding, M. & Ranganath, R. (eds.) Proceedings of the 6th Machine Learning for Healthcare Conference , vol. 149 of Proceedings of Machine Learning Research, 755–769 (PMLR, 2021). URL http...
work page 2021
-
[45]
Iyer, N. S. et al. Self-supervised pretraining enables high-performance chest x-ray interpretation across clinical distributions. medRxiv (2022). URL https://www.medrxiv.org/content/early/2022/11/25/2022.11.19.22282519. https://www.medrxiv.org/content/early/2022/11/25/2022.11.19.22282519.full.pdf
work page 2022
-
[46]
Pmc open access subset. [Internet] (2003). URL https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
work page 2003
-
[47]
Achakulvisut, T., Acuna, D. & Kording, K. Pubmed parser: A python parser for pubmed open-access xml subset and medline xml dataset xml dataset. Journal of Open Source Software 5, 1979 (2020). URL https: //doi.org/10.21105/joss.01979
-
[48]
Eslami, S., de Melo, G. & Meinel, C. Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv e-prints (2021). 2112.13906
-
[49]
S., Linmans, J., Winkens, J., Cohen, T
Veeling, B. S., Linmans, J., Winkens, J., Cohen, T. & Welling, M. Rotation equivariant CNNs for digital pathol- ogy (2018). 1806.03962
- [50]
-
[51]
Saltz, J., Gupta, R., Hou, L. et al. Tumor-infiltrating lymphocytes maps from tcga h&e whole slide pathology images. Cancer Imaging Arch (2018)
work page 2018
-
[52]
Saltz, J. et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learn- ing on pathology images. Cell reports 23, 181–193 (2018)
work page 2018
-
[53]
Clark, K. et al. The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging 26, 1045–1057 (2013)
work page 2013
-
[54]
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
work page 2021
-
[55]
Zhan, L.-M., Liu, B., Fan, L., Chen, J. & Wu, X.-M. Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia, 2345–2354 (2020)
work page 2020
-
[56]
Lau, J. J., Gayen, S., Ben Abacha, A. & Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5, 1–10 (2018)
work page 2018
-
[57]
Liu, B. et al. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1650–1654 (IEEE, 2021). 13
work page 2021
- [58]
-
[59]
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[60]
Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725 (Association for Computational Linguistics, Berlin, Germany, 2016). URL https://aclanthology.org/ P16-1162
work page 2016
-
[61]
Kudo, T. & Richardson, J. SentencePiece: A simple and language independent subword tokenizer and deto- kenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 66–71 (Association for Computational Linguistics, Brussels, Belgium, 2018). URL https://aclanthology.or...
work page 2018
- [62]
-
[63]
Decoupled Weight Decay Regularization
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[64]
Ilharco, G. et al. Openclip (2021). URL https://doi.org/10.5281/zenodo.5143773
- [65]
-
[66]
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019). 14 Task Dataset Metric Description Data size Train Dev Test Cross-Modal Retrieval PMC-15M Recall@k Given textual description (caption), retrieve the corresponding image, or vice versa. (Image size: see Fig. 1...
-
[67]
and reduces memory usage by eliminating redundant computations and only computing the similarities of locally relevant features on each GPU. BiomedCLIP dataset classes templates PCam normal lymph node this is an image of {}; lymph node metastasis {} presented in image LC25000 (Lung) lung adenocarcinomas this is an image of {}; normal lung tissue {} presen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.