NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
hub Canonical reference
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Canonical reference. 70% of citing Pith papers cite this work as background.
abstract
Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.
Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
PHOEBI is a benchmark dataset and LCO evaluation protocol for open-world multi-label bacterial species identification from phase-contrast microscopy of polymicrobial samples.
Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.
EchoPilot delivers state-of-the-art training-free ultrasound video segmentation from a single point prompt by introducing scale-space semantic prompting via S.E.E.D. and reliability-gated memory updates.
EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
MedCRP-CL discovers semantic modalities online via CRP from text prompts and maintains modality-specific LoRA adapters with intra-modality EWC, achieving 73.3% Dice and 4.1% forgetting on 16 tasks while using 6x fewer parameters than the best baseline.
Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially recovering accuracy.
CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
GRAPE augments prototype medical image classifiers with graph attention for co-occurrence, a mismatch safety check, and open-vocabulary anchoring to support incremental addition of findings from single examples.
PlantMicro benchmark shows current VLMs achieve low accuracy (e.g. GPT-5 at 34.93% on pathogen classification) on fine-grained microscopic plant image tasks.
CAFM is a four-stage framework that anchors EHR foundation models to patient cohorts via deviation-aware curation, cohort-conditioned pretraining, multimodal alignment, and clinician refinement to improve interpretability and trustworthiness.
Benchmark study shows zero-shot VLMs achieve near-random results (kappa <=0.10) on individual student videos but moderate agreement (kappa ~0.60) on scene-level images, with up to 32-point accuracy swings from prompt changes alone.
AI rewriting tasks that standardize radiology reports erode cross-modal image-text alignment more than they erode clinical entities or hedging language, creating a dissociation termed the slop paradox.
Frozen ViT embeddings in chest radiography suppress small-lesion signal at the CLS token but recover it via patch-local pooling on the same forward pass across multiple models and large cohorts.
OGKD injects inter-class geometry into teacher targets for two distillation losses (GAD on global tokens, LGD on patches) and reports 1.7-2.8% average accuracy gains over prior VLM adaptation methods on 11 medical datasets.
TC-LIA detects mirage in VLMs via layer-wise image patch to question alignment in CLIP encoders, reaching 94.6-94.7% three-class accuracy and under 3% mirage rate across five domains and twelve backbones.
citing papers explorer
-
DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home
DIYHealth Suite introduces a large home-care dataset, DIYHealthGPT model with Hybrid Hyper Low-Rank Adaptation, and DIYHealthBench, claiming SOTA results on 11 tasks over general and medical baselines.