pith. sign in

hub Mixed citations

Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day

Mixed citation behavior. Most common role is background (57%).

22 Pith papers citing it
Background 57% of classified citations
abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

hub tools

citation-role summary

background 4 baseline 1 method 1 other 1

citation-polarity summary

representative citing papers

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

Foundation Models in Biomedical Imaging: Turning Hype into Reality

q-bio.QM · 2025-12-17 · unverdicted · novelty 4.0

Foundation models excel at pattern recognition in biomedical imaging but lack causal reasoning, robustness, and safety for real-world use, so they should augment rather than replace clinical expertise according to the proposed REAL-FM assessment framework.

Improved Baselines with Visual Instruction Tuning

cs.CV · 2023-10-05 · conditional · novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

citing papers explorer

Showing 22 of 22 citing papers.