{"work":{"id":"19dcd63e-57db-4bc7-83cb-d96d41270f55","openalex_id":null,"doi":null,"arxiv_id":"2306.00890","raw_key":null,"title":"LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day","authors":null,"authors_text":"Li, C","year":2023,"venue":"cs.CV","abstract":"Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.","external_url":"https://arxiv.org/abs/2306.00890","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T04:13:53.141689+00:00","pith_arxiv_id":"2306.00890","created_at":"2026-05-10T12:10:23.159045+00:00","updated_at":"2026-05-24T04:13:53.141689+00:00","title_quality_ok":true,"display_title":"Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day","render_title":"Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day"},"hub":{"state":{"work_id":"19dcd63e-57db-4bc7-83cb-d96d41270f55","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":22,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2023-06-23T15:21:52+00:00","last_pith_cited_at":"2026-05-18T18:50:56+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T18:06:46.369001+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":4},{"context_role":"baseline","n":1},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":4},{"context_polarity":"baseline","n":1},{"context_polarity":"support","n":1},{"context_polarity":"unclear","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}