hub Mixed citations

Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day

Li, C · 2023 · cs.CV · arXiv 2306.00890

Mixed citation behavior. Most common role is background (57%).

22 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 22 citing papers arXiv PDF

abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1 method 1 other 1

citation-polarity summary

background 4 baseline 1 support 1 unclear 1

representative citing papers

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

cs.CV · 2024-06-14 · unverdicted · novelty 7.0

Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQA at less than 20 percent the size of MedVInT-TD.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

cs.CL · 2024-06-17 · unverdicted · novelty 5.0

Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

cs.CL · 2026-05-01 · unverdicted · novelty 5.0

LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to clinical questions.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

Agent AI: Surveying the Horizons of Multimodal Interaction

cs.AI · 2024-01-07 · unverdicted · novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

cs.CL · 2026-02-13 · unverdicted · novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Foundation Models in Biomedical Imaging: Turning Hype into Reality

q-bio.QM · 2025-12-17 · unverdicted · novelty 4.0

Foundation models excel at pattern recognition in biomedical imaging but lack causal reasoning, robustness, and safety for real-world use, so they should augment rather than replace clinical expertise according to the proposed REAL-FM assessment framework.

Improved Baselines with Visual Instruction Tuning

cs.CV · 2023-10-05 · conditional · novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Data-Centric Foundation Models in Computational Healthcare: A Survey

cs.LG · 2024-01-04 · unverdicted · novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

cs.CV · 2026-05-18 · 2 refs

citing papers explorer

Showing 22 of 22 citing papers.

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models cs.CV · 2024-06-14 · unverdicted · none · ref 11 · internal anchor
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making cs.CV · 2026-05-14 · unverdicted · none · ref 16
DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows cs.CV · 2026-05-13 · unverdicted · none · ref 21
ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 16
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 11
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 19
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning cs.CV · 2026-04-21 · unverdicted · none · ref 31
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 34
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 20
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding cs.CV · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQA at less than 20 percent the size of MedVInT-TD.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression cs.CL · 2024-06-17 · unverdicted · none · ref 30 · internal anchor
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks cs.AI · 2026-05-11 · unverdicted · none · ref 41 · 2 links
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI cs.CL · 2026-05-01 · unverdicted · none · ref 14
LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.
Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning cs.CV · 2026-04-20 · unverdicted · none · ref 21
A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to clinical questions.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 158
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 205 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs cs.CL · 2026-02-13 · unverdicted · none · ref 37
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
Foundation Models in Biomedical Imaging: Turning Hype into Reality q-bio.QM · 2025-12-17 · unverdicted · none · ref 40
Foundation models excel at pattern recognition in biomedical imaging but lack causal reasoning, robustness, and safety for real-world use, so they should augment rather than replace clinical expertise according to the proposed REAL-FM assessment framework.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 31
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Data-Centric Foundation Models in Computational Healthcare: A Survey cs.LG · 2024-01-04 · unverdicted · none · ref 160 · internal anchor
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 160
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
MedFM-Robust: Benchmarking Robustness of Medical Foundation Models cs.CV · 2026-05-18 · unreviewed · ref 11 · 2 links · internal anchor

Llava-med:Trainingalargelanguage-and- vision assistant for biomedicine in one day

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer