UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

Bin Fu; Chaoyang Zhang; Chenglong Ma; Cheng Tang; Chenhui Gou; Diping Song; Guang Yang; Huihui Xu; Jiashi Lin; Jin Ye

arxiv: 2510.15710 · v3 · pith:3UCOXXPEnew · submitted 2025-10-17 · 💻 cs.CV

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

Junzhi Ning , Wei Li , Cheng Tang , Jiashi Lin , Chenglong Ma , Chaoyang Zhang , Jiyao Liu , Ying Chen

show 19 more authors

Shujian Gao Yuandong Pu Huihui Xu Chenhui Gou Ziyan Huang Yi Xin Qi Qin Diping Song Bin Fu Guang Yang Yuanfeng Ji Tianbin Li Yanzhou Su Jin Ye Shixiang Tang Zhongying Deng Lihao Liu Ming Hu Junjun He

This is my paper

classification 💻 cs.CV

keywords medicalgenerationunderstandingunimedvlmultimodalunifieddatasetfirst

0 comments

read the original abstract

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment
cs.CV 2026-05 unverdicted novelty 5.0

SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation trai...
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
cs.CV 2026-03 unverdicted novelty 5.0

CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.