pith. sign in

arxiv: 2510.15710 · v3 · pith:3UCOXXPEnew · submitted 2025-10-17 · 💻 cs.CV

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

classification 💻 cs.CV
keywords medicalgenerationunderstandingunimedvlmultimodalunifieddatasetfirst
0
0 comments X
read the original abstract

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...

  2. MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

    cs.CV 2026-04 unverdicted novelty 8.0

    MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...

  3. SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.

  4. SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

    cs.CV 2026-05 unverdicted novelty 5.0

    SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation trai...

  5. Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.