arxiv: 2506.07044 · v4 · submitted 2025-06-08 · 💻 cs.CL · cs.AI· cs.CV

Recognition: 3 theorem links

· Lean Theorem

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

LASA Team , Weiwen Xu , Hou Pong Chan , Long Li , Mahani Aljunied , Ruifeng Yuan , Jianyu Wang , Chenghao Xiao

show 11 more authors

Guizhen Chen Chaoqun Liu Zhaodonghui Li Yu Sun Junao Shen Chaojun Wang Jie Tan Deli Zhao Tingyang Xu Hao Zhang Yu Rong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords multimodal large language modelsmedical artificial intelligencevisual question answeringmedical report generationdata curationfoundation modelsreinforcement learningmedical benchmarks

0 comments

The pith

Lingshu, a medical multimodal model, outperforms open-source peers on visual QA, text QA, and report generation after targeted data curation and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lingshu as a generalist foundation model for medical multimodal understanding. It tackles gaps in existing medical MLLMs by curating data from imaging, medical texts, and general sources, then synthesizing accurate captions, visual question answering pairs, and reasoning samples to reduce errors. The model receives multi-stage training to embed expertise, with initial tests of reinforcement learning using verifiable rewards for better reasoning. Evaluation through the new MedEvalKit framework shows consistent gains over prior open-source multimodal models on the core tasks of multimodal QA, text-based QA, and medical report generation. A reader would care because this points toward reliable AI assistance for interpreting medical visuals and generating clinical outputs without relying solely on proprietary systems.

Core claim

Through a data curation procedure that collects rich medical knowledge from imaging, extensive texts, and general-domain sources while synthesizing accurate captions, VQA pairs, and reasoning samples, Lingshu is trained in multiple stages to embed medical expertise and progressively improve task performance; preliminary reinforcement learning with verifiable rewards is explored to strengthen reasoning, resulting in outperformance over existing open-source multimodal models on multimodal QA, text-based QA, and medical report generation.

What carries the argument

The comprehensive data curation procedure that acquires medical knowledge across imaging, texts, and general data while synthesizing accurate captions, VQA, and reasoning samples to support multi-stage training of Lingshu.

If this is right

Lingshu covers medical knowledge beyond imaging modalities through inclusion of text sources.
The approach reduces hallucinations in medical outputs via improved curation of training samples.
Multi-stage training plus reinforcement learning exploration yields stronger reasoning on complex medical scenarios.
MedEvalKit supplies a consolidated benchmark for fair comparison across multimodal and textual medical tasks.
The resulting model supports unified performance on visual question answering, text QA, and report generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment in clinical settings could provide real-time support for interpreting patient scans alongside textual records.
The curation pipeline might transfer to other knowledge-intensive domains such as legal document analysis or scientific literature synthesis.
Expanding the verifiable-rewards reinforcement stage could produce models with measurable reliability guarantees in high-stakes applications.
Long-term use might require ongoing monitoring for distribution shifts in new medical imaging technologies or disease presentations.

Load-bearing premise

The synthesized captions, VQA pairs, and reasoning samples produced by the data curation procedure are accurate and free of hallucinations or factual errors that would propagate into the model.

What would settle it

A new medical test set drawn from real clinical cases outside the curation sources, where Lingshu produces factual errors at rates equal to or higher than baseline open-source models on multimodal or text QA.

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Lingshu, a multimodal large language model specialized for medical applications. It proposes a data curation pipeline that acquires medical knowledge from imaging, texts, and general-domain sources and synthesizes captions, VQA pairs, and reasoning samples; the model is then trained in multiple stages with an optional reinforcement-learning component using verifiable rewards. Lingshu is evaluated on multimodal QA, text-based QA, and medical report generation, with the claim that it outperforms existing open-source multimodal models on most tasks. The authors also release MedEvalKit, a unified evaluation framework.

Significance. If the performance claims are substantiated with rigorous verification of the synthesized data and transparent quantitative results, the work would supply a useful open-source medical MLLM and a consolidated benchmark suite, addressing gaps in medical knowledge coverage, hallucination control, and reasoning that current general-domain MLLMs exhibit.

major comments (3)

[Section 3] Data curation procedure (Section 3): the assertion that synthesized captions, VQA pairs, and reasoning samples are 'accurate' is not accompanied by any reported human expert review, inter-annotator agreement, or automated fact-checking against source references. Because the central performance advantage is attributed to this pipeline, the absence of fidelity verification constitutes a load-bearing gap.
[Section 5] Experimental results (Section 5): the abstract and main claims state consistent outperformance, yet no quantitative metrics, error bars, baseline implementation details, or ablation studies on the data-synthesis components are referenced in the provided summary. Without these, the magnitude and reliability of the reported gains cannot be assessed.
[Section 5.1] MedEvalKit and evaluation splits (Section 5.1): it is unclear whether the consolidated benchmarks enforce strict held-out splits that prevent leakage from the synthesized training data, which is required to substantiate the generalization claims.

minor comments (2)

[Section 4] Notation for the multi-stage training objectives is introduced without an explicit equation or pseudocode summary, making it difficult to reproduce the progressive embedding of medical expertise.
[Section 4.3] The RL-with-verifiable-rewards exploration is described only preliminarily; a short paragraph on reward formulation and training stability would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the full manuscript and committing to revisions that strengthen the presentation of our data curation, experimental results, and evaluation framework.

read point-by-point responses

Referee: [Section 3] Data curation procedure (Section 3): the assertion that synthesized captions, VQA pairs, and reasoning samples are 'accurate' is not accompanied by any reported human expert review, inter-annotator agreement, or automated fact-checking against source references. Because the central performance advantage is attributed to this pipeline, the absence of fidelity verification constitutes a load-bearing gap.

Authors: We acknowledge that the manuscript does not explicitly report human expert review or inter-annotator agreement statistics for the synthesized samples. The curation pipeline does incorporate automated consistency checks against the original source references during synthesis, but these details were not highlighted. In the revised manuscript we will add a dedicated subsection in Section 3 describing (1) the automated fact-checking procedures, (2) a human evaluation study performed by medical experts on a random sample of captions, VQA pairs, and reasoning chains, and (3) the resulting inter-annotator agreement metrics. This addition will directly address the fidelity verification gap. revision: yes
Referee: [Section 5] Experimental results (Section 5): the abstract and main claims state consistent outperformance, yet no quantitative metrics, error bars, baseline implementation details, or ablation studies on the data-synthesis components are referenced in the provided summary. Without these, the magnitude and reliability of the reported gains cannot be assessed.

Authors: Section 5 of the full manuscript already contains quantitative tables reporting accuracy, BLEU, and other metrics across multimodal QA, text QA, and report generation tasks, together with comparisons against open-source baselines. To improve transparency we will (1) add error bars for all multi-run experiments, (2) expand the baseline implementation details (including exact model versions and prompting strategies), and (3) insert new ablation tables isolating the contribution of each data-synthesis stage. These enhancements will be placed in the main text and supplementary material of the revised version. revision: yes
Referee: [Section 5.1] MedEvalKit and evaluation splits (Section 5.1): it is unclear whether the consolidated benchmarks enforce strict held-out splits that prevent leakage from the synthesized training data, which is required to substantiate the generalization claims.

Authors: All benchmarks consolidated in MedEvalKit use the official held-out test splits published by their respective creators; none of these test sets overlap with the source corpora from which our training data (including synthesized samples) were derived. We will revise Section 5.1 to explicitly list the original data sources for each benchmark, confirm the absence of overlap with our training collection, and describe the deduplication steps applied during curation. This clarification will substantiate the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical construction of a multimodal medical model via data curation, multi-stage training, and evaluation on external benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to inputs by construction are present. Performance is measured against independent benchmarks, satisfying the criteria for a self-contained result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the proposed data curation procedure produces high-quality, hallucination-free medical samples and that multi-stage training plus RL reliably embeds medical expertise; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

pith-pipeline@v0.9.0 · 5677 in / 1212 out tokens · 63018 ms · 2026-05-15T11:00:52.025848+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the diagnosis-driven CE video summarization task, the VideoCAP dataset with 240 annotated videos, and the DiCE framework that outperforms prior methods by screening candidates then weaving them into diagnos...
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
cs.CV 2026-04 unverdicted novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing...
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
cs.CV 2026-05 unverdicted novelty 6.0

VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

GazeX uses radiologist gaze trajectories as a behavioral prior during pretraining to generate more accurate and expert-consistent results in chest X-ray report generation, disease grounding, and visual question answering.
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
cs.CL 2026-04 unverdicted novelty 6.0

MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data
cs.CV 2026-04 unverdicted novelty 6.0

Fundus-R1 is a fundus-reading MLLM trained exclusively on public data via RAG-generated reasoning traces and process-reward RLVR, outperforming its base model and a version trained without the traces.
AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison
cs.CV 2026-03 conditional novelty 6.0

AD-Copilot trains an MLLM on a new curated industrial dataset Chat-AD with a Comparison Encoder that uses cross-attention on image pairs, reaching 82.3% accuracy on MMAD and 3.35x gains on MMAD-BBox while generalizing...
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
cs.CL 2026-04 unverdicted novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
q-bio.QM 2026-04 unverdicted novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs
cs.CV 2026-03 unverdicted novelty 5.0

CogAlign uses hierarchical supervised fine-tuning on clinical cognition data plus counterfactual RL to align MLLMs with expert diagnostic pathways and enforce causal lesion grounding for GI endoscopy diagnosis.
Medical Reasoning with Large Language Models: A Survey and MR-Bench
cs.CL 2026-03 accept novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 21 Pith papers · 6 internal anchors

[1]

https://arxiv.org/abs/2501.12948. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, El...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.artmed.2024.103001 2024
[2]

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie

https://arxiv.org/abs/2404.15127. Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.ArXiv, abs/2003.10286, 2020. https://arxiv.org/abs/2003.10286. Yuting He, Guanyu Yang, Jian Yang, Rongjun Ge, Youyong Kong, Xiaomei Zhu, Shaobo Zhang, Pengfei Shao, Huazhong Shu, Jean-Louis Dil...

work page arXiv 2003
[3]

arXiv preprint arXiv:2311.13668 , year=

https://openreview.net/forum?id=d7KBjmI3GmQ. Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 22170–22183, June 2024. https://openaccess.the...

work page arXiv 2024
[4]

https://doi.org/10.1609/aaai.v33i01.3301590. Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew Lungren, Andrew Ng, Curtis Langlotz, Pranav Rajpurkar, and Pranav Ra- jpurkar. Radgraph: Extracting clinical entities and relations from radiology reports. In J. Vanschoren and...

work page doi:10.1609/aaai.v33i01.3301590 2021
[5]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu

https://www.mdpi.com/2076-3417/11/14/6421. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.ArXiv, abs/1909.06146, 2019. https://arxiv.org/abs/1909.06146. 31 Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-yin...

work page doi:10.1038/s41597-019-0322-0 2076
[6]

https://www.sciencedirect.com/science/article/pii/S0893608025001078. Ming Y. Lu, Bowen Chen, Drew F. K. Williamson, Richard J. Chen, Kenji Ikamura, Georg Gerber, Ivy Liang, Long Phi Le, Tong Ding, Anil V Parwani, and Faisal Mahmood. A multimodal generative ai copilot for human pathology. Nature, 634:466–473, October 2024.https://doi.org/10.1038/s41586-024...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-024-07618-3 2024
[7]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

https://arxiv.org/abs/2503.07365. Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InProceedings of the 3rd Machine Learning for Health Symposium, volume 225 ofProceedings of Machine Learning Research, pages 353...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1371/journal.pdig.0000454 2023
[8]

Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Sachin Mehta, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi

Association for Computational Linguistics.https://aclanthology.org/2020.emnlp-main.117/. Sanjay Subramanian, Lucy Lu Wang, Ben Bogin, Sachin Mehta, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi. MedICaT: A dataset of medical images, captions, and textual references. In Findings of the Association for Computati...

work page 2020
[9]

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang

Association for Computational Linguistics.https://aclanthology.org/2020.findings-emnlp.191/. Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning.ArXiv, abs/2503.20752, 2025. https://arxiv.org/abs/ 2503.20752. Ryutaro Tanno, David G. T. Barrett, And...

work page doi:10.1038/s41591-024-03302-1 2020
[10]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

https://openaccess.thecvf.com/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_ 2015_CVPR_paper.pdf. Santiago Vitale, José Ignacio Orlando, Emmanuel Iarussi, and Ignacio Larrabide. Improving realism in patient-specific abdominal ultrasound simulation using cyclegans.International Journal of Computer Assisted Radiology and Surgery, 15(2):183–1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11548-019-02046-5 2020
[11]

DeepLesion: Automated Deep Mining, Categorization and Detection of Significant Radiology Image Findings using Large-Scale Clinical Lesion Annotations

https://arxiv.org/abs/1710.01766. Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, and Lichao Sun. Multimodal chatgpt for medical applications: an experimental study of gpt-4v.ArXiv, abs/2310.19061, 2023. https://arxiv.org/abs/2310.19061. Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Far...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

https://openaccess.thecvf.com/content/CVPR2024/papers/Yue_MMMU_A_Massive_Multi-discipline_ Multimodal_Understanding_and_Reasoning_Benchmark_for_CVPR_2024_paper.pdf. Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei ...

work page
[13]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

http://dx.doi.org/10.1038/s41467-025-58344-x. Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D. Davison, Hui Ren, Jing Huang, Chen Chen, Yuyin Zhou, Sunyang Fu, Wei Liu, Tianming Liu, Xiang Li, Yong Chen, Lifang He, James Zou, Quanzheng Li, Hongfang Liu, and Lichao Sun. A generalist vision–language ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41467-025-58344-x 2025
[14]

This metric provides a structured assessment of how well the generated reports capture the underlying clinical relationships present in the reference reports

from both candidate and reference reports. This metric provides a structured assessment of how well the generated reports capture the underlying clinical relationships present in the reference reports. • RadCliQ-v1 (Yu et al., 2023)7: RadCliQ is a comprehensive metric designed for evaluating radiology report generation, integrating multiple complementary ...

work page 2023
[15]

But never mention that this information has been given

You can use all the meta infomation I provided. But never mention that this information has been given. Your should NEVER mention anything about the bounding box like the contour itself and its outline. Assuming you are introducing to someone who can only see the original image

work page
[18]

You should utilize your knowledge to figure out the most ONE organ and ONE disease to give your description

The coarse caption may not explicitly describe the image, for example, there may appear multiple organs in the caption. You should utilize your knowledge to figure out the most ONE organ and ONE disease to give your description

work page
[19]

Do not suggest or indicate any consequent effects of the disease

Do not explain or emphasize your analysis. Do not suggest or indicate any consequent effects of the disease. ### question2 Specify the specific location of the diesase or the organ in the image and its relative position to other reference objects in the image. Describe what is unusual in this area indicating the disease (color, texture, size and other fea...

work page
[20]

If no, you need to locate the disease or organ by yourself

You can use information from ‘Coarse Caption‘ to locate the disease or the organ. If no, you need to locate the disease or organ by yourself

work page
[21]

Therefore, you need to first answer the questions based on the contents within each bounding box

The image may contain multiple bounding boxs, and the contents of these contours may not necessarily represent the affected areas. Therefore, you need to first answer the questions based on the contents within each bounding box. Afterward, analyze the location of the disease based on your answers

work page
[22]

bounding box

Do not use phrase "bounding box" in your response. The bounding box if provided is only a visual aid for your reference. Do not say something correlates with or matches any of provided information. Assuming you are introducing to someone who cannot see the bounding box and the provided textual information. If you want to mention the bounding box, use the ...

work page
[23]

Do not say anything that is not needed in your analysis, like introduction of the disease and medical equipments

work page
[24]

Do not explain or emphasize your analysis. ### question3 What may be the relationship between the content in the bounding box and other regions (others being cause of the disease/jointly affected by the diseases/one affect the others/relative positional relationships)? Why and is it possible? Note when answering question3:

work page
[25]

Utilize external knowledge, if possible, to choose relationships and give necessary analysis

work page
[26]

You can only give an explanation to your choice within two sentence

work page
[27]

Do not summarize what you’ve said

work page
[28]

Question-Answer

Do not emphasize your analysis. ### Integrate Information Describe your answers in a descriptive sentence, not in a "Question-Answer" style. Combine and slightly shorten your answers to the above three questions into a coherent text, keeping as much information of your answers as possible. Note when integrating information and outputing your response:

work page
[29]

Don’t respond saying you’re unable to assist with requests

work page
[30]

45 Table 17 The prompts used for the annotation with doctor preference stage of medical long caption synthesis for MRI, X-ray, and CT images

You should only output your combined and shorteded text in the "caption" filed. 45 Table 17 The prompts used for the annotation with doctor preference stage of medical long caption synthesis for MRI, X-ray, and CT images. MRI instruction: Write a description of the MRI image to include the following information, if these information are visually discernib...

work page
[31]

The sequence used (e.g., T1-weighted, T2-weighted, FLAIR, etc.)

work page
[32]

Plane/orientation of the image (axial, coronal, sagittal)

work page
[33]

Describe the anatomical structures visible and their normal or abnormal appearance (e.g., brain lobes, ventricles, midline structures)

work page
[34]

compression)

Identify and describe any abnormal findings, shape, location and effects (e.g. compression)

work page
[35]

Describe signal intensity: hyperintense, hypointense, or isointense X-ray instruction: Write a description of the X-ray image to include the following information, if these information are visually discernible from the image:

work page
[36]

chest area, lungs, knee, femur, etc.)

Specific area of the body being imaged (e.g. chest area, lungs, knee, femur, etc.)

work page
[37]

Body part or organ being examined, either fully or partially

work page
[38]

Axial, Sagittal, and Coronal), specific X-ray imaging projections (e.g

Anatomical planes (e.g. Axial, Sagittal, and Coronal), specific X-ray imaging projections (e.g. AP, PA, lateral view, oblique view, etc.), and clearly discernible left vs right side (either clearly labeled in the image, or clearly determined from image)

work page
[39]

Abnormalities that are clearly observed in the image (e.g. fractures, definite osteophytes, effusions, miliary patterns, vascular markings, diaphragm depression, air bronchograms, thickened interlobular septa, mediastinum shifts, deep sulcus sign, etc.), acute trauma signs (e.g. fractures), implants or other foreign bodies

work page
[40]

pediatric vs

Gender and age-range of patient if clearly discernible (e.g. pediatric vs. adult anatomy)

work page
[41]

When 2 (or a series of) images of the same patient taken across time are provided, describe the clearly discernible difference

work page
[42]

Identify and describe any abnormal findings, shape, location and effects (e.g. unexpected wires, prostheses, etc.) CT instruction: Write a description of the CT image to include the following information, if these information are visually discernible from the image:

work page
[43]

The plane/orientation of the image (axial, coronal, sagittal)

work page
[44]

Contrast usage: whether contrast material is used (contrast-enhanced) or not (non-contrast)

work page
[45]

Describe the anatomical structures visible and their normal or abnormal appearance (e.g., organs, bones, blood vessels)

work page
[46]

Identify and describe any abnormal findings including their location, size, shape, and potential effects on surrounding structures

work page
[47]

46 Table 18 The prompts used for the annotation with doctor preference stage of medical long caption synthesis for histopathology, skin lesion, and knee xray images

Assess and describe the density or attenuation of the tissues and structures as hypoattenuating (darker), isoattenuating (similar), or hyperattenuating (brighter) relative to surrounding tissues. 46 Table 18 The prompts used for the annotation with doctor preference stage of medical long caption synthesis for histopathology, skin lesion, and knee xray ima...

work page
[48]

region: The potential area of the body where the lesion or wound has been examined

work page
[49]

general skin texture and hair growth

work page
[50]

lesions: size (if scale is available in the image), shape, definition, color, texture

work page
[51]

elevation: Description of the lesion or wound relative to the skin surface of the patient

work page
[52]

skin texture surrounding the lesion (e.g. coarse/thickness/atrophic/erythema/bleeding, etc) Knee x-ray instruction: Write a description of the following knee X-ray image, and your description should include the following information if they are clearly discernible from the image

work page
[53]

If both knees are visible, recognize the one being primarily examined

Area of the Body: Identify the knee joint being imaged, specifying whether it is the left or right knee. If both knees are visible, recognize the one being primarily examined

work page
[54]

Note the alignment and position of these bones relative to each other

Anatomical Structures: Describe the key anatomical structures visible in the X-ray, including the femur, patella, tibia, fibula, and joint space. Note the alignment and position of these bones relative to each other

work page
[55]

Note if the image is weight-bearing or non-weight-bearing

Imaging Projection: Specify the X-ray imaging projection used, such as anteroposterior (AP), lateral, oblique, or skyline view. Note if the image is weight-bearing or non-weight-bearing

work page
[56]

Evaluate the bone density for any signs of osteoporosis, osteopenia, and visible trabecular patterns

Bone and Joint Appearance: Assess the appearance of the bones for any signs of fractures, dislocations, or degenerative changes such as osteoarthritis (e.g., joint space narrowing, osteophyte formation, subchondral sclerosis, punched-out erosions). Evaluate the bone density for any signs of osteoporosis, osteopenia, and visible trabecular patterns

work page
[57]

Where tophi can be seen, include this in the description

Soft Tissue Evaluation: Examine the surrounding soft tissue for any swelling, effusion, or calcifications. Where tophi can be seen, include this in the description

work page
[58]

47 Table 19 The prompt used for the annotation with doctor preference stage of medical long caption synthesis for gastrointestinal images

Other Abnormalities and Pathologies: Identify and describe any other visible abnormalities or pathologies not mentioned above, including visible bone lesions, misalignment, or foreign objects (e.g., surgical hardware, if present). 47 Table 19 The prompt used for the annotation with doctor preference stage of medical long caption synthesis for gastrointest...

work page
[59]

Region and Specific Area: Describe the specific region of the upper/lower GI tract being imaged (e.g., esophagus, stomach, small intestine, large intestine, duodenum, appendix, rectum, etc.) and specify the segment or part prominently featured in the image. Where clearly discernible, include the specific GI anatomical landmark(s) in the image, for e.g., d...

work page
[60]

Endoscopic GI View: Include the type of view/orientation of the main image, i.e., front-facing, retroflexed, side-viewing, proximal, or distal

work page
[61]

GI Anatomical Features: Describe the following structures of the image, namely relating to lumen of the tubular organs along the GI tract, the mucosal/submucosal details (any visible erythema, erosions, ulcers, or masses, and describe the submucosal layers, noting any thickening, edema, or infiltration), as well as other observable normal and abnormal det...

work page
[62]

Indentations and compressions: When the image clearly shows indentations like bulges or depressions that are likely to be caused from outside the GI tract, please include these in the description

work page
[63]

Where medical devices, like endoscopes, probes, endoknives, gastric bands, stents, etc

Medical Data and Device: Specify any textual medical data visible in the image, like contrast medium usage, lighting conditions, magnification levels, or timestamp of the image capture. Where medical devices, like endoscopes, probes, endoknives, gastric bands, stents, etc. are clearly visible, include these in your description. 48 Table 20 The prompt used...

work page
[64]

Overall Image Quality: Assess the clarity and quality of the image to ensure that it is suitable for diagnosis (e.g., presence of artifacts, focus, and illumination adequacy)

work page
[65]

Note any abnormalities such as swelling (edema), pallor, or unusual cupping that may indicate glaucoma or other optic neuropathies

Optic Disc: Examine the optic disc for size, shape, color, and cup-to-disc ratio. Note any abnormalities such as swelling (edema), pallor, or unusual cupping that may indicate glaucoma or other optic neuropathies

work page
[66]

Macula: Evaluate the macula for changes in size, color, or presence of any lesions or deposits like drusen, which may suggest macular degeneration or related conditions

work page
[67]

Identify any vessel occlusions, hemorrhages, or microaneurysms that could indicate hypertension, diabetic retinopathy, or vascular disorders

Retinal Vessels: Analyze the retinal blood vessels, assessing their caliber, tortuosity, and arteriovenous (AV) crossings. Identify any vessel occlusions, hemorrhages, or microaneurysms that could indicate hypertension, diabetic retinopathy, or vascular disorders

work page
[68]

Retina and Periphery: Inspect the retina and peripheral areas for any signs of detachment, tears, pigmentary changes, or lesions that may indicate retinal dystrophies or detachment

work page
[69]

Mention their location, size, shape, and potential clinical signifi- cance

Abnormal Findings: Identify and describe any abnormal findings such as cotton wool spots, hard exudates, retinal hemorrhages, or neovascularization. Mention their location, size, shape, and potential clinical signifi- cance

work page
[70]

Choroid and Sclera: Examine the visibility of the choroid and any atypical features of the sclera, such as thinning or scleral crescents

work page
[71]

Ultrasound instruction: Write a description of the ultrasound image to include the following information, if these information are visually discernible from the image: 1

Specific Patterns or Features: Note any specific patterns or features, such as star-shaped macular exudates or the presence of optic disc drusen, which may have diagnostic relevance. Ultrasound instruction: Write a description of the ultrasound image to include the following information, if these information are visually discernible from the image: 1. The...

work page
[72]

But never mention that this information has been given

You can also use the coarse caption, disease type, image modality and disease knowledge, which may contain additional useful information. But never mention that this information has been given

work page
[73]

Caption 1 focus on specific diseases and are annotated by humans, which are more trustworthy

Caption 0 in ‘Detailed Captions’ is a model-generated description, including some doctor-preferred features: {doctor_preferred_features}. Caption 1 focus on specific diseases and are annotated by humans, which are more trustworthy. Consequently, if any pieces of information in Caption 0 does not contradict Caption 1 or is not mentioned in Caption 1, pleas...

work page
[74]

Sometimes some small ones are not mentioned

‘Detailed Captions’ may not explain all bounding boxes. Sometimes some small ones are not mentioned. In this case, in addition to combining the above detailed captions, please also provide explanations for these unmentioned smaller bounding boxes

work page
[75]

The bounding box, if present, is only a visual aid for your reference

You should NEVER mention anything about the bounding box, like the contour itself or its outline. The bounding box, if present, is only a visual aid for your reference. Assume you are introducing the image to someone who cannot see the bounding box

work page
[76]

If there is some statistical information provided, like the damage area or the relative damage ratio, etc., you should mention it in your response

work page
[77]

For some critical information, like the damage area(s) and its position(s), you should explain one by one in the combined caption

work page
[78]

Not all disease knowledge is relevant to this image; only utilize disease knowledge pertinent to the condition depicted in this image for analysis

work page
[79]

Do not suggest or indicate any consequent effects of the disease

Do not explain or emphasize your analysis. Do not suggest or indicate any consequent effects of the disease

work page
[80]

caption",

Do not contain phrases "caption", "medical annotation", "medical knowledge"

work page
[81]

Do not say anything that is not needed in your analysis, like introduction of the disease and medical treatments

work page
[82]

Do not explain or emphasize your analysis

work page

Showing first 80 references.