Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Bin Chen; Hong Jia; Junyan Wang; Lingyan Ruan; Ting Dang; Xiangyuan Xue; Yan Gao; Yang Yu

arxiv: 2606.22442 · v1 · pith:GE7Y6PA7new · submitted 2026-06-21 · 💻 cs.AI

Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment

Xiangyuan Xue , Yang Yu , Yan Gao , Junyan Wang , Bin Chen , Lingyan Ruan , Ting Dang , Hong Jia This is my paper

Pith reviewed 2026-06-26 11:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal large language modelspulmonary embolismclinical question answeringCTPAelectronic health recordsrisk assessmentbenchmark evaluation

0 comments

The pith

Compact multimodal models perform better on pulmonary embolism tasks when both CT images and electronic health records are available.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark of eight diagnostic and prognostic question-answering tasks on a dataset of over 23,000 CTPA studies paired with EHR data. It tests efficient multimodal large language models under CTPA-only, EHR-only, and combined inputs using zero- and few-shot prompting. Performance is higher when EHR evidence is included, especially for diagnosis tasks relative to prognostic ones such as readmission prediction. A reader would care because the setting directly matches routine clinical workflow for a high-risk condition, suggesting smaller models could support timely risk assessment without large-scale compute.

Core claim

On the INSPECT dataset, models such as Gemma4 E4B and Gemma4 E2B achieve stronger results in the CTPA+EHR setting than in single-modality conditions, with diagnosis tasks outperforming prognostic tasks; this pattern indicates that compact multimodal models hold potential for early-stage PE risk detection and explanation.

What carries the argument

Eight structured clinical question-answering tasks evaluated across CTPA-Only, EHR-Only, and CTPA+EHR input combinations with zero-shot and few-shot prompting.

If this is right

Combining CTPA and EHR inputs improves model performance over either modality alone.
Diagnosis tasks reach higher accuracy than prognostic tasks such as readmission prediction.
Efficient models show their strongest results under the combined CTPA+EHR condition.
The observed pattern supports use of compact multimodal models for early PE risk assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same task formulation could be applied to other conditions that combine imaging with longitudinal records, such as heart failure.
Real-world deployment would require testing whether the QA format integrates into existing radiology reporting systems.
If the advantage of combined inputs holds across model families, it would favor multimodal training over separate unimodal pipelines.

Load-bearing premise

The eight chosen QA tasks and the zero/few-shot prompting setups are sufficient to show the true capabilities of the models without further controls for model size or prompt design.

What would settle it

A controlled experiment in which larger non-multimodal models or differently engineered prompts match or exceed the reported performance on the same eight tasks would undermine the claim that compact multimodal models are particularly promising.

Figures

Figures reproduced from arXiv: 2606.22442 by Bin Chen, Hong Jia, Junyan Wang, Lingyan Ruan, Ting Dang, Xiangyuan Xue, Yan Gao, Yang Yu.

**Figure 1.** Figure 1: Representative structured clinical QA examples. Each panel shows the model-visible evidence type, task-specific question, reference answer for illustration only, and structured model response. Reference answers and radiology report impressions are not provided to the model during inference [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Pulmonary embolism (PE) is a high risk cardiopulmonary condition whose management requires both timely diagnosis and reliable assessment of future clinical risk. Because PE care routinely combines computed tomography pulmonary angiography (CTPA), radiology interpretation, and longitudinal electronic health record (EHR) evidence, it provides a clinically meaningful setting for evaluating compact multimodal language models. In this work, we build a benchmark using efficient multimodal large language models (MLLMs) on INSPECT, a multimodal PE dataset containing 23,248 CTPA studies from 19,402 patients. We formulate eight diagnostic and prognostic tasks as structured clinical question answering problems and evaluate on typical efficient MLLMs under CTPA-Only, EHR-Only, and CTPA+EHR settings with zero-shot and few-shot prompting. Results show that Gemma4 E4B and Gemma4 E2B perform more strongly when EHR evidence is available, especially under CTPA+EHR input. Task level analysis further shows that PE diagnosis achieves higher performance than prognostic tasks, particularly readmission prediction. These observations suggest that compact multimodal models have the great potential in early stage PE risk detection and explanation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Benchmark paper formulates eight PE QA tasks on INSPECT and shows EHR helps, but lacks numbers, baselines, and controls to back the potential claim.

read the letter

This paper turns pulmonary embolism work into eight structured QA tasks on the INSPECT dataset of over 23k CTPA studies. It evaluates compact MLLMs like Gemma4 E4B and E2B under CTPA-only, EHR-only, and combined inputs with zero- and few-shot prompting.

What it does cleanly is report that adding EHR evidence lifts results over imaging alone and that diagnosis tasks outperform prognosis ones such as readmission prediction. Those patterns match how clinicians actually combine data for PE and give a usable starting benchmark for multimodal clinical QA.

The soft spots are the missing controls. The abstract gives only directional trends with no absolute scores, no error bars, no statistical tests, and no comparisons to standard clinical rules or non-LLM tabular models. There are also no ablations on prompt variants or model scale. Without those, the differences cannot be confidently attributed to the multimodal models rather than setup choices.

The claim that these compact models have great potential for early PE risk detection therefore rests on thin evidence. The work is an empirical benchmark rather than a new method, so its value hinges on execution details that are not visible here.

This is mainly for researchers already testing MLLMs on narrow clinical tasks who need a new dataset and task split to try. A serious editor could send it to review if the full paper supplies the tables, baselines, and method specifics that are absent from the abstract; otherwise it stays too preliminary.

Referee Report

3 major / 2 minor

Summary. The paper introduces a benchmark for evaluating compact multimodal large language models (MLLMs) on pulmonary embolism (PE) risk assessment using the INSPECT dataset of 23,248 CTPA studies. It formulates eight diagnostic and prognostic tasks as clinical QA problems and evaluates models such as Gemma4 E4B and E2B under CTPA-only, EHR-only, and combined settings with zero- and few-shot prompting. The central claim is that these efficient models show stronger performance when EHR evidence is added (especially in CTPA+EHR), that diagnostic tasks outperform prognostic ones (e.g., readmission prediction), and that compact MLLMs therefore have great potential for early-stage PE detection and explanation.

Significance. If the empirical results were supported by absolute metrics, statistical tests, and appropriate baselines, the work could usefully demonstrate the viability of parameter-efficient multimodal models for a clinically relevant multimodal task where both imaging and longitudinal records matter. The creation of structured QA tasks on a large PE dataset is a concrete contribution that could support future reproducible evaluations.

major comments (3)

[Abstract] Abstract: only directional statements are supplied ("perform more strongly", "higher performance", "great potential") with no absolute accuracy/F1 scores, confidence intervals, statistical significance tests, or data tables. This prevents any assessment of whether the observed trends are large enough to support the clinical-potential claim.
[Abstract] Abstract / evaluation setup: no non-LLM baselines (rule-based PE risk scores such as Wells or Geneva criteria, or standard tabular EHR classifiers) are reported. Without these controls it is impossible to determine whether any advantage is attributable to the compact MLLMs themselves rather than to the QA task formulation or prompting.
[Abstract] Abstract: the eight QA tasks and zero/few-shot CTPA/EHR protocols are presented without ablations on prompt variants, model-scale controls, or confounding factors. Consequently the directional trends cannot be confidently attributed to the multimodal compact models rather than setup artifacts.

minor comments (2)

[Abstract] The manuscript should include a table or figure summarizing the eight QA tasks with their exact question templates and label definitions.
Dataset statistics (patient demographics, label distributions per task) are referenced but not shown; these should appear in a dedicated table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and evaluation design. We address each major comment below, indicating revisions where appropriate to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: only directional statements are supplied ("perform more strongly", "higher performance", "great potential") with no absolute accuracy/F1 scores, confidence intervals, statistical significance tests, or data tables. This prevents any assessment of whether the observed trends are large enough to support the clinical-potential claim.

Authors: We agree that the abstract would benefit from quantitative support. The full manuscript contains the underlying metrics; we will revise the abstract to report representative accuracy and F1 scores (with confidence intervals) for the main CTPA+EHR setting and key tasks. revision: yes
Referee: [Abstract] Abstract / evaluation setup: no non-LLM baselines (rule-based PE risk scores such as Wells or Geneva criteria, or standard tabular EHR classifiers) are reported. Without these controls it is impossible to determine whether any advantage is attributable to the compact MLLMs themselves rather than to the QA task formulation or prompting.

Authors: The work centers on benchmarking compact MLLMs for multimodal clinical QA rather than claiming superiority over all clinical tools. Wells and Geneva scores apply primarily to diagnosis and do not map directly to the prognostic tasks. We will nevertheless add a simple tabular EHR baseline (logistic regression on structured features) in the revised experiments section for context. revision: yes
Referee: [Abstract] Abstract: the eight QA tasks and zero/few-shot CTPA/EHR protocols are presented without ablations on prompt variants, model-scale controls, or confounding factors. Consequently the directional trends cannot be confidently attributed to the multimodal compact models rather than setup artifacts.

Authors: The manuscript already details the task formulations and prompting protocols. To address potential setup artifacts, we will include additional ablations on prompt phrasing and a model-scale comparison (E4B vs. E2B) in the revised results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential fits

full rationale

The paper formulates eight QA tasks on the INSPECT dataset and reports directional performance trends for Gemma4 models under different input settings and prompting regimes. No equations, fitted parameters, uniqueness theorems, or derivation chains appear in the provided text. Conclusions rest on observed empirical patterns (EHR improves results; diagnosis outperforms prognosis) that are directly testable against external baselines or larger models, with no reduction of outputs to inputs by construction. Self-citation is absent from the abstract and described content, so the evaluation chain remains independent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information available from abstract alone to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5744 in / 1076 out tokens · 30241 ms · 2026-06-26T11:08:39.615060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages

[1]

European Heart Journal , volume=

2019 ESC Guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the European Respiratory Society (ERS) , author=. European Heart Journal , volume=

2019
[2]

BMJ Open , volume=

Clinical characteristics associated with diagnostic delay of pulmonary embolism in primary care: a retrospective observational study , author=. BMJ Open , volume=
[3]

npj Digital Medicine , volume=

Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines , author=. npj Digital Medicine , volume=
[4]

Scientific Reports , volume=

Multimodal fusion with deep neural networks for leveraging CT imaging and electronic health record: a case study in pulmonary embolism detection , author=. Scientific Reports , volume=
[5]

arXiv preprint arXiv:2111.11665 , year=

RadFusion: Benchmarking Performance and Fairness for Multimodal Pulmonary Embolism Detection from CT and EHR , author=. arXiv preprint arXiv:2111.11665 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis , author=. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=
[7]

Archives of Internal Medicine , volume=

Pulmonary embolism mortality in the United States, 1979--1998: an analysis using multiple-cause mortality data , author=. Archives of Internal Medicine , volume=. 2003 , publisher=

1979
[8]

European Journal of Internal Medicine , volume=

Delay and misdiagnosis in sub-massive and non-massive acute pulmonary embolism , author=. European Journal of Internal Medicine , volume=. 2010 , publisher=

2010
[9]

BMJ Open , volume=

Clinical characteristics associated with diagnostic delay of pulmonary embolism in primary care: a retrospective observational study , author=. BMJ Open , volume=. 2017 , publisher=

2017
[10]

Journal of Thrombosis and Haemostasis , volume=

Diagnosing pulmonary embolism: running after the decreasing prevalence of cases among suspected patients , author=. Journal of Thrombosis and Haemostasis , volume=. 2004 , publisher=

2004
[11]

The British Journal of Radiology , volume=

The influence of clinical information on the reporting of CT by radiologists , author=. The British Journal of Radiology , volume=. 2000 , publisher=

2000
[12]

Journal of the American College of Radiology , volume=

Accuracy of information on imaging requisitions: does it matter? , author=. Journal of the American College of Radiology , volume=. 2007 , publisher=

2007
[13]

Scientific Reports , volume=

Multimodal fusion with deep neural networks for leveraging CT imaging and electronic health record: a case-study in pulmonary embolism detection , author=. Scientific Reports , volume=. 2020 , publisher=

2020
[14]

Radiology: Artificial Intelligence , volume=

The RSNA pulmonary embolism CT dataset , author=. Radiology: Artificial Intelligence , volume=. 2021 , publisher=

2021
[15]

Artificial Intelligence in Medicine , volume=

Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification , author=. Artificial Intelligence in Medicine , volume=. 2019 , publisher=

2019
[16]

arXiv preprint arXiv:2201.11838 , year=

Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences , author=. arXiv preprint arXiv:2201.11838 , year=

work page arXiv
[17]

Scientific reports , volume=

Deep learning for pulmonary embolism detection on computed tomography pulmonary angiogram: a systematic review and meta-analysis , author=. Scientific reports , volume=. 2021 , publisher=

2021
[18]

European radiology , volume=

Evaluation of acute pulmonary embolism and clot burden on CTPA with deep learning , author=. European radiology , volume=. 2020 , publisher=

2020
[19]

NPJ digital medicine , volume=

PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging , author=. NPJ digital medicine , volume=. 2020 , publisher=

2020
[20]

La radiologia medica , volume=

Prognostic value of CT pulmonary angiography parameters in acute pulmonary embolism , author=. La radiologia medica , volume=. 2021 , publisher=

2021
[21]

Nature medicine , volume=

Large language models in medicine , author=. Nature medicine , volume=. 2023 , publisher=

2023
[22]

Towards visual question answering on pathology images , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=
[23]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

2021
[24]

Machine learning for health (ML4H) , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine learning for health (ML4H) , pages=. 2023 , organization=

2023
[25]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=
[26]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023
[27]

Nejm Ai , volume=

Towards generalist biomedical AI , author=. Nejm Ai , volume=. 2024 , publisher=

2024
[28]

Nature medicine , volume=

Privacy in the age of medical big data , author=. Nature medicine , volume=. 2019 , publisher=

2019
[29]

Communications medicine , volume=

The future landscape of large language models in medicine , author=. Communications medicine , volume=. 2023 , publisher=

2023
[30]

International Journal of Emerging Trends in Computer Science and Information Technology , pages=

AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations , author=. International Journal of Emerging Trends in Computer Science and Information Technology , pages=
[31]

NPJ digital medicine , volume=

Vision-language model for report generation and outcome prediction in CT pulmonary angiogram , author=. NPJ digital medicine , volume=. 2025 , publisher=

2025
[32]

arXiv preprint arXiv:2509.01554 , year=

Unified Supervision For Vision-Language Modeling in 3D Computed Tomography , author=. arXiv preprint arXiv:2509.01554 , year=

work page arXiv

[1] [1]

European Heart Journal , volume=

2019 ESC Guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the European Respiratory Society (ERS) , author=. European Heart Journal , volume=

2019

[2] [2]

BMJ Open , volume=

Clinical characteristics associated with diagnostic delay of pulmonary embolism in primary care: a retrospective observational study , author=. BMJ Open , volume=

[3] [3]

npj Digital Medicine , volume=

Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines , author=. npj Digital Medicine , volume=

[4] [4]

Scientific Reports , volume=

Multimodal fusion with deep neural networks for leveraging CT imaging and electronic health record: a case study in pulmonary embolism detection , author=. Scientific Reports , volume=

[5] [5]

arXiv preprint arXiv:2111.11665 , year=

RadFusion: Benchmarking Performance and Fairness for Multimodal Pulmonary Embolism Detection from CT and EHR , author=. arXiv preprint arXiv:2111.11665 , year=

work page arXiv

[6] [6]

Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis , author=. Advances in Neural Information Processing Systems Datasets and Benchmarks Track , year=

[7] [7]

Archives of Internal Medicine , volume=

Pulmonary embolism mortality in the United States, 1979--1998: an analysis using multiple-cause mortality data , author=. Archives of Internal Medicine , volume=. 2003 , publisher=

1979

[8] [8]

European Journal of Internal Medicine , volume=

Delay and misdiagnosis in sub-massive and non-massive acute pulmonary embolism , author=. European Journal of Internal Medicine , volume=. 2010 , publisher=

2010

[9] [9]

BMJ Open , volume=

Clinical characteristics associated with diagnostic delay of pulmonary embolism in primary care: a retrospective observational study , author=. BMJ Open , volume=. 2017 , publisher=

2017

[10] [10]

Journal of Thrombosis and Haemostasis , volume=

Diagnosing pulmonary embolism: running after the decreasing prevalence of cases among suspected patients , author=. Journal of Thrombosis and Haemostasis , volume=. 2004 , publisher=

2004

[11] [11]

The British Journal of Radiology , volume=

The influence of clinical information on the reporting of CT by radiologists , author=. The British Journal of Radiology , volume=. 2000 , publisher=

2000

[12] [12]

Journal of the American College of Radiology , volume=

Accuracy of information on imaging requisitions: does it matter? , author=. Journal of the American College of Radiology , volume=. 2007 , publisher=

2007

[13] [13]

Scientific Reports , volume=

Multimodal fusion with deep neural networks for leveraging CT imaging and electronic health record: a case-study in pulmonary embolism detection , author=. Scientific Reports , volume=. 2020 , publisher=

2020

[14] [14]

Radiology: Artificial Intelligence , volume=

The RSNA pulmonary embolism CT dataset , author=. Radiology: Artificial Intelligence , volume=. 2021 , publisher=

2021

[15] [15]

Artificial Intelligence in Medicine , volume=

Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification , author=. Artificial Intelligence in Medicine , volume=. 2019 , publisher=

2019

[16] [16]

arXiv preprint arXiv:2201.11838 , year=

Clinical-Longformer and Clinical-BigBird: Transformers for long clinical sequences , author=. arXiv preprint arXiv:2201.11838 , year=

work page arXiv

[17] [17]

Scientific reports , volume=

Deep learning for pulmonary embolism detection on computed tomography pulmonary angiogram: a systematic review and meta-analysis , author=. Scientific reports , volume=. 2021 , publisher=

2021

[18] [18]

European radiology , volume=

Evaluation of acute pulmonary embolism and clot burden on CTPA with deep learning , author=. European radiology , volume=. 2020 , publisher=

2020

[19] [19]

NPJ digital medicine , volume=

PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging , author=. NPJ digital medicine , volume=. 2020 , publisher=

2020

[20] [20]

La radiologia medica , volume=

Prognostic value of CT pulmonary angiography parameters in acute pulmonary embolism , author=. La radiologia medica , volume=. 2021 , publisher=

2021

[21] [21]

Nature medicine , volume=

Large language models in medicine , author=. Nature medicine , volume=. 2023 , publisher=

2023

[22] [22]

Towards visual question answering on pathology images , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=

[23] [23]

2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering , author=. 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , pages=. 2021 , organization=

2021

[24] [24]

Machine learning for health (ML4H) , pages=

Med-flamingo: a multimodal medical few-shot learner , author=. Machine learning for health (ML4H) , pages=. 2023 , organization=

2023

[25] [25]

Advances in Neural Information Processing Systems , volume=

Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023

[27] [27]

Nejm Ai , volume=

Towards generalist biomedical AI , author=. Nejm Ai , volume=. 2024 , publisher=

2024

[28] [28]

Nature medicine , volume=

Privacy in the age of medical big data , author=. Nature medicine , volume=. 2019 , publisher=

2019

[29] [29]

Communications medicine , volume=

The future landscape of large language models in medicine , author=. Communications medicine , volume=. 2023 , publisher=

2023

[30] [30]

International Journal of Emerging Trends in Computer Science and Information Technology , pages=

AI and Data Privacy in Healthcare: Compliance with HIPAA, GDPR, and emerging regulations , author=. International Journal of Emerging Trends in Computer Science and Information Technology , pages=

[31] [31]

NPJ digital medicine , volume=

Vision-language model for report generation and outcome prediction in CT pulmonary angiogram , author=. NPJ digital medicine , volume=. 2025 , publisher=

2025

[32] [32]

arXiv preprint arXiv:2509.01554 , year=

Unified Supervision For Vision-Language Modeling in 3D Computed Tomography , author=. arXiv preprint arXiv:2509.01554 , year=

work page arXiv