ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Bo Liu; Deyun Zhang; Haoyu Wang; Hongyan Li; Jiarui Jin; Shenda Hong; Xiang Lan; Xian Wu; Xiaocheng Fang; Xingliang Wu

arxiv: 2602.04279 · v3 · pith:B5WUISEHnew · submitted 2026-02-04 · 💻 cs.CL

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Jiarui Jin , Haoyu Wang , Xingliang Wu , Xiaocheng Fang , Xiang Lan , Zihan Wang , Deyun Zhang , Bo Liu

show 4 more authors

Yingying Zhang Xian Wu Hongyan Li Shenda Hong

This is my paper

Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords ECG interpretationmultimodal large language modelsprotocol-guided data generationreinforcement learningmedical hallucinationsmodality dropoutdiagnostic AI

0 comments

The pith

ECG-R1 grounds ECG interpretations in measurable features and monograph rules to reduce hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ECG-R1, a multimodal large language model built specifically for reliable ECG interpretation. It constructs training data through protocol-guided generation that follows measurable ECG features along with quantitative thresholds and diagnostic logic from medical monographs. A modality-decoupled architecture with interleaved dropout maintains performance when either the ECG signal or image is unavailable, while reinforcement learning rewards outputs that cite diagnostic evidence. Evaluations of existing proprietary, open-source, and medical MLLMs show frequent severe hallucinations, indicating that direct trust in their ECG outputs is unwarranted without verification.

Core claim

ECG-R1 achieves reliable ECG interpretation by grounding analyses in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic through protocol-guided instruction data generation, a modality-decoupled architecture with interleaved modality dropout, and reinforcement learning with ECG diagnostic evidence rewards.

What carries the argument

Protocol-Guided Instruction Data Generation together with a modality-decoupled architecture using Interleaved Modality Dropout and Reinforcement Learning with ECG Diagnostic Evidence Rewards.

If this is right

Existing MLLMs produce widespread severe hallucinations on ECG interpretation tasks.
The modality-decoupled design with dropout improves robustness when signal or image inputs are missing.
Evidence-rewarded reinforcement learning strengthens traceable, rule-following reasoning in outputs.
Independent verification remains necessary for any current MLLM ECG analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding strategy could extend to other medical signal or image interpretation tasks to curb hallucinations.
Protocol-based training may reduce diagnostic errors in settings where clinicians have limited time for full review.
Real-world deployment would require checking whether the model maintains accuracy on rare or borderline ECG patterns not well-represented in monographs.

Load-bearing premise

The protocol-guided process accurately translates monograph thresholds and diagnostic logic into training examples without systematic biases or omissions from the selected ECG features and extraction methods.

What would settle it

Test whether ECG-R1 still produces clinically incorrect outputs on held-out ECG cases where measured feature values cross clear monograph thresholds, or whether removing the protocol guidance causes performance to match ungrounded baselines.

Figures

Figures reproduced from arXiv: 2602.04279 by Bo Liu, Deyun Zhang, Haoyu Wang, Hongyan Li, Jiarui Jin, Shenda Hong, Xiang Lan, Xian Wu, Xiaocheng Fang, Xingliang Wu, Yingying Zhang, Zihan Wang.

**Figure 1.** Figure 1: Left: Attribute comparison among general/medical MLLMs, previous ECG-specialized MLLMs, and ECG-R1. General/medical MLLMs typically cannot perform signal analysis and lack high-quality ECG interpretation corpora, which often leads to hallucinated, clinically incorrect interpretations at test time. Previous ECG-specialized MLLMs often construct training corpus by purely prompting LLMs from ECG features, the… view at source ↗

**Figure 2.** Figure 2: Framework of ECG-R1. Instruction generation builds a protocol-guided interpretation corpus by combining ECG grounding features with the monograph protocol. Architecture adopts a decoupled dual-encoder design with lightweight projectors to align modalityspecific representations into a shared LLM space. Training follows a two-stage strategy with SFT followed by RL, and integrates IMD to enhance robustness a… view at source ↗

**Figure 3.** Figure 3: Architecture Comparison of GEM and ECG-R1. ECG from Basics to Essentials: Step by Step (Stroobandt et al., 2015). The original procedure is reorganized into five phases: (i) Technical, Rate & Rhythm, (ii) Conduction, Axis & Intervals, (iii) Chamber Hypertrophy & Voltage, (iv) Ischemia, Infarction & Mimics, and (v) Electrolytes & QT. We further enforce differential exclusion with explicit negatives to rule … view at source ↗

**Figure 4.** Figure 4: Modality Missing Results between Time-Series and Image Modalities. features via FeatureDB (Hong et al., 2019). 3.2. Main Evaluation Tasks and Metrics Grounded ECG Interpretation. We evaluate grounded ECG interpretation to assess whether the MLLM attains cardiologist-level competency in basic ECG reading, where accurate diagnosis must be accompanied by fine-grained evidence localization and clinically groun… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of ECG-Grounding and our ECG Protocol-Guided Grounding CoT [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative Comparison of ECG-Grounding and our ECG Protocol-Guided Grounding CoT [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of EDER on Mean Output Length and Entropy during RL Training [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Report: Sinus rhythm with PACs; Left axis deviation; RBBB with left anterior fascicular block; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus with occasional premature atrial contractions (PACs), as indicated by irregular RR intervals, particularly the shorter intervals such as 238 ms … view at source ↗

**Figure 9.** Figure 9: Report: Atrial fibrillation with rapid ventricular response with PVC(s); Left axis deviation; RBBB with left anterior fascicular block; Lateral ST-T changes may be due to myocardial ischemia; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The heart rate is elevated at 120 bpm, indicating sinus tachycardia… view at source ↗

**Figure 10.** Figure 10: Report: Sinus rhythm; Left anterior fascicular block; Left ventricular hypertrophy; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus, as indicated by the consistent P wave morphology across leads and a regular heart rate of 66 bpm. A notable finding is the left anterior fascicular block… view at source ↗

**Figure 11.** Figure 11: Report: Sinus rhythm; Ventricular bigeminy; Probable left atrial enlargement; Prolonged QT interval. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus, as indicated by the presence of consistent P waves preceding each QRS complex. However, there are multiple ventricular premature complexes (VPCs) obse… view at source ↗

read the original abstract

Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ECG-R1 grounds MLLM training in monograph protocols and adds RL evidence rewards plus modality dropout, with a useful hallucination benchmark, but the abstract gives no numbers to show the reliability gains are real.

read the letter

The key takeaway is that ECG-R1 combines protocol-guided data from medical monographs, interleaved modality dropout, and reinforcement learning with evidence rewards to try to cut down on hallucinations in ECG interpretation. They also run a comparison showing that many existing MLLMs hallucinate a lot on ECG tasks. The new parts are the specific way they generate instruction data by translating monograph rules into examples, the architecture that handles missing ECG signal or image by dropping modalities during training, and the RL setup that rewards answers based on actual diagnostic evidence rather than just matching references. The hallucination evaluation across proprietary, open-source, and medical models is a solid addition because it gives concrete numbers on how bad the problem is right now. This approach makes sense for a high-risk area like ECG reading where mistakes matter. Grounding in measurable features like QRS duration and using defined thresholds from monographs is better than free-form generation. The modality dropout should help when one input type is unavailable in practice. The soft spot is that we don't see the actual performance numbers or ablations here. The abstract mentions the methods but skips the results, so it's tough to tell if these changes really move the needle on reliability. The concern about automated feature extraction is real – if the code that pulls out QTc or ST deviation has errors, the generated training data will have bad examples, and the RL stage might just reinforce those mistakes. I'd want to see error analysis on the feature extraction step and how it affects the final model. This is for researchers building medical multimodal models or clinicians interested in AI tools for cardiology. Someone working on reliable AI for diagnostics would find the methods and the baseline evaluation worth looking at. It should go to peer review because the problem is important and the ideas are specific enough to test, even if revisions will be needed for the results section.

Referee Report

2 major / 2 minor

Summary. The paper introduces ECG-R1, an MLLM for ECG interpretation that claims to achieve reliability through three innovations: (1) Protocol-Guided Instruction Data Generation that grounds analyses in measurable ECG features using monograph-defined quantitative thresholds and diagnostic logic, (2) a modality-decoupled architecture incorporating Interleaved Modality Dropout to maintain performance when ECG signals or images are absent, and (3) Reinforcement Learning with ECG Diagnostic Evidence Rewards to reinforce evidence-based reasoning. It additionally reports a systematic evaluation of existing proprietary, open-source, and medical MLLMs, providing quantitative evidence that severe hallucinations are widespread and that outputs should not be trusted without verification. Code is released.

Significance. If the methods are shown to reduce hallucinations while preserving diagnostic accuracy, the work would be significant for clinical AI safety, as it directly targets the gap between plausible-sounding and clinically correct ECG analyses. The modality-agnostic design and public code release are practical strengths that could facilitate follow-on research and deployment in resource-constrained settings.

major comments (2)

[§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
[§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.

minor comments (2)

[Abstract] The link text 'here' in the abstract should be replaced with the full repository URL for clarity.
[Abstract] Ensure consistent capitalization of 'ECG-R1' and expansion of 'MLLM' on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on strengthening the reliability claims through better validation and clearer quantitative reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.

Authors: We agree that explicit validation of the automated feature extractor is essential to support the protocol-guided data generation and the downstream RL rewards. Our extractor combines established open-source ECG signal processing methods with monograph-defined thresholds. While we did not include a dedicated validation analysis in the original submission, we will add one in the revised §3.1. Specifically, we will report mean absolute errors, standard deviations, and agreement rates for key features (QRS duration, QTc, ST deviation) against annotations from two board-certified cardiologists on a held-out set of 500 ECGs. This will confirm that extraction errors remain within clinically acceptable ranges (well below the 10–20 ms thresholds) and do not systematically bias the instruction data or evidence rewards. revision: yes
Referee: [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.

Authors: We thank the referee for noting the need for more prominent quantitative details. Section 5 of the full manuscript already contains quantitative hallucination rates across proprietary, open-source, and medical MLLMs, along with diagnostic accuracy comparisons and initial component-wise results. To improve accessibility and address the concern directly, we will (1) update the abstract to include key metrics such as hallucination rates and overall reliability improvements, (2) add an explicit error analysis subsection, and (3) expand the ablation studies to quantify the individual contributions of protocol-guided data generation, interleaved modality dropout, and RL with diagnostic evidence rewards, including inter-rater agreement where relevant. These changes will make it straightforward to verify the impact of our innovations. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation grounded in external monographs and standard RL

full rationale

The paper constructs its interpretation corpus via Protocol-Guided Instruction Data Generation that explicitly references monograph-defined quantitative thresholds and diagnostic logic, employs a modality-decoupled architecture with interleaved dropout, and applies reinforcement learning using ECG diagnostic evidence rewards. These steps draw from external medical references and established RL methods without any self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain. The central claims remain independent of quantities defined by the model's own outputs or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that existing medical monographs provide complete and unbiased quantitative rules for ECG diagnosis; no new physical entities are postulated and the main free parameters are standard RL hyperparameters whose values are not detailed in the abstract.

free parameters (1)

RL reward scaling factors
Weights balancing different diagnostic evidence components in the reinforcement learning objective; typical in RL but not quantified here.

axioms (1)

domain assumption Monograph-defined ECG features and thresholds constitute sufficient and accurate ground truth for clinical diagnosis.
Invoked when constructing the protocol-guided instruction corpus.

pith-pipeline@v0.9.0 · 5777 in / 1391 out tokens · 50643 ms · 2026-05-21T14:13:11.299338+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
cs.AI 2026-05 unverdicted novelty 5.0

CardioThink applies structured clinical reasoning stages and Structured Set Policy Optimization (SSPO) to ECG classification, yielding higher diagnostic accuracy and more interpretable rationales than direct predictio...
DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
cs.LG 2026-05 unverdicted novelty 5.0

DeepArrhythmia introduces a segment-contextualized multimodal framework for beat-level ECG arrhythmia classification that uses tool-grounded evidence extraction and selective acquisition routed by segment-level confidence.