ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3
The pith
ECG-R1 grounds ECG interpretations in measurable features and monograph rules to reduce hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ECG-R1 achieves reliable ECG interpretation by grounding analyses in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic through protocol-guided instruction data generation, a modality-decoupled architecture with interleaved modality dropout, and reinforcement learning with ECG diagnostic evidence rewards.
What carries the argument
Protocol-Guided Instruction Data Generation together with a modality-decoupled architecture using Interleaved Modality Dropout and Reinforcement Learning with ECG Diagnostic Evidence Rewards.
If this is right
- Existing MLLMs produce widespread severe hallucinations on ECG interpretation tasks.
- The modality-decoupled design with dropout improves robustness when signal or image inputs are missing.
- Evidence-rewarded reinforcement learning strengthens traceable, rule-following reasoning in outputs.
- Independent verification remains necessary for any current MLLM ECG analysis.
Where Pith is reading between the lines
- The same grounding strategy could extend to other medical signal or image interpretation tasks to curb hallucinations.
- Protocol-based training may reduce diagnostic errors in settings where clinicians have limited time for full review.
- Real-world deployment would require checking whether the model maintains accuracy on rare or borderline ECG patterns not well-represented in monographs.
Load-bearing premise
The protocol-guided process accurately translates monograph thresholds and diagnostic logic into training examples without systematic biases or omissions from the selected ECG features and extraction methods.
What would settle it
Test whether ECG-R1 still produces clinically incorrect outputs on held-out ECG cases where measured feature values cross clear monograph thresholds, or whether removing the protocol guidance causes performance to match ungrounded baselines.
Figures
read the original abstract
Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ECG-R1, an MLLM for ECG interpretation that claims to achieve reliability through three innovations: (1) Protocol-Guided Instruction Data Generation that grounds analyses in measurable ECG features using monograph-defined quantitative thresholds and diagnostic logic, (2) a modality-decoupled architecture incorporating Interleaved Modality Dropout to maintain performance when ECG signals or images are absent, and (3) Reinforcement Learning with ECG Diagnostic Evidence Rewards to reinforce evidence-based reasoning. It additionally reports a systematic evaluation of existing proprietary, open-source, and medical MLLMs, providing quantitative evidence that severe hallucinations are widespread and that outputs should not be trusted without verification. Code is released.
Significance. If the methods are shown to reduce hallucinations while preserving diagnostic accuracy, the work would be significant for clinical AI safety, as it directly targets the gap between plausible-sounding and clinically correct ECG analyses. The modality-agnostic design and public code release are practical strengths that could facilitate follow-on research and deployment in resource-constrained settings.
major comments (2)
- [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
- [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.
minor comments (2)
- [Abstract] The link text 'here' in the abstract should be replaced with the full repository URL for clarity.
- [Abstract] Ensure consistent capitalization of 'ECG-R1' and expansion of 'MLLM' on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on strengthening the reliability claims through better validation and clearer quantitative reporting. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
Authors: We agree that explicit validation of the automated feature extractor is essential to support the protocol-guided data generation and the downstream RL rewards. Our extractor combines established open-source ECG signal processing methods with monograph-defined thresholds. While we did not include a dedicated validation analysis in the original submission, we will add one in the revised §3.1. Specifically, we will report mean absolute errors, standard deviations, and agreement rates for key features (QRS duration, QTc, ST deviation) against annotations from two board-certified cardiologists on a held-out set of 500 ECGs. This will confirm that extraction errors remain within clinically acceptable ranges (well below the 10–20 ms thresholds) and do not systematically bias the instruction data or evidence rewards. revision: yes
-
Referee: [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.
Authors: We thank the referee for noting the need for more prominent quantitative details. Section 5 of the full manuscript already contains quantitative hallucination rates across proprietary, open-source, and medical MLLMs, along with diagnostic accuracy comparisons and initial component-wise results. To improve accessibility and address the concern directly, we will (1) update the abstract to include key metrics such as hallucination rates and overall reliability improvements, (2) add an explicit error analysis subsection, and (3) expand the ablation studies to quantify the individual contributions of protocol-guided data generation, interleaved modality dropout, and RL with diagnostic evidence rewards, including inter-rater agreement where relevant. These changes will make it straightforward to verify the impact of our innovations. revision: partial
Circularity Check
No circularity: derivation grounded in external monographs and standard RL
full rationale
The paper constructs its interpretation corpus via Protocol-Guided Instruction Data Generation that explicitly references monograph-defined quantitative thresholds and diagnostic logic, employs a modality-decoupled architecture with interleaved dropout, and applies reinforcement learning using ECG diagnostic evidence rewards. These steps draw from external medical references and established RL methods without any self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain. The central claims remain independent of quantities defined by the model's own outputs or prior author work.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward scaling factors
axioms (1)
- domain assumption Monograph-defined ECG features and thresholds constitute sufficient and accurate ground truth for clinical diagnosis.
Forward citations
Cited by 2 Pith papers
-
Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
CardioThink applies structured clinical reasoning stages and Structured Set Policy Optimization (SSPO) to ECG classification, yielding higher diagnostic accuracy and more interpretable rationales than direct predictio...
-
DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
DeepArrhythmia introduces a segment-contextualized multimodal framework for beat-level ECG arrhythmia classification that uses tool-grounded evidence extraction and selective acquisition routed by segment-level confidence.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.