ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3
The pith
ECG-R1 grounds ECG interpretations in measurable features and monograph rules to reduce hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ECG-R1 achieves reliable ECG interpretation by grounding analyses in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic through protocol-guided instruction data generation, a modality-decoupled architecture with interleaved modality dropout, and reinforcement learning with ECG diagnostic evidence rewards.
What carries the argument
Protocol-Guided Instruction Data Generation together with a modality-decoupled architecture using Interleaved Modality Dropout and Reinforcement Learning with ECG Diagnostic Evidence Rewards.
If this is right
- Existing MLLMs produce widespread severe hallucinations on ECG interpretation tasks.
- The modality-decoupled design with dropout improves robustness when signal or image inputs are missing.
- Evidence-rewarded reinforcement learning strengthens traceable, rule-following reasoning in outputs.
- Independent verification remains necessary for any current MLLM ECG analysis.
Where Pith is reading between the lines
- The same grounding strategy could extend to other medical signal or image interpretation tasks to curb hallucinations.
- Protocol-based training may reduce diagnostic errors in settings where clinicians have limited time for full review.
- Real-world deployment would require checking whether the model maintains accuracy on rare or borderline ECG patterns not well-represented in monographs.
Load-bearing premise
The protocol-guided process accurately translates monograph thresholds and diagnostic logic into training examples without systematic biases or omissions from the selected ECG features and extraction methods.
What would settle it
Test whether ECG-R1 still produces clinically incorrect outputs on held-out ECG cases where measured feature values cross clear monograph thresholds, or whether removing the protocol guidance causes performance to match ungrounded baselines.
Figures
read the original abstract
Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ECG-R1, an MLLM for ECG interpretation that claims to achieve reliability through three innovations: (1) Protocol-Guided Instruction Data Generation that grounds analyses in measurable ECG features using monograph-defined quantitative thresholds and diagnostic logic, (2) a modality-decoupled architecture incorporating Interleaved Modality Dropout to maintain performance when ECG signals or images are absent, and (3) Reinforcement Learning with ECG Diagnostic Evidence Rewards to reinforce evidence-based reasoning. It additionally reports a systematic evaluation of existing proprietary, open-source, and medical MLLMs, providing quantitative evidence that severe hallucinations are widespread and that outputs should not be trusted without verification. Code is released.
Significance. If the methods are shown to reduce hallucinations while preserving diagnostic accuracy, the work would be significant for clinical AI safety, as it directly targets the gap between plausible-sounding and clinically correct ECG analyses. The modality-agnostic design and public code release are practical strengths that could facilitate follow-on research and deployment in resource-constrained settings.
major comments (2)
- [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
- [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.
minor comments (2)
- [Abstract] The link text 'here' in the abstract should be replaced with the full repository URL for clarity.
- [Abstract] Ensure consistent capitalization of 'ECG-R1' and expansion of 'MLLM' on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on strengthening the reliability claims through better validation and clearer quantitative reporting. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
Authors: We agree that explicit validation of the automated feature extractor is essential to support the protocol-guided data generation and the downstream RL rewards. Our extractor combines established open-source ECG signal processing methods with monograph-defined thresholds. While we did not include a dedicated validation analysis in the original submission, we will add one in the revised §3.1. Specifically, we will report mean absolute errors, standard deviations, and agreement rates for key features (QRS duration, QTc, ST deviation) against annotations from two board-certified cardiologists on a held-out set of 500 ECGs. This will confirm that extraction errors remain within clinically acceptable ranges (well below the 10–20 ms thresholds) and do not systematically bias the instruction data or evidence rewards. revision: yes
-
Referee: [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.
Authors: We thank the referee for noting the need for more prominent quantitative details. Section 5 of the full manuscript already contains quantitative hallucination rates across proprietary, open-source, and medical MLLMs, along with diagnostic accuracy comparisons and initial component-wise results. To improve accessibility and address the concern directly, we will (1) update the abstract to include key metrics such as hallucination rates and overall reliability improvements, (2) add an explicit error analysis subsection, and (3) expand the ablation studies to quantify the individual contributions of protocol-guided data generation, interleaved modality dropout, and RL with diagnostic evidence rewards, including inter-rater agreement where relevant. These changes will make it straightforward to verify the impact of our innovations. revision: partial
Circularity Check
No circularity: derivation grounded in external monographs and standard RL
full rationale
The paper constructs its interpretation corpus via Protocol-Guided Instruction Data Generation that explicitly references monograph-defined quantitative thresholds and diagnostic logic, employs a modality-decoupled architecture with interleaved dropout, and applies reinforcement learning using ECG diagnostic evidence rewards. These steps draw from external medical references and established RL methods without any self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain. The central claims remain independent of quantities defined by the model's own outputs or prior author work.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL reward scaling factors
axioms (1)
- domain assumption Monograph-defined ECG features and thresholds constitute sufficient and accurate ground truth for clinical diagnosis.
Forward citations
Cited by 2 Pith papers
-
Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
CardioThink applies structured clinical reasoning stages and Structured Set Policy Optimization (SSPO) to ECG classification, yielding higher diagnostic accuracy and more interpretable rationales than direct predictio...
-
DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
DeepArrhythmia introduces a segment-contextualized multimodal framework for beat-level ECG arrhythmia classification that uses tool-grounded evidence extraction and selective acquisition routed by segment-level confidence.
Reference graph
Works this paper leans on
-
[1]
**The "Differential Exclusion" Rule: ** You must use **both** normal and abnormal findings to reject alternative diagnoses (e.g., "Narrow QRS rules out VT")
-
[2]
**Systematic Lead Examination: ** You must perform a granular analysis of **every lead group ** to ensure no pathology is missed. Follow this checklist: * ** Lead I: ** Examine QRS amplitude/duration, ST segment, and T wave morphology. Look for lateral wall issues (LVH, BBB, Lateral Ischemia). * ** Lead II: ** Analyze P wave amplitude/duration (Atrial Enl...
-
[3]
**Ground Truth Adherence: ** Your reasoning must align with the ‘ECG Report‘ provided in the input, but you must provide the visual evidence (from the lead scan above) that supports it. ## Analysis Workflow & Output Structure Follow this sequential logic. Inside the ‘<think>‘ block, you must explicitly document your findings for the specific leads mention...
-
[4]
**Strict Blind Simulation: ** Do NOT mention the existence of the report. Simulate a primary read
-
[5]
**Exhaustive Lead Mention: ** You must explicitly reference findings from specific leads (e.g., "T wave inversion in V2-V6," "Q waves in II, III, aVF") rather than making vague statements
-
[6]
**FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis
**Formatting Enforcement: ** Inside the ‘<think>‘ block, under each step header, you must write exactly **ONE single paragraph **. **FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis. The output for each step must look like a continuous block of text. 36 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reli...
-
[7]
You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘
**Tag Compliance: ** You must start your response with ‘<think>‘ and close it with ‘</think>‘ before providing the narrative response. You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘
-
[8]
**Internal Alignment Only: ** Use the ‘ECG Report‘ as an answer key
-
[9]
Warning: Data quality may affect computer interpretation
**Data Quality Check: ** If the input ‘ECG Report‘ contains a quality warning (e.g., " Warning: Data quality may affect computer interpretation"), you must acknowledge the artifacts in Step 1 and must NOT state that the technical quality is good
-
[10]
**Strict Fact Adherence: ** Strictly adhere to the provided facts; do not hallucinate, fabricate, or invent details not present in the source. ## Input Data **ECG Report (Ground Truth): ** {{report}} **ECG Machine Measurements: ** {{machine_measurements}} Key Diagnostic Evidence Extraction You are an expert Medical Data Annotator and Reinforcement Learnin...
-
[11]
**step_1_technical_rate_rhythm**: Extract evidence from "Step 1" regarding technical quality, baseline, P-wave morphology, and rhythm type
-
[12]
**step_2_conduction_axis_intervals**: Extract evidence from "Step 2" regarding Axis, PR interval, QRS duration, and conduction blocks
-
[13]
**step_3_chamber_hypertrophy_voltage**: Extract evidence from "Step 3" regarding voltage criteria, R-wave progression, and hypertrophy
-
[14]
**step_4_ischemia_infarction_mimics**: Extract evidence from "Step 4" regarding ST deviation, T waves, Q waves, and mimics (pericarditis, early repolarization)
-
[15]
**step_5_electrolytes_qt**: Extract evidence from "Step 5" regarding QT/QTc intervals and electrolyte signs (Hyper/Hypokalemia)
-
[16]
**step_6_final_medical_reasoning**: Extract the synthesis logic, final diagnostic assertions, or summary statements found in "Step 6" or the final summary. ### Extraction Rules: - You must extract findings exactly as they appear in the original ECG interpretation 37 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation text. D...
-
[17]
- Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features
DiagnosisAccuracy: Evaluates whether the generated diagnosis is correct, specific, and supported by ECG findings. - Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features. +1 per diagnosis: Each mostly correct diagnosis but lacking key supporting details. +0 per diagnosis: Each incorrect or vague diagnosis not suppo...
-
[18]
AnalysisCompleteness: Checks if all key ECG components (rhythm, intervals, waveforms, and lead-specific findings) are discussed. - Scoring +1 per feature: For each correctly addressed key ECG feature (e.g., rhythm, PR interval, QRS duration, ST segment, T wave morphology). +0 per missing feature: For each key feature omitted or inaccurately described
-
[19]
AnalysisRelevance: Assesses whether each provided explanation directly supports the diagnosis. - Scoring +2 per feature or per lead: Each point that strongly supports the diagnosis with clear ECG evidence. +1 per feature or per lead: Some points are relevant but not fully justified. +0: Includes unrelated or misleading explanations
-
[20]
LeadEvidenceValidity: Evaluates whether the lead-related statements are diagnostically necessary, correctly grounded, and free of unsupported lead-wise claims, rather than maximizing the number of mentioned leads. - Scoring +2 per key lead/region: For each diagnosis-critical lead (or contiguous lead group / territory) correctly referenced with explicit an...
-
[21]
GroundedECGUnderstanding: Determines if the interpretation references actual ECG features (e.g., QRS amplitude, PR interval) instead of generic terms. - Scoring (0-100) 100: ECG findings are comprehensively cited, linked to diagnoses, and cover all relevant ECG features. 80: ECG findings are explicitly cited and linked to diagnoses. 50: Some ECG reference...
-
[22]
EvidenceBasedReasoning: Evaluates whether the diagnosis follows logical, evidence- supported steps. - Scoring (0-100) 100: Findings logically progress to diagnosis with thorough and clear justifications covering all necessary steps. 39 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation 80: Findings logically progress to dia...
-
[23]
RealisticDiagnosticProcess: Assesses if the model mimics how a clinician interprets an ECG, considering all relevant factors. - Scoring (0-100) 100: The analysis follows a structured clinical approach and considers all relevant clinical factors. 80: The analysis follows a structured clinical approach. 50: Some clinical reasoning is present but incomplete....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.