ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Bo Liu; Deyun Zhang; Haoyu Wang; Hongyan Li; Jiarui Jin; Shenda Hong; Xiang Lan; Xian Wu; Xiaocheng Fang; Xingliang Wu

arxiv: 2602.04279 · v2 · pith:B5WUISEHnew · submitted 2026-02-04 · 💻 cs.CL

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Jiarui Jin , Haoyu Wang , Xingliang Wu , Xiaocheng Fang , Xiang Lan , Zihan Wang , Deyun Zhang , Bo Liu

show 4 more authors

Yingying Zhang Xian Wu Hongyan Li Shenda Hong

This is my paper

Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords ECG interpretationmultimodal large language modelsprotocol-guided data generationreinforcement learningmedical hallucinationsmodality dropoutdiagnostic AI

0 comments

The pith

ECG-R1 grounds ECG interpretations in measurable features and monograph rules to reduce hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ECG-R1, a multimodal large language model built specifically for reliable ECG interpretation. It constructs training data through protocol-guided generation that follows measurable ECG features along with quantitative thresholds and diagnostic logic from medical monographs. A modality-decoupled architecture with interleaved dropout maintains performance when either the ECG signal or image is unavailable, while reinforcement learning rewards outputs that cite diagnostic evidence. Evaluations of existing proprietary, open-source, and medical MLLMs show frequent severe hallucinations, indicating that direct trust in their ECG outputs is unwarranted without verification.

Core claim

ECG-R1 achieves reliable ECG interpretation by grounding analyses in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic through protocol-guided instruction data generation, a modality-decoupled architecture with interleaved modality dropout, and reinforcement learning with ECG diagnostic evidence rewards.

What carries the argument

Protocol-Guided Instruction Data Generation together with a modality-decoupled architecture using Interleaved Modality Dropout and Reinforcement Learning with ECG Diagnostic Evidence Rewards.

If this is right

Existing MLLMs produce widespread severe hallucinations on ECG interpretation tasks.
The modality-decoupled design with dropout improves robustness when signal or image inputs are missing.
Evidence-rewarded reinforcement learning strengthens traceable, rule-following reasoning in outputs.
Independent verification remains necessary for any current MLLM ECG analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding strategy could extend to other medical signal or image interpretation tasks to curb hallucinations.
Protocol-based training may reduce diagnostic errors in settings where clinicians have limited time for full review.
Real-world deployment would require checking whether the model maintains accuracy on rare or borderline ECG patterns not well-represented in monographs.

Load-bearing premise

The protocol-guided process accurately translates monograph thresholds and diagnostic logic into training examples without systematic biases or omissions from the selected ECG features and extraction methods.

What would settle it

Test whether ECG-R1 still produces clinically incorrect outputs on held-out ECG cases where measured feature values cross clear monograph thresholds, or whether removing the protocol guidance causes performance to match ungrounded baselines.

Figures

Figures reproduced from arXiv: 2602.04279 by Bo Liu, Deyun Zhang, Haoyu Wang, Hongyan Li, Jiarui Jin, Shenda Hong, Xiang Lan, Xian Wu, Xiaocheng Fang, Xingliang Wu, Yingying Zhang, Zihan Wang.

**Figure 1.** Figure 1: Left: Attribute comparison among general/medical MLLMs, previous ECG-specialized MLLMs, and ECG-R1. General/medical MLLMs typically cannot perform signal analysis and lack high-quality ECG interpretation corpora, which often leads to hallucinated, clinically incorrect interpretations at test time. Previous ECG-specialized MLLMs often construct training corpus by purely prompting LLMs from ECG features, the… view at source ↗

**Figure 2.** Figure 2: Framework of ECG-R1. Instruction generation builds a protocol-guided interpretation corpus by combining ECG grounding features with the monograph protocol. Architecture adopts a decoupled dual-encoder design with lightweight projectors to align modalityspecific representations into a shared LLM space. Training follows a two-stage strategy with SFT followed by RL, and integrates IMD to enhance robustness a… view at source ↗

**Figure 3.** Figure 3: Architecture Comparison of GEM and ECG-R1. ECG from Basics to Essentials: Step by Step (Stroobandt et al., 2015). The original procedure is reorganized into five phases: (i) Technical, Rate & Rhythm, (ii) Conduction, Axis & Intervals, (iii) Chamber Hypertrophy & Voltage, (iv) Ischemia, Infarction & Mimics, and (v) Electrolytes & QT. We further enforce differential exclusion with explicit negatives to rule … view at source ↗

**Figure 4.** Figure 4: Modality Missing Results between Time-Series and Image Modalities. features via FeatureDB (Hong et al., 2019). 3.2. Main Evaluation Tasks and Metrics Grounded ECG Interpretation. We evaluate grounded ECG interpretation to assess whether the MLLM attains cardiologist-level competency in basic ECG reading, where accurate diagnosis must be accompanied by fine-grained evidence localization and clinically groun… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of ECG-Grounding and our ECG Protocol-Guided Grounding CoT [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative Comparison of ECG-Grounding and our ECG Protocol-Guided Grounding CoT [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of EDER on Mean Output Length and Entropy during RL Training [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Report: Sinus rhythm with PACs; Left axis deviation; RBBB with left anterior fascicular block; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus with occasional premature atrial contractions (PACs), as indicated by irregular RR intervals, particularly the shorter intervals such as 238 ms … view at source ↗

**Figure 9.** Figure 9: Report: Atrial fibrillation with rapid ventricular response with PVC(s); Left axis deviation; RBBB with left anterior fascicular block; Lateral ST-T changes may be due to myocardial ischemia; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The heart rate is elevated at 120 bpm, indicating sinus tachycardia… view at source ↗

**Figure 10.** Figure 10: Report: Sinus rhythm; Left anterior fascicular block; Left ventricular hypertrophy; Abnormal ECG. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus, as indicated by the consistent P wave morphology across leads and a regular heart rate of 66 bpm. A notable finding is the left anterior fascicular block… view at source ↗

**Figure 11.** Figure 11: Report: Sinus rhythm; Ventricular bigeminy; Probable left atrial enlargement; Prolonged QT interval. GEM Interpretation Upon analyzing the provided ECG image and computed measurements, several key features and abnormalities are evident. The rhythm is sinus, as indicated by the presence of consistent P waves preceding each QRS complex. However, there are multiple ventricular premature complexes (VPCs) obse… view at source ↗

read the original abstract

Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning ECG MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code is available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ECG-R1 grounds MLLM training in monograph protocols and adds RL evidence rewards plus modality dropout, with a useful hallucination benchmark, but the abstract gives no numbers to show the reliability gains are real.

read the letter

The key takeaway is that ECG-R1 combines protocol-guided data from medical monographs, interleaved modality dropout, and reinforcement learning with evidence rewards to try to cut down on hallucinations in ECG interpretation. They also run a comparison showing that many existing MLLMs hallucinate a lot on ECG tasks. The new parts are the specific way they generate instruction data by translating monograph rules into examples, the architecture that handles missing ECG signal or image by dropping modalities during training, and the RL setup that rewards answers based on actual diagnostic evidence rather than just matching references. The hallucination evaluation across proprietary, open-source, and medical models is a solid addition because it gives concrete numbers on how bad the problem is right now. This approach makes sense for a high-risk area like ECG reading where mistakes matter. Grounding in measurable features like QRS duration and using defined thresholds from monographs is better than free-form generation. The modality dropout should help when one input type is unavailable in practice. The soft spot is that we don't see the actual performance numbers or ablations here. The abstract mentions the methods but skips the results, so it's tough to tell if these changes really move the needle on reliability. The concern about automated feature extraction is real – if the code that pulls out QTc or ST deviation has errors, the generated training data will have bad examples, and the RL stage might just reinforce those mistakes. I'd want to see error analysis on the feature extraction step and how it affects the final model. This is for researchers building medical multimodal models or clinicians interested in AI tools for cardiology. Someone working on reliable AI for diagnostics would find the methods and the baseline evaluation worth looking at. It should go to peer review because the problem is important and the ideas are specific enough to test, even if revisions will be needed for the results section.

Referee Report

2 major / 2 minor

Summary. The paper introduces ECG-R1, an MLLM for ECG interpretation that claims to achieve reliability through three innovations: (1) Protocol-Guided Instruction Data Generation that grounds analyses in measurable ECG features using monograph-defined quantitative thresholds and diagnostic logic, (2) a modality-decoupled architecture incorporating Interleaved Modality Dropout to maintain performance when ECG signals or images are absent, and (3) Reinforcement Learning with ECG Diagnostic Evidence Rewards to reinforce evidence-based reasoning. It additionally reports a systematic evaluation of existing proprietary, open-source, and medical MLLMs, providing quantitative evidence that severe hallucinations are widespread and that outputs should not be trusted without verification. Code is released.

Significance. If the methods are shown to reduce hallucinations while preserving diagnostic accuracy, the work would be significant for clinical AI safety, as it directly targets the gap between plausible-sounding and clinically correct ECG analyses. The modality-agnostic design and public code release are practical strengths that could facilitate follow-on research and deployment in resource-constrained settings.

major comments (2)

[§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.
[§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.

minor comments (2)

[Abstract] The link text 'here' in the abstract should be replaced with the full repository URL for clarity.
[Abstract] Ensure consistent capitalization of 'ECG-R1' and expansion of 'MLLM' on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the focus on strengthening the reliability claims through better validation and clearer quantitative reporting. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.1] §3.1 (Protocol-Guided Instruction Data Generation): The central reliability claim rests on the assumption that automated extraction of ECG features (QRS duration, QTc, ST deviation, etc.) accurately reproduces monograph thresholds. No validation metrics, error rates, or comparison against cardiologist annotations for the feature extractor are reported. Extraction errors exceeding the 10–20 ms precision of typical cutoffs would systematically corrupt the generated instruction data and the downstream RL evidence rewards, directly undermining the 'evidence-grounded' training signal.

Authors: We agree that explicit validation of the automated feature extractor is essential to support the protocol-guided data generation and the downstream RL rewards. Our extractor combines established open-source ECG signal processing methods with monograph-defined thresholds. While we did not include a dedicated validation analysis in the original submission, we will add one in the revised §3.1. Specifically, we will report mean absolute errors, standard deviations, and agreement rates for key features (QRS duration, QTc, ST deviation) against annotations from two board-certified cardiologists on a held-out set of 500 ECGs. This will confirm that extraction errors remain within clinically acceptable ranges (well below the 10–20 ms thresholds) and do not systematically bias the instruction data or evidence rewards. revision: yes
Referee: [§5] §5 (Evaluation): The manuscript states it supplies the first quantitative evidence of widespread hallucinations across model classes, yet the abstract and available description contain no specific metrics (e.g., hallucination rate, inter-rater agreement, or ablation on the three proposed components). Without these numbers or error analysis, it is impossible to verify whether the proposed innovations measurably improve reliability over baselines.

Authors: We thank the referee for noting the need for more prominent quantitative details. Section 5 of the full manuscript already contains quantitative hallucination rates across proprietary, open-source, and medical MLLMs, along with diagnostic accuracy comparisons and initial component-wise results. To improve accessibility and address the concern directly, we will (1) update the abstract to include key metrics such as hallucination rates and overall reliability improvements, (2) add an explicit error analysis subsection, and (3) expand the ablation studies to quantify the individual contributions of protocol-guided data generation, interleaved modality dropout, and RL with diagnostic evidence rewards, including inter-rater agreement where relevant. These changes will make it straightforward to verify the impact of our innovations. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation grounded in external monographs and standard RL

full rationale

The paper constructs its interpretation corpus via Protocol-Guided Instruction Data Generation that explicitly references monograph-defined quantitative thresholds and diagnostic logic, employs a modality-decoupled architecture with interleaved dropout, and applies reinforcement learning using ECG diagnostic evidence rewards. These steps draw from external medical references and established RL methods without any self-definitional reduction, fitted input renamed as prediction, or load-bearing self-citation chain. The central claims remain independent of quantities defined by the model's own outputs or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that existing medical monographs provide complete and unbiased quantitative rules for ECG diagnosis; no new physical entities are postulated and the main free parameters are standard RL hyperparameters whose values are not detailed in the abstract.

free parameters (1)

RL reward scaling factors
Weights balancing different diagnostic evidence components in the reinforcement learning objective; typical in RL but not quantified here.

axioms (1)

domain assumption Monograph-defined ECG features and thresholds constitute sufficient and accurate ground truth for clinical diagnosis.
Invoked when constructing the protocol-guided instruction corpus.

pith-pipeline@v0.9.0 · 5777 in / 1391 out tokens · 50643 ms · 2026-05-21T14:13:11.299338+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
cs.AI 2026-05 unverdicted novelty 5.0

CardioThink applies structured clinical reasoning stages and Structured Set Policy Optimization (SSPO) to ECG classification, yielding higher diagnostic accuracy and more interpretable rationales than direct predictio...
DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition
cs.LG 2026-05 unverdicted novelty 5.0

DeepArrhythmia introduces a segment-contextualized multimodal framework for beat-level ECG arrhythmia classification that uses tool-grounded evidence extraction and selective acquisition routed by segment-level confidence.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers

[1]

Differential Exclusion

**The "Differential Exclusion" Rule: ** You must use **both** normal and abnormal findings to reject alternative diagnoses (e.g., "Narrow QRS rules out VT")

work page
[2]

Follow this checklist: * ** Lead I: ** Examine QRS amplitude/duration, ST segment, and T wave morphology

**Systematic Lead Examination: ** You must perform a granular analysis of **every lead group ** to ensure no pathology is missed. Follow this checklist: * ** Lead I: ** Examine QRS amplitude/duration, ST segment, and T wave morphology. Look for lateral wall issues (LVH, BBB, Lateral Ischemia). * ** Lead II: ** Analyze P wave amplitude/duration (Atrial Enl...

work page
[3]

Systematic Lead Examination

**Ground Truth Adherence: ** Your reasoning must align with the ‘ECG Report‘ provided in the input, but you must provide the visual evidence (from the lead scan above) that supports it. ## Analysis Workflow & Output Structure Follow this sequential logic. Inside the ‘<think>‘ block, you must explicitly document your findings for the specific leads mention...

work page
[4]

Simulate a primary read

**Strict Blind Simulation: ** Do NOT mention the existence of the report. Simulate a primary read

work page
[5]

T wave inversion in V2-V6,

**Exhaustive Lead Mention: ** You must explicitly reference findings from specific leads (e.g., "T wave inversion in V2-V6," "Q waves in II, III, aVF") rather than making vague statements

work page
[6]

**FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis

**Formatting Enforcement: ** Inside the ‘<think>‘ block, under each step header, you must write exactly **ONE single paragraph **. **FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis. The output for each step must look like a continuous block of text. 36 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reli...

work page
[7]

You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘

**Tag Compliance: ** You must start your response with ‘<think>‘ and close it with ‘</think>‘ before providing the narrative response. You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘

work page
[8]

**Internal Alignment Only: ** Use the ‘ECG Report‘ as an answer key

work page
[9]

Warning: Data quality may affect computer interpretation

**Data Quality Check: ** If the input ‘ECG Report‘ contains a quality warning (e.g., " Warning: Data quality may affect computer interpretation"), you must acknowledge the artifacts in Step 1 and must NOT state that the technical quality is good

work page
[10]

Key Diagnostic Evidence

**Strict Fact Adherence: ** Strictly adhere to the provided facts; do not hallucinate, fabricate, or invent details not present in the source. ## Input Data **ECG Report (Ground Truth): ** {{report}} **ECG Machine Measurements: ** {{machine_measurements}} Key Diagnostic Evidence Extraction You are an expert Medical Data Annotator and Reinforcement Learnin...

work page
[11]

**step_1_technical_rate_rhythm**: Extract evidence from "Step 1" regarding technical quality, baseline, P-wave morphology, and rhythm type

work page
[12]

**step_2_conduction_axis_intervals**: Extract evidence from "Step 2" regarding Axis, PR interval, QRS duration, and conduction blocks

work page
[13]

**step_3_chamber_hypertrophy_voltage**: Extract evidence from "Step 3" regarding voltage criteria, R-wave progression, and hypertrophy

work page
[14]

**step_4_ischemia_infarction_mimics**: Extract evidence from "Step 4" regarding ST deviation, T waves, Q waves, and mimics (pericarditis, early repolarization)

work page
[15]

**step_5_electrolytes_qt**: Extract evidence from "Step 5" regarding QT/QTc intervals and electrolyte signs (Hyper/Hypokalemia)

work page
[16]

no ST elevation

**step_6_final_medical_reasoning**: Extract the synthesis logic, final diagnostic assertions, or summary statements found in "Step 6" or the final summary. ### Extraction Rules: - You must extract findings exactly as they appear in the original ECG interpretation 37 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation text. D...

work page
[17]

- Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features

DiagnosisAccuracy: Evaluates whether the generated diagnosis is correct, specific, and supported by ECG findings. - Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features. +1 per diagnosis: Each mostly correct diagnosis but lacking key supporting details. +0 per diagnosis: Each incorrect or vague diagnosis not suppo...

work page
[18]

- Scoring +1 per feature: For each correctly addressed key ECG feature (e.g., rhythm, PR interval, QRS duration, ST segment, T wave morphology)

AnalysisCompleteness: Checks if all key ECG components (rhythm, intervals, waveforms, and lead-specific findings) are discussed. - Scoring +1 per feature: For each correctly addressed key ECG feature (e.g., rhythm, PR interval, QRS duration, ST segment, T wave morphology). +0 per missing feature: For each key feature omitted or inaccurately described

work page
[19]

- Scoring +2 per feature or per lead: Each point that strongly supports the diagnosis with clear ECG evidence

AnalysisRelevance: Assesses whether each provided explanation directly supports the diagnosis. - Scoring +2 per feature or per lead: Each point that strongly supports the diagnosis with clear ECG evidence. +1 per feature or per lead: Some points are relevant but not fully justified. +0: Includes unrelated or misleading explanations

work page
[20]

LeadEvidenceValidity: Evaluates whether the lead-related statements are diagnostically necessary, correctly grounded, and free of unsupported lead-wise claims, rather than maximizing the number of mentioned leads. - Scoring +2 per key lead/region: For each diagnosis-critical lead (or contiguous lead group / territory) correctly referenced with explicit an...

work page
[21]

- Scoring (0-100) 100: ECG findings are comprehensively cited, linked to diagnoses, and cover all relevant ECG features

GroundedECGUnderstanding: Determines if the interpretation references actual ECG features (e.g., QRS amplitude, PR interval) instead of generic terms. - Scoring (0-100) 100: ECG findings are comprehensively cited, linked to diagnoses, and cover all relevant ECG features. 80: ECG findings are explicitly cited and linked to diagnoses. 50: Some ECG reference...

work page
[22]

- Scoring (0-100) 100: Findings logically progress to diagnosis with thorough and clear justifications covering all necessary steps

EvidenceBasedReasoning: Evaluates whether the diagnosis follows logical, evidence- supported steps. - Scoring (0-100) 100: Findings logically progress to diagnosis with thorough and clear justifications covering all necessary steps. 39 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation 80: Findings logically progress to dia...

work page
[23]

- Scoring (0-100) 100: The analysis follows a structured clinical approach and considers all relevant clinical factors

RealisticDiagnosticProcess: Assesses if the model mimics how a clinician interprets an ECG, considering all relevant factors. - Scoring (0-100) 100: The analysis follows a structured clinical approach and considers all relevant clinical factors. 80: The analysis follows a structured clinical approach. 50: Some clinical reasoning is present but incomplete....

work page

[1] [1]

Differential Exclusion

**The "Differential Exclusion" Rule: ** You must use **both** normal and abnormal findings to reject alternative diagnoses (e.g., "Narrow QRS rules out VT")

work page

[2] [2]

Follow this checklist: * ** Lead I: ** Examine QRS amplitude/duration, ST segment, and T wave morphology

**Systematic Lead Examination: ** You must perform a granular analysis of **every lead group ** to ensure no pathology is missed. Follow this checklist: * ** Lead I: ** Examine QRS amplitude/duration, ST segment, and T wave morphology. Look for lateral wall issues (LVH, BBB, Lateral Ischemia). * ** Lead II: ** Analyze P wave amplitude/duration (Atrial Enl...

work page

[3] [3]

Systematic Lead Examination

**Ground Truth Adherence: ** Your reasoning must align with the ‘ECG Report‘ provided in the input, but you must provide the visual evidence (from the lead scan above) that supports it. ## Analysis Workflow & Output Structure Follow this sequential logic. Inside the ‘<think>‘ block, you must explicitly document your findings for the specific leads mention...

work page

[4] [4]

Simulate a primary read

**Strict Blind Simulation: ** Do NOT mention the existence of the report. Simulate a primary read

work page

[5] [5]

T wave inversion in V2-V6,

**Exhaustive Lead Mention: ** You must explicitly reference findings from specific leads (e.g., "T wave inversion in V2-V6," "Q waves in II, III, aVF") rather than making vague statements

work page

[6] [6]

**FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis

**Formatting Enforcement: ** Inside the ‘<think>‘ block, under each step header, you must write exactly **ONE single paragraph **. **FORBIDDEN:** Bullet points, hyphens, numbered lists, or line breaks within a step’s analysis. The output for each step must look like a continuous block of text. 36 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reli...

work page

[7] [7]

You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘

**Tag Compliance: ** You must start your response with ‘<think>‘ and close it with ‘</think>‘ before providing the narrative response. You must wrap the final diagnosis in ‘<answer>‘ and ‘</answer>‘

work page

[8] [8]

**Internal Alignment Only: ** Use the ‘ECG Report‘ as an answer key

work page

[9] [9]

Warning: Data quality may affect computer interpretation

**Data Quality Check: ** If the input ‘ECG Report‘ contains a quality warning (e.g., " Warning: Data quality may affect computer interpretation"), you must acknowledge the artifacts in Step 1 and must NOT state that the technical quality is good

work page

[10] [10]

Key Diagnostic Evidence

**Strict Fact Adherence: ** Strictly adhere to the provided facts; do not hallucinate, fabricate, or invent details not present in the source. ## Input Data **ECG Report (Ground Truth): ** {{report}} **ECG Machine Measurements: ** {{machine_measurements}} Key Diagnostic Evidence Extraction You are an expert Medical Data Annotator and Reinforcement Learnin...

work page

[11] [11]

**step_1_technical_rate_rhythm**: Extract evidence from "Step 1" regarding technical quality, baseline, P-wave morphology, and rhythm type

work page

[12] [12]

**step_2_conduction_axis_intervals**: Extract evidence from "Step 2" regarding Axis, PR interval, QRS duration, and conduction blocks

work page

[13] [13]

**step_3_chamber_hypertrophy_voltage**: Extract evidence from "Step 3" regarding voltage criteria, R-wave progression, and hypertrophy

work page

[14] [14]

**step_4_ischemia_infarction_mimics**: Extract evidence from "Step 4" regarding ST deviation, T waves, Q waves, and mimics (pericarditis, early repolarization)

work page

[15] [15]

**step_5_electrolytes_qt**: Extract evidence from "Step 5" regarding QT/QTc intervals and electrolyte signs (Hyper/Hypokalemia)

work page

[16] [16]

no ST elevation

**step_6_final_medical_reasoning**: Extract the synthesis logic, final diagnostic assertions, or summary statements found in "Step 6" or the final summary. ### Extraction Rules: - You must extract findings exactly as they appear in the original ECG interpretation 37 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation text. D...

work page

[17] [17]

- Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features

DiagnosisAccuracy: Evaluates whether the generated diagnosis is correct, specific, and supported by ECG findings. - Scoring +2 per diagnosis: Each correctly identified key diagnosis with supporting ECG features. +1 per diagnosis: Each mostly correct diagnosis but lacking key supporting details. +0 per diagnosis: Each incorrect or vague diagnosis not suppo...

work page

[18] [18]

- Scoring +1 per feature: For each correctly addressed key ECG feature (e.g., rhythm, PR interval, QRS duration, ST segment, T wave morphology)

AnalysisCompleteness: Checks if all key ECG components (rhythm, intervals, waveforms, and lead-specific findings) are discussed. - Scoring +1 per feature: For each correctly addressed key ECG feature (e.g., rhythm, PR interval, QRS duration, ST segment, T wave morphology). +0 per missing feature: For each key feature omitted or inaccurately described

work page

[19] [19]

- Scoring +2 per feature or per lead: Each point that strongly supports the diagnosis with clear ECG evidence

AnalysisRelevance: Assesses whether each provided explanation directly supports the diagnosis. - Scoring +2 per feature or per lead: Each point that strongly supports the diagnosis with clear ECG evidence. +1 per feature or per lead: Some points are relevant but not fully justified. +0: Includes unrelated or misleading explanations

work page

[20] [20]

LeadEvidenceValidity: Evaluates whether the lead-related statements are diagnostically necessary, correctly grounded, and free of unsupported lead-wise claims, rather than maximizing the number of mentioned leads. - Scoring +2 per key lead/region: For each diagnosis-critical lead (or contiguous lead group / territory) correctly referenced with explicit an...

work page

[21] [21]

- Scoring (0-100) 100: ECG findings are comprehensively cited, linked to diagnoses, and cover all relevant ECG features

GroundedECGUnderstanding: Determines if the interpretation references actual ECG features (e.g., QRS amplitude, PR interval) instead of generic terms. - Scoring (0-100) 100: ECG findings are comprehensively cited, linked to diagnoses, and cover all relevant ECG features. 80: ECG findings are explicitly cited and linked to diagnoses. 50: Some ECG reference...

work page

[22] [22]

- Scoring (0-100) 100: Findings logically progress to diagnosis with thorough and clear justifications covering all necessary steps

EvidenceBasedReasoning: Evaluates whether the diagnosis follows logical, evidence- supported steps. - Scoring (0-100) 100: Findings logically progress to diagnosis with thorough and clear justifications covering all necessary steps. 39 ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation 80: Findings logically progress to dia...

work page

[23] [23]

- Scoring (0-100) 100: The analysis follows a structured clinical approach and considers all relevant clinical factors

RealisticDiagnosticProcess: Assesses if the model mimics how a clinician interprets an ECG, considering all relevant factors. - Scoring (0-100) 100: The analysis follows a structured clinical approach and considers all relevant clinical factors. 80: The analysis follows a structured clinical approach. 50: Some clinical reasoning is present but incomplete....

work page