arxiv: 2604.10420 · v1 · submitted 2026-04-12 · 💻 cs.LG

Recognition: unknown

CARE-ECG: Causal Agent-based Reasoning for Explainable and Counterfactual ECG Interpretation

Elahe Khatibi , Ziyu Wang , Ankita Sharma , Krishnendu Chakrabarty , Sanaz Rahimi Moosavi , Farshad Firouzi , Amir Rahmani

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords ECG interpretationcausal reasoninglarge language modelsexplainable AIcounterfactual analysisagentic pipelinesstructural causal models

0 comments

The pith

CARE-ECG integrates causal graph inference and structural models into LLM pipelines to enable faithful, counterfactual ECG interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that encodes multi-lead ECG signals into temporally organized latent biomarkers, then infers causal graphs to support probabilistic diagnosis and counterfactual queries. This structure grounds language outputs via causal retrieval-augmented generation and a modular agentic pipeline that verifies responses against history and diagnosis. The result is improved diagnostic accuracy on benchmarks such as Expert-ECG-QA and SCP-mapped PTB-XL, along with more traceable explanations and fewer hallucinations. A sympathetic reader would care because clinical ECG decisions require understanding how alternative physiological states would alter outcomes, something standard signal-to-text alignments lack.

Core claim

CARE-ECG unifies representation learning, causal diagnosis via inferred graphs and structural causal models, and explanation in one pipeline; language outputs are grounded through causal retrieval-augmented generation and an agentic verification loop, producing traceable reasoning that exposes key latent drivers, causal evidence paths, and the effects of alternative physiological states.

What carries the argument

Structural causal models combined with causal retrieval-augmented generation and a modular agentic pipeline that integrates history, diagnosis, and response verification.

If this is right

Diagnostic accuracy rises on expert QA tasks and standard ECG benchmarks while explanation faithfulness increases.
Counterfactual assessment becomes possible, allowing clinicians to query how changes in latent biomarkers would alter outcomes.
Reasoning paths are exposed, showing key latent drivers and causal evidence links rather than opaque text generation.
Hallucinations in language-model ECG outputs decrease through grounding in the inferred causal structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same causal-agent structure could extend to other waveform-based diagnostics such as EEG or vital-sign time series.
Explicit causal retrieval may reduce the need for post-hoc explanation methods in medical LLMs.
Traceable paths could support regulatory requirements for auditability in clinical AI systems.

Load-bearing premise

The inferred causal graphs and structural causal models accurately reflect the true physiological causal relationships present in ECG signals.

What would settle it

A set of real interventions (such as known drug effects or electrode placements) where the model's predicted counterfactual ECG changes and diagnosis shifts fail to match observed physiological outcomes in controlled recordings.

Figures

Figures reproduced from arXiv: 2604.10420 by Amir Rahmani, Ankita Sharma, Elahe Khatibi, Farshad Firouzi, Krishnendu Chakrabarty, Sanaz Rahimi Moosavi, Ziyu Wang.

**Figure 1.** Figure 1: Motivation. Typical ECG-LLMs (top) often misdiagnose mimics by hallucinating confirmatory features (e.g., reciprocal Figure 1: Motivation. Typical ECG-LLMs may over-diagnose STEMI in mimics by hallucinating confirmatory evidence (e.g., il ST di)CAREECG iihibilhlil bik(ST [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 1.** Figure 1: The CARE-ECG Framework. (1) Physiological Stream (top) performs causal discovery to identify latent drivers Figure 2: The CARE-ECG Framework. The Physiological Stream encodes ECGs into latent biomarkers and performs causal fd() d blfl lhll d l d [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 4.** Figure 4: Faithfulness–robustness trade-off in the HR–SRS plane across datasets. Each panel shows HR (lower is better) versus [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Incremental ablation curves for CARE-ECG. Each panel visualizes how key metrics evolve as modules are progressively [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Radar-style summary of multi-metric trade-offs. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Large language models (LLMs) enable waveform-to-text ECG interpretation and interactive clinical questioning, yet most ECG-LLM systems still rely on weak signal-text alignment and retrieval without explicit physiological or causal structure. This limits grounding, temporal reasoning, and counterfactual "what-if" analysis central to clinical decision-making. We propose CARE-ECG, a causally structured ECG-language reasoning framework that unifies representation learning, diagnosis, and explanation in a single pipeline. CARE-ECG encodes multi-lead ECGs into temporally organized latent biomarkers, performs causal graph inference for probabilistic diagnosis, and supports counterfactual assessment via structural causal models. To improve faithfulness, CARE-ECG grounds language outputs through causal retrieval-augmented generation and a modular agentic pipeline that integrates history, diagnosis, and response with verification. Across multiple ECG benchmarks and expert QA settings, CARE-ECG improves diagnostic accuracy and explanation faithfulness while reducing hallucinations (e.g., 0.84 accuracy on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL under GPT-4). Overall, CARE-ECG provides traceable reasoning by exposing key latent drivers, causal evidence paths, and how alternative physiological states would change outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CARE-ECG, a unified framework for ECG interpretation that encodes multi-lead signals into temporally organized latent biomarkers, infers causal graphs for probabilistic diagnosis, employs structural causal models for counterfactual assessment, and uses a modular agentic LLM pipeline with causal retrieval-augmented generation to ground explanations. It reports improved diagnostic accuracy and explanation faithfulness with reduced hallucinations across ECG benchmarks, including 0.84 accuracy on Expert-ECG-QA and 0.76 on SCP-mapped PTB-XL under GPT-4, attributing these gains to traceable causal reasoning paths.

Significance. If the inferred causal graphs and SCMs prove physiologically valid, the work could meaningfully advance explainable AI for cardiac diagnostics by enabling counterfactual 'what-if' analysis and faithful grounding of LLM outputs, addressing limitations in current signal-text alignment approaches. The unified pipeline and reported gains on expert QA and standard benchmarks indicate potential for improved clinical trust, though this hinges on rigorous validation of the causal components.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The reported accuracy figures (0.84 on Expert-ECG-QA, 0.76 on PTB-XL) are given without any baseline comparisons, ablation studies isolating the causal graph/SCM contributions, statistical tests, or controls for the underlying LLM, which is load-bearing for the central claim that causal structure drives the accuracy and hallucination-reduction improvements.
[§3.2] §3.2 (Causal Graph Inference): The manuscript provides no details on physiological priors, expert validation of edges, identifiability checks, or stability analysis for the data-driven causal graphs; purely observational inference on ECG data risks capturing correlations rather than true cause-effect relations (e.g., QRS to ST changes), directly undermining the validity of the counterfactual assessments and traceable reasoning claims.

minor comments (2)

[§3.3] The description of the modular agentic pipeline would benefit from a clear diagram showing the flow between history, diagnosis, verification, and causal RAG modules.
[§3.1] Notation for latent biomarkers and causal evidence paths is introduced without an accompanying equation or table summarizing the key variables and their relationships.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The reported accuracy figures (0.84 on Expert-ECG-QA, 0.76 on PTB-XL) are given without any baseline comparisons, ablation studies isolating the causal graph/SCM contributions, statistical tests, or controls for the underlying LLM, which is load-bearing for the central claim that causal structure drives the accuracy and hallucination-reduction improvements.

Authors: We agree that providing baseline comparisons and ablations is essential to substantiate the contribution of the causal components. The current manuscript highlights the overall performance but does not include these controls. In the revised version, we will add comparisons to standard LLM-based ECG interpretation methods without causal graph inference or SCMs, ablation studies removing the causal graph and counterfactual modules, and appropriate statistical tests (e.g., McNemar's test or paired t-tests) to assess significance. We will also clarify the role of the underlying LLM by including controls where the same LLM is used with and without our causal pipeline. This will directly support the claim regarding the benefits of causal structure. revision: yes
Referee: [§3.2] §3.2 (Causal Graph Inference): The manuscript provides no details on physiological priors, expert validation of edges, identifiability checks, or stability analysis for the data-driven causal graphs; purely observational inference on ECG data risks capturing correlations rather than true cause-effect relations (e.g., QRS to ST changes), directly undermining the validity of the counterfactual assessments and traceable reasoning claims.

Authors: This is a valid concern regarding the validity of the inferred causal graphs. The §3.2 describes a data-driven approach to causal graph inference from the latent biomarkers, but indeed omits specifics on priors, validation, or stability. We will revise this section to include any physiological priors incorporated (such as known ECG feature relationships), details on the inference algorithm used (e.g., if PC algorithm or similar with constraints), and add stability analysis via bootstrapping or sensitivity checks. For expert validation, we will note that while not performed in the current work due to resource constraints, we plan to include it or discuss it as future work. We emphasize that the counterfactual assessments rely on the SCM assumptions, and we will add a limitations section clarifying that the graphs are inferred from observational data and may reflect associations. revision: partial

Circularity Check

0 steps flagged

No circularity detected; framework claims are methodologically independent of outputs

full rationale

The paper describes a proposed pipeline (latent biomarker encoding, causal graph inference, SCM-based counterfactuals, and agentic RAG) that is presented as a new construction rather than a derivation from its own results. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or high-level description. The accuracy and faithfulness improvements are reported as empirical outcomes on benchmarks, not as quantities forced by the method's own definitions. The central assumption (that inferred graphs capture physiology) is an external validity concern, not a circular reduction within the paper's logic. Therefore the derivation chain does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5536 in / 1254 out tokens · 55459 ms · 2026-05-10T16:33:35.274953+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

PerCaM-Health learns evolving personalized dynamic causal graphs from longitudinal health data to enable more reliable patient-level counterfactual queries than cohort or per-patient baselines.

Reference graph

Works this paper leans on

36 extracted references · 10 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Hamidreza Alikhani, Ziyu Wang, Anil Kanduri, Pasi Lilieberg, Amir M Rah- mani, and Nikil Dutt. 2024. SEAL: Sensing efficient active learning on wearables through context-awareness. In2024 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE). IEEE, 1–2

2024
[2]

Hamidreza Alikhani, Ziyu Wang, Anil Kanduri, Pasi Liljeberg, Amir M Rahmani, and Nikil Dutt. 2024. EAˆ 2: Energy Efficient Adaptive Active Learning for Smart Wearables. InProceedings of the 29th ACM/IEEE international symposium on low power electronics and design. 1–6

2024
[3]

Seyed Amir Hossein Aqajari, Ziyu Wang, Ali Tazarv, Sina Labbaf, Salar Jafarlou, Brenda Nguyen, Nikil Dutt, Marco Levorato, and Amir M Rahmani. 2024. Enhanc- ing performance and user engagement in everyday stress monitoring: A context- aware active reinforcement learning approach.arXiv preprint arXiv:2407.08215 (2024)

work page arXiv 2024
[4]

1976.Causal modeling

Herbert B Asher. 1976.Causal modeling. Vol. 3. Sage

1976
[5]

Mingsheng Cai, Jiuming Jiang, Wenhao Huang, Che Liu, and Rossella Arcucci. [n. d.]. Towards Generalizable Multimodal ECG Representation Learning with LLM-extracted Clinical Entities. In1st ICML Workshop on Foundation Models for Structured Data
[6]

Tri Dao and Albert Gu. 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060(2024)

work page internal anchor Pith review arXiv 2024
[7]

Zahra Ebrahimi, Mohammad Loni, Masoud Daneshtalab, and Arash Gharehbaghi
[8]

Expert Systems with Applications: X7 (2020), 100033

A review on deep learning methods for ECG arrhythmia classification. Expert Systems with Applications: X7 (2020), 100033

2020
[9]

Brian Gow, Tom Pollard, Larry A Nathanson, Alistair Johnson, Benjamin Moody, Chrystinne Fernandes, Nathaniel Greenbaum, Jonathan W Waks, Parastou Es- lami, Tanner Carbonati, et al. 2023. MIMIC-IV-ECG: Diagnostic Electrocardio- gram Matched Subset.Type: dataset6 (2023), 13–14

2023
[10]

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

work page internal anchor Pith review arXiv 2021
[11]

Awni Y Hannun, Pranav Rajpurkar, Masoumeh Haghpanahi, Geoffrey H Tison, Codie Bourn, Mintu P Turakhia, and Andrew Y Ng. 2019. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network.Nature medicine25, 1 (2019), 65–69

2019
[12]

Elahe Khatibi, Ziyu Wang, and Amir M Rahmani. 2025. CDF-RAG: Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation.arXiv preprint arXiv:2504.12560(2025)

work page arXiv 2025
[13]

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, et al. 2025. Medical hallucinations in foundation models and their impact on healthcare. arXiv preprint arXiv:2503.05777(2025)

work page arXiv 2025
[14]

Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang
[15]

Ecg-fm: An open electrocardiogram foundation model.JAMIA open8, 5 (2025), ooaf122

2025
[16]

Fatma Murat, Ozal Yildirim, Muhammed Talo, Ulas Baran Baloglu, Yakup Demir, and U Rajendra Acharya. 2020. Application of deep learning techniques for heartbeats detection using ECG signals-analysis and review.Computers in biology and medicine120 (2020), 103726

2020
[17]

Jasmine Chiat Ling Ong, Shelley Yin-Hsi Chang, Wasswa William, Atul J Butte, Nigam H Shah, Lita Sui Tjien Chew, Nan Liu, Finale Doshi-Velez, Wei Lu, Julian Savulescu, et al . 2024. Medical ethics of large language models in medicine. NEJM AI1, 7 (2024), AIra2400038

2024
[18]

Levina Perzhilla, Soumia Siyoucef, Rose Al-Aslani, Muhammad Mahboob Ur Rahman, and Tareq Y Al-Naffouri. 2025. In-situ dehydration monitoring via a stable diffusion-aided single-lead ecg iomt: Ml/dl models shine while llms hallucinate.IEEE Internet of Things Journal(2025)

2025
[19]

Antônio H Ribeiro, Manoel Horta Ribeiro, Gabriela MM Paixão, Derick M Oliveira, Paulo R Gomes, Jéssica A Canazart, Milton PS Ferreira, Carl R Andersson, Peter W Macfarlane, Wagner Meira Jr, et al. 2020. Automatic diagnosis of the 12-lead ECG using a deep neural network.Nature communications11, 1 (2020), 1760

2020
[20]

Sina Shool, Sara Adimi, Reza Saboori Amleshi, Ehsan Bitaraf, Reza Golpira, and Mahmood Tara. 2025. A systematic review of large language model (LLM) evaluations in clinical medicine.BMC Medical Informatics and Decision Making 25, 1 (2025), 117

2025
[21]

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al . 2025. Toward expert-level medical question answering with large language models. Nature Medicine(2025), 1–8

2025
[22]

Karan Singhal, Tao Tu, Jurgen Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kristy Clark, Stephen Pfohl, Heather Cole-Lewis, David Neal, et al. 2024. Toward expert-level medical question answering with large language models.Nature Medicine(2024). https://doi.org/10.1038/s41591-024-03423-7

work page doi:10.1038/s41591-024-03423-7 2024
[23]

Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. 2020. PTB-XL, a large publicly available electrocardiography dataset.Scientific data7, 1 (2020), 1–15

2020
[24]

Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, and Jie Tao. 2025. ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis.arXiv preprint arXiv:2502.17475(2025)

work page arXiv 2025
[25]

Ziyu Wang, Anil Kanduri, Seyed Amir Hossein Aqajari, Salar Jafarlou, Sanaz R Mousavi, Pasi Liljeberg, Shaista Malik, and Amir M Rahmani. 2024. Ecg unveiled: Analysis of client re-identification risks in real-world ecg datasets. In2024 IEEE 20th International Conference on Body Sensor Networks (BSN). IEEE, 1–4

2024
[26]

Ziyu Wang, Elahe Khatibi, Farshad Firouzi, Sanaz Rahimi Mousavi, Krishnendu Chakrabarty, and Amir M Rahmani. 2025. Linkage Attacks Expose Identity Risks in Public ECG Data Sharing.arXiv preprint arXiv:2508.15850(2025)

work page arXiv 2025
[27]

Ziyu Wang, Elahe Khatibi, and Amir M Rahmani. 2025. MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering.arXiv preprint arXiv:2508.15849(2025)

work page arXiv 2025
[28]

Ziyu Wang, Hao Li, Di Huang, Hye-Sung Kim, Chae-Won Shin, and Amir M Rahmani. 2025. Healthq: Unveiling questioning capabilities of llm chains in healthcare conversations.Smart Health(2025), 100570

2025
[29]

Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, and Hongen Liao
[30]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Llm-rg4: Flexible and factual radiology report generation across diverse input contexts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 8250–8258
[31]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

2022
[33]

arXiv preprint arXiv:2004.08697(2020)

Causalvae: Structured causal disentanglement in variational autoencoder. arXiv preprint arXiv:2004.08697(2020)

work page arXiv 2004
[34]

Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang
[35]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Causalvae: Disentangled representation learning via neural structural causal models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9593–9602
[36]

Han Yu, Peikun Guo, and Akane Sano. 2023. Zero-shot ECG diagnosis with large language models and retrieval-augmented generation. InMachine learning for health (ML4H). PMLR, 650–663

2023
[37]

Jaehak Yu, Sejin Park, Soon-Hyun Kwon, Kang-Hee Cho, and Hansung Lee. 2022. AI-based stroke disease prediction system using ECG and PPG bio-signals.Ieee Access10 (2022), 43623–43638. 12

2022