arxiv: 2511.23036 · v2 · submitted 2025-11-28 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring

Changhun Kim , Yechan Mun , Hyeongwon Jang , Eunseo Lee , Sangchul Hahn , Eunho Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords explainable AItime seriesonline monitoringintegrated gradientstemporal dependenciesprediction changesfaithfulness evaluationstreaming data

0 comments

The pith

Delta-XAI wraps existing explanation methods and introduces shifted-window gradients to track why predictions change across time steps in online monitoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to make explanations for time-series models useful in streaming settings where each new observation can alter the forecast and where past context matters. Current methods explain one step at a time and ignore how information accumulates, which leaves prediction shifts hard to interpret and hard to evaluate. Delta-XAI therefore supplies a simple wrapper that turns fourteen standard techniques into online versions and adds a test suite that checks whether the explanations stay faithful to the model, sufficient to reproduce the change, and coherent over time. Experiments show that older gradient methods, once given this temporal wrapper, often beat newer approaches; the authors then build SWING on top of Integrated Gradients so that the path of integration deliberately includes earlier observations.

Core claim

Delta-XAI adapts fourteen existing XAI methods through a wrapper function and supplies a principled evaluation suite for the online setting that measures faithfulness, sufficiency, and coherence. When these wrapped methods are tested, classical gradient-based approaches such as Integrated Gradients outperform more recent techniques. The authors therefore introduce Shifted Window Integrated Gradients (SWING), which incorporates past observations into the integration path so that temporal dependencies are captured and out-of-distribution effects are reduced. Extensive experiments across diverse settings and metrics confirm that SWING improves explanation quality for prediction changes.

What carries the argument

Shifted Window Integrated Gradients (SWING), an adaptation of Integrated Gradients that extends the integration path to include past observations and thereby accounts for temporal context in online explanations.

If this is right

Classical gradient methods become competitive or superior once they receive temporal context through the wrapper.
SWING reduces out-of-distribution effects by keeping the explanation path inside the observed history.
The evaluation suite allows direct comparison of how well different methods recover the actual change in prediction.
Effectiveness of the adapted methods holds across multiple datasets, model architectures, and quantitative metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same wrapper pattern could be applied to other sequential domains such as video or sensor streams to test whether the temporal gain generalizes.
Coherence scores may expose cases in which explanations drift even when the underlying process changes gradually.
In high-stakes monitoring, the ability to attribute a shift to specific past steps could guide targeted data collection or model retraining.

Load-bearing premise

That wrapping existing XAI methods together with the new faithfulness-sufficiency-coherence suite correctly reflects genuine temporal dependencies without creating artificial patterns of its own.

What would settle it

On a new streaming dataset where known external events drive clear prediction shifts, check whether SWING attributions recover those events more accurately than standard Integrated Gradients or the other wrapped baselines.

Figures

Figures reproduced from arXiv: 2511.23036 by Changhun Kim, Eunho Yang, Eunseo Lee, Hyeongwon Jang, Sangchul Hahn, Yechan Mun.

**Figure 1.** Figure 1: Motivation for explaining prediction changes through illustrative scenarios that are not generated by actual XAI outputs. (top) Vital signs across T1, T2, T3: risk rises from 10% to 90% then partially recovers (70%). Conventional attribution at T3 misleads, while our method highlights features driving recovery. (bottom) Risk evolves from 10% at T1 to 80% at T2 and slightly increases (85%) at T3 due to dela… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed SWING framework for explaining prediction changes in online patient monitoring. SWING extends conventional Integrated Gradients (IG) by replacing zero-baseline straight paths with line integrals over shifted historical windows and piecewise-linear paths, capturing temporal dynamics and mitigating out-of-distribution effects. 4.2 SWING: SHIFTED WINDOW INTEGRATED GRADIENTS Standard I… view at source ↗

**Figure 3.** Figure 3: Computational efficiency analysis comparing SWING with baselines on the MIMIC-III benchmark. (a) Elapsed real time per sample (sec/sample, log-scale) versus AUPD (K = 50). (b) GPU peak memory consumption per sample (MB/sample) versus AUPD (K = 50) [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity of SWING with respect to nsamples, compared to IG (dotted gray line). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative case study showing attributions extracted with XAI methods on MIMICIII (Johnson et al., 2016) using LSTM (Hochreiter & Schmidhuber, 1997) backbone with T1 = 47 and T2 = 48, i.e., T2 − T1 = 1. The uppermost-left heatmap displays the normalized input features, while the remaining fifteen panels illustrate the attribution heatmaps generated by each XAI method under the Delta-XAI framework, reflec… view at source ↗

**Figure 6.** Figure 6: Qualitative case study showing attributions extracted with XAI methods on MIMICIII (Johnson et al., 2016) using LSTM (Hochreiter & Schmidhuber, 1997) backbone with T1 = 47 and T2 = 48, i.e., T2 − T1 = 1. The uppermost-left heatmap displays the normalized input features, while the remaining fifteen panels illustrate the attribution heatmaps generated by each XAI method under the Delta-XAI framework, reflec… view at source ↗

**Figure 7.** Figure 7: Qualitative case study showing attributions extracted with XAI methods on MIMICIII (Johnson et al., 2016) using LSTM (Hochreiter & Schmidhuber, 1997) backbone with T1 = 47 and T2 = 48, i.e., T2 − T1 = 1. The uppermost-left heatmap displays the normalized input features, while the remaining fifteen panels illustrate the attribution heatmaps generated by each XAI method under the Delta-XAI framework, reflec… view at source ↗

**Figure 8.** Figure 8: Qualitative case study showing attributions extracted with XAI methods on MIMICIII (Johnson et al., 2016) using LSTM (Hochreiter & Schmidhuber, 1997) backbone with T1 = 47 and T2 = 48, i.e., T2 − T1 = 1. The uppermost-left heatmap displays the normalized input features, while the remaining fifteen panels illustrate the attribution heatmaps generated by each XAI method under the Delta-XAI framework, reflec… view at source ↗

**Figure 9.** Figure 9: Qualitative case study showing attributions extracted with XAI methods on MIMICIII (Johnson et al., 2016) using LSTM (Hochreiter & Schmidhuber, 1997) backbone with T1 = 47 and T2 = 48, i.e., T2 − T1 = 1. The uppermost-left heatmap displays the normalized input features, while the remaining fifteen panels illustrate the attribution heatmaps generated by each XAI method under the Delta-XAI framework, reflec… view at source ↗

**Figure 10.** Figure 10: These visualization illustrates case studies for coherence analysis of decompensation risk. Each subfigure (a-d) shows two heatmaps for a MIMIC-III (Johnson et al., 2016) sample processed with a LSTM (Hochreiter & Schmidhuber, 1997) backbone: the left heatmap visualizes normalized input features, and the right heatmap displays SWING’s feature attributions. These panels reveal the clinically relevant featu… view at source ↗

read the original abstract

Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at https://github.com/AITRICS/Delta-XAI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Delta-XAI is a wrapper around 14 existing XAI methods for online time series plus a shifted-window IG variant, but the effectiveness claims rest on an abstract with no visible results or implementation details.

read the letter

The main point here is a practical wrapper that lets you run standard XAI methods on prediction shifts in streaming time series, plus SWING, which tweaks Integrated Gradients to pull in past observations during the integration path. The paper flags that most time series explanations treat each step in isolation and miss how attributions should evolve with the online context. Adapting the 14 methods through one interface and adding an evaluation focused on faithfulness, sufficiency, and coherence for this setting is a reasonable move for applied work in monitoring. They also note that classical gradient methods can look stronger once adapted, which is worth testing. SWING's idea of using history to reduce OOD effects in the path is a direct response to a known issue with plain IG on drifting inputs, and the public code lowers the cost of trying it. The soft spot is that we only have the abstract. The claim of consistent effectiveness across settings and metrics cannot be checked for whether the shifted window actually preserves the base method's properties or whether the metrics reward real temporal coherence rather than artifacts from the wrapper. No derivation, pseudocode, or table is shown, so it is unclear if simpler baselines would match the gains. This is aimed at practitioners who need explanations for changing forecasts in healthcare or finance monitors rather than theorists. A reader already working on temporal XAI tooling would find the evaluation suite and the classical-versus-recent comparison useful to extend. I would send it for peer review. The problem is concrete, the wrapper is implementable, and the code makes verification feasible even if the current evidence is thin.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Delta-XAI, a unified framework for explaining prediction changes in online time series monitoring. It adapts 14 existing XAI methods using a wrapper function to account for temporal dependencies, which prior methods overlook by analyzing time steps independently. A new evaluation suite is introduced to assess faithfulness, sufficiency, and coherence in the online setting. The paper finds that adapted classical methods like Integrated Gradients can outperform recent approaches. It proposes Shifted Window Integrated Gradients (SWING) that incorporates past observations in the integration path to capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments are reported to demonstrate SWING's effectiveness across diverse settings and metrics, with public code available.

Significance. If the results hold, the work is significant for providing a practical solution to explainability challenges in online time series models used in sensitive applications like healthcare and finance. The wrapper-based adaptation offers a way to leverage existing XAI techniques for temporal analysis, while the evaluation suite addresses the difficulty of assessing explanations in dynamic online contexts. SWING's use of shifted windows to include historical data represents a targeted improvement over standard methods to handle online dynamics and OOD issues. Public code availability is a notable strength for reproducibility and community use.

major comments (2)

[Abstract] The central claim that 'extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics' is load-bearing, yet the abstract provides no datasets, quantitative results, tables, error bars, or ablation details to verify whether the faithfulness/sufficiency/coherence suite captures temporal dependencies without introducing artifacts (as noted in the weakest assumption).
[Abstract] The description of SWING states that it 'incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects,' but provides no equation, pseudocode, or derivation for the shifted-window construction, preventing assessment of whether it preserves original IG properties or differs substantively from simpler baselines.

minor comments (2)

The abstract refers to adapting '14 existing XAI methods' without enumerating them; listing these would clarify the scope of the unified wrapper framework.
The public code link is provided, but the abstract-only format makes it impossible to cross-check reproducibility of the claimed experimental outcomes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance for online time series explainability. We address the two major comments on the abstract below and will prepare a revised version of the manuscript.

read point-by-point responses

Referee: [Abstract] The central claim that 'extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics' is load-bearing, yet the abstract provides no datasets, quantitative results, tables, error bars, or ablation details to verify whether the faithfulness/sufficiency/coherence suite captures temporal dependencies without introducing artifacts (as noted in the weakest assumption).

Authors: We agree that the abstract would benefit from more concrete support for its central claim to allow readers to better assess the evaluation suite. In the revised abstract we will add the names of the primary datasets, a concise summary of key quantitative results (including error bars where applicable), and a brief reference to ablation studies. We will also include a short clause clarifying how the faithfulness/sufficiency/coherence metrics are designed to respect temporal structure and avoid obvious artifacts. These additions will remain within abstract length limits while strengthening the claim. revision: yes
Referee: [Abstract] The description of SWING states that it 'incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects,' but provides no equation, pseudocode, or derivation for the shifted-window construction, preventing assessment of whether it preserves original IG properties or differs substantively from simpler baselines.

Authors: We acknowledge that the current abstract description of SWING is high-level. We will revise the abstract to include a compact, high-level characterization of the shifted-window construction (e.g., a one-sentence outline of how the integration path is extended over past observations). The full equation, pseudocode, derivation demonstrating preservation of IG axioms, and explicit comparison to simpler baselines appear in the methods section of the full manuscript; we will ensure the abstract points readers to that section for technical detail. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external experiments and wrapper adaptation without self-referential reductions

full rationale

The abstract presents Delta-XAI as a wrapper adapting 14 existing XAI methods plus a new evaluation suite (faithfulness, sufficiency, coherence), notes that gradient-based methods outperform others in experiments, and introduces SWING to incorporate past observations. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Effectiveness claims are tied to experimental comparisons and public code rather than any internal definition or input that forces the output by construction. This is the common case of a self-contained proposal whose central assertions can be checked externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the assumption that standard XAI techniques remain valid when wrapped for temporal sequences and that the new evaluation metrics reflect real-world explanation quality.

axioms (1)

domain assumption Existing gradient-based and perturbation XAI methods can be adapted via a wrapper without losing core properties in online time series
Invoked when proposing Delta-XAI as a unified framework

invented entities (1)

Shifted Window Integrated Gradients (SWING) no independent evidence
purpose: Incorporate past observations into the integration path to capture temporal dependencies
New method introduced to mitigate out-of-distribution effects

pith-pipeline@v0.9.0 · 5487 in / 1249 out tokens · 26143 ms · 2026-05-17T04:05:22.567084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SWING … incorporates past observations in the integration path … Online Completeness, Implementation Invariance, and Skew-Symmetry
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (Online Completeness) … sum of SWING attributions equals the prediction difference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

Opportunities and challenges in explainable artificial intelligence (xai): A survey.arXiv preprint arXiv:2006.11371,

Arun Das and Paul Rad. Opportunities and challenges in explainable artificial intelligence (xai): A survey.arXiv preprint arXiv:2006.11371,

work page arXiv 2006
[3]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Deep Learning for Time-Series Analysis

John Cristian Borges Gamboa. Deep learning for time-series analysis.arXiv preprint arXiv:1701.01887,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896,

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896,

work page arXiv 2009
[6]

Temporal dependencies in feature importance for time series predictions.arXiv preprint arXiv:2107.14317,

Kin Kwan Leung, Clayton Rooke, Jonathan Smith, Saba Zuberi, and Maksims V olkovs. Temporal dependencies in feature importance for time series predictions.arXiv preprint arXiv:2107.14317,

work page arXiv
[7]

Timex++: Learning time-series explanations with information bottleneck.arXiv preprint arXiv:2405.09308, 2024a

Zichuan Liu, Tianchun Wang, Jimeng Shi, Xu Zheng, Zhuomin Chen, Lei Song, Wenqian Dong, Jayantha Obeysekera, Farhad Shirani, and Dongsheng Luo. Timex++: Learning time-series explanations with information bottleneck.arXiv preprint arXiv:2405.09308, 2024a. Zichuan Liu, Yingying Zhang, Tianchun Wang, Zefan Wang, Dongsheng Luo, Mengnan Du, Min Wu, Yi Wang, Ch...

work page arXiv
[8]

Introducing a new benchmarked dataset for activity monitoring

Attila Reiss and Didier Stricker. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers, pp. 108–109. IEEE,

work page 2012
[9]

why should i trust you?

Matthew A Reyna, Christopher S Josef, Russell Jeter, Supreeth P Shashikumar, M Brandon Westover, Shamim Nemati, Gari D Clifford, and Ashish Sharma. Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019.Critical care medicine, 48(2):210–217, 2020a. Matthew A Reyna, Christopher S Josef, Russell Jeter, Supreeth P...

work page 2019
[10]

Financial time series forecasting with deep learning: A systematic literature review: 2005–2019.Applied soft computing, 90:106181,

Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019.Applied soft computing, 90:106181,

work page 2005
[11]

Clinical Intervention Prediction and Understanding using Deep Networks

Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Clinical intervention prediction and understanding using deep networks.arXiv preprint arXiv:1705.08498,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Yue Wang and Sai Ho Chung

DOI: https://doi.org/10.24432/C57G8X. Yue Wang and Sai Ho Chung. Artificial intelligence in safety-critical systems: a systematic review. Industrial Management & Data Systems, 122(2):442–470,

work page doi:10.24432/c57g8x
[13]

Deep Time Series Models: A Comprehensive Survey and Benchmark

Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Interpretation of time-series deep models: A survey.arXiv preprint arXiv:2305.14582,

Ziqi Zhao, Yucheng Shi, Shushan Wu, Fan Yang, Wenzhan Song, and Ninghao Liu. Interpretation of time-series deep models: A survey.arXiv preprint arXiv:2305.14582,

work page arXiv
[15]

13 A LIMITATIONS ANDBROADERIMPACTS Limitations.Our proposed approach has few limitations. First, while SWING extends IG by incorporating historical points with shifted window paths, it introduces additional computational overhead compared to simpler methods like standard IG. Second, while our prediction wrapper function can seamlessly incorporate most exi...

work page 2016
[16]

and SHAP (Shapley, 1953; Lundberg,

work page 1953
[17]

Perturbation-based methods like Feature Occlusion (FO) (Suresh et al.,

and DeepLIFT (Shrikumar et al., 2017), compute attributions using model gradients. Perturbation-based methods like Feature Occlusion (FO) (Suresh et al.,

work page 2017
[18]

While these methods have enhanced model explainability, most have been evaluated in vision tasks (Das & Rad, 2020)

measure feature importance by replacing inputs and observing prediction changes. While these methods have enhanced model explainability, most have been evaluated in vision tasks (Das & Rad, 2020). Their application to time series—particularly for explaining prediction changes in online settings—remains limited, despite the importance of capturing temporal...

work page 2020
[19]

are particularly relevant to online prediction tasks. FIT estimates feature importance by comparing predictive distributions under observed and counterfactual inputs using KL divergence, whereas WinIT models delayed effects by assessing how past observations influence future predictions. However, our proposed framework significantly extends beyond these m...

work page 2016
[20]

For example, if T1 corresponds to a window XT1−W+1:T 1 and T2 to XT2−W+1:T 2, then the baseline for g dif- fers depending on which window is active, leading to inconsistency

directly to the wrapper g violates complete- ness, since the baseline input for g depends on whether the evaluation is at T1 or T2. For example, if T1 corresponds to a window XT1−W+1:T 1 and T2 to XT2−W+1:T 2, then the baseline for g dif- fers depending on which window is active, leading to inconsistency. To address this, we compute GradSHAP directly on f...

work page 2020
[21]

=− X t,d φSWING(f,X t,d |T 1 →T 2). 18 Table 4:We evaluate our method on five datasets: three real-world benchmarks—MIMIC-III (Johnson et al., 2016), PhysioNet 2019 (Reyna et al., 2020a), and Activity (Reiss & Stricker, 2012)—and two widely used synthetic datasets—Delayed Spike (Leung et al.,

work page 2016
[22]

and Switch-Feature (Tonekaboni et al., 2020; Liu et al., 2024b). Type Name Task # IDs # Samples / ID Window Size Feature Classes Real-world MIMIC-III Decompensation prediction6,221 5 48 32 2 PhysioNet 2019Sepsis prediction 8,066 5 48 40 2 Activity Human action recognition 5 200 50 12 7 Synthetic Delayed SpikeBinary classification 1,000 5 40 3 2 Switch-Fea...

work page 2020
[23]

The dataset comprises 25 sequences from five individuals, each having around 6,600 time points

and follow the preprocessing protocol from Latent ODEs (Rubanova et al., 2019). The dataset comprises 25 sequences from five individuals, each having around 6,600 time points. We segment each sequence into overlapping windows of 50 time points using a stride of 1 (unlike the stride 25 used in the original Latent ODEs paper). Labels are provided at each ti...

work page 2019
[24]

Switch-Feature.We generate the Switch-Feature dataset following the design in FIT (Tonekaboni et al., 2020)

This modification forces explanation methods to correctly identify the causal spike event rather than simply aligning with the delayed label change. Switch-Feature.We generate the Switch-Feature dataset following the design in FIT (Tonekaboni et al., 2020). Similar to the State dataset, it is constructed based on a three-state hidden Markov model with an ...

work page 2020
[25]

Efficiency.Figure 3 demonstrates that SWING achieves state-of-the-art explanatory quality with- out significant computational overhead

additionally show that SWING preserves its advantages regardless of prediction or label changes and remains stable across both 24- and 72-length windows. Efficiency.Figure 3 demonstrates that SWING achieves state-of-the-art explanatory quality with- out significant computational overhead. Runtime (0.35s/sample) and memory (448 MB) remain comparable to gra...

work page 2019
[26]

2.68±0.001.54±0.003.00±0.011.93±0.003.19±0.011.43±0.003.43±0.011.61±0.000.11±0.00 DeepLIFT (Shrikumar et al., 2017)2.72±0.001.60±0.002.87±0.001.86±0.003.06±0.011.35±0.003.38±0.011.56±0.000.16±0.00 FO (Suresh et al.,

work page 2017
[27]

1.86±0.001.10±0.002.85±0.011.66±0.003.88±0.002.02±0.003.12±0.011.64±0.000.06±0.00 Dynamask (Crabbé & Van Der Schaar, 2021)1.69±0.001.11±0.002.05±0.001.36±0.004.98±0.012.77±0.004.74±0.012.59±0.000.06±0.00 Extrmask (Enguehard,

work page 2021
[28]

2.73±0.001.56±0.003.02±0.001.94±0.003.12±0.001.41±0.003.38±0.001.60±0.000.13±0.00 SWING 3.10±0.011.96±0.002.81±0.011.78±0.012.27±0.010.93±0.002.38±0.001.01±0.000.32±0.02 Table 7:Performance of XAI methods on (top) Activity (Vidulin et al., 2010), (middle) Delayed Spike (Leung et al., 2021), and (bottom) Switch-Feature (Tonekaboni et al.,

work page 2010
[29]

Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution

benchmarks with an LSTM backbone. Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient 50 PointsCorr.↑CPD↑AUPD↑MPD↑AUMPD↑ CPP↓AUPP↓MPP↓AUMPP↓ IG (Sundararajan et al., 2017)19.80±0.5312.53±0.317.81±0.534.76±0.2817.5...

work page 2017
[30]

Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient 50 PointsCorr.↑CPD↑AUPD↑MPD↑AUMPD↑ CPP↓AUPP↓MPP↓AUMPP↓ IG (Sundararajan et al., 2017)281.32±0.42222.60±0.37319.69±0.75251.67±0.58381.59±0.71186.83±0.36349.24±0.7...

work page 2017
[31]

arg maxf(X T2−W+1:T 2)) is different or remains the same

vs. arg maxf(X T2−W+1:T 2)) is different or remains the same. Results are reported on the MIMIC-III decompensation benchmark with an LSTM backbone (T2 −T 1 = 1). Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient...

work page 2017