Recognition: 2 theorem links
· Lean TheoremDelta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring
Pith reviewed 2026-05-17 04:05 UTC · model grok-4.3
The pith
Delta-XAI wraps existing explanation methods and introduces shifted-window gradients to track why predictions change across time steps in online monitoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Delta-XAI adapts fourteen existing XAI methods through a wrapper function and supplies a principled evaluation suite for the online setting that measures faithfulness, sufficiency, and coherence. When these wrapped methods are tested, classical gradient-based approaches such as Integrated Gradients outperform more recent techniques. The authors therefore introduce Shifted Window Integrated Gradients (SWING), which incorporates past observations into the integration path so that temporal dependencies are captured and out-of-distribution effects are reduced. Extensive experiments across diverse settings and metrics confirm that SWING improves explanation quality for prediction changes.
What carries the argument
Shifted Window Integrated Gradients (SWING), an adaptation of Integrated Gradients that extends the integration path to include past observations and thereby accounts for temporal context in online explanations.
If this is right
- Classical gradient methods become competitive or superior once they receive temporal context through the wrapper.
- SWING reduces out-of-distribution effects by keeping the explanation path inside the observed history.
- The evaluation suite allows direct comparison of how well different methods recover the actual change in prediction.
- Effectiveness of the adapted methods holds across multiple datasets, model architectures, and quantitative metrics.
Where Pith is reading between the lines
- The same wrapper pattern could be applied to other sequential domains such as video or sensor streams to test whether the temporal gain generalizes.
- Coherence scores may expose cases in which explanations drift even when the underlying process changes gradually.
- In high-stakes monitoring, the ability to attribute a shift to specific past steps could guide targeted data collection or model retraining.
Load-bearing premise
That wrapping existing XAI methods together with the new faithfulness-sufficiency-coherence suite correctly reflects genuine temporal dependencies without creating artificial patterns of its own.
What would settle it
On a new streaming dataset where known external events drive clear prediction shifts, check whether SWING attributions recover those events more accurately than standard Integrated Gradients or the other wrapped baselines.
Figures
read the original abstract
Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at https://github.com/AITRICS/Delta-XAI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Delta-XAI, a unified framework for explaining prediction changes in online time series monitoring. It adapts 14 existing XAI methods using a wrapper function to account for temporal dependencies, which prior methods overlook by analyzing time steps independently. A new evaluation suite is introduced to assess faithfulness, sufficiency, and coherence in the online setting. The paper finds that adapted classical methods like Integrated Gradients can outperform recent approaches. It proposes Shifted Window Integrated Gradients (SWING) that incorporates past observations in the integration path to capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments are reported to demonstrate SWING's effectiveness across diverse settings and metrics, with public code available.
Significance. If the results hold, the work is significant for providing a practical solution to explainability challenges in online time series models used in sensitive applications like healthcare and finance. The wrapper-based adaptation offers a way to leverage existing XAI techniques for temporal analysis, while the evaluation suite addresses the difficulty of assessing explanations in dynamic online contexts. SWING's use of shifted windows to include historical data represents a targeted improvement over standard methods to handle online dynamics and OOD issues. Public code availability is a notable strength for reproducibility and community use.
major comments (2)
- [Abstract] The central claim that 'extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics' is load-bearing, yet the abstract provides no datasets, quantitative results, tables, error bars, or ablation details to verify whether the faithfulness/sufficiency/coherence suite captures temporal dependencies without introducing artifacts (as noted in the weakest assumption).
- [Abstract] The description of SWING states that it 'incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects,' but provides no equation, pseudocode, or derivation for the shifted-window construction, preventing assessment of whether it preserves original IG properties or differs substantively from simpler baselines.
minor comments (2)
- The abstract refers to adapting '14 existing XAI methods' without enumerating them; listing these would clarify the scope of the unified wrapper framework.
- The public code link is provided, but the abstract-only format makes it impossible to cross-check reproducibility of the claimed experimental outcomes.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance for online time series explainability. We address the two major comments on the abstract below and will prepare a revised version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] The central claim that 'extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics' is load-bearing, yet the abstract provides no datasets, quantitative results, tables, error bars, or ablation details to verify whether the faithfulness/sufficiency/coherence suite captures temporal dependencies without introducing artifacts (as noted in the weakest assumption).
Authors: We agree that the abstract would benefit from more concrete support for its central claim to allow readers to better assess the evaluation suite. In the revised abstract we will add the names of the primary datasets, a concise summary of key quantitative results (including error bars where applicable), and a brief reference to ablation studies. We will also include a short clause clarifying how the faithfulness/sufficiency/coherence metrics are designed to respect temporal structure and avoid obvious artifacts. These additions will remain within abstract length limits while strengthening the claim. revision: yes
-
Referee: [Abstract] The description of SWING states that it 'incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects,' but provides no equation, pseudocode, or derivation for the shifted-window construction, preventing assessment of whether it preserves original IG properties or differs substantively from simpler baselines.
Authors: We acknowledge that the current abstract description of SWING is high-level. We will revise the abstract to include a compact, high-level characterization of the shifted-window construction (e.g., a one-sentence outline of how the integration path is extended over past observations). The full equation, pseudocode, derivation demonstrating preservation of IG axioms, and explicit comparison to simpler baselines appear in the methods section of the full manuscript; we will ensure the abstract points readers to that section for technical detail. revision: yes
Circularity Check
No circularity: claims rest on external experiments and wrapper adaptation without self-referential reductions
full rationale
The abstract presents Delta-XAI as a wrapper adapting 14 existing XAI methods plus a new evaluation suite (faithfulness, sufficiency, coherence), notes that gradient-based methods outperform others in experiments, and introduces SWING to incorporate past observations. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. Effectiveness claims are tied to experimental comparisons and public code rather than any internal definition or input that forces the output by construction. This is the common case of a self-contained proposal whose central assertions can be checked externally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing gradient-based and perturbation XAI methods can be adapted via a wrapper without losing core properties in online time series
invented entities (1)
-
Shifted Window Integrated Gradients (SWING)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SWING … incorporates past observations in the integration path … Online Completeness, Implementation Invariance, and Skew-Symmetry
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2 (Online Completeness) … sum of SWING attributions equals the prediction difference
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[2]
Arun Das and Paul Rad. Opportunities and challenges in explainable artificial intelligence (xai): A survey.arXiv preprint arXiv:2006.11371,
-
[3]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Deep Learning for Time-Series Analysis
John Cristian Borges Gamboa. Deep learning for time-series analysis.arXiv preprint arXiv:1701.01887,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896,
-
[6]
Kin Kwan Leung, Clayton Rooke, Jonathan Smith, Saba Zuberi, and Maksims V olkovs. Temporal dependencies in feature importance for time series predictions.arXiv preprint arXiv:2107.14317,
-
[7]
Zichuan Liu, Tianchun Wang, Jimeng Shi, Xu Zheng, Zhuomin Chen, Lei Song, Wenqian Dong, Jayantha Obeysekera, Farhad Shirani, and Dongsheng Luo. Timex++: Learning time-series explanations with information bottleneck.arXiv preprint arXiv:2405.09308, 2024a. Zichuan Liu, Yingying Zhang, Tianchun Wang, Zefan Wang, Dongsheng Luo, Mengnan Du, Min Wu, Yi Wang, Ch...
-
[8]
Introducing a new benchmarked dataset for activity monitoring
Attila Reiss and Didier Stricker. Introducing a new benchmarked dataset for activity monitoring. In 2012 16th international symposium on wearable computers, pp. 108–109. IEEE,
work page 2012
-
[9]
Matthew A Reyna, Christopher S Josef, Russell Jeter, Supreeth P Shashikumar, M Brandon Westover, Shamim Nemati, Gari D Clifford, and Ashish Sharma. Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019.Critical care medicine, 48(2):210–217, 2020a. Matthew A Reyna, Christopher S Josef, Russell Jeter, Supreeth P...
work page 2019
-
[10]
Omer Berat Sezer, Mehmet Ugur Gudelek, and Ahmet Murat Ozbayoglu. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019.Applied soft computing, 90:106181,
work page 2005
-
[11]
Clinical Intervention Prediction and Understanding using Deep Networks
Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Clinical intervention prediction and understanding using deep networks.arXiv preprint arXiv:1705.08498,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
DOI: https://doi.org/10.24432/C57G8X. Yue Wang and Sai Ho Chung. Artificial intelligence in safety-critical systems: a systematic review. Industrial Management & Data Systems, 122(2):442–470,
-
[13]
Deep Time Series Models: A Comprehensive Survey and Benchmark
Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, and Jianmin Wang. Deep time series models: A comprehensive survey and benchmark.arXiv preprint arXiv:2407.13278,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Interpretation of time-series deep models: A survey.arXiv preprint arXiv:2305.14582,
Ziqi Zhao, Yucheng Shi, Shushan Wu, Fan Yang, Wenzhan Song, and Ninghao Liu. Interpretation of time-series deep models: A survey.arXiv preprint arXiv:2305.14582,
-
[15]
13 A LIMITATIONS ANDBROADERIMPACTS Limitations.Our proposed approach has few limitations. First, while SWING extends IG by incorporating historical points with shifted window paths, it introduces additional computational overhead compared to simpler methods like standard IG. Second, while our prediction wrapper function can seamlessly incorporate most exi...
work page 2016
-
[16]
and SHAP (Shapley, 1953; Lundberg,
work page 1953
-
[17]
Perturbation-based methods like Feature Occlusion (FO) (Suresh et al.,
and DeepLIFT (Shrikumar et al., 2017), compute attributions using model gradients. Perturbation-based methods like Feature Occlusion (FO) (Suresh et al.,
work page 2017
-
[18]
measure feature importance by replacing inputs and observing prediction changes. While these methods have enhanced model explainability, most have been evaluated in vision tasks (Das & Rad, 2020). Their application to time series—particularly for explaining prediction changes in online settings—remains limited, despite the importance of capturing temporal...
work page 2020
-
[19]
are particularly relevant to online prediction tasks. FIT estimates feature importance by comparing predictive distributions under observed and counterfactual inputs using KL divergence, whereas WinIT models delayed effects by assessing how past observations influence future predictions. However, our proposed framework significantly extends beyond these m...
work page 2016
-
[20]
directly to the wrapper g violates complete- ness, since the baseline input for g depends on whether the evaluation is at T1 or T2. For example, if T1 corresponds to a window XT1−W+1:T 1 and T2 to XT2−W+1:T 2, then the baseline for g dif- fers depending on which window is active, leading to inconsistency. To address this, we compute GradSHAP directly on f...
work page 2020
-
[21]
=− X t,d φSWING(f,X t,d |T 1 →T 2). 18 Table 4:We evaluate our method on five datasets: three real-world benchmarks—MIMIC-III (Johnson et al., 2016), PhysioNet 2019 (Reyna et al., 2020a), and Activity (Reiss & Stricker, 2012)—and two widely used synthetic datasets—Delayed Spike (Leung et al.,
work page 2016
-
[22]
and Switch-Feature (Tonekaboni et al., 2020; Liu et al., 2024b). Type Name Task # IDs # Samples / ID Window Size Feature Classes Real-world MIMIC-III Decompensation prediction6,221 5 48 32 2 PhysioNet 2019Sepsis prediction 8,066 5 48 40 2 Activity Human action recognition 5 200 50 12 7 Synthetic Delayed SpikeBinary classification 1,000 5 40 3 2 Switch-Fea...
work page 2020
-
[23]
The dataset comprises 25 sequences from five individuals, each having around 6,600 time points
and follow the preprocessing protocol from Latent ODEs (Rubanova et al., 2019). The dataset comprises 25 sequences from five individuals, each having around 6,600 time points. We segment each sequence into overlapping windows of 50 time points using a stride of 1 (unlike the stride 25 used in the original Latent ODEs paper). Labels are provided at each ti...
work page 2019
-
[24]
This modification forces explanation methods to correctly identify the causal spike event rather than simply aligning with the delayed label change. Switch-Feature.We generate the Switch-Feature dataset following the design in FIT (Tonekaboni et al., 2020). Similar to the State dataset, it is constructed based on a three-state hidden Markov model with an ...
work page 2020
-
[25]
additionally show that SWING preserves its advantages regardless of prediction or label changes and remains stable across both 24- and 72-length windows. Efficiency.Figure 3 demonstrates that SWING achieves state-of-the-art explanatory quality with- out significant computational overhead. Runtime (0.35s/sample) and memory (448 MB) remain comparable to gra...
work page 2019
-
[26]
2.68±0.001.54±0.003.00±0.011.93±0.003.19±0.011.43±0.003.43±0.011.61±0.000.11±0.00 DeepLIFT (Shrikumar et al., 2017)2.72±0.001.60±0.002.87±0.001.86±0.003.06±0.011.35±0.003.38±0.011.56±0.000.16±0.00 FO (Suresh et al.,
work page 2017
-
[27]
1.86±0.001.10±0.002.85±0.011.66±0.003.88±0.002.02±0.003.12±0.011.64±0.000.06±0.00 Dynamask (Crabbé & Van Der Schaar, 2021)1.69±0.001.11±0.002.05±0.001.36±0.004.98±0.012.77±0.004.74±0.012.59±0.000.06±0.00 Extrmask (Enguehard,
work page 2021
-
[28]
2.73±0.001.56±0.003.02±0.001.94±0.003.12±0.001.41±0.003.38±0.001.60±0.000.13±0.00 SWING 3.10±0.011.96±0.002.81±0.011.78±0.012.27±0.010.93±0.002.38±0.001.01±0.000.32±0.02 Table 7:Performance of XAI methods on (top) Activity (Vidulin et al., 2010), (middle) Delayed Spike (Leung et al., 2021), and (bottom) Switch-Feature (Tonekaboni et al.,
work page 2010
-
[29]
benchmarks with an LSTM backbone. Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient 50 PointsCorr.↑CPD↑AUPD↑MPD↑AUMPD↑ CPP↓AUPP↓MPP↓AUMPP↓ IG (Sundararajan et al., 2017)19.80±0.5312.53±0.317.81±0.534.76±0.2817.5...
work page 2017
-
[30]
Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient 50 PointsCorr.↑CPD↑AUPD↑MPD↑AUMPD↑ CPP↓AUPP↓MPP↓AUMPP↓ IG (Sundararajan et al., 2017)281.32±0.42222.60±0.37319.69±0.75251.67±0.58381.59±0.71186.83±0.36349.24±0.7...
work page 2017
-
[31]
arg maxf(X T2−W+1:T 2)) is different or remains the same
vs. arg maxf(X T2−W+1:T 2)) is different or remains the same. Results are reported on the MIMIC-III decompensation benchmark with an LSTM backbone (T2 −T 1 = 1). Evaluation is performed by removing the most or least salient 50 feature points per time step, using forward-fill substitution. Algorithm Removal of Most Salient 50 PointsRemoval of Least Salient...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.