arxiv: 2604.17998 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Causally-Constrained Probabilistic Forecasting for Time-Series Anomaly Detection

Pooyan Khosravinia , Jo\~ao Gama , Bruno Veloso

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords anomaly detectioncausal graphstime seriestransformerprobabilistic forecastingmultivariate dataroot cause attribution

0 comments

The pith

Causal graph constraints in probabilistic forecasting yield state-of-the-art anomaly detection with improved attribution for multivariate time series.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a Causally Guided Transformer model that integrates an explicit time-lagged causal graph prior into deep sequence modeling for anomaly detection. This framework restricts the main prediction pathway using a hard parent mask from causal discovery, while using a latent Gaussian head to model uncertainty and a gated auxiliary path for residual correlations. A sympathetic reader would care because it offers both high detection performance and better interpretability for root-cause analysis in industrial monitoring systems where failures stem from complex interactions. Experiments demonstrate F1-scores of 96.19% on ASD and 95.32% on SMD benchmarks.

Core claim

The authors establish that constraining a Transformer-based probabilistic forecaster with a hard mask from time-lagged causal graphs enables accurate anomaly detection via negative log-likelihood scores and enhances variable-level attribution through probabilistic methods and counterfactual clamping, achieving superior results on standard benchmarks compared to correlational approaches.

What carries the argument

The hard parent mask from causal discovery that restricts the main forecasting pathway to causal parents, paired with a shadow auxiliary path using stop-gradient for blending residual information without compromising causality.

If this is right

Anomalies are detected using adaptive streaming thresholding on negative log-likelihood scores.
Root causes are identified via per-dimension probabilistic attribution and counterfactual clamping.
The approach maintains causal interpretability while leveraging some correlational information through the safety-gated blending.
State-of-the-art performance is achieved on ASD and SMD datasets with F1-scores of 96.19% and 95.32% respectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be extended to other deep learning architectures for time-series tasks to improve their interpretability.
In domains with known causal structures, such as physics simulations, the causal mask could be provided directly without discovery.
The blending mechanism might inspire hybrid models in other areas where pure causality is too restrictive.

Load-bearing premise

The time-lagged causal graph prior derived from causal discovery accurately captures the true underlying causal relationships, and the hard mask does not exclude critical predictive information.

What would settle it

Observing that performance significantly degrades when the causal mask is ablated or replaced with a non-causal structure on the same benchmarks would falsify the claim that causal constraints are beneficial.

Figures

Figures reproduced from arXiv: 2604.17998 by Bruno Veloso, Jo\~ao Gama, Pooyan Khosravinia.

**Figure 1.** Figure 1: Overall pipeline of the proposed causal-probabilistic anomaly detection framework. A causal graph prior and windowed lagged multivariate input [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Detailed architecture of a target-specific forecasting block. The causal branch applies the parent mask and shared Transformer encoder to produce a [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Anomaly detection in multivariate time series is a central challenge in industrial monitoring, as failures frequently arise from complex temporal dynamics and cross-sensor interactions. While recent deep learning models, including graph neural networks and Transformers, have demonstrated strong empirical performance, most approaches remain primarily correlational and offer limited support for causal interpretation and root-cause localization. This study introduces a causally-constrained probabilistic forecasting framework which is a Causally Guided Transformer (CGT) model for multivariate time-series anomaly detection, integrating an explicit time-lagged causal graph prior with deep sequence modeling. For each target variable, a dedicated forecasting block employs a hard parent mask derived from causal discovery to restrict the main prediction pathway to graph-supported causes, while a latent Gaussian head captures predictive uncertainty. To leverage residual correlational information without compromising the causal representation, a shadow auxiliary path with stop-gradient isolation and a safety-gated blending mechanism is incorporated to suppress non-causal contributions when reliability is low. Anomalies are identified using negative log-likelihood scores with adaptive streaming thresholding, and root-cause variables are determined through per-dimension probabilistic attribution and counterfactual clamping. Experiments on the ASD and SMD benchmarks indicate that the proposed method achieves state-of-the-art detection performance, with F1-scores of 96.19% on ASD and 95.32% on SMD, and enhances variable-level attribution quality. These findings suggest that causal structural priors can improve both robustness and interpretability in detecting deep anomalies in multivariate sensor systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new CGT architecture hard-masks a transformer with a data-derived causal graph and adds a stop-gradient shadow path plus safety-gated blending for probabilistic anomaly detection, claiming SOTA F1 on ASD and SMD, but the gains rest on untested assumptions about the graph quality.

read the letter

The paper builds a transformer where each forecasting block uses a hard parent mask from causal discovery to limit the main pathway to graph-supported lags, pairs it with a latent Gaussian head for uncertainty, and routes residual correlations through an isolated shadow path with stop-gradient and safety-gated blending. It reports F1 scores of 96.19% on ASD and 95.32% on SMD plus better per-variable attribution via negative log-likelihood and counterfactual clamping. That specific combination of hard causal constraint, shadow isolation, and blending inside a probabilistic forecaster is the concrete addition over prior correlational transformer or GNN baselines for anomaly detection.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Causally Guided Transformer (CGT) for multivariate time-series anomaly detection. It derives a time-lagged causal graph via causal discovery, applies a hard parent mask to restrict the main Transformer forecasting pathway to graph-supported parents, and augments it with a stop-gradient shadow auxiliary path plus safety-gated blending to retain residual correlations. Anomalies are flagged via negative log-likelihood under a latent Gaussian head with adaptive streaming thresholding; root-cause attribution uses per-dimension probabilities and counterfactual clamping. Experiments claim state-of-the-art F1 scores of 96.19% on ASD and 95.32% on SMD together with improved variable-level attribution.

Significance. If the causal structural prior demonstrably improves both detection robustness and interpretability without performance degradation from graph errors, the framework would advance causally-aware anomaly detection for industrial sensor systems. The explicit separation of causal and correlational pathways via stop-gradient isolation is a technically interesting design choice that could support better root-cause localization than purely correlational baselines.

major comments (2)

[§3] §3 (Method, causal graph integration): The central performance claim depends on the time-lagged causal graph obtained by discovery on the training data accurately identifying true parents; the hard mask then restricts the primary pathway. No validation of graph quality (e.g., stability across folds, comparison to domain knowledge, or oracle-graph ablation) is reported, nor is sensitivity to graph perturbations shown. If the mask excludes lagged predictors that carry predictive signal, the NLL-based detector and per-dimension attribution both become unreliable, undermining the SOTA claim.
[§4] §4 (Experiments): The reported F1 scores (96.19% ASD, 95.32% SMD) are presented without error bars, multiple random seeds, or ablations that isolate the contribution of the hard causal mask versus the shadow path or blending mechanism. Without these controls it is impossible to confirm that the causal constraint, rather than other modeling choices or post-hoc threshold tuning, drives the gains.

minor comments (2)

[§3.3] The safety-gated blending mechanism and its reliability estimator are described at a high level; explicit equations for the gating function and how its parameters are learned would improve reproducibility.
[§3] Notation for the stop-gradient operation and the latent Gaussian head parameters should be introduced once and used consistently throughout the method and experiments sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate the suggested analyses and controls into a revised manuscript to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§3] §3 (Method, causal graph integration): The central performance claim depends on the time-lagged causal graph obtained by discovery on the training data accurately identifying true parents; the hard mask then restricts the primary pathway. No validation of graph quality (e.g., stability across folds, comparison to domain knowledge, or oracle-graph ablation) is reported, nor is sensitivity to graph perturbations shown. If the mask excludes lagged predictors that carry predictive signal, the NLL-based detector and per-dimension attribution both become unreliable, undermining the SOTA claim.

Authors: We agree that validating the discovered causal graph is essential to support the central claims. The current manuscript emphasizes the integration mechanism and downstream anomaly detection results rather than graph quality diagnostics. In the revision we will add: (i) stability analysis of the time-lagged graph across multiple training folds or bootstrap resamples, (ii) sensitivity experiments that systematically perturb or remove edges from the discovered graph and report resulting changes in F1 and attribution quality, and (iii) an ablation replacing the discovered graph with a fully-connected mask (no causal restriction) and with a random mask of equal density. Because ground-truth causal structures are unavailable for the ASD and SMD benchmarks, an oracle-graph ablation is not feasible; the fully-connected and random-mask controls will instead isolate the contribution of the causal prior. These additions will directly address concerns about whether excluded lagged predictors degrade the NLL detector or attribution. revision: yes
Referee: [§4] §4 (Experiments): The reported F1 scores (96.19% ASD, 95.32% SMD) are presented without error bars, multiple random seeds, or ablations that isolate the contribution of the hard causal mask versus the shadow path or blending mechanism. Without these controls it is impossible to confirm that the causal constraint, rather than other modeling choices or post-hoc threshold tuning, drives the gains.

Authors: We acknowledge that single-run F1 scores and the absence of component-wise ablations limit the strength of the experimental claims. In the revised manuscript we will: (i) rerun all experiments with at least five independent random seeds and report mean F1 scores together with standard deviations, (ii) add explicit ablations that disable the hard parent mask (reverting to full self-attention), remove the stop-gradient shadow path, and disable the safety-gated blending, and (iii) document the adaptive streaming threshold procedure in detail, including its hyper-parameters and sensitivity to initialization, to demonstrate that gains are not driven by post-hoc tuning. These controls will clarify the incremental benefit of the causal mask relative to the auxiliary pathway and other architectural choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent benchmark validation.

full rationale

The paper proposes an architectural framework (CGT) that first runs causal discovery on training data to obtain a time-lagged graph, then uses the resulting hard parent mask inside forecasting blocks while adding an isolated shadow path and safety-gated blending. Anomaly scores are computed via negative log-likelihood on held-out test data from ASD and SMD, with F1 and attribution metrics reported as direct empirical outcomes. No equations, self-citations, or fitted parameters are shown to reduce the central performance claims to tautological re-statements of the inputs; the causal mask is a fixed structural prior for each experiment rather than a quantity re-derived from the model's own predictions. The derivation chain therefore remains self-contained and externally falsifiable on standard benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim rests on the accuracy of the upstream causal discovery step and the assumption that the architectural constraints preserve predictive power; without the full manuscript the exact free parameters in blending and adaptive thresholding cannot be enumerated.

free parameters (2)

safety-gated blending parameters
The mechanism that decides when to suppress non-causal contributions likely requires at least one tunable or fitted parameter.
adaptive streaming threshold parameters
The adaptive thresholding for negative log-likelihood anomaly scores is described as streaming and therefore depends on chosen update rules or hyperparameters.

axioms (2)

domain assumption The time-lagged causal graph obtained from causal discovery accurately represents the true causal structure of the sensor system.
Invoked to justify the hard parent mask that restricts the main prediction pathway.
domain assumption Restricting the main pathway to graph-supported parents does not cause loss of critical predictive information.
Required for the claim that causal constraints improve rather than degrade forecasting quality.

invented entities (2)

Causally Guided Transformer (CGT) no independent evidence
purpose: Integrate explicit causal graph prior with transformer sequence modeling for anomaly detection
New named architecture proposed to enforce causal constraints.
shadow auxiliary path with stop-gradient isolation no independent evidence
purpose: Capture residual correlational information without contaminating the causal representation
Novel component introduced to balance causal fidelity and performance.

pith-pipeline@v0.9.0 · 5565 in / 1983 out tokens · 62184 ms · 2026-05-10T04:58:27.535800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages

[1]

Anomaly detection: A survey,

V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, Jul. 2009

2009
[2]

Outlier detection for temporal data: A survey,

M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey,”IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250–2267, 2013

2013
[4]

Inferring causation from time series in earth system sciences,

J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls, D. Coumou, E. Deyle, C. Glymour, M. Kretschmer, M. D. Mahecha, J. Mu ˜noz-Mar´ı, E. H. van Nes, J. Peters, R. Quax, M. Reichstein, M. Scheffer, B. Sch ¨olkopf, P. Spirtes, G. Sugihara, J. Sun, K. Zhang, and J. Zscheischler, “Inferring causation from time series in earth system sciences,”Nature Communi- cati...

2019
[5]

Unifying explainable anomaly detection and root cause analysis in dynamical systems,

Y . Sun, R. S. Blum, and P. Venkitasubramaniam, “Unifying explainable anomaly detection and root cause analysis in dynamical systems,”arXiv preprint arXiv:2502.12086, 2025

work page arXiv 2025
[6]

Graph neural network-based anomaly detection in multivariate time series,

A. Deng and B. Hooi, “Graph neural network-based anomaly detection in multivariate time series,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4027–4035

2021
[7]

GRELEN: Multivariate time series anomaly detection from the perspective of graph relational learning,

W. Zhang, C. Zhang, and F. Tsung, “GRELEN: Multivariate time series anomaly detection from the perspective of graph relational learning,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 2390–2397

2022
[8]

Investigating causal relations by econometric models and cross-spectral methods,

C. W. J. Granger, “Investigating causal relations by econometric models and cross-spectral methods,”Econometrica, pp. 424–438, 1969

1969
[9]

Neural granger causality,

A. Tank, I. Covert, N. Foti, A. Shojaie, and E. B. Fox, “Neural granger causality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4267–4279, 2022

2022
[10]

CausalRCA: Causal inference based pre- cise fine-grained root cause localization for microservice applications,

R. Xin, P. Chen, and Z. Zhao, “CausalRCA: Causal inference based pre- cise fine-grained root cause localization for microservice applications,” Journal of Systems and Software, vol. 203, p. 111724, 2023

2023
[11]

Diagnosing network-wide traffic anomalies,

A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide traffic anomalies,”Computer Communication Review, vol. 34, Oct. 2004

2004
[12]

LSTM-based encoder-decoder for multi-sensor anomaly detection.CoRR, abs/1607.00148, 2016

P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff, “LSTM-based encoder-decoder for multi-sensor anomaly detection,”arXiv preprint arXiv:1607.00148, Jul. 2016

work page arXiv 2016
[13]

Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding,

K. Hundman, V . Constantinou, C. Laporte, I. Colwell, and T. S ¨oderstr¨om, “Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding,” inProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 387–395

2018
[14]

A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data,

C. Zhang, D. Song, Y . Chen, X. Feng, C. Lumezanu, W. Cheng, J. Ni, B. Zong, H. Chen, and N. V . Chawla, “A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 1409–1416

2019
[16]

Multivariate time series anomaly detection and interpretation using hierarchical inter- metric and temporal embedding,

Z. Li, Y . Zhao, J. Han, Y . Su, R. Jiao, X. Wen, and D. Pei, “Multivariate time series anomaly detection and interpretation using hierarchical inter- metric and temporal embedding,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD), 2021, pp. 3220–3230

2021
[17]

Anomaly transformer: Time series anomaly detection with association discrepancy,

J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time 12 series anomaly detection with association discrepancy,”arXiv preprint arXiv:2110.02642, 2021

work page arXiv 2021
[18]

Measuring information transfer,

T. Schreiber, “Measuring information transfer,”Physical Review Letters, vol. 85, no. 2, pp. 461–464, 2000

2000
[19]

Temporal causal modeling with graphical granger methods,

A. Arnold, Y . Liu, and N. Abe, “Temporal causal modeling with graphical granger methods,” inProceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 66–75

2007
[20]

DAGs with no tears: Continuous optimization for structure learning,

X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “DAGs with no tears: Continuous optimization for structure learning,” inAdvances in Neural Information Processing Systems, vol. 31, 2018

2018
[21]

CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems,

P. Chen, Y . Qi, P. Zheng, and D. Hou, “CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems,” inIEEE INFOCOM, Apr. 2014, pp. 1887– 1895

2014
[22]

Micro- Diag: Fine-grained performance diagnosis for microservice systems,

L. Wu, J. Tordsson, J. Bogatinovski, E. Elmroth, and O. Kao, “Micro- Diag: Fine-grained performance diagnosis for microservice systems,” May 2021

2021
[23]

Boot- strap your own latent: A new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. D. Guo, M. Ghesh- laghi Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Boot- strap your own latent: A new approach to self-supervised learning,”Ad- vances in Neural Information Processing Systems, vol. 33, pp. 21 271– 21 284, 2020

2020
[24]

Exploring simple siamese representation learning,

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 15 745–15 753

2021
[25]

Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets,

J. Runge, “Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets,” inProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), ser. Proceedings of Machine Learning Research, vol. 124. PMLR, 2020, pp. 1388–1397

2020
[26]

Model selection and estimation in regression with grouped variables,

M. Yuan and Y . Lin, “Model selection and estimation in regression with grouped variables,”Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006

2006
[27]

Anomaly detec- tion in streams with extreme value theory,

A. Siffer, P.-A. Fouque, A. Termier, and C. Largou ¨et, “Anomaly detec- tion in streams with extreme value theory,” inProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2017, pp. 1067–1075

2017
[28]

Coles,An Introduction to Statistical Modeling of Extreme Values

S. Coles,An Introduction to Statistical Modeling of Extreme Values. Springer, 2001

2001
[29]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network,

Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2828– 2837

2019
[30]

Multivariate time series anomaly detection and interpretation using hierarchical inter- metric and temporal embedding,

Z. Li, Y . Zhao, J. Han, Y . Su, R. Jiao, X. Wen, and D. Pei, “Multivariate time series anomaly detection and interpretation using hierarchical inter- metric and temporal embedding,” inProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3220–3230

2021
[31]

MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks,

D. Li, D. Chen, B. Jin, L. Shi, J. Goh, and S.-K. Ng, “MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks,” inInternational Conference on Artificial Neural Networks, 2019, pp. 703–716

2019
[32]

USAD: Unsupervised anomaly detection on multivariate time series,

J. Audibert, P. Michiardi, F. Guyard, S. Marti, and M. A. Zuluaga, “USAD: Unsupervised anomaly detection on multivariate time series,” inProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3395–3404

2020
[33]

Multivariate time-series anomaly detection via graph attention network,

H. Zhao, Y . Wang, J. Duan, C. Huang, D. Cao, Y . Tong, B. Xu, J. Bai, J. Tong, and Q. Zhang, “Multivariate time-series anomaly detection via graph attention network,” in2020 IEEE International Conference on Data Mining (ICDM), 2020, pp. 841–850

2020
[34]

Unsupervised deep anomaly detection for multi-sensor time-series signals,

Y . Zhang, Y . Chen, J. Wang, and Z. Pan, “Unsupervised deep anomaly detection for multi-sensor time-series signals,”IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 2, pp. 2118–2132, 2021

2021
[35]

TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data,

S. Tuli, G. Casale, and N. R. Jennings, “TranAD: Deep transformer networks for anomaly detection in multivariate time series data,”arXiv preprint arXiv:2201.07284, 2022

work page arXiv 2022