Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention
Pith reviewed 2026-05-11 02:02 UTC · model grok-4.3
The pith
Mask2Cause recovers the causal graph of a time series directly inside its forecasting model by constraining attention to an adjacency structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an end-to-end training procedure using Inverted Variable Embedding together with Adjacency-Constrained Masked Attention, optimized under homoscedastic or heteroscedastic forecasting losses, recovers the underlying causal graph as a direct byproduct of the forward pass rather than through separate post-processing.
What carries the argument
The Adjacency-Constrained Masked Attention mechanism, which restricts each variable's attention to a learned or supplied adjacency mask so that only potential causal parents contribute to the prediction of mean and variance.
If this is right
- Causal structure is obtained without a separate graph-extraction step after training.
- Both mean and variance of forecasts are modeled under the same causal attention constraints.
- The recovered graph can be substituted into a new forecaster to reduce its parameter count by more than 70 percent on average while keeping predictive accuracy.
- Performance holds across benchmarks ranging from chaotic dynamical systems to realistic biological simulations.
Where Pith is reading between the lines
- The same masking idea could be tested on non-time-series sequence tasks where known structural constraints exist.
- If the attention recovers stable graphs across different forecast horizons, the method might support causal intervention queries without retraining.
- Domain experts could supply partial adjacency masks to guide discovery in settings where some causal links are already known.
Load-bearing premise
That the attention weights shaped solely by a forecasting loss will converge to the true causal parents rather than to any other set of connections that happen to improve short-term prediction accuracy.
What would settle it
A synthetic time series generated from a known causal graph in which Mask2Cause returns a different adjacency matrix whose corresponding pruned forecaster still achieves lower or equal error than the unpruned baseline.
Figures
read the original abstract
Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mask2Cause, an end-to-end framework for causal discovery in time series that recovers the causal graph directly during the forecasting forward pass using an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism. It is trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. The work claims state-of-the-art performance on benchmarks from synthetic chaotic dynamics to biological simulations, along with the ability to reduce the parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.
Significance. If the central results hold, this would be a significant contribution to causal discovery in time series by providing an integrated approach that avoids component-wise architectures and post-hoc graph extraction. The parameter reduction aspect offers practical benefits for deploying forecasting models. The use of both mean and variance modeling for causal influences is a notable extension.
major comments (3)
- [Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.
- [Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.
- [Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.
minor comments (1)
- The abstract could benefit from a brief mention of the specific benchmarks used to allow readers to assess the diversity claimed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of an integrated end-to-end approach to causal discovery in time series. We address each major comment point by point below, with proposed revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.
Authors: The abstract is a concise summary; all quantitative tables, baseline descriptions, and error bars appear in the Experiments section (Tables 1–4 and associated figures). The baselines are the standard methods listed in Section 4.1. We will revise the abstract to add one sentence briefly noting that the Inverted Variable Embedding and Adjacency-Constrained Masked Attention, together with the forecasting objective, are intended to prioritize causal edges over non-causal predictive shortcuts, with supporting evidence in the empirical results. revision: partial
-
Referee: [Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.
Authors: We agree that a formal identifiability result is absent. The manuscript instead supplies independent verification on multiple datasets that contain known ground-truth causal graphs (synthetic chaotic systems and biological simulations), where the recovered masks achieve state-of-the-art structural accuracy. We will add an explicit paragraph in a new Limitations subsection discussing the reliance on empirical validation, the absence of interventional data, and the possibility of spurious predictive masks under certain conditions. revision: partial
-
Referee: [Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.
Authors: The Adjacency-Constrained Masked Attention embeds the sparsity and causality constraint directly inside the attention operation, so that only edges retained in the mask participate in the forecasting computation; this is further shaped by the homoscedastic or heteroscedastic loss. We will expand the Method section to clarify this built-in regularization effect and add an ablation that removes the adjacency constraint, showing degraded causal-discovery metrics while forecasting performance may remain comparable. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces Mask2Cause as an end-to-end neural architecture that embeds an adjacency-constrained masked attention mechanism inside a forecasting model, with the learned mask serving as the recovered causal graph. Training occurs via standard homoscedastic or heteroscedastic forecasting losses on time-series data. This setup does not reduce any claimed result to its inputs by construction: the mask is not defined as causal a priori, no parameter is fitted on a subset and then relabeled as a prediction of the full causal structure, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The method is presented as a modeling proposal whose validity is assessed empirically against ground-truth causal graphs on synthetic and biological benchmarks. The assumption that forecasting optimization will privilege true causal edges over other sparse predictive masks is a substantive (and potentially falsifiable) modeling hypothesis rather than a tautological reduction, placing any concern under correctness risk rather than circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Mask2Cause, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass... Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L = L_pred + λ · 1/N(N−1) ∑_{i≠j} Â_ij
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jakob Runge et al. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11), 2019
work page 2019
-
[2]
Dynotears: Structure learning from time-series data
Roxana Pamfil et al. Dynotears: Structure learning from time-series data. InInternational Conference on Artificial Intelligence and Statistics, 2020
work page 2020
-
[3]
Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. Estimation of a structural vector autoregression model using non-gaussianity.Journal of Machine Learning Research, 11:1709–1731, 2010
work page 2010
-
[4]
Jonas Peters, D. Janzing, and B. Schölkopf. Causal inference on time series using re- stricted structural equation models. InAdvances in Neural Information Processing Systems, volume 26, pages 154–162, 2013
work page 2013
-
[5]
Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969
work page 1969
-
[6]
Grouped graphical granger modeling methods for temporal causal modeling
Aurelie C Lozano, Naoki Abe, Yan Liu, and Saharon Rosset. Grouped graphical granger modeling methods for temporal causal modeling. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 577–586. ACM, 2009
work page 2009
-
[7]
Springer Science & Business Media, 2005
Helmut Lütkepohl.New Introduction to Multiple Time Series Analysis. Springer Science & Business Media, 2005
work page 2005
-
[8]
Alireza Sheikhattar, Sina Miran, Ji Liu, Jonathan B Fritz, Shihab A Shamma, Patrick O Kanold, and Behtash Babadi. Extracting neuronal functional network dynamics via adaptive granger causality analysis.Proceedings of the National Academy of Sciences, 115(17):E3869– E3878, 2018
work page 2018
-
[9]
Patrick A Stokes and Patrick L Purdon. A study of problems encountered in granger causality analysis from a neuroscience perspective.Proceedings of the National Academy of Sciences, 114(34):E7063–E7072, 2017
work page 2017
-
[10]
Raul Vicente, Michael Wibral, Michael Lindner, and Gordon Pipa. Transfer entropy—a model-free measure of effective connectivity for the neurosciences.Journal of Computational Neuroscience, 30(1):45–67, 2011
work page 2011
- [11]
-
[12]
William F Sharpe, Gordon J Alexander, and Jeffery W Bailey.Investments. Prentice Hall, 1968
work page 1968
-
[13]
Saurabh Khanna and Vincent Y. F. Tan. Economy statistical recurrent units for inferring nonlinear granger causality. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SyxV9ANFDH. 11
work page 2020
-
[14]
Meike Nauta, Doina Bucur, and Christin Seifert. Causal discovery with attention-based convolutional neural networks.Machine Learning and Knowledge Extraction, 1(1):19, 2019
work page 2019
-
[15]
Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021
work page 2021
-
[16]
Cuts+: High-dimensional causal discovery from irregular time-series
Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts+: High-dimensional causal discovery from irregular time-series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11525–11533, 2024
work page 2024
-
[17]
Lingbai Kong, Wengen Li, Hanchen Yang, Yichao Zhang, Jihong Guan, and Shuigeng Zhou. Causalformer: An interpretable transformer for temporal causal discovery.IEEE Transactions on Knowledge and Data Engineering, 2024
work page 2024
-
[18]
Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024
Wanqi Zhou, Shuanghao Bai, Shujian Yu, Qibin Zhao, and Badong Chen. Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024
-
[19]
Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025
Tingzhu Bi, Yicheng Pan, Xinrui Jiang, Huize Sun, Meng Ma, and Ping Wang. Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025. URL https://arxiv.org/abs/2511.03168
-
[20]
Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries
Yusen Liu, Yong Wang, Yifan Yin, Tianqing Zhu, Xiufeng Liu, and Huan Huo. Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries. InProceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Lecture Notes in Computer Science, pages 167–179. Springer, 2025. doi: 10.1007/978-981-96-8183-9_14
-
[21]
Francis X Diebold and Kamil Yilmaz. Measuring financial asset return and volatility spillovers, with application to global equity markets.The Economic Journal, 119(534): 158–171, 2009
work page 2009
-
[22]
Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf Singer, Robert Desi- mone, Andreas K Engel, and Pascal Fries. Modulation of neuronal interactions through neuronal synchronization.science, 316(5831):1609–1612, 2007
work page 2007
-
[23]
itransformer: Inverted transformers are effective for time series forecasting
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[24]
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021
work page 2021
-
[25]
Informer: Beyond efficient transformer for long sequence time-series forecasting
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021
work page 2021
-
[26]
Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Directed information graphs. IEEE Transactions on information theory, 61(12):6887–6909, 2015
work page 2015
-
[27]
The bidirectional communication theory-a generalization of information theory
Hans Marko. The bidirectional communication theory-a generalization of information theory. IEEE Transactions on communications, 21(12):1345–1351, 2003
work page 2003
-
[28]
Cuts: Neural causal discovery from irregular time-series data
Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts: Neural causal discovery from irregular time-series data. InICLR, 2023. 12
work page 2023
-
[29]
Neural graphical modelling in continuous-time: consistency guarantees and algorithms
Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=SsHBkfeRF9L
work page 2022
-
[30]
Latent convergent cross mapping
Edward De Brouwer, Adam Arany, Jaak Simm, and Yves Moreau. Latent convergent cross mapping. InInternational Conference on Learning Representations, 2020
work page 2020
-
[31]
Causaltime: Realistically generated time-series for benchmarking of causal discovery
Yuxiao Cheng, Ziqian Wang, Tingxiong Xiao, Qin Zhong, Jinli Suo, and Kunlun He. Causaltime: Realistically generated time-series for benchmarking of causal discovery. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[32]
Alireza Karimi and Mark R Paul. Extensive chaos in the lorenz-96 model.Chaos: An interdisciplinary journal of nonlinear science, 20(4), 2010
work page 2010
-
[33]
Robert J Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K Sorger, Leonidas G Alex- opoulos, Xiaowei Xue, Neil D Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges.PloS one, 5(2):e9202, 2010
work page 2010
-
[34]
Saurabh Khanna and Philippe Vincent-Lamarre. Economy statistical recurrent units for inferring nonlinear granger causality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2514–2528, 2021. doi: 10.1109/TPAMI.2021.3065601. arXiv:1802.05842. 13 A Notation Table Table 7 summarizes the primary mathematical notation used throughout the mai...
-
[35]
Representational Contention and Capacity Limits.To use variablek as a lossless conduit forj, the network must allocate specific attention heads and latent subspace dimensions within k’s token strictly forj’s signal. However, under the Directed Information framework, if j is a true direct causal parent ofi (i.e., I(Xj→Xi|X−{i,j})> 0), thenj contains unique...
-
[36]
Shared-Weight Disentanglement.Mask2Cause applies a universally shared Feed- Forward Network and final projection head across all variable tokens. This constraint in the architecture allows us to argue that routing is not prefered by the model even when the network has plenty of representation capacity. Let us assume that nodek could partition its latent v...
-
[37]
Strict Positivity (Non-Determinism):We assume the true joint probability density of the system is strictly positive over the entire state spaceX: P(x)>0∀x∈X(13) Necessity:This condition guarantees that no variable is a strictly deterministic, noiseless function of another. In purely deterministic systems, information becomes redundant, creating unresolvab...
-
[38]
Causal Sufficiency:We assume there are no unobserved (hidden) confounding variables that simultaneously influence two or more observed variables in our systemX. Necessity:If a hidden confounder exists, the network’s optimizer will observe a spurious statistical correlation between the variables and hallucinate a direct causal edge to minimize the forecast...
-
[39]
Strict Temporal Precedence:We assume that causal influences strictly take time to propagate, precluding instantaneous (intra-step) causal effects. Necessity:Under this assumption, the state of variablei at timet is strictly determined by historical states and independent noise, rather than the concurrent states of other variables. This justifies our model...
-
[40]
Causal Faithfulness:We assume the observed probability distribution is faithful to the causal graphG.Necessity:This ensures that true causal pathways do not feature "perfect cancellations" (e.g., a positive direct effect perfectly negated by a negative mediated effect). If unfaithful cancellations occurred, the variables would appear statistically indepen...
-
[41]
Stationarity and Finite Markov Order:We assume the graph topology and system dynamics are invariant over time, and that the conditional transition probabilities satisfy a finite-order Markov property bounded by our look-back windowL: P(xt|x0:t−1) =P(x t|xt−L:t−1)(14) Necessity:This guarantees that the complete causal footprint required to predict the next...
work page 2000
-
[42]
Synthetic Generative Systems (VAR, Lorenz-96)We strictly separate hyperparam- eter tuning from final evaluation to prevent leakage. For each physical configuration (e.g., forcing constantF), we use the available 6 independent dataset realizations using distinct random seeds. Seed 0 is used exclusively as aCalibration Setfor hyperparameter tuning. Seeds 1–...
-
[43]
CausalTime (Static Real-World Proxies)For the fixed CausalTime datasets, where new samples cannot be generated, we employ a chronological split, using the first 20% of the data for tuning. To account for the variance inherent in neural network initialization, we train the model 5 times on the same dataset using different random seeds for weight initializa...
-
[44]
We use the first 20% of the data for tuning
DREAM3 (Gene Regulatory Networks)Consistent with the evaluation protocol of the baselines we compare against (which utilize the single fixed dataset provided by the challenge), we do not perform multi-seed averaging for this benchmark. We use the first 20% of the data for tuning. We report the final AUROC from the single best model found after tuning on t...
-
[45]
Mixed PhysicsConsistent with the protocol employed for DREAM3, we treat the Mixed Physics benchmark as a fixed dataset challenge. Baseline Configurations.For baseline methods that we ran locally (rather than quoting from published literature), we strictly utilized the hyperparameter configurations specified for each respective benchmark in their original ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.