arxiv: 2605.07280 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention

Omar Muhammad , Pasupuleti Dhruv Shivkant , Deepak N. Subramani This is my paper

Pith reviewed 2026-05-11 02:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords causal discoverytime series forecastingmasked attentiongraph inferenceneural networksadjacency constraintsparameter reduction

0 comments

The pith

Mask2Cause recovers the causal graph of a time series directly inside its forecasting model by constraining attention to an adjacency structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mask2Cause as an end-to-end neural approach that discovers causal relationships among variables in time series data while it makes forecasts. Prior neural methods either process variables separately and miss shared dynamics or train a predictor then extract a graph afterward, which can latch onto correlations that do not reflect true causation. Mask2Cause instead builds an inverted variable embedding and an adjacency-constrained masked attention layer whose weights are shaped by the same forecasting objective, allowing the model to learn which past variables influence future values and variances. If the claim holds, the resulting graph can be used to prune a downstream forecaster to less than 30 percent of its original parameters while preserving accuracy on both synthetic chaotic systems and biological simulations.

Core claim

The central claim is that an end-to-end training procedure using Inverted Variable Embedding together with Adjacency-Constrained Masked Attention, optimized under homoscedastic or heteroscedastic forecasting losses, recovers the underlying causal graph as a direct byproduct of the forward pass rather than through separate post-processing.

What carries the argument

The Adjacency-Constrained Masked Attention mechanism, which restricts each variable's attention to a learned or supplied adjacency mask so that only potential causal parents contribute to the prediction of mean and variance.

If this is right

Causal structure is obtained without a separate graph-extraction step after training.
Both mean and variance of forecasts are modeled under the same causal attention constraints.
The recovered graph can be substituted into a new forecaster to reduce its parameter count by more than 70 percent on average while keeping predictive accuracy.
Performance holds across benchmarks ranging from chaotic dynamical systems to realistic biological simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking idea could be tested on non-time-series sequence tasks where known structural constraints exist.
If the attention recovers stable graphs across different forecast horizons, the method might support causal intervention queries without retraining.
Domain experts could supply partial adjacency masks to guide discovery in settings where some causal links are already known.

Load-bearing premise

That the attention weights shaped solely by a forecasting loss will converge to the true causal parents rather than to any other set of connections that happen to improve short-term prediction accuracy.

What would settle it

A synthetic time series generated from a known causal graph in which Mask2Cause returns a different adjacency matrix whose corresponding pruned forecaster still achieves lower or equal error than the unpruned baseline.

Figures

Figures reproduced from arXiv: 2605.07280 by Deepak N. Subramani, Omar Muhammad, Pasupuleti Dhruv Shivkant.

**Figure 1.** Figure 1: The Mask2Cause Architecture.The model maps a multivariate history into variable-specific tokens. A Transformer encoder, constrained by a learnable adjacency matrix, processes these tokens to predict the next-step state, discovering the causal graph through the forecasting objective. formulate this structural inference task as a predictive modeling problem, learning the adjacency matrix A by optimizing a fo… view at source ↗

**Figure 2.** Figure 2: Ground truth and predicted causal graph for Lorenz-96 (F = 40, T = 250, AUROC = 0.98) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Ground Truth for p=10 VAR system E.2 Lorenz-96 Dataset (Nonlinear) The Lorenz-96 model is a continuous-time dynamic system often used to simulate complex atmospheric physics [32]. It consists of p variables x1, . . . , xp arranged in a cyclic dependency structure. The evolution of variable xi is governed by the system of ordinary differential equations (ODEs): dxi dt = (xi+1 − xi−2)xi−1 − xi + F (17) where… view at source ↗

**Figure 4.** Figure 4: Ground Truth Causal Graphs for p=10 Lorenz-96 system [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Ground Truth Causal Graphs for CausalTime datasets [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Ground Truth Causal Graphs for DREAM3 networks [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Ground Truth Causal Graphs for Mixed Physics system [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Complexity Scaling (MSE). (Left) FLOPs vs. Sequence Length L. The model is virtually insensitive to the look-back window; increasing history from L = 5 to L = 2000 only increases cost from 2.05M to 4.60M FLOPs. (Right) FLOPs vs. Variables N. The cost grows significantly with system size, spanning from 1.01M (N = 5) to nearly 2510.85M (N = 2000), driven by the variable-wise projections. F.1 Hardware and Com… view at source ↗

**Figure 9.** Figure 9: Comparison of Complexity Scaling. (Left) FLOPs vs. Sequence Length L (with N = 10 fixed). Mask2Cause demonstrates superior scalability with respect to sequence length; increasing history from L = 5 to L = 2000 only increases cost from 2.05M to 4.60M FLOPs. In contrast, baselines exhibit significantly steeper linear scaling: cMLP (0.13M to 51.20M), cLSTM (2.00M to 801.28M), and CUTS+ (16.37M to 6549.68M). (… view at source ↗

**Figure 10.** Figure 10: Hyperparameter Sensitivity Analysis. The model exhibits high robustness across wide ranges of hyperparameters. Notably, (b) shows that performance remains near-perfect even as sequence length increases to L = 50, demonstrating the embedding’s ability to prioritize relevant recent history. corresponding to P(i). In the case of ARIMAX, the set P(i) is treated as the collection of exogenous regressors, where… view at source ↗

read the original abstract

Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The architecture folds causal masking into the forecasting pass in a clean way, but the claim that the learned mask equals the true causal graph still rests on an unproven hope that prediction loss alone will select causal edges over other sparse predictors.

read the letter

The paper puts forward Mask2Cause, which runs an inverted variable embedding through adjacency-constrained masked attention inside the same forward pass used for forecasting. Training uses either homoscedastic or heteroscedastic losses so the mask is supposed to capture influences on both mean and variance. They then use the extracted mask to prune a downstream forecaster and report more than 70 percent parameter reduction with little accuracy loss on synthetic chaotic systems and biological simulations. That end-to-end integration is the concrete novelty; most prior neural causal methods either separate discovery from prediction or rely on post-hoc extraction that can overfit noise. If the numbers hold, the parameter savings could be useful for anyone already running attention-based time-series models who wants a built-in sparsity mechanism. The central weakness is exactly the one the stress-test note flags. Nothing in the abstract shows an independent check, such as held-out interventional data or known ground-truth graphs, that the mask recovered under forecasting loss is causal rather than any other sparse structure that happens to predict the next few steps. Without that, the method could still deliver compact models while returning graphs that are only predictively useful. The abstract also gives no tables, no listed baselines, and no error bars, so it is hard to judge how large the claimed gains actually are. This work is for researchers already inside neural causal discovery or constrained forecasting who want to test whether joint training can replace two-stage pipelines. A serious editor should send it to review so the full experiments and any identifiability arguments can be examined; the idea is coherent enough to merit referee time even if the causal guarantee needs more support.

Referee Report

3 major / 1 minor

Summary. The paper proposes Mask2Cause, an end-to-end framework for causal discovery in time series that recovers the causal graph directly during the forecasting forward pass using an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism. It is trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. The work claims state-of-the-art performance on benchmarks from synthetic chaotic dynamics to biological simulations, along with the ability to reduce the parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.

Significance. If the central results hold, this would be a significant contribution to causal discovery in time series by providing an integrated approach that avoids component-wise architectures and post-hoc graph extraction. The parameter reduction aspect offers practical benefits for deploying forecasting models. The use of both mean and variance modeling for causal influences is a notable extension.

major comments (3)

[Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.
[Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.
[Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.

minor comments (1)

The abstract could benefit from a brief mention of the specific benchmarks used to allow readers to assess the diversity claimed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of an integrated end-to-end approach to causal discovery in time series. We address each major comment point by point below, with proposed revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.

Authors: The abstract is a concise summary; all quantitative tables, baseline descriptions, and error bars appear in the Experiments section (Tables 1–4 and associated figures). The baselines are the standard methods listed in Section 4.1. We will revise the abstract to add one sentence briefly noting that the Inverted Variable Embedding and Adjacency-Constrained Masked Attention, together with the forecasting objective, are intended to prioritize causal edges over non-causal predictive shortcuts, with supporting evidence in the empirical results. revision: partial
Referee: [Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.

Authors: We agree that a formal identifiability result is absent. The manuscript instead supplies independent verification on multiple datasets that contain known ground-truth causal graphs (synthetic chaotic systems and biological simulations), where the recovered masks achieve state-of-the-art structural accuracy. We will add an explicit paragraph in a new Limitations subsection discussing the reliance on empirical validation, the absence of interventional data, and the possibility of spurious predictive masks under certain conditions. revision: partial
Referee: [Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.

Authors: The Adjacency-Constrained Masked Attention embeds the sparsity and causality constraint directly inside the attention operation, so that only edges retained in the mask participate in the forecasting computation; this is further shaped by the homoscedastic or heteroscedastic loss. We will expand the Method section to clarify this built-in regularization effect and add an ablation that removes the adjacency constraint, showing degraded causal-discovery metrics while forecasting performance may remain comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Mask2Cause as an end-to-end neural architecture that embeds an adjacency-constrained masked attention mechanism inside a forecasting model, with the learned mask serving as the recovered causal graph. Training occurs via standard homoscedastic or heteroscedastic forecasting losses on time-series data. This setup does not reduce any claimed result to its inputs by construction: the mask is not defined as causal a priori, no parameter is fitted on a subset and then relabeled as a prediction of the full causal structure, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The method is presented as a modeling proposal whose validity is assessed empirically against ground-truth causal graphs on synthetic and biological benchmarks. The assumption that forecasting optimization will privilege true causal edges over other sparse predictive masks is a substantive (and potentially falsifiable) modeling hypothesis rather than a tautological reduction, placing any concern under correctness risk rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the ledger is populated from the high-level claims only. No explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5454 in / 1221 out tokens · 30197 ms · 2026-05-11T02:02:32.093112+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Mask2Cause, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass... Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_pred + λ · 1/N(N−1) ∑_{i≠j} Â_ij

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11), 2019

Jakob Runge et al. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11), 2019

work page 2019
[2]

Dynotears: Structure learning from time-series data

Roxana Pamfil et al. Dynotears: Structure learning from time-series data. InInternational Conference on Artificial Intelligence and Statistics, 2020

work page 2020
[3]

Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. Estimation of a structural vector autoregression model using non-gaussianity.Journal of Machine Learning Research, 11:1709–1731, 2010

work page 2010
[4]

Janzing, and B

Jonas Peters, D. Janzing, and B. Schölkopf. Causal inference on time series using re- stricted structural equation models. InAdvances in Neural Information Processing Systems, volume 26, pages 154–162, 2013

work page 2013
[5]

Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

work page 1969
[6]

Grouped graphical granger modeling methods for temporal causal modeling

Aurelie C Lozano, Naoki Abe, Yan Liu, and Saharon Rosset. Grouped graphical granger modeling methods for temporal causal modeling. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 577–586. ACM, 2009

work page 2009
[7]

Springer Science & Business Media, 2005

Helmut Lütkepohl.New Introduction to Multiple Time Series Analysis. Springer Science & Business Media, 2005

work page 2005
[8]

Extracting neuronal functional network dynamics via adaptive granger causality analysis.Proceedings of the National Academy of Sciences, 115(17):E3869– E3878, 2018

Alireza Sheikhattar, Sina Miran, Ji Liu, Jonathan B Fritz, Shihab A Shamma, Patrick O Kanold, and Behtash Babadi. Extracting neuronal functional network dynamics via adaptive granger causality analysis.Proceedings of the National Academy of Sciences, 115(17):E3869– E3878, 2018

work page 2018
[9]

A study of problems encountered in granger causality analysis from a neuroscience perspective.Proceedings of the National Academy of Sciences, 114(34):E7063–E7072, 2017

Patrick A Stokes and Patrick L Purdon. A study of problems encountered in granger causality analysis from a neuroscience perspective.Proceedings of the National Academy of Sciences, 114(34):E7063–E7072, 2017

work page 2017
[10]

Transfer entropy—a model-free measure of effective connectivity for the neurosciences.Journal of Computational Neuroscience, 30(1):45–67, 2011

Raul Vicente, Michael Wibral, Michael Lindner, and Gordon Pipa. Transfer entropy—a model-free measure of effective connectivity for the neurosciences.Journal of Computational Neuroscience, 30(1):45–67, 2011

work page 2011
[11]

MIT Press, 2010

Olaf Sporns.Networks of the Brain. MIT Press, 2010

work page 2010
[12]

Prentice Hall, 1968

William F Sharpe, Gordon J Alexander, and Jeffery W Bailey.Investments. Prentice Hall, 1968

work page 1968
[13]

Saurabh Khanna and Vincent Y. F. Tan. Economy statistical recurrent units for inferring nonlinear granger causality. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SyxV9ANFDH. 11

work page 2020
[14]

Causal discovery with attention-based convolutional neural networks.Machine Learning and Knowledge Extraction, 1(1):19, 2019

Meike Nauta, Doina Bucur, and Christin Seifert. Causal discovery with attention-based convolutional neural networks.Machine Learning and Knowledge Extraction, 1(1):19, 2019

work page 2019
[15]

Neural granger causality

Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021

work page 2021
[16]

Cuts+: High-dimensional causal discovery from irregular time-series

Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts+: High-dimensional causal discovery from irregular time-series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11525–11533, 2024

work page 2024
[17]

Causalformer: An interpretable transformer for temporal causal discovery.IEEE Transactions on Knowledge and Data Engineering, 2024

Lingbai Kong, Wengen Li, Hanchen Yang, Yichao Zhang, Jihong Guan, and Shuigeng Zhou. Causalformer: An interpretable transformer for temporal causal discovery.IEEE Transactions on Knowledge and Data Engineering, 2024

work page 2024
[18]

Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024

Wanqi Zhou, Shuanghao Bai, Shujian Yu, Qibin Zhao, and Badong Chen. Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024

work page arXiv 2024
[19]

Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025

Tingzhu Bi, Yicheng Pan, Xinrui Jiang, Huize Sun, Meng Ma, and Ping Wang. Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025. URL https://arxiv.org/abs/2511.03168

work page arXiv 2025
[20]

Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries

Yusen Liu, Yong Wang, Yifan Yin, Tianqing Zhu, Xiufeng Liu, and Huan Huo. Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries. InProceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Lecture Notes in Computer Science, pages 167–179. Springer, 2025. doi: 10.1007/978-981-96-8183-9_14

work page doi:10.1007/978-981-96-8183-9_14 2025
[21]

Measuring financial asset return and volatility spillovers, with application to global equity markets.The Economic Journal, 119(534): 158–171, 2009

Francis X Diebold and Kamil Yilmaz. Measuring financial asset return and volatility spillovers, with application to global equity markets.The Economic Journal, 119(534): 158–171, 2009

work page 2009
[22]

Modulation of neuronal interactions through neuronal synchronization.science, 316(5831):1609–1612, 2007

Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf Singer, Robert Desi- mone, Andreas K Engel, and Pascal Fries. Modulation of neuronal interactions through neuronal synchronization.science, 316(5831):1609–1612, 2007

work page 2007
[23]

itransformer: Inverted transformers are effective for time series forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[24]

Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

work page 2021
[25]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

work page 2021
[26]

Directed information graphs

Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Directed information graphs. IEEE Transactions on information theory, 61(12):6887–6909, 2015

work page 2015
[27]

The bidirectional communication theory-a generalization of information theory

Hans Marko. The bidirectional communication theory-a generalization of information theory. IEEE Transactions on communications, 21(12):1345–1351, 2003

work page 2003
[28]

Cuts: Neural causal discovery from irregular time-series data

Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts: Neural causal discovery from irregular time-series data. InICLR, 2023. 12

work page 2023
[29]

Neural graphical modelling in continuous-time: consistency guarantees and algorithms

Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=SsHBkfeRF9L

work page 2022
[30]

Latent convergent cross mapping

Edward De Brouwer, Adam Arany, Jaak Simm, and Yves Moreau. Latent convergent cross mapping. InInternational Conference on Learning Representations, 2020

work page 2020
[31]

Causaltime: Realistically generated time-series for benchmarking of causal discovery

Yuxiao Cheng, Ziqian Wang, Tingxiong Xiao, Qin Zhong, Jinli Suo, and Kunlun He. Causaltime: Realistically generated time-series for benchmarking of causal discovery. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[32]

Extensive chaos in the lorenz-96 model.Chaos: An interdisciplinary journal of nonlinear science, 20(4), 2010

Alireza Karimi and Mark R Paul. Extensive chaos in the lorenz-96 model.Chaos: An interdisciplinary journal of nonlinear science, 20(4), 2010

work page 2010
[33]

Towards a rigorous assessment of systems biology models: the dream3 challenges.PloS one, 5(2):e9202, 2010

Robert J Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K Sorger, Leonidas G Alex- opoulos, Xiaowei Xue, Neil D Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges.PloS one, 5(2):e9202, 2010

work page 2010
[34]

multi-hop leakage

Saurabh Khanna and Philippe Vincent-Lamarre. Economy statistical recurrent units for inferring nonlinear granger causality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2514–2528, 2021. doi: 10.1109/TPAMI.2021.3065601. arXiv:1802.05842. 13 A Notation Table Table 7 summarizes the primary mathematical notation used throughout the mai...

work page doi:10.1109/tpami.2021.3065601 2021
[35]

Representational Contention and Capacity Limits.To use variablek as a lossless conduit forj, the network must allocate specific attention heads and latent subspace dimensions within k’s token strictly forj’s signal. However, under the Directed Information framework, if j is a true direct causal parent ofi (i.e., I(Xj→Xi|X−{i,j})> 0), thenj contains unique...

work page
[36]

null space

Shared-Weight Disentanglement.Mask2Cause applies a universally shared Feed- Forward Network and final projection head across all variable tokens. This constraint in the architecture allows us to argue that routing is not prefered by the model even when the network has plenty of representation capacity. Let us assume that nodek could partition its latent v...

work page
[37]

In purely deterministic systems, information becomes redundant, creating unresolvable causal symmetries

Strict Positivity (Non-Determinism):We assume the true joint probability density of the system is strictly positive over the entire state spaceX: P(x)>0∀x∈X(13) Necessity:This condition guarantees that no variable is a strictly deterministic, noiseless function of another. In purely deterministic systems, information becomes redundant, creating unresolvab...

work page
[38]

Causal Sufficiency:We assume there are no unobserved (hidden) confounding variables that simultaneously influence two or more observed variables in our systemX. Necessity:If a hidden confounder exists, the network’s optimizer will observe a spurious statistical correlation between the variables and hallucinate a direct causal edge to minimize the forecast...

work page
[39]

Strict Temporal Precedence:We assume that causal influences strictly take time to propagate, precluding instantaneous (intra-step) causal effects. Necessity:Under this assumption, the state of variablei at timet is strictly determined by historical states and independent noise, rather than the concurrent states of other variables. This justifies our model...

work page
[40]

perfect cancellations

Causal Faithfulness:We assume the observed probability distribution is faithful to the causal graphG.Necessity:This ensures that true causal pathways do not feature "perfect cancellations" (e.g., a positive direct effect perfectly negated by a negative mediated effect). If unfaithful cancellations occurred, the variables would appear statistically indepen...

work page
[41]

lukewarm

Stationarity and Finite Markov Order:We assume the graph topology and system dynamics are invariant over time, and that the conditional transition probabilities satisfy a finite-order Markov property bounded by our look-back windowL: P(xt|x0:t−1) =P(x t|xt−L:t−1)(14) Necessity:This guarantees that the complete causal footprint required to predict the next...

work page 2000
[42]

For each physical configuration (e.g., forcing constantF), we use the available 6 independent dataset realizations using distinct random seeds

Synthetic Generative Systems (VAR, Lorenz-96)We strictly separate hyperparam- eter tuning from final evaluation to prevent leakage. For each physical configuration (e.g., forcing constantF), we use the available 6 independent dataset realizations using distinct random seeds. Seed 0 is used exclusively as aCalibration Setfor hyperparameter tuning. Seeds 1–...

work page
[43]

To account for the variance inherent in neural network initialization, we train the model 5 times on the same dataset using different random seeds for weight initialization

CausalTime (Static Real-World Proxies)For the fixed CausalTime datasets, where new samples cannot be generated, we employ a chronological split, using the first 20% of the data for tuning. To account for the variance inherent in neural network initialization, we train the model 5 times on the same dataset using different random seeds for weight initializa...

work page
[44]

We use the first 20% of the data for tuning

DREAM3 (Gene Regulatory Networks)Consistent with the evaluation protocol of the baselines we compare against (which utilize the single fixed dataset provided by the challenge), we do not perform multi-seed averaging for this benchmark. We use the first 20% of the data for tuning. We report the final AUROC from the single best model found after tuning on t...

work page
[45]

Diag Force

Mixed PhysicsConsistent with the protocol employed for DREAM3, we treat the Mixed Physics benchmark as a fixed dataset challenge. Baseline Configurations.For baseline methods that we ran locally (rather than quoting from published literature), we strictly utilized the hyperparameter configurations specified for each respective benchmark in their original ...

work page arXiv 1952