pith. machine review for the scientific record. sign in

arxiv: 2605.07280 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention

Pith reviewed 2026-05-11 02:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords causal discoverytime series forecastingmasked attentiongraph inferenceneural networksadjacency constraintsparameter reduction
0
0 comments X

The pith

Mask2Cause recovers the causal graph of a time series directly inside its forecasting model by constraining attention to an adjacency structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mask2Cause as an end-to-end neural approach that discovers causal relationships among variables in time series data while it makes forecasts. Prior neural methods either process variables separately and miss shared dynamics or train a predictor then extract a graph afterward, which can latch onto correlations that do not reflect true causation. Mask2Cause instead builds an inverted variable embedding and an adjacency-constrained masked attention layer whose weights are shaped by the same forecasting objective, allowing the model to learn which past variables influence future values and variances. If the claim holds, the resulting graph can be used to prune a downstream forecaster to less than 30 percent of its original parameters while preserving accuracy on both synthetic chaotic systems and biological simulations.

Core claim

The central claim is that an end-to-end training procedure using Inverted Variable Embedding together with Adjacency-Constrained Masked Attention, optimized under homoscedastic or heteroscedastic forecasting losses, recovers the underlying causal graph as a direct byproduct of the forward pass rather than through separate post-processing.

What carries the argument

The Adjacency-Constrained Masked Attention mechanism, which restricts each variable's attention to a learned or supplied adjacency mask so that only potential causal parents contribute to the prediction of mean and variance.

If this is right

  • Causal structure is obtained without a separate graph-extraction step after training.
  • Both mean and variance of forecasts are modeled under the same causal attention constraints.
  • The recovered graph can be substituted into a new forecaster to reduce its parameter count by more than 70 percent on average while keeping predictive accuracy.
  • Performance holds across benchmarks ranging from chaotic dynamical systems to realistic biological simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking idea could be tested on non-time-series sequence tasks where known structural constraints exist.
  • If the attention recovers stable graphs across different forecast horizons, the method might support causal intervention queries without retraining.
  • Domain experts could supply partial adjacency masks to guide discovery in settings where some causal links are already known.

Load-bearing premise

That the attention weights shaped solely by a forecasting loss will converge to the true causal parents rather than to any other set of connections that happen to improve short-term prediction accuracy.

What would settle it

A synthetic time series generated from a known causal graph in which Mask2Cause returns a different adjacency matrix whose corresponding pruned forecaster still achieves lower or equal error than the unpruned baseline.

Figures

Figures reproduced from arXiv: 2605.07280 by Deepak N. Subramani, Omar Muhammad, Pasupuleti Dhruv Shivkant.

Figure 1
Figure 1. Figure 1: The Mask2Cause Architecture.The model maps a multivariate history into variable-specific tokens. A Transformer encoder, constrained by a learnable adjacency matrix, processes these tokens to predict the next-step state, discovering the causal graph through the forecasting objective. formulate this structural inference task as a predictive modeling problem, learning the adjacency matrix A by optimizing a fo… view at source ↗
Figure 2
Figure 2. Figure 2: Ground truth and predicted causal graph for Lorenz-96 (F = 40, T = 250, AUROC = 0.98) [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ground Truth for p=10 VAR system E.2 Lorenz-96 Dataset (Nonlinear) The Lorenz-96 model is a continuous-time dynamic system often used to simulate complex atmospheric physics [32]. It consists of p variables x1, . . . , xp arranged in a cyclic dependency structure. The evolution of variable xi is governed by the system of ordinary differential equations (ODEs): dxi dt = (xi+1 − xi−2)xi−1 − xi + F (17) where… view at source ↗
Figure 4
Figure 4. Figure 4: Ground Truth Causal Graphs for p=10 Lorenz-96 system [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ground Truth Causal Graphs for CausalTime datasets [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ground Truth Causal Graphs for DREAM3 networks [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ground Truth Causal Graphs for Mixed Physics system [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Complexity Scaling (MSE). (Left) FLOPs vs. Sequence Length L. The model is virtually insensitive to the look-back window; increasing history from L = 5 to L = 2000 only increases cost from 2.05M to 4.60M FLOPs. (Right) FLOPs vs. Variables N. The cost grows significantly with system size, spanning from 1.01M (N = 5) to nearly 2510.85M (N = 2000), driven by the variable-wise projections. F.1 Hardware and Com… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of Complexity Scaling. (Left) FLOPs vs. Sequence Length L (with N = 10 fixed). Mask2Cause demonstrates superior scalability with respect to sequence length; increasing history from L = 5 to L = 2000 only increases cost from 2.05M to 4.60M FLOPs. In contrast, baselines exhibit significantly steeper linear scaling: cMLP (0.13M to 51.20M), cLSTM (2.00M to 801.28M), and CUTS+ (16.37M to 6549.68M). (… view at source ↗
Figure 10
Figure 10. Figure 10: Hyperparameter Sensitivity Analysis. The model exhibits high robustness across wide ranges of hyperparameters. Notably, (b) shows that performance remains near-perfect even as sequence length increases to L = 50, demonstrating the embedding’s ability to prioritize relevant recent history. corresponding to P(i). In the case of ARIMAX, the set P(i) is treated as the collection of exogenous regressors, where… view at source ↗
read the original abstract

Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Mask2Cause, an end-to-end framework for causal discovery in time series that recovers the causal graph directly during the forecasting forward pass using an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism. It is trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. The work claims state-of-the-art performance on benchmarks from synthetic chaotic dynamics to biological simulations, along with the ability to reduce the parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.

Significance. If the central results hold, this would be a significant contribution to causal discovery in time series by providing an integrated approach that avoids component-wise architectures and post-hoc graph extraction. The parameter reduction aspect offers practical benefits for deploying forecasting models. The use of both mean and variance modeling for causal influences is a notable extension.

major comments (3)
  1. [Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.
  2. [Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.
  3. [Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.
minor comments (1)
  1. The abstract could benefit from a brief mention of the specific benchmarks used to allow readers to assess the diversity claimed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of an integrated end-to-end approach to causal discovery in time series. We address each major comment point by point below, with proposed revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts state-of-the-art results and 70% parameter reduction, yet supplies no quantitative tables, no description of the exact baselines, no error bars, and no discussion of how the method avoids learning non-causal shortcuts that still aid forecasting.

    Authors: The abstract is a concise summary; all quantitative tables, baseline descriptions, and error bars appear in the Experiments section (Tables 1–4 and associated figures). The baselines are the standard methods listed in Section 4.1. We will revise the abstract to add one sentence briefly noting that the Inverted Variable Embedding and Adjacency-Constrained Masked Attention, together with the forecasting objective, are intended to prioritize causal edges over non-causal predictive shortcuts, with supporting evidence in the empirical results. revision: partial

  2. Referee: [Method] The central claim that the learned mask equals the causal graph rests on the assumption that forecasting performance is maximized only by true causal edges; without an identifiability proof, interventional data, or independent verification on ground-truth graphs, the optimization may recover spurious predictive structures instead.

    Authors: We agree that a formal identifiability result is absent. The manuscript instead supplies independent verification on multiple datasets that contain known ground-truth causal graphs (synthetic chaotic systems and biological simulations), where the recovered masks achieve state-of-the-art structural accuracy. We will add an explicit paragraph in a new Limitations subsection discussing the reliance on empirical validation, the absence of interventional data, and the possibility of spurious predictive masks under certain conditions. revision: partial

  3. Referee: [Abstract] No explicit causal regularizer is mentioned, raising the risk that the adjacency-constrained attention learns any sparse mask that improves short-term prediction rather than the true causal adjacency.

    Authors: The Adjacency-Constrained Masked Attention embeds the sparsity and causality constraint directly inside the attention operation, so that only edges retained in the mask participate in the forecasting computation; this is further shaped by the homoscedastic or heteroscedastic loss. We will expand the Method section to clarify this built-in regularization effect and add an ablation that removes the adjacency constraint, showing degraded causal-discovery metrics while forecasting performance may remain comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Mask2Cause as an end-to-end neural architecture that embeds an adjacency-constrained masked attention mechanism inside a forecasting model, with the learned mask serving as the recovered causal graph. Training occurs via standard homoscedastic or heteroscedastic forecasting losses on time-series data. This setup does not reduce any claimed result to its inputs by construction: the mask is not defined as causal a priori, no parameter is fitted on a subset and then relabeled as a prediction of the full causal structure, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The method is presented as a modeling proposal whose validity is assessed empirically against ground-truth causal graphs on synthetic and biological benchmarks. The assumption that forecasting optimization will privilege true causal edges over other sparse predictive masks is a substantive (and potentially falsifiable) modeling hypothesis rather than a tautological reduction, placing any concern under correctness risk rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the ledger is populated from the high-level claims only. No explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5454 in / 1221 out tokens · 30197 ms · 2026-05-11T02:02:32.093112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11), 2019

    Jakob Runge et al. Detecting and quantifying causal associations in large nonlinear time series datasets.Science Advances, 5(11), 2019

  2. [2]

    Dynotears: Structure learning from time-series data

    Roxana Pamfil et al. Dynotears: Structure learning from time-series data. InInternational Conference on Artificial Intelligence and Statistics, 2020

  3. [3]

    Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. Estimation of a structural vector autoregression model using non-gaussianity.Journal of Machine Learning Research, 11:1709–1731, 2010

  4. [4]

    Janzing, and B

    Jonas Peters, D. Janzing, and B. Schölkopf. Causal inference on time series using re- stricted structural equation models. InAdvances in Neural Information Processing Systems, volume 26, pages 154–162, 2013

  5. [5]

    Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

    Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

  6. [6]

    Grouped graphical granger modeling methods for temporal causal modeling

    Aurelie C Lozano, Naoki Abe, Yan Liu, and Saharon Rosset. Grouped graphical granger modeling methods for temporal causal modeling. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 577–586. ACM, 2009

  7. [7]

    Springer Science & Business Media, 2005

    Helmut Lütkepohl.New Introduction to Multiple Time Series Analysis. Springer Science & Business Media, 2005

  8. [8]

    Extracting neuronal functional network dynamics via adaptive granger causality analysis.Proceedings of the National Academy of Sciences, 115(17):E3869– E3878, 2018

    Alireza Sheikhattar, Sina Miran, Ji Liu, Jonathan B Fritz, Shihab A Shamma, Patrick O Kanold, and Behtash Babadi. Extracting neuronal functional network dynamics via adaptive granger causality analysis.Proceedings of the National Academy of Sciences, 115(17):E3869– E3878, 2018

  9. [9]

    A study of problems encountered in granger causality analysis from a neuroscience perspective.Proceedings of the National Academy of Sciences, 114(34):E7063–E7072, 2017

    Patrick A Stokes and Patrick L Purdon. A study of problems encountered in granger causality analysis from a neuroscience perspective.Proceedings of the National Academy of Sciences, 114(34):E7063–E7072, 2017

  10. [10]

    Transfer entropy—a model-free measure of effective connectivity for the neurosciences.Journal of Computational Neuroscience, 30(1):45–67, 2011

    Raul Vicente, Michael Wibral, Michael Lindner, and Gordon Pipa. Transfer entropy—a model-free measure of effective connectivity for the neurosciences.Journal of Computational Neuroscience, 30(1):45–67, 2011

  11. [11]

    MIT Press, 2010

    Olaf Sporns.Networks of the Brain. MIT Press, 2010

  12. [12]

    Prentice Hall, 1968

    William F Sharpe, Gordon J Alexander, and Jeffery W Bailey.Investments. Prentice Hall, 1968

  13. [13]

    Saurabh Khanna and Vincent Y. F. Tan. Economy statistical recurrent units for inferring nonlinear granger causality. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=SyxV9ANFDH. 11

  14. [14]

    Causal discovery with attention-based convolutional neural networks.Machine Learning and Knowledge Extraction, 1(1):19, 2019

    Meike Nauta, Doina Bucur, and Christin Seifert. Causal discovery with attention-based convolutional neural networks.Machine Learning and Knowledge Extraction, 1(1):19, 2019

  15. [15]

    Neural granger causality

    Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021

  16. [16]

    Cuts+: High-dimensional causal discovery from irregular time-series

    Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts+: High-dimensional causal discovery from irregular time-series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11525–11533, 2024

  17. [17]

    Causalformer: An interpretable transformer for temporal causal discovery.IEEE Transactions on Knowledge and Data Engineering, 2024

    Lingbai Kong, Wengen Li, Hanchen Yang, Yichao Zhang, Jihong Guan, and Shuigeng Zhou. Causalformer: An interpretable transformer for temporal causal discovery.IEEE Transactions on Knowledge and Data Engineering, 2024

  18. [18]

    Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024

    Wanqi Zhou, Shuanghao Bai, Shujian Yu, Qibin Zhao, and Badong Chen. Jacobian regularizer-based neural granger causality.arXiv preprint arXiv:2405.08779, 2024

  19. [19]

    Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025

    Tingzhu Bi, Yicheng Pan, Xinrui Jiang, Huize Sun, Meng Ma, and Ping Wang. Uncle: Towards scalable dynamic causal discovery in non-linear temporal systems, 2025. URL https://arxiv.org/abs/2511.03168

  20. [20]

    Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries

    Yusen Liu, Yong Wang, Yifan Yin, Tianqing Zhu, Xiufeng Liu, and Huan Huo. Causal DiscoverywithInvertedSelf-attentionforMultivariateTimeSeries. InProceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Lecture Notes in Computer Science, pages 167–179. Springer, 2025. doi: 10.1007/978-981-96-8183-9_14

  21. [21]

    Measuring financial asset return and volatility spillovers, with application to global equity markets.The Economic Journal, 119(534): 158–171, 2009

    Francis X Diebold and Kamil Yilmaz. Measuring financial asset return and volatility spillovers, with application to global equity markets.The Economic Journal, 119(534): 158–171, 2009

  22. [22]

    Modulation of neuronal interactions through neuronal synchronization.science, 316(5831):1609–1612, 2007

    Thilo Womelsdorf, Jan-Mathijs Schoffelen, Robert Oostenveld, Wolf Singer, Robert Desi- mone, Andreas K Engel, and Pascal Fries. Modulation of neuronal interactions through neuronal synchronization.science, 316(5831):1609–1612, 2007

  23. [23]

    itransformer: Inverted transformers are effective for time series forecasting

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InThe Twelfth International Conference on Learning Representations, 2024

  24. [24]

    Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

    Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.Advances in neural information processing systems, 34:22419–22430, 2021

  25. [25]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

  26. [26]

    Directed information graphs

    Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Directed information graphs. IEEE Transactions on information theory, 61(12):6887–6909, 2015

  27. [27]

    The bidirectional communication theory-a generalization of information theory

    Hans Marko. The bidirectional communication theory-a generalization of information theory. IEEE Transactions on communications, 21(12):1345–1351, 2003

  28. [28]

    Cuts: Neural causal discovery from irregular time-series data

    Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts: Neural causal discovery from irregular time-series data. InICLR, 2023. 12

  29. [29]

    Neural graphical modelling in continuous-time: consistency guarantees and algorithms

    Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=SsHBkfeRF9L

  30. [30]

    Latent convergent cross mapping

    Edward De Brouwer, Adam Arany, Jaak Simm, and Yves Moreau. Latent convergent cross mapping. InInternational Conference on Learning Representations, 2020

  31. [31]

    Causaltime: Realistically generated time-series for benchmarking of causal discovery

    Yuxiao Cheng, Ziqian Wang, Tingxiong Xiao, Qin Zhong, Jinli Suo, and Kunlun He. Causaltime: Realistically generated time-series for benchmarking of causal discovery. In The Twelfth International Conference on Learning Representations, 2024

  32. [32]

    Extensive chaos in the lorenz-96 model.Chaos: An interdisciplinary journal of nonlinear science, 20(4), 2010

    Alireza Karimi and Mark R Paul. Extensive chaos in the lorenz-96 model.Chaos: An interdisciplinary journal of nonlinear science, 20(4), 2010

  33. [33]

    Towards a rigorous assessment of systems biology models: the dream3 challenges.PloS one, 5(2):e9202, 2010

    Robert J Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K Sorger, Leonidas G Alex- opoulos, Xiaowei Xue, Neil D Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a rigorous assessment of systems biology models: the dream3 challenges.PloS one, 5(2):e9202, 2010

  34. [34]

    multi-hop leakage

    Saurabh Khanna and Philippe Vincent-Lamarre. Economy statistical recurrent units for inferring nonlinear granger causality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2514–2528, 2021. doi: 10.1109/TPAMI.2021.3065601. arXiv:1802.05842. 13 A Notation Table Table 7 summarizes the primary mathematical notation used throughout the mai...

  35. [35]

    Representational Contention and Capacity Limits.To use variablek as a lossless conduit forj, the network must allocate specific attention heads and latent subspace dimensions within k’s token strictly forj’s signal. However, under the Directed Information framework, if j is a true direct causal parent ofi (i.e., I(Xj→Xi|X−{i,j})> 0), thenj contains unique...

  36. [36]

    null space

    Shared-Weight Disentanglement.Mask2Cause applies a universally shared Feed- Forward Network and final projection head across all variable tokens. This constraint in the architecture allows us to argue that routing is not prefered by the model even when the network has plenty of representation capacity. Let us assume that nodek could partition its latent v...

  37. [37]

    In purely deterministic systems, information becomes redundant, creating unresolvable causal symmetries

    Strict Positivity (Non-Determinism):We assume the true joint probability density of the system is strictly positive over the entire state spaceX: P(x)>0∀x∈X(13) Necessity:This condition guarantees that no variable is a strictly deterministic, noiseless function of another. In purely deterministic systems, information becomes redundant, creating unresolvab...

  38. [38]

    Causal Sufficiency:We assume there are no unobserved (hidden) confounding variables that simultaneously influence two or more observed variables in our systemX. Necessity:If a hidden confounder exists, the network’s optimizer will observe a spurious statistical correlation between the variables and hallucinate a direct causal edge to minimize the forecast...

  39. [39]

    Strict Temporal Precedence:We assume that causal influences strictly take time to propagate, precluding instantaneous (intra-step) causal effects. Necessity:Under this assumption, the state of variablei at timet is strictly determined by historical states and independent noise, rather than the concurrent states of other variables. This justifies our model...

  40. [40]

    perfect cancellations

    Causal Faithfulness:We assume the observed probability distribution is faithful to the causal graphG.Necessity:This ensures that true causal pathways do not feature "perfect cancellations" (e.g., a positive direct effect perfectly negated by a negative mediated effect). If unfaithful cancellations occurred, the variables would appear statistically indepen...

  41. [41]

    lukewarm

    Stationarity and Finite Markov Order:We assume the graph topology and system dynamics are invariant over time, and that the conditional transition probabilities satisfy a finite-order Markov property bounded by our look-back windowL: P(xt|x0:t−1) =P(x t|xt−L:t−1)(14) Necessity:This guarantees that the complete causal footprint required to predict the next...

  42. [42]

    For each physical configuration (e.g., forcing constantF), we use the available 6 independent dataset realizations using distinct random seeds

    Synthetic Generative Systems (VAR, Lorenz-96)We strictly separate hyperparam- eter tuning from final evaluation to prevent leakage. For each physical configuration (e.g., forcing constantF), we use the available 6 independent dataset realizations using distinct random seeds. Seed 0 is used exclusively as aCalibration Setfor hyperparameter tuning. Seeds 1–...

  43. [43]

    To account for the variance inherent in neural network initialization, we train the model 5 times on the same dataset using different random seeds for weight initialization

    CausalTime (Static Real-World Proxies)For the fixed CausalTime datasets, where new samples cannot be generated, we employ a chronological split, using the first 20% of the data for tuning. To account for the variance inherent in neural network initialization, we train the model 5 times on the same dataset using different random seeds for weight initializa...

  44. [44]

    We use the first 20% of the data for tuning

    DREAM3 (Gene Regulatory Networks)Consistent with the evaluation protocol of the baselines we compare against (which utilize the single fixed dataset provided by the challenge), we do not perform multi-seed averaging for this benchmark. We use the first 20% of the data for tuning. We report the final AUROC from the single best model found after tuning on t...

  45. [45]

    Diag Force

    Mixed PhysicsConsistent with the protocol employed for DREAM3, we treat the Mixed Physics benchmark as a fixed dataset challenge. Baseline Configurations.For baseline methods that we ran locally (rather than quoting from published literature), we strictly utilized the hyperparameter configurations specified for each respective benchmark in their original ...