pith. machine review for the scientific record. sign in

arxiv: 2605.07057 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords causal reinforcement learningMarkov propertystate representationdeep Q-networkscausal DAGsminimal statemulti-order exposurecontrolled redundancy
0
0 comments X

The pith

Given a longitudinal causal graph over observations, a procedure builds a provably minimal Markov state for RL, yet deep networks require multi-order historical exposures to realize any gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a procedure that turns a given longitudinal causal graph on observed variables into a state representation guaranteed to satisfy the Markov property. Real-world RL often lacks such states from raw data, so this construction fills a basic gap. Tests show that feeding only this minimal state into deep Q-networks produces no reliable improvement, which suggests that neural nets cannot exploit minimality on their own. MOSE instead supplies the Q-function with several orders of historical versions of the same minimal state at once, and this version beats both the pure minimal state and ordinary single-window policies on standard benchmarks and synthetic tasks. The results indicate that some controlled redundancy must accompany minimal causal states before their information becomes usable in practice.

Core claim

Given a longitudinal causal graph over observed variables, a procedure constructs a provably minimal state representation that satisfies the Markov property. In deep RL, the minimal representation alone fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. MOSE addresses this by feeding multi-order historical state constructions into the same Q-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. These results establish that minimal sufficiency is not enough and

What carries the argument

MOSE (Multi-Order State Exposure), the mechanism that augments a minimal Markov state derived from a causal DAG with multiple historical orders and supplies them jointly to a standard Q-network.

If this is right

  • A provably minimal Markovian state can be derived directly from any accurate longitudinal causal DAG over observed variables.
  • Standard deep Q-networks cannot exploit the minimality of a state without additional structure such as multi-order histories.
  • Multi-order exposure of historical states produces higher performance than either the pure minimal state or single-window policies.
  • Combining the minimal state with MOSE yields further gains beyond MOSE alone.
  • The performance pattern holds on common RL benchmarks and on synthetic datasets with known causal structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the causal graph must be learned from data rather than provided exactly, small errors could turn the derived state non-Markovian and erase the theoretical guarantee.
  • The same principle of controlled redundancy might apply to other deep RL architectures such as actor-critic or model-based methods.
  • An adaptive choice of which historical orders to expose could replace the fixed multi-order scheme and reduce unnecessary computation.
  • The construction could be tested in partially observable settings where some variables in the causal graph are hidden.

Load-bearing premise

An accurate longitudinal causal graph over the observed variables is supplied as input, and standard neural Q-networks can directly exploit the minimal state when it is augmented with multi-order histories without further architectural changes.

What would settle it

If, on a controlled benchmark where the true causal graph is known exactly, MOSE produces no improvement or produces worse performance than a non-causal baseline that ignores the graph, the claim that multi-order exposure unlocks the benefit of causal states would be refuted.

Figures

Figures reproduced from arXiv: 2605.07057 by Jacqueline Maasch, Jiamin Xu, Kyra Gan.

Figure 1
Figure 1. Figure 1: From DAG to MDP. (1) Time-series causal DAG shown at time t ∈ [0, 2], where S0, S1, and S2 are valid states selected by Algorithm F.1. (2) Causal DAG representation of the corresponding MDP [58], a subgraph of (1). In this work, we assume for simplicity, the action At can only affect Xt+1. When this assumption does not hold, we need to modify Theorem 4.1 to also include the parents of actions to the state … view at source ↗
Figure 2
Figure 2. Figure 2: Average per episode reward averaged over 17 instances, each with 2 repetitions. From [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results on GOPHER. particular, the conventional practice of concatenating the most recent four frames may introduce substantial redundancy, since many pixels or past-frame components are only weakly related to the information needed for predicting future rewards and selecting actions in order to estimate the Q-function. As a result, the effective input space becomes unnecessarily large, which can make valu… view at source ↗
read the original abstract

Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emph{given a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property?} We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbf{MOSE} (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same $Q$-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emph{controlled redundancy} is necessary to unlock the benefit of causal state information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide a procedure that constructs a provably minimal Markovian state representation from a longitudinal causal graph over observed variables for use in RL. It observes that this minimal representation alone does not improve performance in deep RL, and proposes MOSE which feeds multi-order historical state constructions into the Q-function. MOSE is reported to consistently outperform the minimal state construction and single-window policies on common benchmarks and synthetic datasets. The paper concludes that minimal sufficiency is not enough and controlled redundancy is necessary to unlock the benefit of causal state information in deep RL.

Significance. If the results hold, this work is significant for causal deep RL as it provides a principled way to derive minimal Markov states from causal DAGs and highlights a key practical issue with using minimal representations in neural RL agents. The proposal of MOSE as a simple way to add controlled redundancy is a useful contribution. The paper explicitly credits the idea of using causal graphs but extends it to state construction and empirical validation of the redundancy principle. However, the significance depends on the rigor of the proof and experiments, which are not detailed in the abstract.

major comments (2)
  1. [Construction procedure (likely §3)] The claim that the procedure constructs a 'provably minimal state representation' is central but the manuscript does not supply the algorithm steps or a proof sketch in the provided text, preventing assessment of whether the construction indeed satisfies the Markov property without additional assumptions.
  2. [Empirical evaluation (likely §5)] The assertion that 'MOSE consistently outperforms' both minimal and single-window policies lacks any quantitative results, error bars, baseline details, or statistical tests in the abstract, which is load-bearing for the claim that minimal sufficiency is not enough.
minor comments (2)
  1. [Abstract] The abstract mentions 'common benchmarks and synthetic datasets' but does not specify which ones, reducing clarity.
  2. [Notation] The term 'multi-order historical state constructions' is introduced without a formal definition or equation in the summary text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below by referencing the relevant sections of the full manuscript and outlining the revisions we will make to improve clarity and accessibility of the key claims.

read point-by-point responses
  1. Referee: [Construction procedure (likely §3)] The claim that the procedure constructs a 'provably minimal state representation' is central but the manuscript does not supply the algorithm steps or a proof sketch in the provided text, preventing assessment of whether the construction indeed satisfies the Markov property without additional assumptions.

    Authors: Section 3 of the full manuscript presents the complete construction procedure as an algorithm that extracts a minimal set of variables from the longitudinal causal DAG such that the resulting state satisfies the Markov property for the RL process. The section includes pseudocode for the procedure and a proof sketch based on d-separation and the definition of minimal sufficient statistics for the transition and reward functions. The proof requires no assumptions beyond the given causal graph being a faithful representation of the data-generating process. We will revise the manuscript to move the proof sketch into the main text (currently in the appendix) and add an explicit statement that the construction is minimal by construction. revision: yes

  2. Referee: [Empirical evaluation (likely §5)] The assertion that 'MOSE consistently outperforms' both minimal and single-window policies lacks any quantitative results, error bars, baseline details, or statistical tests in the abstract, which is load-bearing for the claim that minimal sufficiency is not enough.

    Authors: Section 5 reports the full experimental results on standard RL benchmarks and synthetic datasets, including mean returns with standard error bars over 10 random seeds, explicit baseline implementations (minimal state, fixed-window history, and standard DQN), and statistical significance tests confirming MOSE's improvements. To make the abstract self-contained and address the load-bearing nature of the claim, we will revise it to include concise quantitative highlights such as average performance gains and significance levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core claim is a procedure that, given an external longitudinal causal graph over observed variables as input, constructs a provably minimal state representation satisfying the Markov property. This is presented as derived from causal graph properties rather than fitted parameters or self-referential definitions. The subsequent observation that the minimal state alone fails to improve deep RL performance (leading to the MOSE multi-order augmentation) is an empirical finding, not a mathematical reduction to the input. No load-bearing equations, uniqueness theorems, or ansatzes are shown to collapse by construction to the provided causal graph or to self-citations; the central result remains independent of the fitted Q-networks and rests on the external graph plus experimental validation. This is the expected honest non-finding for a method whose inputs are stated as given and whose outputs are not tautological renamings of those inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that a correct longitudinal causal graph is supplied and that the construction rule derived from it yields a state that neural networks can use once augmented with controlled redundancy.

axioms (1)
  • domain assumption A longitudinal causal graph over observed variables is given and correctly encodes the temporal dependencies.
    The entire construction procedure begins from this supplied graph.
invented entities (1)
  • MOSE (Multi-Order State Exposure) no independent evidence
    purpose: Feeding multiple differently ordered historical versions of the minimal state into the same Q-function.
    New technique introduced to overcome the empirical failure of the minimal state alone.

pith-pipeline@v0.9.0 · 5509 in / 1491 out tokens · 48118 ms · 2026-05-11T02:29:21.405332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages

  1. [1]

    MIT press Cambridge, 1998

    Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  2. [2]

    The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care

    Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24(11):1716–1720, 2018

  3. [3]

    Rethinking progression of memory state in robotic manipulation: An object-centric perspective

    Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, et al. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3407–3415, 2026

  4. [4]

    State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms.Management science, 28(1):1–16, 1982

    George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms.Management science, 28(1):1–16, 1982

  5. [5]

    Deep recurrent q-learning for partially observable mdps

    Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015

  6. [6]

    Recurrent experience replay in distributed reinforcement learning

    Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. InInternational conference on learning representations, 2018

  7. [7]

    Agent57: Outperforming the atari human benchmark

    Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvit- skyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. InInternational conference on machine learning, pages 507–517. PMLR, 2020

  8. [8]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

  9. [9]

    Online decision transformer

    Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. Ininternational conference on machine learning, pages 27042–27059. PMLR, 2022

  10. [10]

    Does the markov decision process fit the data: Testing for the markov property in sequential decision making

    Chengchun Shi, Runzhe Wan, Rui Song, Wenbin Lu, and Ling Leng. Does the markov decision process fit the data: Testing for the markov property in sequential decision making. In International Conference on Machine Learning, pages 8807–8817. PMLR, 2020

  11. [11]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  12. [12]

    Rainbow: Combining improve- ments in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  13. [13]

    Model based reinforcement learning for atari

    Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Bła˙zej Osi´nski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model based reinforcement learning for atari. InInternational Conference on Learning Representations, 2020

  14. [14]

    Finding the framestack: Learning what to remember for non-markovian reinforcement learning

    Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, and Tim Klinger. Finding the framestack: Learning what to remember for non-markovian reinforcement learning. InFinding the Frame Workshop at RLC 2025, 2025

  15. [15]

    Automatic reward shaping from confounded offline data.arXiv preprint arXiv:2505.11478, 2025

    Mingxuan Li, Junzhe Zhang, and Elias Bareinboim. Automatic reward shaping from confounded offline data.arXiv preprint arXiv:2505.11478, 2025

  16. [16]

    Confounding robust deep reinforcement learning: A causal approach

    Mingxuan Li, Junzhe Zhang, and Elias Bareinboim. Confounding robust deep reinforcement learning: A causal approach. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 10

  17. [17]

    arXiv preprint arXiv:1804.06893 , year=

    Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning.arXiv preprint arXiv:1804.06893, 2018

  18. [18]

    The primacy bias in deep reinforcement learning

    Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022

  19. [19]

    The dormant neuron phenomenon in deep reinforcement learning

    Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InInternational Conference on Machine Learning, pages 32145–32168. PMLR, 2023

  20. [20]

    Quantifying generalization in reinforcement learning

    Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pages 1282–1289. PMLR, 2019

  21. [21]

    Noisy networks for exploration

    M Fortunato, MG Azar, B Piot, J Menick, M Hessel, I Osband, A Graves, V Mnih, R Munos, D Hassabis, O Pietquin, and S Blundell, C Legg. Noisy networks for exploration. InInterna- tional Conference on Learning Representations (ICLR), 2018

  22. [22]

    Reinforcement learning with augmented data.Advances in Neural Information Processing Systems, 33:19884–19895, 2020

    Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in Neural Information Processing Systems, 33:19884–19895, 2020

  23. [23]

    Curl: Contrastive unsupervised repre- sentations for reinforcement learning

    Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised repre- sentations for reinforcement learning. InInternational conference on machine learning, pages 5639–5650. PMLR, 2020

  24. [24]

    Decoupling representation learning from reinforcement learning

    Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. InInternational conference on machine learning, pages 9870–9879. PMLR, 2021

  25. [25]

    Multi-view reinforcement learning.Advances in Neural Information Processing Systems, 32, 2019

    Minne Li, Lisheng Wu, Jun Wang, and Haitham Bou Ammar. Multi-view reinforcement learning.Advances in Neural Information Processing Systems, 32, 2019

  26. [26]

    Unsupervised learning of visual 3d keypoints for control

    Boyuan Chen, Pieter Abbeel, and Deepak Pathak. Unsupervised learning of visual 3d keypoints for control. InInternational Conference on Machine Learning, pages 1539–1549. PMLR, 2021

  27. [27]

    Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation

    Rishabh Jangir, Nicklas Hansen, Sambaran Ghosal, Mohit Jain, and Xiaolong Wang. Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7(2):3046–3053, 2022

  28. [28]

    Information-theoretic state space model for multi-view reinforcement learning

    HyeongJoo Hwang, Seokin Seo, Youngsoo Jang, Sungyoon Kim, Geon-Hyeong Kim, Se- unghoon Hong, and Kee-Eung Kim. Information-theoretic state space model for multi-view reinforcement learning. InProceedings of the 40th International Conference on Machine Learning, pages 14249–14282, 2023

  29. [29]

    Testing for the markov property in timeseries.Econometric Theory, 28(1):130–178, 2012

    Bin Chen and Yongmiao Hong. Testing for the markov property in timeseries.Econometric Theory, 28(1):130–178, 2012

  30. [30]

    Yunzhe Zhou, Chengchun Shi, Lexin Li, and Qiwei Yao. Testing for the markov property in time series via deep conditional generative learning.Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(4):1204–1222, 2023

  31. [31]

    Causal directed acyclic graph-informed reward design

    Lutong Zou, Ziping Xu, Daiqi Gao, and Susan Murphy. Causal directed acyclic graph-informed reward design. InThe Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2025

  32. [32]

    Robust reward modeling via causal rubrics

    Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, et al. Robust reward modeling via causal rubrics. InICML 2025 Workshop on Models of Human Feedback for AI Alignment, 2025

  33. [33]

    Confounding robust continuous control via automatic reward shaping.Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems, 2026

    Mateo Juliani, Mingxuan Li, and Elias Bareinboim. Confounding robust continuous control via automatic reward shaping.Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems, 2026. 11

  34. [34]

    Designing optimal dynamic treatment regimes: A causal reinforcement learning approach

    Junzhe Zhang and Elias Bareinboim. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. InInternational Conference on Machine Learning. PMLR, 2020

  35. [35]

    Causal dynamics learning for task-independent state abstraction

    Zizhao Wang, Xuesu Xiao, Zifan Xu, Yuke Zhu, and Peter Stone. Causal dynamics learning for task-independent state abstraction. InInternational Conference on Machine Learning, pages 23151–23180. PMLR, 2022

  36. [36]

    Invariant causal prediction for block mdps

    Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block mdps. InInternational Conference on Machine Learning, pages 11214–11224. PMLR, 2020

  37. [37]

    Building minimal and reusable causal state abstractions for reinforcement learning

    Zizhao Wang, Caroline Wang, Xuesu Xiao, Yuke Zhu, and Peter Stone. Building minimal and reusable causal state abstractions for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15778–15786, 2024

  38. [38]

    Harnessing causality in reinforce- ment learning with bagged decision times

    Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, and Susan Murphy. Harnessing causality in reinforce- ment learning with bagged decision times. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

  39. [39]

    State abstraction for programmable reinforcement learning agents

    David Andre, Stuart J Russell, et al. State abstraction for programmable reinforcement learning agents. InAnnual AAAI Conference on Artificial Intelligence, 2002

  40. [40]

    State abstractions for lifelong reinforcement learning

    David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong reinforcement learning. InInternational conference on machine learning, pages 10–19. PMLR, 2018

  41. [41]

    Bridging state and history representations: Understanding self- predictive rl.arXiv preprint arXiv:2401.08898,

    Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, and Pierre-Luc Bacon. Bridging state and history representations: Understanding self-predictive rl.arXiv preprint arXiv:2401.08898, 2024

  42. [42]

    State representation learning for control: An overview.Neural Networks, 108:379–392, 2018

    Timothée Lesort, Natalia Díaz-Rodríguez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview.Neural Networks, 108:379–392, 2018

  43. [43]

    Can increasing input dimensionality improve deep reinforcement learning? InInternational conference on machine learning, pages 7424–7433

    Kei Ota, Tomoaki Oiki, Devesh Jha, Toshisada Mariyama, and Daniel Nikovski. Can increasing input dimensionality improve deep reinforcement learning? InInternational conference on machine learning, pages 7424–7433. PMLR, 2020

  44. [44]

    Curiosity-driven exploration by self-supervised prediction

    Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational Conference on Machine Learning, pages 2778–

  45. [45]

    Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels

    Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021

  46. [46]

    Time-contrastive networks: Self-supervised learning from video

    Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018

  47. [47]

    Unsupervised reinforcement learning with contrastive intrinsic control.Advances in Neural Information Processing Systems, 35:34478–34491, 2022

    Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control.Advances in Neural Information Processing Systems, 35:34478–34491, 2022

  48. [48]

    Contrastive learning as goal-conditioned reinforcement learning.Advances in Neural Information Process- ing Systems, 35:35603–35620, 2022

    Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Russ R Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning.Advances in Neural Information Process- ing Systems, 35:35603–35620, 2022

  49. [49]

    Causality: Models, reasoning, and inference.Cambridge, UK: Cambridge University Press, 19(2):3, 2000

    Judea Pearl. Causality: Models, reasoning, and inference.Cambridge, UK: Cambridge University Press, 19(2):3, 2000. 12

  50. [50]

    Correa, Duligur Ibeling, and Thomas Icard.On Pearl’s Hierarchy and the Foundations of Causal Inference, page 507–556

    Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard.On Pearl’s Hierarchy and the Foundations of Causal Inference, page 507–556. Association for Computing Machinery, New York, NY , USA, 1 edition, 2022. ISBN 9781450395861. URLhttps://doi.org/10. 1145/3501714.3501743

  51. [51]

    Probabilities of Causation: Three Counterfactual Interpretations and Their Identifi- cation.Synthese, 121:93–149, 1999

    Judea Pearl. Probabilities of Causation: Three Counterfactual Interpretations and Their Identifi- cation.Synthese, 121:93–149, 1999

  52. [52]

    Causal inference: A tale of three frameworks.arXiv preprint arXiv:2511.21516, 2025

    Linbo Wang, Thomas Richardson, and James Robins. Causal inference: A tale of three frameworks.arXiv preprint arXiv:2511.21516, 2025

  53. [53]

    Causal network reconstruction from time series: From theoretical assumptions to practical estimation.Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 2018

    Jakob Runge. Causal network reconstruction from time series: From theoretical assumptions to practical estimation.Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 2018

  54. [54]

    Survey and evaluation of causal discovery methods for time series.Journal of Artificial Intelligence Research, 73:767–819, 2022

    Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time series.Journal of Artificial Intelligence Research, 73:767–819, 2022

  55. [55]

    A survey on causal discovery methods for iid and time series data.Transactions on Machine Learning Research, 2023

    Uzma Hasan, Emam Hossain, and Md Osman Gani. A survey on causal discovery methods for iid and time series data.Transactions on Machine Learning Research, 2023

  56. [56]

    Causal inference on time series using restricted structural equation models.Advances in neural information processing systems, 26, 2013

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models.Advances in neural information processing systems, 26, 2013

  57. [57]

    Near-optimal regret bounds for reinforcement learning

    Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. InAdvances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008

  58. [58]

    An introduction to causal reinforcement learning.arXiv preprint arXiv:2101.06498, 2025

    Elias Bareinboim, Sanghack Lee, and Junzhe Zhang. An introduction to causal reinforcement learning.arXiv preprint arXiv:2101.06498, 2025

  59. [59]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  60. [60]

    Machado, Marc G

    Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents (extended abstract). InProceedings of the 27th International Joint Conference on Artificial Intelligence, page 5573–5577, 2018. ISBN 9780999241127

  61. [61]

    Deep Reinforcement Learning with Double Q-learning

    Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015. URLhttps://arxiv.org/abs/1509.06461

  62. [62]

    Nonlinear causal discovery with additive noise models.Advances in neural information processing systems, 21, 2008

    Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models.Advances in neural information processing systems, 21, 2008

  63. [63]

    Directlingam: A direct method for learning a linear non-gaussian structural equation model.Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248, 2011

    Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvarinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, Kenneth Bollen, and Patrik Hoyer. Directlingam: A direct method for learning a linear non-gaussian structural equation model.Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248, 2011

  64. [64]

    Causal discovery with continuous additive noise models.Journal of Machine Learning Research, 15:2009–2053, 2014

    Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models.Journal of Machine Learning Research, 15:2009–2053, 2014

  65. [65]

    On the identifiability of the post-nonlinear causal model

    K Zhang and A Hyvärinen. On the identifiability of the post-nonlinear causal model. In25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), pages 647–655. AUAI Press, 2009

  66. [66]

    Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

    Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

  67. [67]

    Measurement of linear dependence and feedback between multiple time series

    John Geweke. Measurement of linear dependence and feedback between multiple time series. Journal of the American statistical association, 77(378):304–313, 1982. 13

  68. [68]

    Interpretable models for granger causality using self-explaining neural networks.arXiv preprint arXiv:2101.07600, 2021

    Riˇcards Marcinkevi ˇcs and Julia E V ogt. Interpretable models for granger causality using self-explaining neural networks.arXiv preprint arXiv:2101.07600, 2021

  69. [69]

    Neural granger causality

    Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2022

  70. [70]

    Amortized causal discovery: Learning to infer causal graphs from time-series data

    Sindy Löwe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learning to infer causal graphs from time-series data. InConference on Causal Learning and Reasoning, pages 509–525. PMLR, 2022

  71. [71]

    Causal discovery for non-stationary non-linear time series data using just-in-time modeling

    Daigo Fujiwara, Kazuki Koyama, Keisuke Kiritoshi, Tomomi Okawachi, Tomonori Izumitani, and Shohei Shimizu. Causal discovery for non-stationary non-linear time series data using just-in-time modeling. InConference on Causal Learning and Reasoning, pages 880–894. PMLR, 2023

  72. [72]

    On causal discovery from time series data using fci.Proba- bilistic graphical models, 16, 2010

    Doris Entner and Patrik O Hoyer. On causal discovery from time series data using fci.Proba- bilistic graphical models, 16, 2010

  73. [73]

    Detecting and quantifying causal associations in large nonlinear time series datasets.Science advances, 5 (11):eaau4996, 2019

    Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets.Science advances, 5 (11):eaau4996, 2019

  74. [74]

    Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets

    Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), pages 1388–1397, 2020

  75. [75]

    Causal discovery for time series from multiple datasets with latent contexts

    Wiebke Günther, Urmi Ninad, and Jakob Runge. Causal discovery for time series from multiple datasets with latent contexts. InUncertainty in Artificial Intelligence, pages 766–776. PMLR, 2023

  76. [76]

    Dynotears: Structure learning from time-series data

    Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. InInternational conference on artificial intelligence and statistics, pages 1595–1605. PMLR, 2020

  77. [77]

    Neural graphical modelling in continuous-time: consistency guarantees and algorithms

    Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. InInternational Conference on Learning Representations, 2022

  78. [78]

    Nts-notears: Learning nonpara- metric dbns with prior knowledge

    Xiangyu Sun, Oliver Schulte, Guiliang Liu, and Pascal Poupart. Nts-notears: Learning nonpara- metric dbns with prior knowledge. InInternational Conference on Artificial Intelligence and Statistics, pages 1942–1964. PMLR, 2023

  79. [79]

    Conditional local independence testing for it\ˆ o processes with applications to dynamic causal discovery.arXiv preprint arXiv:2506.07844, 2025

    Mingzhou Liu, Xinwei Sun, and Yizhou Wang. Conditional local independence testing for it\ˆ o processes with applications to dynamic causal discovery.arXiv preprint arXiv:2506.07844, 2025

  80. [80]

    Causal bandits: Learning good interven- tions via causal inference.Advances in neural information processing systems, 29, 2016

    Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interven- tions via causal inference.Advances in neural information processing systems, 29, 2016

Showing first 80 references.