pith. sign in

arxiv: 2606.24160 · v1 · pith:VLL3AYR5new · submitted 2026-06-23 · 💻 cs.AI

An Introduction to Causal Reinforcement Learning

Pith reviewed 2026-06-26 00:16 UTC · model grok-4.3

classification 💻 cs.AI
keywords causal inferencereinforcement learningstructural causal modelscounterfactual reasoningpolicy learningimitation learningoff-policy learning
0
0 comments X

The pith

Reinforcement learning environments implicitly encode structural causal models that unify online, off-policy, and counterfactual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that any environment an RL agent faces can be decomposed into autonomous mechanisms with distinct causal invariances and represented as a structural causal model. Making this implicit model explicit places standard RL methods, data-driven off-policy methods, and explicit causal reasoning under one formal treatment. This connection defines new task classes such as intervening on policies, imitation under causal constraints, and learning from counterfactual outcomes, opening a combined study of causal inference and reinforcement learning.

Core claim

The paper states that any RL environment decomposes as a collection of autonomous mechanisms with different causal invariances, parsimoniously modeled as a structural causal model; every standard RL setting therefore implicitly encodes such a model. This formalization places online learning, off-policy learning, and causal-calculus learning under a single treatment and introduces generalized policy learning, imitation learning, and counterfactual learning as natural extensions.

What carries the argument

The structural causal model of the environment, which represents autonomous mechanisms and their causal invariances to enable joint analysis of different learning modes.

If this is right

  • Online trial-and-error, reuse of logged data, and explicit counterfactual queries become instances of the same causal process.
  • Policy learning extends to settings that require choosing where to intervene and how to imitate under causal constraints.
  • Counterfactual learning becomes a well-defined task that reasons about outcomes under actions never taken.
  • A broader view of counterfactual learning emerges that treats causal inference and reinforcement learning as two sides of the same structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents could be designed to maintain an explicit causal model of the environment and update it from both observed and hypothetical trajectories.
  • Transfer across tasks may reduce to identifying which mechanisms remain invariant when the agent moves to a new environment.
  • Standard regret bounds could be refined by separating the cost of learning mechanisms from the cost of learning their combination.

Load-bearing premise

Every standard reinforcement learning setting already encodes a structural causal model whose mechanisms can unify online, off-policy, and causal-calculus learning without further assumptions on what is observable or identifiable.

What would settle it

An RL environment in which the standard online or off-policy update rules cannot be recovered as special cases of operations on the implied causal mechanisms.

Figures

Figures reproduced from arXiv: 2606.24160 by Elias Bareinboim, Junzhe Zhang, Sanghack Lee.

Figure 1
Figure 1. Figure 1: The Agent-Environment interaction from Causal Reinforcement Learning (CRL). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Paper’s roadmap and organization. on this framework, we introduce causal reinforcement learning tasks that consider the interaction capabilities of the learning agent and the prior knowledge of the environment accessible to the agent (Sec. 3.2). We compare the CRL formalisms with reinforcement learning under the standard model assumptions of Markov decision processes, emphasizing that there exists no discr… view at source ↗
Figure 3
Figure 3. Figure 3: Building blocks of Causal RL analysis. (a) Unobserved model of the environment; (b) the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representation of the CRL agent (right side) interacting with the SCM (middle) through [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Observational distributions of the MDP model [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Interventional distributions of the MDP model described in Eq. 5 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The space of SCMs/Causal diagrams are shown on the left/right side. The true SCM M∗ and the corre￾sponding causal diagram G ∗ are explic￾itly shown. The yellow area represents the subspace where these other SCMs generate the same G ∗ . In words, there is an edge from endogenous variables Vi to Vj whenever Vj “listens to”12 Vi for determining its value. Similarly, a bidirected edge between Vi and Vj indicat… view at source ↗
Figure 8
Figure 8. Figure 8: Causal diagrams for (a) a multi-armed bandit (MAB); (b) a Markov decision process [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Policy spaces for a 2-stage DTR envi￾ronment. Every policy π ∈ Π is a sequence of decision rules (π1 (X1 | S1), . . . , πH (XH | SH)). An agent following policy π se￾lects values of actions X following a temporal ordering X1, . . . , XH. At every step of intervention i = 1, . . . , H, it performs the following 1. Observe some state variables Si = si ; 2. Select a value of action xi ∼ πi(Xi | Si = si) follo… view at source ↗
Figure 10
Figure 10. Figure 10: Causal diagrams for CDMs representing canonical decision-making models. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Graphical representation of a causal reinforcement learning task [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Causal Hierarchy Theorem (CHT) in MDP environments. [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Temporal diagram showing an off-policy learning agent interacting with the environment [PITH_FULL_IMAGE:figures/full_fig_p045_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Temporal diagram showing an online learning agent interacting with the environment [PITH_FULL_IMAGE:figures/full_fig_p052_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The regret of RCT with varying total number of trials. where C is a universal constant. That is, RCT is able to achieve a sublinear regret R(T,M∗ ) = O [PITH_FULL_IMAGE:figures/full_fig_p057_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Simulation results comparing performance of online learning algorithms [PITH_FULL_IMAGE:figures/full_fig_p060_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Temporal diagram showing a causal identification agent interacting with the environment [PITH_FULL_IMAGE:figures/full_fig_p061_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Causal diagram satisfying the NUC condition and its manipulated diagrams. [PITH_FULL_IMAGE:figures/full_fig_p064_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Causal diagram satisfying the sequential backdoor and its manipulated diagrams. [PITH_FULL_IMAGE:figures/full_fig_p065_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Assumptions under which IPW and DP algorithms are applicable [PITH_FULL_IMAGE:figures/full_fig_p068_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: A front-door graph and its manipulated representations. [PITH_FULL_IMAGE:figures/full_fig_p072_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: A causal diagram of the SCM described in Eq. 197 and its manipulated diagrams. [PITH_FULL_IMAGE:figures/full_fig_p073_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Temporal diagram showing an offline-to-online learning agent interacting with the envi [PITH_FULL_IMAGE:figures/full_fig_p077_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Simulation results comparing UCB learner with direct transfer of observational data (UCB- ) and standard UCB without any prior observations. Experiment 3 Fig. 24a shows the cumulative regret of UCB￾in the MAB environmentM∗ described in Example 1 with the suboptimal gap ∆ = 0.1, taking as input 5, 000 observational samples drawn from the distribution P(X, Y ). The NUC assumption does not hold in this model… view at source ↗
Figure 25
Figure 25. Figure 25: Simulation results comparing UCB+ learner augmented with causal bounds over the expected rewards, standard UCB, and UCB- with direct transfer of observational data. Experiment 4 [PITH_FULL_IMAGE:figures/full_fig_p085_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Simulation results comparing UCB learner optimizing a 2-stage DTR model and RCT determining values of actions X1, X2 uniformly at random. Experiment 5 [PITH_FULL_IMAGE:figures/full_fig_p090_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Simulation results comparing UCB+ learner augmented with causal bounds over the expected rewards, standard UCB, and RCT determining values of action uniformly at random. Decision Horizon Algorithm Regret Bound H = 1 UCB (Alg. 3) O (|D(X)| log(T)/∆) UCB+ (Alg. 5) O (|D(X)| log(T)/∆∗) H ≥ 2 UCB (Alg. 6) O [PITH_FULL_IMAGE:figures/full_fig_p095_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Temporal diagram showing the dynamics of a mixed policy learning while the agent in [PITH_FULL_IMAGE:figures/full_fig_p097_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: (a) True causal diagram G for environment M∗ . (b) Hypothesized model in the agent’s mind, after intervention. (c) Structure of the action space. (d) Two agents’ cumulative regret with Thompson sampling (solid line) and UCB (dashed line) together with shaded areas representing 95% confidence interval. Two lines for the All-at-once agent are overlapped. where the endogenous variables V are all binary. The … view at source ↗
Figure 30
Figure 30. Figure 30: Relationships among quantities such as probability distributions and expected rewards [PITH_FULL_IMAGE:figures/full_fig_p100_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Illustrative examples demonstrating how partial-orders can be obtained from Figure 31a. [PITH_FULL_IMAGE:figures/full_fig_p103_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Obtaining MUCTs (variables in light blue areas) and IBs (variable in green) under in [PITH_FULL_IMAGE:figures/full_fig_p104_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: A causal diagram G and the visualization of original c-factorization of P(v). 0 5,000 10,000 Episodes 0 100 200 300 400 500 600 Cum. Regret POMIS+ID POMIS MIS Brute-force [PITH_FULL_IMAGE:figures/full_fig_p106_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Cumulative regrets of different bandit agents based on Brute-force, MIS, POMIS, and [PITH_FULL_IMAGE:figures/full_fig_p106_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: (a) a causal diagram, (b) abstract representation of a contextual bandit policy, and (c,d,e) [PITH_FULL_IMAGE:figures/full_fig_p107_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Relationships among the policy spaces based on two aspects. [PITH_FULL_IMAGE:figures/full_fig_p108_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Causal diagrams where the relevance of some contexts can be further eliminated under [PITH_FULL_IMAGE:figures/full_fig_p109_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: (a) A minimal policy space and (b) its dependency graph derived in Example 50 where [PITH_FULL_IMAGE:figures/full_fig_p112_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: (left) Cumulative regrets (in a log scale) of different arm strategies based on all possible [PITH_FULL_IMAGE:figures/full_fig_p113_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Temporal diagram showing an agent interacting with the environment for repeated [PITH_FULL_IMAGE:figures/full_fig_p115_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: (a) Illustration of the machines configurations. (b) Table summarizing gamblers natural [PITH_FULL_IMAGE:figures/full_fig_p116_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Performance of different bandit strategies in the greedy casino example; x-axis represents [PITH_FULL_IMAGE:figures/full_fig_p118_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Causal diagrams for different interaction regimes in the MAB environment. [PITH_FULL_IMAGE:figures/full_fig_p121_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Causal diagrams for the MDP environment and its submodel induced by a counterfactual [PITH_FULL_IMAGE:figures/full_fig_p127_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Illustration of decision flow, fX, where U is taken as input and the natural predilections X′ is returned as output. The process is refined through multiple stages. and MDPs. However, how can an optimal counterfactual policy be learned by computing the coun￾terfactual quantities entailed by the underlying, unknown environment? To illustrate, consider the MAB environment as an example. When the intended ac… view at source ↗
Figure 46
Figure 46. Figure 46: Performance of standard UCB performing atomic interventions and the augmented Ctf-UCB using counterfactual interventions; x-axis represents the total episodes of interactions. The x-axis represents, respectively, the probability of picking an optimal action and the cumulative regret in (a) and (b); the y-axis represents the number of episodes in both (a) and (b). In words, there is an MAB environment such… view at source ↗
Figure 47
Figure 47. Figure 47: Performance of standard UCBVI performing atomic interventions and the augmented Ctf-UCBVI using counterfactual interventions. Environment Structural Assumptions Optimality Autonomy ΠEXP ΠCTF MAB NUC ✓ ✓ ✓ - ✗ ✓ ✗ MDP NUC ✓ ✓ ✓ - ✗ ✓ ✗ [PITH_FULL_IMAGE:figures/full_fig_p142_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Causal diagram for the MDP environment induced by a hybrid policy where extended [PITH_FULL_IMAGE:figures/full_fig_p144_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Simulations comparing the performance (a) and occupancy composition (b) of [PITH_FULL_IMAGE:figures/full_fig_p146_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: The tail light of the front car is unobserved in highway (aerial) drone data. This means that the agent will try to find a policy π ∗ such that π ∗ = arg max π∈ΠEXP EM∗ π  R (Y ) [PITH_FULL_IMAGE:figures/full_fig_p147_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: A causal diagram and its manipulated subgraphs. [PITH_FULL_IMAGE:figures/full_fig_p154_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Causal diagrams where X represents an action (shaded blue) and Y represents a latent reward (shaded red). Input covariates of the policy space Π are shaded in light blue. 8.2.1 MINIMAL IMITATION BACKDOOR We will next study causal IRL in more general settings where the NUC assumption does not hold, and there exist unobserved confounders in the demonstration data affecting both actions and other variables i… view at source ↗
Figure 53
Figure 53. Figure 53: Simulation results evaluating causal IRL when imitation backdoor condition holds. [PITH_FULL_IMAGE:figures/full_fig_p164_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Causal Hierarchy Theorem (CHT) in POMDP environments. [PITH_FULL_IMAGE:figures/full_fig_p182_54.png] view at source ↗
read the original abstract

Causal inference provides a set of principles and tools that allow one to combine data and knowledge about an environment to reason with questions of counterfactual nature, i.e., what would have happened had reality been different, even when no data of this unrealized reality is currently available. Reinforcement learning provides methods to learn a policy that optimizes a specific measure (e.g., reward, regret) when the agent is deployed in an environment and pursues an exploratory, trial-and-error approach. These two disciplines have evolved independently and with virtually no interaction between them. We note that they operate over different aspects of the same building block, counterfactual relations, which makes them umbilically connected. Based on these observations, novel learning opportunities arise when this connection is explicitly acknowledged and mathematized. To realize this potential, we note that any environment where the RL agent is deployed can be decomposed as a collection of autonomous mechanisms with different causal invariances, parsimoniously modeled as a structural causal model; any standard RL setting implicitly encodes such a model. This formalization allows us to put under a unifying treatment different modes of learning, including online, off-policy, and causal calculus learning, which appear unrelated in the literature. However, these modalities are not exhaustive: we introduce several natural and pervasive classes of learning settings that entail novel dimensions of analysis. Specifically, we introduce and discuss through causal lenses generalized policy learning, where to intervene, imitation learning, and counterfactual learning. These tasks lead to a broader view of counterfactual learning and suggest great potential for studying causal inference and reinforcement learning side by side, which we call causal reinforcement learning (CRL).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript claims that any RL environment can be decomposed as a collection of autonomous mechanisms with different causal invariances and parsimoniously modeled as a structural causal model (SCM); any standard RL setting implicitly encodes such a model. This formalization unifies online, off-policy, and causal-calculus learning modes under one treatment. The authors further introduce and analyze through causal lenses several new task classes—generalized policy learning, intervention-based learning (where to intervene), imitation learning, and counterfactual learning—arguing that these open a broader view of counterfactual reasoning and motivate the joint study of causal inference and RL, termed causal reinforcement learning (CRL).

Significance. If the proposed modeling perspective is adopted, the work supplies a useful conceptual unification that makes explicit the shared counterfactual substrate of the two fields. By treating the SCM decomposition as a modeling lens rather than a derived theorem, the paper organizes existing RL modalities and surfaces new task dimensions that exploit causal invariances, providing a clear roadmap for future CRL research without requiring additional observability or identifiability assumptions beyond the framing itself.

minor comments (1)
  1. [Abstract] Abstract: the phrasing 'generalized policy learning, where to intervene, imitation learning, and counterfactual learning' leaves ambiguous whether 'where to intervene' denotes a distinct task or a sub-component of generalized policy learning; a brief clarifying clause would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation to accept. The review accurately captures the core contribution of framing RL environments as structural causal models to unify learning modalities and surface new task classes under the CRL lens.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is an explicit conceptual introduction that frames the decomposition of any RL environment into an SCM as a modeling perspective rather than a theorem or first-principles derivation. The unification of online, off-policy, and causal-calculus learning follows directly from adopting this framing by definition, with no claimed prediction, fitted parameter, or uniqueness result that reduces to its own inputs. No load-bearing self-citation chains or ansatzes smuggled via prior work appear in the argument; the central claim is presented as a choice of representation that enables the subsequent taxonomy, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that RL environments are decomposable into SCMs with autonomous mechanisms; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Any environment where the RL agent is deployed can be decomposed as a collection of autonomous mechanisms with different causal invariances, parsimoniously modeled as a structural causal model.
    Directly stated in the abstract as the basis for the entire unification.

pith-pipeline@v0.9.1-grok · 5819 in / 1291 out tokens · 32269 ms · 2026-06-26T00:16:15.492705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 4 canonical work pages

  1. [1]

    Proceedings of the 34th AAAI Conference on Artificial Intelligence , year=

    A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments , author =. Proceedings of the 34th AAAI Conference on Artificial Intelligence , year=

  2. [2]

    2020 , eprint=

    Combining Observational and Experimental Datasets Using Shrinkage Estimators , author=. 2020 , eprint=

  3. [3]

    and Bareinboim, E

    Plecko, D. and Bareinboim, E. Causal Fairness Analysis. 2022

  4. [4]

    ACM Transactions on Mathematical Software (TOMS) , volume=

    Algorithm 883: Sparsepop---a sparse semidefinite programming relaxation of polynomial optimization problems , author=. ACM Transactions on Mathematical Software (TOMS) , volume=. 2008 , publisher=

  5. [5]

    Management Science , volume=

    A probabilistic production and inventory problem , author=. Management Science , volume=. 1963 , publisher=

  6. [6]

    Trends in Cognitive Sciences , volume=

    How Rich is Consciousness? The Partial Awareness Hypothesis , author=. Trends in Cognitive Sciences , volume=. 2010 , publisher=

  7. [7]

    2013 , publisher=

    A reformulation-linearization technique for solving discrete and continuous nonconvex problems , author=. 2013 , publisher=

  8. [8]

    and Tian, J

    Jung, Y. and Tian, J. and Bareinboim, E. Estimating Joint Treatment Effects by Combining Multiple Experiments. Proceedings of the 40th International Conference on Machine Learning. 2023

  9. [9]

    MC Tracts , year=

    Linear programming and finite Markovian control problems , author=. MC Tracts , year=

  10. [10]

    and Diaz, I

    Jung, Y. and Diaz, I. and Tian, J. and Bareinboim, E. Estimating Causal Effects Identifiable from Combination of Observations and Experiments. 2023

  11. [11]

    1998 , publisher=

    Convex analysis and global optimization , author=. 1998 , publisher=

  12. [12]

    2022 , publisher=

    Introduction to algorithms , author=. 2022 , publisher=

  13. [13]

    Mathematics of operations research , volume=

    The complexity of Markov decision processes , author=. Mathematics of operations research , volume=. 1987 , publisher=

  14. [14]

    Journal of Computer and System Sciences , volume=

    An analysis of model-based interval estimation for Markov decision processes , author=. Journal of Computer and System Sciences , volume=. 2008 , publisher=

  15. [15]

    SIAM Journal on Optimization , volume=

    Global optimization with polynomials and the problem of moments , author=. SIAM Journal on Optimization , volume=. 2001 , publisher=

  16. [16]

    JAMA internal medicine , volume=

    Estimated costs of pivotal trials for novel therapeutic agents approved by the US Food and Drug Administration, 2015-2016 , author=. JAMA internal medicine , volume=. 2018 , publisher=

  17. [17]

    Optimization methods and software , volume=

    Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones , author=. Optimization methods and software , volume=. 1999 , publisher=

  18. [18]

    Journal of Machine Learning Research , volume=

    Causal reasoning with ancestral graphs , author=. Journal of Machine Learning Research , volume=. 2008 , publisher=

  19. [19]

    Science , volume=

    Judgment under Uncertainty: Heuristics and Biases , author=. Science , volume=. 1974 , publisher=

  20. [20]

    American Psychologist , volume=

    The Unbearable Automaticity of Being , author=. American Psychologist , volume=. 1999 , publisher=

  21. [21]

    Perspectives on Psychological Science , volume=

    A Theory of Unconscious Thought , author=. Perspectives on Psychological Science , volume=. 2006 , publisher=

  22. [22]

    On the application of probability theory to agricultural experiments

    Neyman, J. On the application of probability theory to agricultural experiments. E ssay on principles. S ection 9. Statistical Science

  23. [23]

    1985 , journal =

    Lai, Tze Leung and Robbins, Herbert , number =. 1985 , journal =

  24. [24]

    Proceedings of the National Academy of Sciences , volume=

    Causal inference and the data-fusion problem , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

  25. [25]

    arXiv preprint arXiv:2304.02339 , year=

    Many Data: Combine Experimental and Observational Data through a Power Likelihood , author=. arXiv preprint arXiv:2304.02339 , year=

  26. [26]

    2022 , publisher=

    Robust Causal Inference Methods for Using Randomized Clinical Trial and Observational Study , author=. 2022 , publisher=

  27. [27]

    Jama , volume=

    Pharmacologic treatments for coronavirus disease 2019 (COVID-19): a review , author=. Jama , volume=. 2020 , publisher=

  28. [28]

    arXiv preprint arXiv:2011.08047 , year=

    Causal inference methods for combining randomized trials and observational studies: a review , author=. arXiv preprint arXiv:2011.08047 , year=

  29. [29]

    2008 , publisher=

    Dataset shift in machine learning , author=. 2008 , publisher=

  30. [30]

    Science , volume=

    Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir , author=. Science , volume=. 2020 , publisher=

  31. [31]

    New England Journal of Medicine , volume=

    Remdesivir for 5 or 10 days in patients with severe Covid-19 , author=. New England Journal of Medicine , volume=. 2020 , publisher=

  32. [32]

    , author=

    Estimating causal effects of treatments in randomized and nonrandomized studies. , author=. Journal of educational Psychology , volume=. 1974 , publisher=

  33. [33]

    2010 , publisher=

    Artificial intelligence a modern approach , author=. 2010 , publisher=

  34. [34]

    IEEE Spectrum , volume=

    IBM Watson, heal thyself: How IBM overpromised and underdelivered on AI health care , author=. IEEE Spectrum , volume=. 2019 , publisher=

  35. [35]

    arXiv preprint arXiv:1912.06680 , year=

    Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

  36. [36]

    The Eleventh International Conference on Learning Representations , year =

    Causal Imitation Learning via Inverse Reinforcement Learning , author =. The Eleventh International Conference on Learning Representations , year =

  37. [37]

    Springer

    Observational studies , author =. Springer. First citation in articleRosenbaum, PR, & Rubin, DB (1983). The central role of the propensity score in observational studies for causal effects. Biometrika , volume =

  38. [38]

    Games and Economic Behavior , volume=

    Adaptive game playing using multiplicative weights , author=. Games and Economic Behavior , volume=. 1999 , publisher=

  39. [39]

    Conference on Robot Learning , pages =

    Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble , author =. Conference on Robot Learning , pages =. 2022 , organization =

  40. [40]

    , author =

    Experimental design and primary data analysis methods for comparing adaptive interventions. , author =. Psychological methods , volume =. 2012 , publisher =

  41. [41]

    Bellman, Richard , publisher =

  42. [42]

    Uncertainty in Artificial Intelligence , pages=

    Finding minimal d-separators in linear time and applications , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

  43. [43]

    Journal of Machine Learning Research , volume =

    Tree-based batch mode reinforcement learning , author =. Journal of Machine Learning Research , volume =. 2005 , publisher =

  44. [44]

    Machine learning , volume =

    Kernel-based reinforcement learning , author =. Machine learning , volume =. 2002 , publisher =

  45. [45]

    , author=

    Learning to Drive a Bicycle Using Reinforcement Learning and Shaping. , author=. ICML , volume=. 1998 , organization=

  46. [46]

    Icml , volume=

    Policy invariance under reward transformations: Theory and application to reward shaping , author=. Icml , volume=

  47. [47]

    Yang and Karthikeyan Shanmugam and Caroline Uhler , title=

    Raj Agrawal and Chandler Squires and Karren D. Yang and Karthikeyan Shanmugam and Caroline Uhler , title=. 2019 , cdate=

  48. [48]

    Mooij and Sara Magliacane and Tom Claassen , title =

    Joris M. Mooij and Sara Magliacane and Tom Claassen , title =. Journal of Machine Learning Research , year =

  49. [49]

    Annual Review of Neuroscience , volume =

    Rizzolatti, Giacomo and Craighero, Laila , title =. Annual Review of Neuroscience , volume =

  50. [50]

    Mirror neurons

    Keysers, Christian , address =. Mirror neurons. , volume =. Current biology , lccn =

  51. [51]

    and Salehkaleybar, S

    Ghassami, A. and Salehkaleybar, S. and Kiyavash, N. and Bareinboim, E. Budgeted Experiment Design for Causal Structure Learning. Proceedings of the 35th International Conference on Machine Learning. 2018

  52. [52]

    Philip and Geneletti, Sara , title =

    Didelez, Vanessa and Dawid, A. Philip and Geneletti, Sara , title =. Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence , pages =. 2006 , publisher =

  53. [53]

    the great leap forward

    Mirror neurons and imitation learning as the driving force behind "the great leap forward" in human evolution , author=

  54. [54]

    The New York Times, January 10 , author=

    Cells that read minds. The New York Times, January 10 , author=

  55. [55]

    Advances in neural information processing systems , volume=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

  56. [56]

    and Ribeiro, A

    Anand, T. and Ribeiro, A. and Tian, J. and Bareinboim, E. Effect Identification in Causal Diagrams with Clustered Variables. 2021

  57. [57]

    Journal of artificial intelligence research , volume=

    Hierarchical reinforcement learning with the MAXQ value function decomposition , author=. Journal of artificial intelligence research , volume=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    On explore-then-commit strategies , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Approximate planning in large POMDPs via reusable trajectories , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Brain , volume =

    Gallese, Vittorio and Fadiga, Luciano and Fogassi, Leonardo and Rizzolatti, Giacomo , title = ". Brain , volume =. 1996 , month =

  61. [61]

    Advances in neural information processing systems , volume=

    Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

  62. [62]

    arXiv preprint arXiv:1710.11248 , year=

    Learning robust rewards with adversarial inverse reinforcement learning , author=. arXiv preprint arXiv:1710.11248 , year=

  63. [63]

    Advances in Neural Information Processing Systems , volume=

    Infogail: Interpretable imitation learning from visual demonstrations , author=. Advances in Neural Information Processing Systems , volume=

  64. [64]

    International Conference on Machine Learning , pages=

    Intrinsic reward driven imitation learning via generative model , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  65. [65]

    International Conference on Machine Learning , pages=

    Sensitivity analysis of linear structural causal models , author=. International Conference on Machine Learning , pages=

  66. [66]

    Advances in Neural Information Processing Systems , volume=

    General transportability of soft interventions: Completeness results , author=. Advances in Neural Information Processing Systems , volume=

  67. [67]

    IJCAI , year=

    From Statistical Transportability to Estimating the Effect of Stochastic Interventions , author=. IJCAI , year=

  68. [68]

    Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages=

    Confounding-robust policy improvement , author=. Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages=

  69. [69]

    Journal of the American Statistical Association , volume=

    Probability Inequalities for Sums of Bounded Random Variables , author=. Journal of the American Statistical Association , volume=

  70. [70]

    General Transportability of Soft Interventions: Completeness Results , url =

    Correa, Juan and Bareinboim, Elias , booktitle =. General Transportability of Soft Interventions: Completeness Results , url =

  71. [71]

    Stabilizing Off-Policy

    Kumar, Aviral and Fu, Justin and Tucker, George and Levine, Sergey , booktitle =. Stabilizing Off-Policy. 2019 , publisher =

  72. [72]

    Advances in Neural Information Processing Systems , volume=

    Characterizing Optimal Mixed Policies: Where to Intervene and What to Observe , author=. Advances in Neural Information Processing Systems , volume=

  73. [73]

    and Bareinboim, E

    Zhang, J. and Bareinboim, E. Can Humans Be Out of the Loop?. 2022

  74. [74]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Off-Policy Deep Reinforcement Learning without Exploration , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

  75. [75]

    In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence , year=

    Sanghack Lee and Juan David Correa and Elias Bareinboim , title=. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence , year=

  76. [76]

    , title =

    Quionero-Candela, Joaquin and Sugiyama, Masashi and Schwaighofer, Anton and Lawrence, Neil D. , title =. 2009 , isbn =

  77. [77]

    Philip and Didelez, Vanessa

    Dawid, A. Philip and Didelez, Vanessa. Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview. Statist. Surv. 2010. doi:10.1214/10-SS081

  78. [78]

    Cassel, Claes M. and S. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika , volume =. 1976 , month =

  79. [79]

    2018 21st International Conference on Intelligent Transportation Systems (ITSC) , pages=

    The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems , author=. 2018 21st International Conference on Intelligent Transportation Systems (ITSC) , pages=. 2018 , doi=

  80. [80]

    Proceedings of the National Conference on Artificial Intelligence , volume=

    Identification of joint interventional distributions in recursive semi-Markovian causal models , author=. Proceedings of the National Conference on Artificial Intelligence , volume=. 2006 , organization=

Showing first 80 references.