Bayesian Inverse Transition Learning: Learning Dynamics From Near-Optimal Trajectories

Abhishek Sharma; Finale Doshi-Velez; Leo Benac; Sonali Parbhoo

arxiv: 2411.05174 · v2 · submitted 2024-11-07 · 💻 cs.LG · cs.AI· stat.ML

Bayesian Inverse Transition Learning: Learning Dynamics From Near-Optimal Trajectories

Leo Benac , Abhishek Sharma , Sonali Parbhoo , Finale Doshi-Velez This is my paper

Pith reviewed 2026-05-23 17:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords inverse transition learningoffline model-based RLBayesian dynamics estimationnear-optimal trajectoriestransition function constraintshealthcare RLtransfer diagnostics

0 comments

The pith

Near-optimal expert trajectories supply constraints that improve Bayesian estimates of transition dynamics in offline reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for estimating the unknown transition function T* when data comes only from trajectories of a near-optimal expert. It converts the expert's avoidance of bad actions into explicit constraints that rule out incompatible transition models. These constraints are folded into a Bayesian posterior, which is then used both to select better actions and to judge whether the learned model will transfer. The approach is demonstrated on synthetic tasks and on real ICU data for managing patient hypotension.

Core claim

Inverse Transition Learning derives constraints on T* directly from the near-optimality of observed expert trajectories and incorporates those constraints into a Bayesian posterior over transition functions, yielding improved policies and transfer diagnostics even when state coverage is incomplete.

What carries the argument

Inverse Transition Learning, the procedure that translates near-optimality of expert actions into explicit constraints on the unknown transition function T*.

If this is right

Policies derived from the constrained posterior outperform those from unconstrained Bayesian estimates on the same limited expert data.
The posterior width over T* supplies a practical signal for deciding whether the learned model can be transferred to a new environment.
The same constraint mechanism applies directly to real clinical data such as ICU hypotension management.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constraint idea could be reused in any setting where expert behavior is known to be near-optimal but state coverage remains sparse.
Pairing these optimality constraints with other weak priors on dynamics might further tighten the posterior without extra data.
The method implicitly extracts information about which parts of the state space matter most for the expert's objective.

Load-bearing premise

The expert policy is near-optimal, and this fact can be turned into usable constraints on T* without further assumptions about the reward function or policy class.

What would settle it

No measurable gain in policy performance or transfer prediction accuracy when the method is run on synthetic environments that supply known near-optimal trajectories and a ground-truth transition function.

Figures

Figures reproduced from arXiv: 2411.05174 by Abhishek Sharma, Finale Doshi-Velez, Leo Benac, Sonali Parbhoo.

**Figure 2.** Figure 2: Top row: Normalized Value vs. Coverage for Gridworld (left: Standard Task, middle: Transfer Task), Bottom row: [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Most likely next 3 states after prescribing Intra [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Example of a subspace of the feasible region de [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the grid world environment. Each [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The grid world environment after the transfer task [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: (40% stochastic-policy states) Top row: Normalized Value vs. Coverage for Gridworld (left: Standard Task, middle: Transfer Task), Bottom row: Normalized Value vs. Coverage for Randomworlds (left: Standard Task, middle: Transfer Task). Rightmost plots: Normalized Value vs. Bayesian Regret of both Tasks (top: Gridworld, bottom: Randomworlds) [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: (20% stochastic-policy states) Top row: Normalized Value vs. Coverage for Gridworld (left: Standard Task, middle: [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: (0% stochastic-policy states) Top row: Normalized Value vs. Coverage for Gridworld (left: Standard Task, middle: [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

We consider the problem of estimating the transition dynamics $T^*$ from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a \emph{feature}: we use the fact that the expert is near-optimal to inform our estimate of $T^*$. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Inverse Transition Learning, a constraint-based Bayesian method for estimating unknown transition dynamics T* from near-optimal expert trajectories in offline model-based RL. It treats the limited coverage of expert data as a feature by deriving constraints from near-optimality, integrates these into a Bayesian posterior over T*, and reports improved decision-making plus transfer diagnostics on synthetic environments and a real ICU hypotension management task.

Significance. If the derivation of reward-free constraints from near-optimality is valid and the empirical gains hold under proper baselines, the work could be significant for healthcare and other domains where rewards are hard to specify and expert data is sparse, by enabling dynamics learning that also flags transfer risk via the posterior.

major comments (2)

[Abstract and §3] Abstract and §3 (method derivation): the claim that near-optimality directly supplies inequality constraints on T* without additional assumptions on the reward function or policy class is load-bearing for the entire approach. Standard derivations of optimality-based constraints require either explicit rewards (to compare action values under T*) or a parametric policy class; if the paper's construction invokes an implicit optimality gap, value function, or reward proxy, the resulting Bayesian posterior is misspecified when the true reward differs.
[§4 and experimental sections] §4 and experimental sections: the abstract asserts 'significant improvements in decision-making' and transfer diagnostics, yet the provided description supplies no quantitative results, baselines, or validation details (e.g., no reported metrics, comparison methods, or statistical tests). Without these, the central empirical claim cannot be evaluated.

minor comments (1)

[§3] Notation for the constraint set and the Bayesian update should be defined with explicit equations early in §3 to allow readers to verify the claimed reward-free property.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the review. We address the two major comments point by point, clarifying the derivation and directing to the full experimental details in the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method derivation): the claim that near-optimality directly supplies inequality constraints on T* without additional assumptions on the reward function or policy class is load-bearing for the entire approach. Standard derivations of optimality-based constraints require either explicit rewards (to compare action values under T*) or a parametric policy class; if the paper's construction invokes an implicit optimality gap, value function, or reward proxy, the resulting Bayesian posterior is misspecified when the true reward differs.

Authors: Section 3 derives the inequality constraints on T* solely from the near-optimality of the observed expert trajectories under an unknown reward: for each trajectory segment, the observed actions must be near-optimal for some reward function consistent with the data. This uses only the definition of near-optimality (existence of a reward making the expert policy approximately optimal) without parameterizing the reward, the policy class, or introducing an explicit optimality gap or value-function proxy. The resulting posterior is therefore the distribution over transition models that admit at least one reward rendering the data near-optimal. We maintain that this construction is not misspecified under the paper's stated assumptions; a concrete counter-example where the constraints fail to hold for any reward would be helpful to examine. revision: no
Referee: [§4 and experimental sections] §4 and experimental sections: the abstract asserts 'significant improvements in decision-making' and transfer diagnostics, yet the provided description supplies no quantitative results, baselines, or validation details (e.g., no reported metrics, comparison methods, or statistical tests). Without these, the central empirical claim cannot be evaluated.

Authors: The complete manuscript (Section 4 and appendix) reports quantitative results on both synthetic environments and the real ICU hypotension dataset. These include policy return improvements versus maximum-likelihood and other inverse-RL baselines, posterior predictive checks for transfer success, and statistical significance tests (e.g., paired t-tests with reported p-values). We apologize if only the high-level summary was available to the referee; the full paper supplies the requested metrics, baselines, and validation details. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and description present Inverse Transition Learning as integrating near-optimality constraints on T* into a Bayesian posterior without reducing any prediction or central claim to a fitted input by construction, self-citation chain, or definitional equivalence. No equations or steps are quoted that exhibit self-definitional loops (e.g., X defined via Y where Y is the output), fitted parameters renamed as predictions, or load-bearing uniqueness imported from prior self-work. The method treats limited expert coverage as a feature using stated near-optimality assumptions, with validation on synthetic and real ICU data serving as external checks. This is the common honest outcome for papers whose core construction remains independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.0 · 5643 in / 1073 out tokens · 39498 ms · 2026-05-23T17:06:55.249064+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Quantifying Potential Observation Missingness in Inverse Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

A practical algorithm quantifies potential missing observations in IRL by computing minimal perturbations to recorded data that render expert actions optimal.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 1

work page 2004
[4]

G.; Bradtke, S

Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to act using real-time dynamic programming. Artificial intelligence, 72(1-2): 81--138

work page 1995
[5]

Betancourt, M. 2011. Nested sampling with constrained hamiltonian monte carlo. In AIP Conference Proceedings, volume 1305, 165--172. American Institute of Physics

work page 2011
[6]

Betancourt, M. 2012. Cruising the simplex: Hamiltonian Monte Carlo and the Dirichlet distribution. In AIP Conference Proceedings 31st, volume 1443, 157--164. American Institute of Physics

work page 2012
[7]

Buesing, L.; Weber, T.; Zwols, Y.; Racaniere, S.; Guez, A.; Lespiau, J.-B.; and Heess, N. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint arXiv:1811.06272

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Dearden, R.; Friedman, N.; and Andre, D. 2013. Model-based Bayesian exploration. arXiv preprint arXiv:1301.6690

work page internal anchor Pith review Pith/arXiv arXiv 2013
[9]

Deisenroth, M.; and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), 465--472

work page 2011
[10]

Diamond, S.; and Boyd, S. 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1): 2909--2913

work page 2016
[11]

W.; Subramanian, J.; and Ghassemi, M

Fatemi, M.; Killian, T. W.; Subramanian, J.; and Ghassemi, M. 2021. Medical dead-ends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34: 4856--4870

work page 2021
[12]

Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A.; et al. 2015. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6): 359--483

work page 2015
[13]

Guo, K.; Yunfeng, S.; and Geng, Y. 2022. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35: 449--461

work page 2022
[14]

Ha, D.; and Schmidhuber, J. 2018. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31

work page 2018
[15]

Herman, M.; Gindele, T.; Wagner, J.; Schmitt, F.; and Burgard, W. 2016. Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In Artificial intelligence and statistics, 102--110. PMLR

work page 2016
[16]

Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 1181--1189

work page 2015
[17]

A.; and Mark, R

Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L. A.; and Mark, R. 2020. Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021)

work page 2020
[18]

Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; and Joachims, T. 2020. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810--21823

work page 2020
[19]

Kim, B.; Farahmand, A.-m.; Pineau, J.; and Precup, D. 2013. Learning from limited demonstrations. Advances in Neural Information Processing Systems, 26

work page 2013
[20]

Kim, B.; and Oh, M.-h. 2023. Model-based offline reinforcement learning with count-based conservatism. In International Conference on Machine Learning, 16728--16746. PMLR

work page 2023
[21]

LaValle, S. 1998. Rapidly-exploring random trees: A new tool for path planning. Research Report 9811

work page 1998
[22]

J.; Lee, J.; and Kim, K

Lee, B. J.; Lee, J.; and Kim, K. E. 2021. Representation balancing offline model-based reinforcement learning. In 9th International Conference on Learning Representations, ICLR 2021

work page 2021
[23]

Lin, L.-J. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8: 293--321

work page 1992
[24]

M.; Broekens, J.; Plaat, A.; Jonker, C

Moerland, T. M.; Broekens, J.; Plaat, A.; Jonker, C. M.; et al. 2023. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning , 16(1): 1--118

work page 2023
[25]

W.; and Atkeson, C

Moore, A. W.; and Atkeson, C. G. 1993. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13: 103--130

work page 1993
[26]

Y.; Russell, S.; et al

Ng, A. Y.; Russell, S.; et al. 2000. Algorithms for inverse reinforcement learning. In Icml, volume 1, 2

work page 2000
[27]

L.; and Singh, S

Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Action-conditional video prediction using deep networks in atari games. Advances in neural information processing systems, 28

work page 2015
[28]

Ornik, M.; and Topcu, U. 2021. Learning and planning for time-varying mdps using maximum likelihood estimation. The Journal of Machine Learning Research, 22(1): 1656--1695

work page 2021
[29]

Poupart, P.; and Vlassis, N. 2008. Model-based Bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artificial Intelligence and Mathematics,, 1--2

work page 2008
[30]

Poupart, P.; Vlassis, N.; Hoey, J.; and Regan, K. 2006. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, 697--704

work page 2006
[31]

Ramachandran, D.; and Amir, E. 2007. Bayesian Inverse Reinforcement Learning. In IJCAI, volume 7, 2586--2591

work page 2007
[32]

Rebello, A.; Tang, S.; Wiens, J.; and Parbhoo, S. 2023. Leveraging Factored Action Spaces for Off-Policy Evaluation. arXiv preprint arXiv:2307.07014

work page arXiv 2023
[33]

Reddy, S.; Dragan, A.; and Levine, S. 2018. Where do you think you're going?: Inferring beliefs about dynamics from behavior. Advances in Neural Information Processing Systems, 31

work page 2018
[34]

Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627--635. JMLR Workshop and Conference Proceedings

work page 2011
[35]

Ross, S.; and Pineau, J. 2008. Model-based Bayesian reinforcement learning in large structured domains. In Uncertainty in artificial intelligence: proceedings of the... conference. Conference on Uncertainty in Artificial Intelligence, volume 2008, 476. NIH Public Access

work page 2008
[36]

R.; and Sastry, S

Scobee, D. R.; and Sastry, S. S. 2019. Maximum likelihood constraint inference for inverse reinforcement learning. arXiv preprint arXiv:1909.05477

work page arXiv 2019
[37]

St \'e phane, R.; Gordon Geoffrey, J.; and Andrew, B. J. 2010. No-regret reductions for imitation learning and structured prediction. arXiv preprint arXiv: 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2010
[38]

Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4): 160--163

work page 1991
[39]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press

work page 2018
[40]

Tang, S.; Makar, M.; Sjoding, M.; Doshi-Velez, F.; and Wiens, J. 2022. Leveraging factored action spaces for efficient offline reinforcement learning in healthcare. Advances in Neural Information Processing Systems, 35: 34272--34286

work page 2022
[41]

Tang, S.; Modi, A.; Sjoding, M.; and Wiens, J. 2020. Clinician-in-the-loop decision making: Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine Learning, 9387--9396. PMLR

work page 2020
[42]

P.; Hessel, M.; and Aslanides, J

Van Hasselt, H. P.; Hessel, M.; and Aslanides, J. 2019. When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32

work page 2019
[43]

Vanseijen, H.; and Sutton, R. 2015. A deeper look at planning as learning from replay. In International conference on machine learning, 2314--2322. PMLR

work page 2015
[44]

Wang, J.; Hertzmann, A.; and Fleet, D. J. 2005. Gaussian process dynamical models. Advances in neural information processing systems, 18

work page 2005
[45]

Yu, T.; Kumar, A.; Rafailov, R.; Rajeswaran, A.; Levine, S.; and Finn, C. 2021. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954--28967

work page 2021
[46]

Zhang, A.; Lyle, C.; Sodhani, S.; Filos, A.; Kwiatkowska, M.; Pineau, J.; Gal, Y.; and Precup, D. 2020 a . Invariant causal prediction for block mdps. In International Conference on Machine Learning, 11214--11224. PMLR

work page 2020
[47]

Zhang, A.; McAllister, R.; Calandra, R.; Gal, Y.; and Levine, S. 2020 b . Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742

work page arXiv 2020
[48]

D.; Bagnell, J

Ziebart, B. D.; Bagnell, J. A.; and Dey, A. K. 2010. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 1255--1262

work page 2010
[49]

D.; Maas, A

Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; Dey, A. K.; et al. 2008. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, 1433--1438. Chicago, IL, USA

work page 2008

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 1

work page 2004

[4] [4]

G.; Bradtke, S

Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1995. Learning to act using real-time dynamic programming. Artificial intelligence, 72(1-2): 81--138

work page 1995

[5] [5]

Betancourt, M. 2011. Nested sampling with constrained hamiltonian monte carlo. In AIP Conference Proceedings, volume 1305, 165--172. American Institute of Physics

work page 2011

[6] [6]

Betancourt, M. 2012. Cruising the simplex: Hamiltonian Monte Carlo and the Dirichlet distribution. In AIP Conference Proceedings 31st, volume 1443, 157--164. American Institute of Physics

work page 2012

[7] [7]

Buesing, L.; Weber, T.; Zwols, Y.; Racaniere, S.; Guez, A.; Lespiau, J.-B.; and Heess, N. 2018. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint arXiv:1811.06272

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Dearden, R.; Friedman, N.; and Andre, D. 2013. Model-based Bayesian exploration. arXiv preprint arXiv:1301.6690

work page internal anchor Pith review Pith/arXiv arXiv 2013

[9] [9]

Deisenroth, M.; and Rasmussen, C. E. 2011. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), 465--472

work page 2011

[10] [10]

Diamond, S.; and Boyd, S. 2016. CVXPY: A Python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1): 2909--2913

work page 2016

[11] [11]

W.; Subramanian, J.; and Ghassemi, M

Fatemi, M.; Killian, T. W.; Subramanian, J.; and Ghassemi, M. 2021. Medical dead-ends and learning to identify high-risk states and treatments. Advances in Neural Information Processing Systems, 34: 4856--4870

work page 2021

[12] [12]

Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A.; et al. 2015. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6): 359--483

work page 2015

[13] [13]

Guo, K.; Yunfeng, S.; and Geng, Y. 2022. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. Advances in Neural Information Processing Systems, 35: 449--461

work page 2022

[14] [14]

Ha, D.; and Schmidhuber, J. 2018. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31

work page 2018

[15] [15]

Herman, M.; Gindele, T.; Wagner, J.; Schmitt, F.; and Burgard, W. 2016. Inverse reinforcement learning with simultaneous estimation of rewards and dynamics. In Artificial intelligence and statistics, 102--110. PMLR

work page 2016

[16] [16]

Jiang, N.; Kulesza, A.; Singh, S.; and Lewis, R. 2015. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, 1181--1189

work page 2015

[17] [17]

A.; and Mark, R

Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L. A.; and Mark, R. 2020. Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021)

work page 2020

[18] [18]

Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; and Joachims, T. 2020. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810--21823

work page 2020

[19] [19]

Kim, B.; Farahmand, A.-m.; Pineau, J.; and Precup, D. 2013. Learning from limited demonstrations. Advances in Neural Information Processing Systems, 26

work page 2013

[20] [20]

Kim, B.; and Oh, M.-h. 2023. Model-based offline reinforcement learning with count-based conservatism. In International Conference on Machine Learning, 16728--16746. PMLR

work page 2023

[21] [21]

LaValle, S. 1998. Rapidly-exploring random trees: A new tool for path planning. Research Report 9811

work page 1998

[22] [22]

J.; Lee, J.; and Kim, K

Lee, B. J.; Lee, J.; and Kim, K. E. 2021. Representation balancing offline model-based reinforcement learning. In 9th International Conference on Learning Representations, ICLR 2021

work page 2021

[23] [23]

Lin, L.-J. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8: 293--321

work page 1992

[24] [24]

M.; Broekens, J.; Plaat, A.; Jonker, C

Moerland, T. M.; Broekens, J.; Plaat, A.; Jonker, C. M.; et al. 2023. Model-based reinforcement learning: A survey. Foundations and Trends in Machine Learning , 16(1): 1--118

work page 2023

[25] [25]

W.; and Atkeson, C

Moore, A. W.; and Atkeson, C. G. 1993. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13: 103--130

work page 1993

[26] [26]

Y.; Russell, S.; et al

Ng, A. Y.; Russell, S.; et al. 2000. Algorithms for inverse reinforcement learning. In Icml, volume 1, 2

work page 2000

[27] [27]

L.; and Singh, S

Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Action-conditional video prediction using deep networks in atari games. Advances in neural information processing systems, 28

work page 2015

[28] [28]

Ornik, M.; and Topcu, U. 2021. Learning and planning for time-varying mdps using maximum likelihood estimation. The Journal of Machine Learning Research, 22(1): 1656--1695

work page 2021

[29] [29]

Poupart, P.; and Vlassis, N. 2008. Model-based Bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artificial Intelligence and Mathematics,, 1--2

work page 2008

[30] [30]

Poupart, P.; Vlassis, N.; Hoey, J.; and Regan, K. 2006. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, 697--704

work page 2006

[31] [31]

Ramachandran, D.; and Amir, E. 2007. Bayesian Inverse Reinforcement Learning. In IJCAI, volume 7, 2586--2591

work page 2007

[32] [32]

Rebello, A.; Tang, S.; Wiens, J.; and Parbhoo, S. 2023. Leveraging Factored Action Spaces for Off-Policy Evaluation. arXiv preprint arXiv:2307.07014

work page arXiv 2023

[33] [33]

Reddy, S.; Dragan, A.; and Levine, S. 2018. Where do you think you're going?: Inferring beliefs about dynamics from behavior. Advances in Neural Information Processing Systems, 31

work page 2018

[34] [34]

Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627--635. JMLR Workshop and Conference Proceedings

work page 2011

[35] [35]

Ross, S.; and Pineau, J. 2008. Model-based Bayesian reinforcement learning in large structured domains. In Uncertainty in artificial intelligence: proceedings of the... conference. Conference on Uncertainty in Artificial Intelligence, volume 2008, 476. NIH Public Access

work page 2008

[36] [36]

R.; and Sastry, S

Scobee, D. R.; and Sastry, S. S. 2019. Maximum likelihood constraint inference for inverse reinforcement learning. arXiv preprint arXiv:1909.05477

work page arXiv 2019

[37] [37]

St \'e phane, R.; Gordon Geoffrey, J.; and Andrew, B. J. 2010. No-regret reductions for imitation learning and structured prediction. arXiv preprint arXiv: 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2010

[38] [38]

Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4): 160--163

work page 1991

[39] [39]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press

work page 2018

[40] [40]

Tang, S.; Makar, M.; Sjoding, M.; Doshi-Velez, F.; and Wiens, J. 2022. Leveraging factored action spaces for efficient offline reinforcement learning in healthcare. Advances in Neural Information Processing Systems, 35: 34272--34286

work page 2022

[41] [41]

Tang, S.; Modi, A.; Sjoding, M.; and Wiens, J. 2020. Clinician-in-the-loop decision making: Reinforcement learning with near-optimal set-valued policies. In International Conference on Machine Learning, 9387--9396. PMLR

work page 2020

[42] [42]

P.; Hessel, M.; and Aslanides, J

Van Hasselt, H. P.; Hessel, M.; and Aslanides, J. 2019. When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32

work page 2019

[43] [43]

Vanseijen, H.; and Sutton, R. 2015. A deeper look at planning as learning from replay. In International conference on machine learning, 2314--2322. PMLR

work page 2015

[44] [44]

Wang, J.; Hertzmann, A.; and Fleet, D. J. 2005. Gaussian process dynamical models. Advances in neural information processing systems, 18

work page 2005

[45] [45]

Yu, T.; Kumar, A.; Rafailov, R.; Rajeswaran, A.; Levine, S.; and Finn, C. 2021. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34: 28954--28967

work page 2021

[46] [46]

Zhang, A.; Lyle, C.; Sodhani, S.; Filos, A.; Kwiatkowska, M.; Pineau, J.; Gal, Y.; and Precup, D. 2020 a . Invariant causal prediction for block mdps. In International Conference on Machine Learning, 11214--11224. PMLR

work page 2020

[47] [47]

Zhang, A.; McAllister, R.; Calandra, R.; Gal, Y.; and Levine, S. 2020 b . Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742

work page arXiv 2020

[48] [48]

D.; Bagnell, J

Ziebart, B. D.; Bagnell, J. A.; and Dey, A. K. 2010. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on International Conference on Machine Learning, 1255--1262

work page 2010

[49] [49]

D.; Maas, A

Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; Dey, A. K.; et al. 2008. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, 1433--1438. Chicago, IL, USA

work page 2008