FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang; Gary D. Luker; Kathy E. Luker; Kenneth K. Y. Ho; Krishna Garikipati; Siddhartha Srivastava; Xun Huan

arxiv: 2306.10407 · v3 · submitted 2023-06-17 · 💻 cs.LG · cs.AI· physics.bio-ph· q-bio.CB

FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang , Siddhartha Srivastava , Kenneth K. Y. Ho , Kathy E. Luker , Gary D. Luker , Xun Huan , Krishna Garikipati This is my paper

Pith reviewed 2026-05-24 08:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.bio-phq-bio.CB

keywords inverse reinforcement learningFokker-Planck dynamicsMarkov decision processvariational system identificationphysics-constrained learningtrajectory inferencereward function recovery

0 comments

The pith

FP-IRL infers both reward and transition functions directly from trajectory data by mapping MDPs to Fokker-Planck dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most inverse reinforcement learning methods need the transition function known or estimated in advance, which limits their use when dynamics are unknown. FP-IRL removes this requirement by establishing a correspondence that equates reward maximization in an MDP with free energy minimization under Fokker-Planck dynamics. From observed trajectories the method first infers the underlying potential function through variational system identification. Analytic expressions then recover the full set of MDP elements: reward, transition probabilities, and policy. The approach therefore works on systems where direct sampling of transitions is impossible while keeping the inferred quantities physically interpretable.

Core claim

A correspondence between Markov decision processes and the Fokker-Planck equation links reward maximization in the MDP to free energy minimization in the FP dynamics. Inference of the FP potential function from trajectory data via variational system identification then yields analytic expressions for the reward function, transition function, and policy, allowing all three to be recovered simultaneously without prior access to sampled transitions.

What carries the argument

The MDP-FP correspondence that equates reward maximization with free energy minimization, enabling analytic recovery of reward, transition, and policy from a single inferred potential function.

If this is right

Inverse reinforcement learning becomes feasible in domains where transition dynamics cannot be sampled or estimated beforehand.
Reward, transition, and policy are recovered together and remain consistent with one another through the shared potential.
The inferred quantities retain physical meaning because they derive from an FP potential.
The method applies to both synthetic test cases and continuous control tasks such as a modified Mountain Car environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same physics-constrained inference route could be tested on data from diffusion processes that only approximately obey Fokker-Planck equations.
If the analytic recovery step holds, similar mappings might be sought between other reinforcement learning objectives and known physical evolution equations.
The framework naturally supplies a consistency check: the inferred transition function can be validated against any available partial observations of state changes.

Load-bearing premise

The system dynamics must follow Fokker-Planck equations and the MDP-FP correspondence must permit exact analytic recovery of all components from the potential.

What would settle it

The recovered reward and transition functions, when inserted back into the MDP, produce trajectories that differ systematically from the original observed data.

Figures

Figures reproduced from arXiv: 2306.10407 by Chengyang Huang, Gary D. Luker, Kathy E. Luker, Kenneth K. Y. Ho, Krishna Garikipati, Siddhartha Srivastava, Xun Huan.

**Figure 2.** Figure 2: (a) KL divergence DKL(pt|qt) of the probability distribution between data distribution and simulation probability distribution over time. The errors of the (b) value function and (c) partial derivatives of the value function, estimated as 1 |Ω| R Ω (f(x) − fGT(x))2dx 1/2 truth state-action value generated using fixed-point iteration (details provided in Appendix B). The results of the convergence analysis… view at source ↗

**Figure 3.** Figure 3: (a) Kymograph for the cancer cell migration data: each column shows the different vari [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems that can be described by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a correspondence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the FP potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FP-IRL, a physics-constrained IRL method for MDPs whose dynamics admit a Fokker-Planck description. It asserts a direct correspondence between MDP reward maximization and FP free-energy minimization, uses variational system identification to recover the FP potential from trajectories alone, and then applies analytic expressions to extract the reward function, transition kernel, and policy without requiring sampled transitions. Experiments on synthetic benchmarks and a modified Mountain Car task are used to illustrate recovery accuracy and efficiency.

Significance. If the asserted MDP-FP correspondence is exact, parameter-free, and yields unique analytic recoveries, the framework would enable IRL in continuous-state systems with unknown dynamics while preserving physical interpretability; this would be a distinctive contribution relative to standard IRL methods that presuppose or estimate the transition function.

major comments (2)

[Abstract and §3] Abstract and §3 (MDP-FP correspondence): the central claim that reward maximization in an MDP is exactly equivalent to free-energy minimization under FP dynamics, permitting closed-form recovery of r(s,a), P(s'|s,a), and π from the inferred potential alone, is load-bearing. The manuscript must supply the explicit derivation (including the role of the diffusion coefficient and action embedding) showing that the mapping is bijective and independent of quantities fitted inside the variational procedure; without it the simultaneous inference guarantee does not follow.
[§4] §4 (variational system identification and analytic recovery formulas): the expressions that convert the recovered FP potential into the MDP reward and transition must be shown to be free of hidden parameters or discretization artifacts. If the recovery formulas implicitly re-introduce knowledge of the FP operator or require additional constraints on the action space, the claim that both reward and transition are inferred directly from trajectories is undermined.

minor comments (2)

[Experiments] The modified Mountain Car experiment should explicitly state how the continuous FP dynamics are reconciled with the discrete action set and whether any approximation artifacts affect the reported recovery accuracy.
[Notation] Notation for the FP potential and the MDP components should be unified across sections to avoid ambiguity when mapping between the two formalisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments both concern the need for fuller explicit derivations and parameter-independence arguments in Sections 3 and 4. We address each point below and will incorporate the requested material in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (MDP-FP correspondence): the central claim that reward maximization in an MDP is exactly equivalent to free-energy minimization under FP dynamics, permitting closed-form recovery of r(s,a), P(s'|s,a), and π from the inferred potential alone, is load-bearing. The manuscript must supply the explicit derivation (including the role of the diffusion coefficient and action embedding) showing that the mapping is bijective and independent of quantities fitted inside the variational procedure; without it the simultaneous inference guarantee does not follow.

Authors: We agree that an expanded, self-contained derivation is necessary to make the bijectivity and independence claims fully transparent. Section 3 already sketches the MDP-to-FP correspondence, but the revised manuscript will enlarge this section with a complete step-by-step derivation that explicitly treats the diffusion coefficient and the action-embedding map. A short appendix will supply the bijectivity argument and verify that the recovered quantities do not depend on any parameters internal to the variational procedure. revision: yes
Referee: [§4] §4 (variational system identification and analytic recovery formulas): the expressions that convert the recovered FP potential into the MDP reward and transition must be shown to be free of hidden parameters or discretization artifacts. If the recovery formulas implicitly re-introduce knowledge of the FP operator or require additional constraints on the action space, the claim that both reward and transition are inferred directly from trajectories is undermined.

Authors: We concur that the analytic recovery formulas require an explicit demonstration of parameter independence. In the revision we will augment Section 4 with a dedicated subsection that (i) states all assumptions on the action space, (ii) shows that the conversion expressions contain no hidden parameters or re-introduced FP-operator terms, and (iii) confirms the absence of discretization artifacts under the continuous-state formulation used in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain.

full rationale

The paper's derivation rests on leveraging an asserted correspondence between MDPs and FP dynamics to connect reward maximization with free-energy minimization, followed by variational inference of the potential and analytic recovery of reward/transition/policy. No quoted equations or steps in the abstract or description reduce any claimed prediction to fitted inputs by construction, nor does any load-bearing premise collapse to a self-citation chain or ansatz smuggled from prior author work. The variational system identification and analytic expressions are presented as independent of the target quantities, and the method is tested on external benchmarks, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that the system obeys Fokker-Planck dynamics and that an exact analytic correspondence to MDP components exists. No free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption The system dynamics admit a Fokker-Planck description and a direct correspondence exists between MDP reward maximization and FP free-energy minimization.
Explicitly stated as the tailoring condition and enabling link in the abstract.

pith-pipeline@v0.9.0 · 5808 in / 1393 out tokens · 23355 ms · 2026-05-24T08:24:36.022113+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Conjecture 3.1. The potential function in FP is equivalent to the negative state-action value function in MDP: ψ(s,a)=−Q^π(s,a).
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

F(p) = ∫ ψ(x)p(x)dx + β⁻¹ ∫ p(x)log p(x)dx … the solution of pt+1 = arg min W₂(pt,p)² + Δt F(p) converges to the FP PDE

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Abbeel and A

P. Abbeel and A. Y . Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-First International Conference on Machine Learning , ICML ’04, page 1, New York, NY , USA, 2004. Association for Computing Machinery

work page 2004
[2]

R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy of Sciences, 38(8):716–719, 1952

work page 1952
[3]

R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft up- dates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, page 202–211, Arlington, Virginia, USA, 2016. AUAI Press

work page 2016
[4]

K. Friston. The free-energy principle: a rough guide to the brain? Trends in cognitive sciences, 13(7):293–301, 2009

work page 2009
[5]

K. Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127–138, 2010

work page 2010
[6]

Friston, J

K. Friston, J. Daunizeau, and S. Kiebel. Active inference or reinforcement learning. PLoS One, 4(7):e6421, 2009

work page 2009
[7]

Friston, J

K. Friston, J. Kilner, and L. Harrison. A free energy principle for the brain. Journal of Physiology-Paris, 100(1):70–87, 2006. Theoretical and Computational Neuroscience: Un- derstanding Brain Functions

work page 2006
[8]

J. Fu, K. Luo, and S. Levine. Learning robust rewards with adverserial inverse reinforcement learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018

work page 2018
[9]

D. Garg, S. Chakraborty, C. Cundy, J. Song, and S. Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021

work page 2021
[10]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Sys- tems, volume 27. Curran Associates, Inc., 2014

work page 2014
[11]

Haarnoja, H

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy- based policies. In D. Precup and Y . W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1352–1361. PMLR, 06–11 Aug 2017

work page 2017
[12]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors, Proceed- ings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018

work page 2018
[13]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Henderson, W.-D

P. Henderson, W.-D. Chang, P.-L. Bacon, D. Meger, J. Pineau, and D. Precup. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learn- ing. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018

work page 2018
[15]

Herman, T

M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard. Inverse reinforcement learn- ing with simultaneous estimation of rewards and dynamics. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statis- tics, volume 51 of Proceedings of Machine Learning Research, pages 102–110, Cad...

work page 2016
[16]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[17]

K. K. Ho, S. Srivastava, P. C. Kinnunen, K. Garikipati, G. D. Luker, and K. E. Luker. Cell-to- cell variability of dynamic cxcl12-cxcr4 signaling and morphological processes in chemotaxis. bioRxiv, 2022

work page 2022
[18]

Hossain, W

T. Hossain, W. Shen, A. D. Antar, S. Prabhudesai, S. Inoue, X. Huan, and N. Banovic. A bayesian approach for quantifying data scarcity when modeling human behavior via inverse reinforcement learning. ACM Trans. Comput.-Hum. Interact., jul 2022. Just Accepted

work page 2022
[19]

Jordan, D

R. Jordan, D. Kinderlehrer, and F. Otto. Free energy and the fokker-planck equation. Physica D: Nonlinear Phenomena , 107(2):265–271, 1997. 16th Annual International Conference of the Center for Nonlinear Studies

work page 1997
[20]

Kalantari, H

J. Kalantari, H. Nelson, and N. Chia. The unreasonable effectiveness of inverse reinforce- ment learning in advancing cancer research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):437–445, Apr. 2020

work page 2020
[21]

A. Y . Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc

work page 2000
[22]

Ramachandran and E

D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, page 2586–2591, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc

work page 2007
[23]

N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning , ICML ’06, page 729–736, New York, NY , USA, 2006. Association for Computing Machinery

work page 2006
[24]

Risken and T

H. Risken and T. Frank. The Fokker-Planck Equation: Methods of Solution and Applications. Springer Series in Synergetics. Springer Berlin Heidelberg, 1996

work page 1996
[25]

S. Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998

work page 1998
[26]

Sallans and G

B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5:1063–1088, 2004

work page 2004
[27]

Z. Wang, X. Huan, and K. Garikipati. Variational system identification of the partial differential equations governing the physics of pattern-formation: Inference under varying fidelity and noise. Computer Methods in Applied Mechanics and Engineering, 356:44–74, 2019

work page 2019
[28]

Z. Wang, X. Huan, and K. Garikipati. Variational system identification of the partial differential equations governing microstructure evolution in materials: Inference over sparse and spatially unrelated data. Computer Methods in Applied Mechanics and Engineering, 377:113706, 2021

work page 2021
[29]

L. Yu, J. Song, and S. Ermon. Multi-agent adversarial inverse reinforcement learning. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7194–

work page
[30]

PMLR, 09–15 Jun 2019

work page 2019
[31]

B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, USA, 2010. AAI3438449

work page 2010
[32]

B. D. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling interaction via the principle of max- imum causal entropy. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 1255–1262, Madison, WI, USA, 2010. Om- nipress

work page 2010
[33]

fp_irl_main

B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, page 1433–1438. AAAI Press, 2008. 11 A Inverse Bellman Operator In this section, we show the proof for Theorem 3.2. Note that the proof is similar to the proof ...

work page 2008

[1] [1]

Abbeel and A

P. Abbeel and A. Y . Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-First International Conference on Machine Learning , ICML ’04, page 1, New York, NY , USA, 2004. Association for Computing Machinery

work page 2004

[2] [2]

R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy of Sciences, 38(8):716–719, 1952

work page 1952

[3] [3]

R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft up- dates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, page 202–211, Arlington, Virginia, USA, 2016. AUAI Press

work page 2016

[4] [4]

K. Friston. The free-energy principle: a rough guide to the brain? Trends in cognitive sciences, 13(7):293–301, 2009

work page 2009

[5] [5]

K. Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127–138, 2010

work page 2010

[6] [6]

Friston, J

K. Friston, J. Daunizeau, and S. Kiebel. Active inference or reinforcement learning. PLoS One, 4(7):e6421, 2009

work page 2009

[7] [7]

Friston, J

K. Friston, J. Kilner, and L. Harrison. A free energy principle for the brain. Journal of Physiology-Paris, 100(1):70–87, 2006. Theoretical and Computational Neuroscience: Un- derstanding Brain Functions

work page 2006

[8] [8]

J. Fu, K. Luo, and S. Levine. Learning robust rewards with adverserial inverse reinforcement learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018

work page 2018

[9] [9]

D. Garg, S. Chakraborty, C. Cundy, J. Song, and S. Ermon. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021

work page 2021

[10] [10]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Sys- tems, volume 27. Curran Associates, Inc., 2014

work page 2014

[11] [11]

Haarnoja, H

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy- based policies. In D. Precup and Y . W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1352–1361. PMLR, 06–11 Aug 2017

work page 2017

[12] [12]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors, Proceed- ings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018

work page 2018

[13] [13]

Soft Actor-Critic Algorithms and Applications

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V . Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Henderson, W.-D

P. Henderson, W.-D. Chang, P.-L. Bacon, D. Meger, J. Pineau, and D. Precup. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learn- ing. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018

work page 2018

[15] [15]

Herman, T

M. Herman, T. Gindele, J. Wagner, F. Schmitt, and W. Burgard. Inverse reinforcement learn- ing with simultaneous estimation of rewards and dynamics. In A. Gretton and C. C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statis- tics, volume 51 of Proceedings of Machine Learning Research, pages 102–110, Cad...

work page 2016

[16] [16]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[17] [17]

K. K. Ho, S. Srivastava, P. C. Kinnunen, K. Garikipati, G. D. Luker, and K. E. Luker. Cell-to- cell variability of dynamic cxcl12-cxcr4 signaling and morphological processes in chemotaxis. bioRxiv, 2022

work page 2022

[18] [18]

Hossain, W

T. Hossain, W. Shen, A. D. Antar, S. Prabhudesai, S. Inoue, X. Huan, and N. Banovic. A bayesian approach for quantifying data scarcity when modeling human behavior via inverse reinforcement learning. ACM Trans. Comput.-Hum. Interact., jul 2022. Just Accepted

work page 2022

[19] [19]

Jordan, D

R. Jordan, D. Kinderlehrer, and F. Otto. Free energy and the fokker-planck equation. Physica D: Nonlinear Phenomena , 107(2):265–271, 1997. 16th Annual International Conference of the Center for Nonlinear Studies

work page 1997

[20] [20]

Kalantari, H

J. Kalantari, H. Nelson, and N. Chia. The unreasonable effectiveness of inverse reinforce- ment learning in advancing cancer research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):437–445, Apr. 2020

work page 2020

[21] [21]

A. Y . Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc

work page 2000

[22] [22]

Ramachandran and E

D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, page 2586–2591, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc

work page 2007

[23] [23]

N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning , ICML ’06, page 729–736, New York, NY , USA, 2006. Association for Computing Machinery

work page 2006

[24] [24]

Risken and T

H. Risken and T. Frank. The Fokker-Planck Equation: Methods of Solution and Applications. Springer Series in Synergetics. Springer Berlin Heidelberg, 1996

work page 1996

[25] [25]

S. Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103, 1998

work page 1998

[26] [26]

Sallans and G

B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5:1063–1088, 2004

work page 2004

[27] [27]

Z. Wang, X. Huan, and K. Garikipati. Variational system identification of the partial differential equations governing the physics of pattern-formation: Inference under varying fidelity and noise. Computer Methods in Applied Mechanics and Engineering, 356:44–74, 2019

work page 2019

[28] [28]

Z. Wang, X. Huan, and K. Garikipati. Variational system identification of the partial differential equations governing microstructure evolution in materials: Inference over sparse and spatially unrelated data. Computer Methods in Applied Mechanics and Engineering, 377:113706, 2021

work page 2021

[29] [29]

L. Yu, J. Song, and S. Ermon. Multi-agent adversarial inverse reinforcement learning. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7194–

work page

[30] [30]

PMLR, 09–15 Jun 2019

work page 2019

[31] [31]

B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, USA, 2010. AAI3438449

work page 2010

[32] [32]

B. D. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling interaction via the principle of max- imum causal entropy. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 1255–1262, Madison, WI, USA, 2010. Om- nipress

work page 2010

[33] [33]

fp_irl_main

B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, page 1433–1438. AAAI Press, 2008. 11 A Inverse Bellman Operator In this section, we show the proof for Theorem 3.2. Note that the proof is similar to the proof ...

work page 2008