Inverse Reinforcement Learning with Just Classification and a Few Regressions

Aurelien Bibaut; Lars van der Laan; Nathan Kallus

arxiv: 2509.21172 · v2 · submitted 2025-09-25 · 💻 cs.LG · econ.EM· math.OC· stat.ML

Inverse Reinforcement Learning with Just Classification and a Few Regressions

Lars van der Laan , Nathan Kallus , Aurelien Bibaut This is my paper

Pith reviewed 2026-05-18 14:06 UTC · model grok-4.3

classification 💻 cs.LG econ.EMmath.OCstat.ML

keywords inverse reinforcement learningmaximum entropy IRLpolicy estimationsoft Q-functionreward recoveryfunction approximationfinite sample guarantees

0 comments

The pith

Inverse reinforcement learning recovers normalized rewards through policy classification and Q-function regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in the maximum-entropy model of inverse reinforcement learning, normalized rewards can be recovered by first classifying the behavior policy from data and then using regression to evaluate the soft Q-function via the Bellman equation, followed by a simple inversion to the reward. This modular approach, called GenPQR, allows the use of standard classification and regression tools without needing specialized neural network architectures or anchor-action restrictions. A sympathetic reader would care because it provides finite-sample guarantees that separate the errors from policy estimation and Q-evaluation, making the method more practical for large or continuous action spaces compared to existing methods.

Core claim

Under the maximum-entropy or Gumbel-shock model with statewise affine normalizations, the normalized reward is recovered by estimating the behavior policy, solving for its soft Q-function through the Bellman equation, and then applying the Q-to-reward inversion. Both stages use off-the-shelf methods, and the procedure yields modular finite-sample guarantees with separate policy and Q-estimation errors.

What carries the argument

Generalized Policy-to-Q-to-Reward (GenPQR), a procedure that estimates the policy, evaluates the soft Q-function, and inverts to the normalized reward.

If this is right

IRL becomes implementable with general function approximation using standard methods.
Error bounds are modular, allowing independent analysis of policy and value estimation steps.
The method extends to large and continuous action spaces without anchor restrictions.
Reward recovery performance matches or exceeds specialized approaches like DeepPQR while being simpler.
Theory makes coverage requirements explicit and is independent of specific training procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that many existing RL algorithms for policy and Q-learning can be directly repurposed for IRL tasks.
Future work could test the method on real-world datasets where the max-entropy assumption may be approximate.
Connections to fitted Q-evaluation indicate potential for iterative improvement in reward estimates.

Load-bearing premise

The observed actions are generated exactly according to the maximum-entropy policy induced by the unknown normalized reward and its soft Q-function.

What would settle it

A simulation where data is generated from a non-maximum-entropy policy, such as a deterministic one, and GenPQR is applied to see if the recovered reward matches the true one or induces the observed behavior.

read the original abstract

Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery therefore requires a normalization, yet existing normalized IRL methods often rely on anchor-action restrictions or specialized neural architectures. We study reward recovery in the maximum-entropy, or Gumbel-shock, model under a broad class of statewise affine normalizations, with anchor-action constraints as a special case. This yields Generalized Policy-to-$Q$-to-Reward (GenPQR), a modular procedure that estimates the behavior policy, evaluates its soft $Q$-function through the Bellman equation, and recovers the normalized reward. Both stages can be implemented with off-the-shelf classification and regression methods. We prove modular finite-sample guarantees under general function approximation, with separate policy-estimation and $Q$-estimation errors. As a concrete instantiation, we study GenPQR with fitted $Q$-evaluation, reducing IRL to policy estimation followed by regression. Experiments show that GenPQR matches or improves reward recovery relative to DeepPQR while remaining simpler and more modular. Compared with DeepPQR, our theory goes beyond anchor actions, accommodates large and continuous action spaces, makes coverage requirements explicit, and is not tied to a specific neural-network architecture or training procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenPQR splits normalized IRL into off-the-shelf policy classification plus Q regression with separate finite-sample bounds, but the reward inversion only holds under exact max-ent data generation.

read the letter

The paper's main point is that you can recover a normalized reward by first estimating the behavior policy via classification, then regressing its soft Q-function from the Bellman equation, and finally applying direct algebra. This GenPQR procedure covers a wider family of statewise affine normalizations than just anchor actions and supplies modular finite-sample bounds that keep policy error and Q error separate under general function approximation.

Referee Report

2 major / 2 minor

Summary. The paper proposes Generalized Policy-to-Q-to-Reward (GenPQR) for inverse reinforcement learning in the maximum-entropy (Gumbel-shock) model under statewise affine normalizations. The method estimates the behavior policy via classification, evaluates the soft Q-function by solving the Bellman equation with regression, and recovers the normalized reward by direct algebraic inversion. It supplies modular finite-sample guarantees under general function approximation with separate policy and Q error terms, reduces to policy estimation plus regression in the fitted Q-evaluation case, and reports experiments matching or improving on DeepPQR while handling large/continuous action spaces.

Significance. If the derivations and experiments hold, GenPQR supplies a simpler, modular alternative to architecture-specific IRL methods, with explicit coverage requirements and off-the-shelf ML components. The separation of policy estimation from Q-evaluation with independent bounds, plus the algebraic recovery step, is a clear strength when the generative model is satisfied.

major comments (2)

[Abstract; method and theory sections] The central reward-recovery claim (abstract and method section) via Q-to-reward inversion after Bellman evaluation holds only when the observed trajectories are generated exactly by the max-ent Gumbel-shock policy induced by the unknown normalized reward. Under any deviation from this generative assumption the recovered quantity need not equal the target even with perfect policy and Q estimates; the modular finite-sample bounds inherit the same restriction. This scope limitation is load-bearing for the practical interpretation of the guarantees.
[Theory section] The finite-sample guarantees are stated to be modular with separate error terms, yet the manuscript provides only high-level statements of the bounds without the full derivations or explicit coverage assumptions in the main text. This makes it difficult to verify that the policy-estimation and Q-estimation errors remain independent under general function approximation.

minor comments (2)

[Experiments] Clarify in the experimental section how the coverage assumptions required by the theory are satisfied in the reported simulations and how the empirical reward recovery is measured against the ground-truth normalized reward.
[Experiments] The comparison to DeepPQR would benefit from an explicit statement of which normalization is used in each baseline and whether the same normalization is applied to GenPQR outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the recommendation for minor revision. We address each major comment below and have revised the manuscript to improve clarity on scope and assumptions while preserving the original contributions.

read point-by-point responses

Referee: [Abstract; method and theory sections] The central reward-recovery claim (abstract and method section) via Q-to-reward inversion after Bellman evaluation holds only when the observed trajectories are generated exactly by the max-ent Gumbel-shock policy induced by the unknown normalized reward. Under any deviation from this generative assumption the recovered quantity need not equal the target even with perfect policy and Q estimates; the modular finite-sample bounds inherit the same restriction. This scope limitation is load-bearing for the practical interpretation of the guarantees.

Authors: We agree that the reward-recovery step and the associated finite-sample guarantees are derived under the assumption that the observed trajectories are generated exactly by the maximum-entropy (Gumbel-shock) policy induced by the unknown normalized reward. This is the standard generative model for maximum-entropy IRL, and the Q-to-reward algebraic inversion is valid precisely under this model; deviations would generally prevent exact recovery even with perfect estimates. To make this scope explicit for readers, we have added a clarifying sentence in the abstract and a dedicated paragraph in the method section stating the generative assumption upfront. revision: yes
Referee: [Theory section] The finite-sample guarantees are stated to be modular with separate error terms, yet the manuscript provides only high-level statements of the bounds without the full derivations or explicit coverage assumptions in the main text. This makes it difficult to verify that the policy-estimation and Q-estimation errors remain independent under general function approximation.

Authors: The complete derivations of the modular finite-sample bounds, including the explicit coverage assumptions required for the policy and Q-function estimators and the argument establishing independence of the two error terms under general function approximation, appear in the appendix. We acknowledge that a high-level statement of the coverage conditions and a brief proof outline in the main theory section would aid verification. We have therefore expanded the theory section with the key coverage requirements and a short modular decomposition sketch while keeping the full proofs in the appendix. revision: yes

Circularity Check

0 steps flagged

Modular estimation with algebraic inversion; no load-bearing self-definition or fitted-input prediction

full rationale

The derivation chain consists of (1) policy estimation by off-the-shelf classification, (2) soft-Q evaluation by regression on the Bellman equation, and (3) normalized-reward recovery by direct algebraic inversion. Finite-sample bounds are stated separately for the two estimation stages and do not rely on re-fitting a quantity defined to equal the target reward. The max-ent generative assumption is an explicit modeling premise rather than a hidden self-definition; the paper does not rename a fitted parameter as a 'prediction' or smuggle an ansatz via self-citation. The central claim therefore retains independent content and is scored as a normal non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the maximum-entropy assumption for the data-generating process and on the statewise affine normalization class to guarantee identifiability of the reward from the policy and Q-function.

axioms (1)

domain assumption Observed behavior is generated by the maximum-entropy (Gumbel-shock) policy induced by the unknown normalized reward.
This modeling choice enables the decomposition into policy estimation, Bellman Q-evaluation, and direct reward recovery.

pith-pipeline@v0.9.0 · 5782 in / 1503 out tokens · 68760 ms · 2026-05-18T14:06:10.882446+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004

work page 2004
[2]

Dynamic discrete choice structural models: A survey

Victor Aguirregabiria and Pedro Mira. Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156 0 (1): 0 38--67, 2010

work page 2010
[3]

A variant of the wang-foster-kakade lower bound for the discounted setting

Philip Amortila, Nan Jiang, and Tengyang Xie. A variant of the wang-foster-kakade lower bound for the discounted setting. arXiv preprint arXiv:2011.01075, 2020

work page arXiv 2011
[4]

Practical methods for estimation of dynamic discrete choice models

Peter Arcidiacono and Paul Ellickson. Practical methods for estimation of dynamic discrete choice models. Annual Review of Economics, 3: 0 363--394, 2011

work page 2011
[5]

Maximum entropy semi-supervised inverse reinforcement learning

Julien Audiffren, Michal Valko, Alessandro Lazaric, and Mohammad Ghavamzadeh. Maximum entropy semi-supervised inverse reinforcement learning. In International joint conference on artificial intelligence, 2015

work page 2015
[6]

Learning bellman complete representations for offline policy evaluation

Jonathan Chang, Kaiwen Wang, Nathan Kallus, and Wen Sun. Learning bellman complete representations for offline policy evaluation. In International Conference on Machine Learning, pages 2938--2971. PMLR, 2022

work page 2022
[7]

Information-theoretic considerations in batch reinforcement learning

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International conference on machine learning, pages 1042--1051. PMLR, 2019

work page 2019
[8]

Tree-based batch mode reinforcement learning

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005

work page 2005
[9]

A theoretical analysis of deep q-learning

Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pages 486--489. PMLR, 2020

work page 2020
[10]

Offline reinforcement learning: Fundamental barriers for value function approximation

Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021

work page arXiv 2021
[11]

Learning robust rewards with adversarial inverse reinforcement learning

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, 2018

work page 2018
[12]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018

work page 2018
[13]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NeurIPS, 2016

work page 2016
[14]

Joseph Hotz and Robert A

V. Joseph Hotz and Robert A. Miller. Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60 0 (3): 0 497--529, 1993

work page 1993
[15]

A simulation estimator for dynamic models of discrete choice

V Joseph Hotz, Robert A Miller, Seth Sanders, and Jeffrey Smith. A simulation estimator for dynamic models of discrete choice. The Review of Economic Studies, 61 0 (2): 0 265--289, 1994

work page 1994
[16]

Fast rates for the regret of offline reinforcement learning

Yichun Hu, Nathan Kallus, and Masatoshi Uehara. Fast rates for the regret of offline reinforcement learning. Mathematics of Operations Research, 50 0 (1): 0 633--655, 2025

work page 2025
[17]

Hilbert J. Kappen. Linear theory for control of nonlinear stochastic systems. Physical Review Letters, 95 0 (20): 0 200201, 2005

work page 2005
[18]

Oracle inequalities in empirical risk minimization and sparse recovery problems: Ecole D’Et \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008 , volume 2033

Vladimir Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems: Ecole D’Et \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008 , volume 2033. Springer, 2011

work page 2008
[19]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Nonlinear inverse reinforcement learning with gaussian processes

Sergey Levine, Zoran Popovi\' c , and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In NeurIPS, 2011

work page 2011
[21]

Identifying dynamic discrete decision processes

Thierry Magnac and David Thesmar. Identifying dynamic discrete decision processes. Econometrica, 70 0 (2): 0 801--816, 2002

work page 2002
[22]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[23]

Error bounds for approximate value iteration

R \'e mi Munos. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005

work page 1999
[24]

Finite-time bounds for fitted value iteration

R \'e mi Munos and Csaba Szepesv \'a ri. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 0 (5), 2008

work page 2008
[25]

Bridging the gap between value and policy based reinforcement learning

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In NeurIPS, 2017

work page 2017
[26]

Ng and Stuart J

Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In ICML, 2000

work page 2000
[27]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278--287. Citeseer, 1999

work page 1999
[28]

Stable offline value function learning with bisimulation-based representations

Brahma S Pavse, Yudong Chen, Qiaomin Xie, and Josiah P Hanna. Stable offline value function learning with bisimulation-based representations. arXiv preprint arXiv:2410.01643, 2024

work page arXiv 2024
[29]

Bayesian inverse reinforcement learning

Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007

work page 2007
[30]

Andrew Bagnell, and Martin Zinkevich

Nathan Ratliff, J. Andrew Bagnell, and Martin Zinkevich. Maximum margin planning. In ICML, 2006

work page 2006
[31]

Optimal replacement of gmc bus engines: An empirical model of harold zurcher

John Rust. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, pages 999--1033, 1987

work page 1987
[32]

Revisiting maximum entropy inverse reinforcement learning: New perspectives and algorithms

Aaron J Snoswell, Surya PN Singh, and Nan Ye. Revisiting maximum entropy inverse reinforcement learning: New perspectives and algorithms. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 241--249. IEEE, 2020

work page 2020
[33]

Efficient computation of optimal actions

Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences, 106 0 (28): 0 11478--11483, 2009

work page 2009
[34]

Minimax weight and q-function learning for off-policy evaluation

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659--9668. PMLR, 2020

work page 2020
[35]

Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency

Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. 2021

work page 2021
[36]

Offline minimax soft-q-learning under realizability and partial coverage

Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Offline minimax soft-q-learning under realizability and partial coverage. Advances in Neural Information Processing Systems, 36: 0 12797--12809, 2023

work page 2023
[37]

Empirical Processes in M-estimation, volume 6

Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

work page 2000
[38]

An exponential lower bound for linearly realizable mdp with constant suboptimality gap

Yuanhao Wang, Ruosong Wang, and Sham Kakade. An exponential lower bound for linearly realizable mdp with constant suboptimality gap. Advances in Neural Information Processing Systems, 34: 0 9521--9533, 2021

work page 2021
[39]

Maximum entropy deep inverse reinforcement learning

Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. In AAAI, 2016

work page 2016
[40]

Brian D. Ziebart. Modeling Purposeful Adaptive Behavior With the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010

work page 2010
[41]

Ziebart, Andrew Maas, J

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008

work page 2008
[42]

Modeling interaction via the principle of maximum causal entropy

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010

work page 2010

[1] [1]

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004

work page 2004

[2] [2]

Dynamic discrete choice structural models: A survey

Victor Aguirregabiria and Pedro Mira. Dynamic discrete choice structural models: A survey. Journal of Econometrics, 156 0 (1): 0 38--67, 2010

work page 2010

[3] [3]

A variant of the wang-foster-kakade lower bound for the discounted setting

Philip Amortila, Nan Jiang, and Tengyang Xie. A variant of the wang-foster-kakade lower bound for the discounted setting. arXiv preprint arXiv:2011.01075, 2020

work page arXiv 2011

[4] [4]

Practical methods for estimation of dynamic discrete choice models

Peter Arcidiacono and Paul Ellickson. Practical methods for estimation of dynamic discrete choice models. Annual Review of Economics, 3: 0 363--394, 2011

work page 2011

[5] [5]

Maximum entropy semi-supervised inverse reinforcement learning

Julien Audiffren, Michal Valko, Alessandro Lazaric, and Mohammad Ghavamzadeh. Maximum entropy semi-supervised inverse reinforcement learning. In International joint conference on artificial intelligence, 2015

work page 2015

[6] [6]

Learning bellman complete representations for offline policy evaluation

Jonathan Chang, Kaiwen Wang, Nathan Kallus, and Wen Sun. Learning bellman complete representations for offline policy evaluation. In International Conference on Machine Learning, pages 2938--2971. PMLR, 2022

work page 2022

[7] [7]

Information-theoretic considerations in batch reinforcement learning

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International conference on machine learning, pages 1042--1051. PMLR, 2019

work page 2019

[8] [8]

Tree-based batch mode reinforcement learning

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005

work page 2005

[9] [9]

A theoretical analysis of deep q-learning

Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pages 486--489. PMLR, 2020

work page 2020

[10] [10]

Offline reinforcement learning: Fundamental barriers for value function approximation

Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021

work page arXiv 2021

[11] [11]

Learning robust rewards with adversarial inverse reinforcement learning

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, 2018

work page 2018

[12] [12]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018

work page 2018

[13] [13]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In NeurIPS, 2016

work page 2016

[14] [14]

Joseph Hotz and Robert A

V. Joseph Hotz and Robert A. Miller. Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60 0 (3): 0 497--529, 1993

work page 1993

[15] [15]

A simulation estimator for dynamic models of discrete choice

V Joseph Hotz, Robert A Miller, Seth Sanders, and Jeffrey Smith. A simulation estimator for dynamic models of discrete choice. The Review of Economic Studies, 61 0 (2): 0 265--289, 1994

work page 1994

[16] [16]

Fast rates for the regret of offline reinforcement learning

Yichun Hu, Nathan Kallus, and Masatoshi Uehara. Fast rates for the regret of offline reinforcement learning. Mathematics of Operations Research, 50 0 (1): 0 633--655, 2025

work page 2025

[17] [17]

Hilbert J. Kappen. Linear theory for control of nonlinear stochastic systems. Physical Review Letters, 95 0 (20): 0 200201, 2005

work page 2005

[18] [18]

Oracle inequalities in empirical risk minimization and sparse recovery problems: Ecole D’Et \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008 , volume 2033

Vladimir Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems: Ecole D’Et \'e de Probabilit \'e s de Saint-Flour XXXVIII-2008 , volume 2033. Springer, 2011

work page 2008

[19] [19]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Nonlinear inverse reinforcement learning with gaussian processes

Sergey Levine, Zoran Popovi\' c , and Vladlen Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In NeurIPS, 2011

work page 2011

[21] [21]

Identifying dynamic discrete decision processes

Thierry Magnac and David Thesmar. Identifying dynamic discrete decision processes. Econometrica, 70 0 (2): 0 801--816, 2002

work page 2002

[22] [22]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[23] [23]

Error bounds for approximate value iteration

R \'e mi Munos. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005

work page 1999

[24] [24]

Finite-time bounds for fitted value iteration

R \'e mi Munos and Csaba Szepesv \'a ri. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 0 (5), 2008

work page 2008

[25] [25]

Bridging the gap between value and policy based reinforcement learning

Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In NeurIPS, 2017

work page 2017

[26] [26]

Ng and Stuart J

Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In ICML, 2000

work page 2000

[27] [27]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278--287. Citeseer, 1999

work page 1999

[28] [28]

Stable offline value function learning with bisimulation-based representations

Brahma S Pavse, Yudong Chen, Qiaomin Xie, and Josiah P Hanna. Stable offline value function learning with bisimulation-based representations. arXiv preprint arXiv:2410.01643, 2024

work page arXiv 2024

[29] [29]

Bayesian inverse reinforcement learning

Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, 2007

work page 2007

[30] [30]

Andrew Bagnell, and Martin Zinkevich

Nathan Ratliff, J. Andrew Bagnell, and Martin Zinkevich. Maximum margin planning. In ICML, 2006

work page 2006

[31] [31]

Optimal replacement of gmc bus engines: An empirical model of harold zurcher

John Rust. Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, pages 999--1033, 1987

work page 1987

[32] [32]

Revisiting maximum entropy inverse reinforcement learning: New perspectives and algorithms

Aaron J Snoswell, Surya PN Singh, and Nan Ye. Revisiting maximum entropy inverse reinforcement learning: New perspectives and algorithms. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 241--249. IEEE, 2020

work page 2020

[33] [33]

Efficient computation of optimal actions

Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences, 106 0 (28): 0 11478--11483, 2009

work page 2009

[34] [34]

Minimax weight and q-function learning for off-policy evaluation

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In International Conference on Machine Learning, pages 9659--9668. PMLR, 2020

work page 2020

[35] [35]

Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency

Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. 2021

work page 2021

[36] [36]

Offline minimax soft-q-learning under realizability and partial coverage

Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Offline minimax soft-q-learning under realizability and partial coverage. Advances in Neural Information Processing Systems, 36: 0 12797--12809, 2023

work page 2023

[37] [37]

Empirical Processes in M-estimation, volume 6

Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

work page 2000

[38] [38]

An exponential lower bound for linearly realizable mdp with constant suboptimality gap

Yuanhao Wang, Ruosong Wang, and Sham Kakade. An exponential lower bound for linearly realizable mdp with constant suboptimality gap. Advances in Neural Information Processing Systems, 34: 0 9521--9533, 2021

work page 2021

[39] [39]

Maximum entropy deep inverse reinforcement learning

Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement learning. In AAAI, 2016

work page 2016

[40] [40]

Brian D. Ziebart. Modeling Purposeful Adaptive Behavior With the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010

work page 2010

[41] [41]

Ziebart, Andrew Maas, J

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008

work page 2008

[42] [42]

Modeling interaction via the principle of maximum causal entropy

Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010

work page 2010