CROP: Conservative Reward for Model-based Offline Policy Optimization

Hao Li; Mei-Jiang Gui; Shi-Qi Liu; Shuang-Yi Wang; Shu-Hai Li; Xiao-Hu Zhou; Xiao-Liang Xie; Zeng-Guang Hou; Zhen-Qiu Feng

arxiv: 2310.17245 · v2 · submitted 2023-10-26 · 💻 cs.LG · cs.AI

CROP: Conservative Reward for Model-based Offline Policy Optimization

Hao Li , Xiao-Hu Zhou , Shu-Hai Li , Mei-Jiang Gui , Xiao-Liang Xie , Shi-Qi Liu , Shuang-Yi Wang , Zhen-Qiu Feng

show 1 more author

Zeng-Guang Hou

This is my paper

Pith reviewed 2026-05-24 06:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningmodel-based RLconservative reward estimationdistribution shiftpolicy optimization

0 comments

The pith

CROP creates a conservative reward estimator by minimizing estimation error and rewards of random actions to address overestimation in model-based offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline RL suffers from overestimation due to distribution shift when models generate data outside the collected dataset. CROP proposes a streamlined objective that jointly minimizes estimation error and the rewards assigned to random actions. This yields a robustly conservative reward estimator. Theoretical analysis shows the mechanism produces conservative policy evaluation and mitigates distribution shift. Experiments indicate the simple change to reward estimation delivers competitive performance against existing methods.

Core claim

CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift.

What carries the argument

The streamlined objective that concurrently minimizes estimation error and the rewards of random actions, yielding a robustly conservative reward estimator

If this is right

The conservative reward mechanism produces conservative policy evaluation.
Distribution shift is mitigated during offline policy optimization.
CROP achieves competitive performance with existing methods via a simple modification to reward estimation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reward conservatism might combine with other offline RL techniques such as explicit pessimism penalties.
Testing the estimator on environments with deliberately inaccurate dynamics models could reveal limits of the transfer to policy evaluation.
The approach may apply beyond model-based settings to model-free methods that also suffer reward overestimation.

Load-bearing premise

That jointly minimizing estimation error and the rewards of random actions produces a robustly conservative reward estimator whose conservatism transfers to policy evaluation without introducing offsetting biases or requiring additional assumptions on model accuracy.

What would settle it

A controlled experiment in which the CROP reward estimator is applied yet the resulting policy still overestimates values or fails to mitigate distribution shift.

Figures

Figures reproduced from arXiv: 2310.17245 by Hao Li, Mei-Jiang Gui, Shi-Qi Liu, Shuang-Yi Wang, Shu-Hai Li, Xiao-Hu Zhou, Xiao-Liang Xie, Zeng-Guang Hou, Zhen-Qiu Feng.

**Figure 1.** Figure 1: Conservative reward with different β. R and the behavior policy are also shown for comparison. Due to the different sizes and behavior policies of different datasets, the coverage of offline data and the learned model accuracy are different, which affect the selection of conservatism coefficient β and roll-out length k. For each dataset, β is searched from {0.01, 0.05, 0.1, 0.2} and k is searched from {3,… view at source ↗

read the original abstract

Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CROP adds a joint penalty on estimation error and random-action rewards for conservative reward estimation in model-based offline RL, but the theory does not appear to bound dynamics model error so the conservatism may not reach policy evaluation.

read the letter

CROP proposes a conservative reward estimator for model-based offline RL. It minimizes estimation error while also penalizing the rewards given to random actions, and claims this produces conservative policy evaluation that reduces distribution shift. The new part is that joint objective. Prior work has conservative or pessimistic methods, but this specific pairing of error minimization with random-action penalties seems distinct from what's cited in the abstract. The paper does a decent job keeping the method simple, just modifying the reward part, and it claims competitive results against other methods without major added complexity. The main weakness is that the theoretical claim may not hold up. The stress test points out that even with a conservative reward, rolling out an inaccurate dynamics model can still cause overestimation. The abstract does not mention any handling of model error in the analysis, so the guarantee might be incomplete. Experiments are described only at a high level with no numbers or details provided, which makes it hard to assess how well it actually works. This paper is for researchers in offline RL who are looking for straightforward ways to add conservatism to model-based methods. Someone already deep in the area might see it as a small variation rather than a big advance. I would recommend sending it for peer review. The core idea is clear enough that referees can evaluate whether the theory closes the gap on model error and whether the experiments support the claims with proper details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CROP, a model-based offline RL algorithm whose core contribution is a reward estimator trained by jointly minimizing estimation error and the rewards assigned to random actions. The authors claim that this objective produces a conservative reward function whose conservatism propagates through model-based policy evaluation, thereby mitigating overestimation from distribution shift. Experiments are reported to show competitive performance against existing offline RL methods.

Significance. A simple, jointly optimized conservative reward estimator could reduce the engineering overhead of conservatism in model-based offline RL. If the propagation from reward conservatism to policy evaluation holds under standard model-error assumptions and the experiments include proper controls, the method would be a modest but practical addition to the conservative offline RL literature.

major comments (2)

[Theoretical analysis] Theoretical analysis section: the claim that the joint objective yields a reward estimator whose conservatism transfers to policy evaluation (via model rollouts) is load-bearing, yet the provided description supplies no explicit error term or bound on residual dynamics-model error. Without such a term, it is unclear whether the reward penalty dominates model-induced bias under standard Lipschitz or bounded-error assumptions on the dynamics.
[Experiments] Experiments section: the abstract asserts competitive performance, yet no quantitative results, error bars, ablation on the random-action penalty coefficient, or comparison of estimated vs. true rewards are referenced. This prevents verification that the conservative term, rather than other implementation choices, drives the reported gains.

minor comments (2)

[Experiments] The abstract states that source code will be released after acceptance; a reproducibility statement with exact hyper-parameter ranges and random seeds should be added to the experimental section.
[Method] Notation for the conservative reward objective should be introduced with an explicit equation number so that later claims about its effect on the value function can be traced directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the claim that the joint objective yields a reward estimator whose conservatism transfers to policy evaluation (via model rollouts) is load-bearing, yet the provided description supplies no explicit error term or bound on residual dynamics-model error. Without such a term, it is unclear whether the reward penalty dominates model-induced bias under standard Lipschitz or bounded-error assumptions on the dynamics.

Authors: The current analysis shows that the conservative reward produces conservative policy evaluation when the dynamics model is exact. We agree an explicit error term for residual model error is needed to clarify dominance under bounded-error or Lipschitz assumptions. In revision we will add a proposition bounding the total evaluation error as a sum of the reward conservatism term and a model-error term, with a short discussion of when the former can dominate. This is a partial revision because the core claim holds under the exact-model case already analyzed. revision: partial
Referee: [Experiments] Experiments section: the abstract asserts competitive performance, yet no quantitative results, error bars, ablation on the random-action penalty coefficient, or comparison of estimated vs. true rewards are referenced. This prevents verification that the conservative term, rather than other implementation choices, drives the reported gains.

Authors: The manuscript contains performance tables with means and standard deviations. We accept that an explicit ablation on the penalty coefficient and a direct estimated-vs-true reward comparison are missing. We will add both in the revision: a table varying the coefficient across environments and a figure showing reward estimation error on held-out trajectories. These additions will isolate the contribution of the conservative term. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces conservatism via explicit objective and analyzes its consequences separately.

full rationale

The paper defines a reward estimator via a joint objective (minimize estimation error + penalize random-action rewards) and then states that theoretical analysis shows this yields conservative policy evaluation. No quoted equation reduces the policy-evaluation conservatism directly to the fitted parameters by algebraic identity or by renaming the input fit as an output prediction. The central claim rests on a derived consequence rather than a self-referential definition or self-citation chain that bears the entire load. The method is therefore self-contained against external benchmarks of model-based offline RL.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, hyperparameters, or modeling assumptions can be extracted, so the ledger remains empty.

pith-pipeline@v0.9.0 · 5704 in / 1079 out tokens · 22239 ms · 2026-05-24T06:35:51.634469+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

lr = ED [[ˆr(s,a)−R(s,a)]² + β ˆr(s,¯a)] … r(s,a)=R(s,a)−βμ/¯π(a|s) … Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection; RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ˆQπ conservatively estimates the true Q-function … Qπ(s1,a1)−ˆQπ(s1,a1) > Qπ(s2,a2)−ˆQπ(s2,a2) when ¯π(a1|s1)<¯π(a2|s2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Agarwal, A.; Jiang, N.; and Kakade, S. M. 2019. Reinforcement learning: Theory and algorithms. Seattle, WA: CS Dept. of UW Seattle

work page 2019
[4]

Agarwal, R.; Schuurmans, D.; and Norouzi, M. 2019. An Optimistic Perspective on Offline Reinforcement Learning. In International Conference on Machine Learning

work page 2019
[5]

An, G.; Moon, S.; Kim, J.; and Song, H. O. 2021. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Annual Conference on Neural Information Processing Systems 2021, 7436--7447

work page 2021
[6]

Bai, C.; Wang, L.; Yang, Z.; Deng, Z.; Garg, A.; Liu, P.; and Wang, Z. 2022. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. In The Tenth International Conference on Learning Representations

work page 2022
[7]

Bhardwaj, M.; Xie, T.; Boots, B.; Jiang, N.; and Cheng, C.-A. 2023. Adversarial Model for Offline Reinforcement Learning. ArXiv, abs/2302.11048

work page arXiv 2023
[8]

Cheng, C.-A.; Xie, T.; Jiang, N.; and Agarwal, A. 2022. Adversarially Trained Actor Critic for Offline Reinforcement Learning. In Proceedings of the 39th International Conference on Machine Learning, 3852--3878

work page 2022
[9]

Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; and Levine, S. 2020. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. ArXiv, abs/2004.07219

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Fujimoto, S.; and Gu, S. S. 2021. A Minimalist Approach to Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 20132--20145

work page 2021
[11]

Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2052--2062

work page 2019
[12]

Ghasemipour, S. K. S.; Schuurmans, D.; and Gu, S. S. 2021. EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL . In Proceedings of the 38th International Conference on Machine Learning, volume 139, 3682--3691

work page 2021
[13]

Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; and Levine, S. 2018. Soft Actor-Critic Algorithms and Applications. CoRR, abs/1812.05905

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; and Levine, S. 2018. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. ArXiv, abs/1806.10293

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; and Joachims, T. 2020. MOReL: Model-Based Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

work page 2020
[16]

Kim, B.; and Oh, M. 2023. Model-based Offline Reinforcement Learning with Count-based Conservatism. In the 40 th International Conference on Machine Learning

work page 2023
[17]

Kostrikov, I.; Fergus, R.; Tompson, J.; and Nachum, O. 2021. Offline Reinforcement Learning with Fisher Divergence Critic Regularization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5774--5783. PMLR

work page 2021
[18]

Kostrikov, I.; Nair, A.; and Levine, S. 2022. Offline Reinforcement Learning with Implicit Q-Learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022
[19]

Kumar, A.; Fu, J.; Soh, M.; Tucker, G.; and Levine, S. 2019. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 11761--11771

work page 2019
[20]

Kumar, A.; Zhou, A.; Tucker, G.; and Levine, S. 2020. Conservative Q-Learning for Offline Reinforcement Learning. In Proceedings of Annual Conference on Neural Information Processing Systems 2020, 1179--1191

work page 2020
[21]

Laroche, R.; Trichelair, P.; and des Combes, R. T. 2019. Safe Policy Improvement with Baseline Bootstrapping. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learning Research, 3652--3661. PMLR

work page 2019
[22]

Levine, S.; Kumar, A.; Tucker, G.; and Fu, J. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ArXiv, abs/2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020
[23]

J.; Parker - Holder, J.; Osborne, M

Lu, C.; Ball, P. J.; Parker - Holder, J.; Osborne, M. A.; and Roberts, S. J. 2022. Revisiting Design Choices in Offline Model Based Reinforcement Learning. In The Tenth International Conference on Learning Representations. OpenReview.net

work page 2022
[24]

Lyu, J.; Li, X.; and Lu, Z. 2022. Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. In NeurIPS

work page 2022
[25]

Lyu, J.; Ma, X.; Li, X.; and Lu, Z. 2022. Mildly Conservative Q-Learning for Offline Reinforcement Learning. In Annual Conference on Neural Information Processing Systems 2022

work page 2022
[26]

A.; Veness, J.; Bellemare, M

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature, 518: 529--533

work page 2015
[27]

Rafailov, R.; Yu, T.; Rajeswaran, A.; and Finn, C. 2021. Offline Reinforcement Learning from Images with Latent Space Models. In the 3rd Annual Conference on Learning for Dynamics and Control, volume 144, 1154--1168

work page 2021
[28]

Rigter, M.; Lacerda, B.; and Hawes, N. 2022. RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning. In NeurIPS

work page 2022
[29]

Shi, L.; Li, G.; Wei, Y.; Chen, Y.; and Chi, Y. 2022. Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity. In International Conference on Machine Learning, volume 162, 19967--20025

work page 2022
[30]

Sun, Y.; Zhang, J.; Jia, C.; Lin, H.; Ye, J.; and Yu, Y. 2023. Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning. In the 40 th International Conference on Machine Learning

work page 2023
[31]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2005. Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 16: 285--286

work page 2005
[32]

Yu, T.; Kumar, A.; Rafailov, R.; Rajeswaran, A.; Levine, S.; and Finn, C. 2021. COMBO: Conservative Offline Model-Based Policy Optimization. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 202...

work page 2021
[33]

Y.; Levine, S.; Finn, C.; and Ma, T

Yu, T.; Thomas, G.; Yu, L.; Ermon, S.; Zou, J. Y.; Levine, S.; Finn, C.; and Ma, T. 2020. MOPO: Model-based Offline Policy Optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

work page 2020
[34]

Zhang, S.; Yao, L.; Sun, A.; Tay, Y.; Zhang, S.; Yao, L.; and Sun, A. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. ArXiv, abs/1707.07435

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Agarwal, A.; Jiang, N.; and Kakade, S. M. 2019. Reinforcement learning: Theory and algorithms. Seattle, WA: CS Dept. of UW Seattle

work page 2019

[4] [4]

Agarwal, R.; Schuurmans, D.; and Norouzi, M. 2019. An Optimistic Perspective on Offline Reinforcement Learning. In International Conference on Machine Learning

work page 2019

[5] [5]

An, G.; Moon, S.; Kim, J.; and Song, H. O. 2021. Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Annual Conference on Neural Information Processing Systems 2021, 7436--7447

work page 2021

[6] [6]

Bai, C.; Wang, L.; Yang, Z.; Deng, Z.; Garg, A.; Liu, P.; and Wang, Z. 2022. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning. In The Tenth International Conference on Learning Representations

work page 2022

[7] [7]

Bhardwaj, M.; Xie, T.; Boots, B.; Jiang, N.; and Cheng, C.-A. 2023. Adversarial Model for Offline Reinforcement Learning. ArXiv, abs/2302.11048

work page arXiv 2023

[8] [8]

Cheng, C.-A.; Xie, T.; Jiang, N.; and Agarwal, A. 2022. Adversarially Trained Actor Critic for Offline Reinforcement Learning. In Proceedings of the 39th International Conference on Machine Learning, 3852--3878

work page 2022

[9] [9]

Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; and Levine, S. 2020. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. ArXiv, abs/2004.07219

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Fujimoto, S.; and Gu, S. S. 2021. A Minimalist Approach to Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 20132--20145

work page 2021

[11] [11]

Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-Policy Deep Reinforcement Learning without Exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2052--2062

work page 2019

[12] [12]

Ghasemipour, S. K. S.; Schuurmans, D.; and Gu, S. S. 2021. EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL . In Proceedings of the 38th International Conference on Machine Learning, volume 139, 3682--3691

work page 2021

[13] [13]

Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; and Levine, S. 2018. Soft Actor-Critic Algorithms and Applications. CoRR, abs/1812.05905

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; and Levine, S. 2018. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. ArXiv, abs/1806.10293

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; and Joachims, T. 2020. MOReL: Model-Based Offline Reinforcement Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

work page 2020

[16] [16]

Kim, B.; and Oh, M. 2023. Model-based Offline Reinforcement Learning with Count-based Conservatism. In the 40 th International Conference on Machine Learning

work page 2023

[17] [17]

Kostrikov, I.; Fergus, R.; Tompson, J.; and Nachum, O. 2021. Offline Reinforcement Learning with Fisher Divergence Critic Regularization. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 5774--5783. PMLR

work page 2021

[18] [18]

Kostrikov, I.; Nair, A.; and Levine, S. 2022. Offline Reinforcement Learning with Implicit Q-Learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022

work page 2022

[19] [19]

Kumar, A.; Fu, J.; Soh, M.; Tucker, G.; and Levine, S. 2019. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 11761--11771

work page 2019

[20] [20]

Kumar, A.; Zhou, A.; Tucker, G.; and Levine, S. 2020. Conservative Q-Learning for Offline Reinforcement Learning. In Proceedings of Annual Conference on Neural Information Processing Systems 2020, 1179--1191

work page 2020

[21] [21]

Laroche, R.; Trichelair, P.; and des Combes, R. T. 2019. Safe Policy Improvement with Baseline Bootstrapping. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learning Research, 3652--3661. PMLR

work page 2019

[22] [22]

Levine, S.; Kumar, A.; Tucker, G.; and Fu, J. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ArXiv, abs/2005.01643

work page internal anchor Pith review Pith/arXiv arXiv 2020

[23] [23]

J.; Parker - Holder, J.; Osborne, M

Lu, C.; Ball, P. J.; Parker - Holder, J.; Osborne, M. A.; and Roberts, S. J. 2022. Revisiting Design Choices in Offline Model Based Reinforcement Learning. In The Tenth International Conference on Learning Representations. OpenReview.net

work page 2022

[24] [24]

Lyu, J.; Li, X.; and Lu, Z. 2022. Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. In NeurIPS

work page 2022

[25] [25]

Lyu, J.; Ma, X.; Li, X.; and Lu, Z. 2022. Mildly Conservative Q-Learning for Offline Reinforcement Learning. In Annual Conference on Neural Information Processing Systems 2022

work page 2022

[26] [26]

A.; Veness, J.; Bellemare, M

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M. A.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Nature, 518: 529--533

work page 2015

[27] [27]

Rafailov, R.; Yu, T.; Rajeswaran, A.; and Finn, C. 2021. Offline Reinforcement Learning from Images with Latent Space Models. In the 3rd Annual Conference on Learning for Dynamics and Control, volume 144, 1154--1168

work page 2021

[28] [28]

Rigter, M.; Lacerda, B.; and Hawes, N. 2022. RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning. In NeurIPS

work page 2022

[29] [29]

Shi, L.; Li, G.; Wei, Y.; Chen, Y.; and Chi, Y. 2022. Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity. In International Conference on Machine Learning, volume 162, 19967--20025

work page 2022

[30] [30]

Sun, Y.; Zhang, J.; Jia, C.; Lin, H.; Ye, J.; and Yu, Y. 2023. Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning. In the 40 th International Conference on Machine Learning

work page 2023

[31] [31]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2005. Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 16: 285--286

work page 2005

[32] [32]

Yu, T.; Kumar, A.; Rafailov, R.; Rajeswaran, A.; Levine, S.; and Finn, C. 2021. COMBO: Conservative Offline Model-Based Policy Optimization. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 202...

work page 2021

[33] [33]

Y.; Levine, S.; Finn, C.; and Ma, T

Yu, T.; Thomas, G.; Yu, L.; Ermon, S.; Zou, J. Y.; Levine, S.; Finn, C.; and Ma, T. 2020. MOPO: Model-based Offline Policy Optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual

work page 2020

[34] [34]

Zhang, S.; Yao, L.; Sun, A.; Tay, Y.; Zhang, S.; Yao, L.; and Sun, A. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. ArXiv, abs/1707.07435

work page internal anchor Pith review Pith/arXiv arXiv 2017