pith. machine review for the scientific record. sign in

arxiv: 2605.07393 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Offline Policy Optimization with Posterior Sampling

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords offline reinforcement learningmodel-based RLposterior samplingconstrained policy optimizationout-of-distribution generalizationmonotonic improvementconvergence analysis
0
0 comments X

The pith

Posterior sampling from a dynamics model combined with constrained subproblems lets offline RL generalize from consistent OOD data without exploitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the core tension in model-based offline reinforcement learning: out-of-distribution samples can reflect real dynamics and aid generalization, yet they also let policies exploit model errors and collapse. It introduces Posterior Sampling-based Policy Optimization, which treats dynamics learning as Bayesian inference to produce an explicit posterior over models. Sampling transitions from this posterior supplies dynamics-consistent data for improvement, while a sequence of constrained policy subproblems keeps updates robust. The authors show that Q-value estimation converges as a stochastic approximation problem and that solving the subproblems produces monotonic policy gains until a fixed point. This matters for any setting where data is limited and off-policy evaluation is unreliable, because it replaces blanket pessimism with uncertainty-aware sampling.

Core claim

The central claim is that dynamics modeling as Bayesian inference yields a posterior that quantifies model fidelity, and that posterior sampling inside constrained policy optimization subproblems lets the learner use dynamics-consistent OOD transitions for generalization while guaranteeing robustness; the Q-function under posterior sampling converges, and the policy improves monotonically with each solved subproblem until convergence.

What carries the argument

Posterior sampling from the learned Bayesian distribution over dynamics models, paired with decomposition of policy optimization into a sequence of constrained subproblems.

If this is right

  • Q-value estimation converges when performed under repeated posterior sampling.
  • Policy performance improves monotonically with each solved constrained subproblem until a stationary point.
  • The approach avoids the performance loss that comes from uniform pessimistic regularization over all OOD data.
  • Standard benchmark tasks show higher final returns than existing offline RL baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same posterior-plus-constraint pattern could be tested in online model-based RL to guide exploration without separate uncertainty bonuses.
  • If the posterior is replaced by an ensemble, the method might extend to settings where full Bayesian inference is intractable.
  • The decomposition into constrained subproblems suggests a way to add safety or budget constraints to other offline algorithms without losing their convergence arguments.

Load-bearing premise

The assumption that samples drawn from the learned posterior accurately represent the unknown true dynamics in out-of-distribution regions and that each constrained subproblem can be solved precisely enough to guarantee the claimed monotonic improvement.

What would settle it

An experiment in which policy returns fail to increase monotonically across iterations or in which the method underperforms strongly pessimistic baselines on tasks where OOD model errors are large and verifiable.

Figures

Figures reproduced from arXiv: 2605.07393 by Dongxu Zhang, Haijun Zhang, Hongqiang Lin, Mingzhe Li, Ning Yang, Yiding Sun.

Figure 1
Figure 1. Figure 1: (Left) Existing pessimism-based policy itera [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Learning and evaluation curves. We observe that the true performance improves near-monotonically, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between uncertainty and the TD target. The target is defined as [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Posterior Sampling-based Policy Optimization (PSPO) for model-based offline RL. It treats dynamics modeling as Bayesian inference to obtain a posterior over models, then integrates posterior sampling with constrained policy optimization to exploit dynamics-consistent OOD transitions for generalization while guarding against model exploitation. The authors claim that Q-value estimation under posterior sampling converges as a stochastic approximation problem and that policy optimization decomposes into a sequence of constrained subproblems whose exact solution guarantees monotonic improvement until convergence. Experiments on standard benchmarks are reported to show superior performance versus SOTA baselines.

Significance. If the convergence and monotonicity results hold under realistic approximations, the work offers a principled middle ground between overly pessimistic regularization and unchecked model exploitation in offline MBRL. The explicit use of posterior sampling to quantify model fidelity and the constrained-subproblem decomposition are technically interesting; successful validation could influence subsequent offline RL algorithms that seek to leverage OOD data safely.

major comments (2)
  1. [§4] §4 (Policy Optimization Decomposition): The central claim that 'solving these subproblems guarantees monotonic improvement until convergence' assumes each constrained subproblem (involving posterior-sampled dynamics and Q-estimates) is solved to exact optimality. With neural-network policies and value functions, only approximate solutions are feasible; no error-propagation bounds or robustness analysis are provided showing that accumulated approximation error preserves monotonicity or the claimed convergence of the stochastic-approximation Q-estimator.
  2. [§3] §3 (Q-value Estimation under Posterior Sampling): The stochastic-approximation convergence argument for Q-values is stated but the manuscript provides neither the full derivation nor the precise conditions (step-size schedules, bounded variance of posterior samples) under which the result holds. Without these details it is impossible to verify whether the claimed convergence survives the offline data distribution and the additional variability introduced by posterior sampling.
minor comments (2)
  1. The abstract and introduction refer to a 'constraint tightness parameter' without specifying how it is chosen or whether it is tuned per environment; a brief discussion of its sensitivity would improve reproducibility.
  2. Notation for the posterior sampling temperature (or variance scaling) is introduced inconsistently across sections; a single consolidated definition would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with clarifications and commit to targeted revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [§4] §4 (Policy Optimization Decomposition): The central claim that 'solving these subproblems guarantees monotonic improvement until convergence' assumes each constrained subproblem (involving posterior-sampled dynamics and Q-estimates) is solved to exact optimality. With neural-network policies and value functions, only approximate solutions are feasible; no error-propagation bounds or robustness analysis are provided showing that accumulated approximation error preserves monotonicity or the claimed convergence of the stochastic-approximation Q-estimator.

    Authors: We agree that the monotonic improvement guarantee is formally established only under exact optimality of each constrained subproblem. The manuscript does not supply error-propagation bounds for approximate neural-network solutions, which is a fair critique. In the revision we will (i) explicitly restate the exact-optimality assumption in §4, (ii) add a new paragraph discussing the practical effect of approximation error, and (iii) include additional ablation experiments that monitor policy improvement across iterations on the benchmark suites. These experiments empirically indicate that the improvement direction remains stable despite approximation, although we acknowledge that a rigorous robustness analysis lies beyond the present scope and is left for future work. revision: partial

  2. Referee: [§3] §3 (Q-value Estimation under Posterior Sampling): The stochastic-approximation convergence argument for Q-values is stated but the manuscript provides neither the full derivation nor the precise conditions (step-size schedules, bounded variance of posterior samples) under which the result holds. Without these details it is impossible to verify whether the claimed convergence survives the offline data distribution and the additional variability introduced by posterior sampling.

    Authors: The main text sketches the reduction of posterior-sampling Q-learning to a stochastic-approximation process and asserts convergence, but the complete derivation and the exact technical conditions are indeed omitted. In the revised manuscript we will append the full proof to the supplementary material. The proof will explicitly invoke the Robbins-Monro step-size conditions (∑α_t=∞, ∑α_t²<∞), assume bounded variance of the posterior-sampled targets (which follows from the finite offline dataset and the Lipschitz continuity of the dynamics posterior), and verify that the offline data distribution satisfies the required ergodicity conditions for the mean-field ODE. This addition will make the convergence claim verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper formulates dynamics as Bayesian posterior inference on observed data, casts Q-value estimation under posterior sampling as a standard stochastic approximation problem whose convergence follows from established results in stochastic approximation theory, and decomposes policy optimization into constrained subproblems whose exact solution yields monotonic improvement via the standard policy improvement theorem. None of these steps reduce by construction to fitted parameters renamed as predictions, self-citations that are load-bearing, or ansatzes imported from prior author work. The central performance claims rest on the separation between the learned posterior (derived from data) and the subsequent constrained optimization, with no evidence that the claimed generalization or robustness is tautological with the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard Bayesian modeling assumptions and the existence of accurate solutions to the constrained subproblems. No new physical entities are introduced.

free parameters (2)
  • posterior sampling temperature or variance scaling
    Likely tuned to control how aggressively OOD samples are used; exact value not stated in abstract.
  • constraint tightness parameter
    Controls the allowed deviation from the data distribution during policy optimization.
axioms (2)
  • domain assumption The learned posterior over dynamics models contains at least some members that are consistent with the true (unknown) dynamics in OOD regions.
    Invoked when claiming that posterior sampling enables safe use of OOD transitions.
  • domain assumption The sequence of constrained policy optimization subproblems can be solved to sufficient accuracy to preserve monotonic improvement.
    Required for the theoretical guarantee stated in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1518 out tokens · 30307 ms · 2026-05-11T01:44:13.101695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Multi-agent deep reinforcement learning for liquidation strategy analysis.Arxiv, 2019

    Wenhang Bao and Xiao-yang Liu. Multi-agent deep reinforcement learning for liquidation strategy analysis.Arxiv, 2019

  2. [2]

    A general framework for updating belief distributions

    Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2016

  3. [3]

    Offline rl without off-policy evaluation

    David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. Advances in neural information processing systems, 2021

  4. [4]

    Diversification of adaptive policy for effective offline reinforcement learning

    Yunseon Choi, Li Zhao, Chuheng Zhang, Lei Song, Jiang Bian, and Kee-Eung Kim. Diversification of adaptive policy for effective offline reinforcement learning. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

  5. [5]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 2018

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 2018

  6. [6]

    Online policy optimization for robust markov decision process

    Jing Dong, Jingwei Li, Baoxiang Wang, and Jingzhao Zhang. Online policy optimization for robust markov decision process. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024

  7. [7]

    D4rl: Datasets for deep data-driven reinforcement learning.Arxiv, 2020

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.Arxiv, 2020

  8. [8]

    Bayesian reinforcement learning: A survey

    Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 2015

  9. [9]

    Offline rl policies should be trained to be adaptive

    Dibya Ghosh, Anurag Ajay, Pulkit Agrawal, and Sergey Levine. Offline rl policies should be trained to be adaptive. In International Conference on Machine Learning, 2022

  10. [10]

    Model-based offline reinforcement learning with pessimism-modulated dynamics belief.Advances in Neural Information Processing Systems, 2022

    Kaiyang Guo, Shao Yunfeng, and Yanhui Geng. Model-based offline reinforcement learning with pessimism-modulated dynamics belief.Advances in Neural Information Processing Systems, 2022

  11. [11]

    Efficient offline reinforcement learning with relaxed conservatism

    Longyang Huang, Botao Dong, and Weidong Zhang. Efficient offline reinforcement learning with relaxed conservatism. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  12. [12]

    Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020

    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020

  13. [13]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InDeep RL Workshop NeurIPS 2021, 2021

  14. [14]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 2020

  15. [15]

    Tucker, and Justin Fu

    Sergey Levine, Aviral Kumar, G. Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.ArXiv, 2020

  16. [16]

    Tw-crl: Time-weighted contrastive reward learning for efficient inverse reinforcement learning

    Yuxuan Li, Yicheng Gao, Ning Yang, and Stephen Xia. Tw-crl: Time-weighted contrastive reward learning for efficient inverse reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  17. [17]

    Any-step dynamics model improves future predictions for online and offline reinforcement learning

    Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, and Yang Yu. Any-step dynamics model improves future predictions for online and offline reinforcement learning. InInternational Conference on Learning Representations, 2025

  18. [18]

    Robust regularized policy iteration under transition uncertainty.Arxiv, 2026

    Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, and Dongxu Zhang. Robust regularized policy iteration under transition uncertainty.Arxiv, 2026

  19. [19]

    An offline adaptation framework for constrained multi-objective reinforcement learning

    Qian Lin, Zongkai Liu, Danying Mo, and Chao Yu. An offline adaptation framework for constrained multi-objective reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

  20. [20]

    Imagination-limited q-learning for offline reinforcement learning

    Wenhui Liu, Zhijian Wu, Jingchao Wang, Dingjiang Huang, and Shuigeng Zhou. Imagination-limited q-learning for offline reinforcement learning. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025. 11 Offline Policy Optimization with Posterior Sampling

  21. [21]

    Soft-robust algorithms for batch reinforcement learning

    Elita A Lobo, Mohammad Ghavamzadeh, and Marek Petrik. Soft-robust algorithms for batch reinforcement learning. Arxiv, 2020

  22. [22]

    Revisiting design choices in offline model based reinforcement learning

    Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J Roberts. Revisiting design choices in offline model based reinforcement learning. InInternational Conference on Learning Representations, 2022

  23. [23]

    Doubly mild generalization for offline reinforce- ment learning

    Yixiu Mao, Cheems Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly mild generalization for offline reinforce- ment learning. InAdvances in neural information processing systems, 2024

  24. [24]

    The generalization gap in offline reinforcement learning

    Ishita Mediratta, Qingfei You, Minqi Jiang, and Roberta Raileanu. The generalization gap in offline reinforcement learning. InInternational Conference on Learning Representations, 2024

  25. [25]

    Human-level control through deep reinforcement learning.nature, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 2015

  26. [26]

    Model-based reinforcement learning: A survey.Foundations and Trends® in Machine Learning, 2023

    Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learning: A survey.Foundations and Trends® in Machine Learning, 2023

  27. [27]

    A unified view of entropy-regularized markov decision processes, 2017

    Gergely Neu, Anders Jonsson, and Vicenc ¸G´omez. A unified view of entropy-regularized markov decision processes, 2017

  28. [28]

    Long-horizon model-based offline reinforcement learning without conservatism.Arxiv, 2025

    Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, and Pierre-Luc Bacon. Long-horizon model-based offline reinforcement learning without conservatism.Arxiv, 2025

  29. [29]

    Is value learning really the main bottleneck in offline rl? Advances in Neural Information Processing Systems, 2024

    Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? Advances in Neural Information Processing Systems, 2024

  30. [30]

    Flow q-learning

    Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. InInternational Conference on Machine Learning, 2025

  31. [31]

    Sumo: Search-based uncertainty estimation for model- based offline reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 2025

    Zhongjian Qiao, Jiafei Lyu, Kechen Jiao, Qi Liu, and Xiu Li. Sumo: Search-based uncertainty estimation for model- based offline reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 2025

  32. [32]

    Planning for risk-aversion and expected value in mdps

    Marc Rigter, Paul Duckworth, Bruno Lacerda, and Nick Hawes. Planning for risk-aversion and expected value in mdps. InProceedings of the International Conference on Automated Planning and Scheduling, 2022

  33. [33]

    Rambo-rl: Robust adversarial model-based offline reinforcement learning.Advances in neural information processing systems, 2022

    Marc Rigter, Bruno Lacerda, and Nick Hawes. Rambo-rl: Robust adversarial model-based offline reinforcement learning.Advances in neural information processing systems, 2022

  34. [34]

    One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning.Advances in neural information processing systems, 2023

    Marc Rigter, Bruno Lacerda, and Nick Hawes. One risk to rule them all: A risk-sensitive perspective on model-based offline reinforcement learning.Advances in neural information processing systems, 2023

  35. [35]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  36. [36]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015

  37. [37]

    Model-bellman inconsistency for model-based offline reinforcement learning

    Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, and Yang Yu. Model-bellman inconsistency for model-based offline reinforcement learning. InInternational Conference on Machine Learning, 2023

  38. [38]

    Risk-averse offline reinforcement learning

    N´uria Armengol Urp´ı, Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. InInternational Conference on Learning Representations, 2021

  39. [39]

    A bayesian approach to robust inverse reinforcement learning

    Ran Wei, Siliang Zeng, Chenliang Li, Alfredo Garcia, Anthony D McDonald, and Mingyi Hong. A bayesian approach to robust inverse reinforcement learning. InConference on Robot Learning, 2023

  40. [40]

    Proactive constrained policy optimization with preemptive penalty

    Ning Yang, Pengyu Wang, Guoqing Liu, Haifeng Zhang, Pin Lyu, and Jun Wang. Proactive constrained policy optimization with preemptive penalty. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  41. [41]

    RTDiff: Reverse trajectory synthesis via diffusion for offline reinforcement learning

    Qianlan Yang and Yu-Xiong Wang. RTDiff: Reverse trajectory synthesis via diffusion for offline reinforcement learning. InInternational Conference on Learning Representations, 2025

  42. [42]

    Exclusively penalized q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:113405–113435, 2024

    Junghyuk Yeom, Yonghyeon Jo, Jeongmo Kim, Sanghyeon Lee, and Seungyul Han. Exclusively penalized q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:113405–113435, 2024. 12 Offline Policy Optimization with Posterior Sampling

  43. [43]

    Mopo: Model-based offline policy optimization.Advances in Neural Information Processing Systems, 2020

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization.Advances in Neural Information Processing Systems, 2020

  44. [44]

    Combo: Conservative offline model-based policy optimization.Advances in neural information processing systems, 2021

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization.Advances in neural information processing systems, 2021

  45. [45]

    Optimistic model rollouts for pessimistic offline policy optimization

    Y Zhai, Y Li, Z Gao, et al. Optimistic model rollouts for pessimistic offline policy optimization. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  46. [46]

    · · · E T∞,π ∞X t=0 γtr(st, at) #  .(10) Regularized objective function: eJ(π) = Eρ0,π T0∼P(T|D)   E T0,π T1∼P(T|D)

    Zhe Zhang and Xiaoyang Tan. Adaptive reward shifting based on behavior proximity for offline reinforcement learning. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023. 13 Offline Policy Optimization with Posterior Sampling Appendix A Basic Knowledge about Bayesian RL We begin by reviewing the formal framewo...