Offline Constrained Reinforcement Learning under Partial Data Coverage

Ambuj Tewari; Kihyuk Hong; Seokmin Ko

arxiv: 2505.17506 · v2 · submitted 2025-05-23 · 📊 stat.ML · cs.LG

Offline Constrained Reinforcement Learning under Partial Data Coverage

Seokmin Ko , Ambuj Tewari , Kihyuk Hong This is my paper

Pith reviewed 2026-05-19 14:13 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords offline constrained reinforcement learninggeneral function approximationprimal-dual algorithmpartial data coverageconstrained Markov decision processesoracle-efficient methods

0 comments

The pith

PDOCRL returns a near-optimal near-feasible policy in offline constrained RL with partial coverage of an optimal policy and a stronger realizability condition, without access to the data-generating distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PDOCRL, an oracle-efficient primal-dual algorithm for offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. It relies on a decomposed linear-programming formulation that treats the policy as an explicit optimization variable, which sidesteps policy extraction steps that would otherwise require knowledge of the data-generating distribution. The algorithm only calls standard policy-optimization, online linear-optimization, and linear-minimization oracles. Under a stronger realizability condition that rules out spurious restricted saddle points and given only partial coverage of an optimal policy, it delivers a near-optimal near-feasible output with sample complexity of order epsilon to the minus two. A sympathetic reader would care because prior offline constrained RL methods typically demand full data coverage or explicit knowledge of the behavior distribution, both of which are often unavailable in practice.

Core claim

The paper shows that saddle-point formulations with general function approximation can admit spurious saddle points even when an optimal solution is realizable, and identifies a stronger realizability condition under which every restricted saddle point is optimal. Under this condition together with partial coverage of an optimal policy, the PDOCRL algorithm, built on the decomposed linear-programming formulation, returns a near-optimal near-feasible policy with a tilde-O of epsilon to the minus two sample guarantee while using only standard oracles and without requiring access to the data-generating distribution.

What carries the argument

The decomposed linear-programming formulation that makes the policy an explicit optimization variable, together with the stronger realizability condition ensuring every restricted saddle point is optimal.

If this is right

The method works with only partial coverage of an optimal policy rather than full coverage.
Policy extraction no longer requires knowledge of the data-generating distribution.
Implementation reduces to calls to standard policy-optimization, online linear-optimization, and linear-minimization oracles.
The sample complexity scales as tilde-O of epsilon to the minus two for near-optimal near-feasible output.
The approach applies to general function approximation in discounted constrained MDPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposed formulation and stronger realizability condition could be tested in online or model-based constrained RL to see whether similar sample savings appear.
Applications with hard safety constraints, such as robotics or resource allocation, might benefit from the ability to operate under partial optimal-policy coverage.
Relaxing the stronger realizability condition while retaining oracle efficiency remains an open direction suggested by the spurious-saddle-point analysis.

Load-bearing premise

The stronger realizability condition holds so that every restricted saddle point is optimal.

What would settle it

Run the saddle-point formulation on an instance where an optimal solution is realizable but the stronger realizability condition fails, and check whether spurious restricted saddle points appear; or execute PDOCRL on a constrained MDP benchmark where only an optimal policy has partial coverage and measure whether near-optimal near-feasible performance is achieved at the claimed rate.

Figures

Figures reproduced from arXiv: 2505.17506 by Ambuj Tewari, Kihyuk Hong, Seokmin Ko.

read the original abstract

We study offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. Prior methods either require full data coverage for evaluating intermediate policies, lack oracle efficiency, or requires the knowledge of data-generating distribution for policy extraction. We propose PDOCRL, an oracle-efficient primal-dual algorithm based on a decomposed linear-programming formulation that makes the policy an explicit optimization variable. This avoids policy extraction that requires the knowledge of data-generating distribution, and only uses standard policy-optimization, online linear-optimization, and linear-minimization oracles. We show that saddle-point formulations using general function approximation can have spurious saddle points even when an optimal solution is realizable, and identify a stronger realizability condition under which every restricted saddle point is optimal. Under this condition and partial coverage of an optimal policy, PDOCRL returns a near-optimal, near-feasible policy with a \(\widetilde{\mathcal O}(\epsilon^{-2})\) sample guarantee, without access to the data-generating distribution. Empirically, PDOCRL is competitive with strong baselines on standard offline constrained RL benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean way to handle partial coverage in offline constrained RL by making the policy explicit in a decomposed LP and adding a stronger realizability condition to kill spurious saddle points.

read the letter

The main takeaway is that PDOCRL uses a decomposed linear program to treat the policy as an explicit variable. This sidesteps the need for the data-generating distribution when extracting a policy, and it only calls standard oracles for policy optimization, online linear optimization, and linear minimization. Under partial coverage of an optimal policy and their stronger realizability condition, they claim a near-optimal near-feasible policy with Õ(ε^{-2}) samples. That is a direct response to a real limitation in prior offline constrained RL work that often assumes full coverage or extra distribution knowledge. The empirical results on standard benchmarks are competitive with baselines, which adds some practical grounding. They also clearly show that ordinary realizability can leave spurious saddle points even when an optimum is realizable, and the stronger condition ensures every restricted saddle point is optimal. That identification step is the part that feels new rather than routine extension. The soft spot is that the entire sample-complexity guarantee sits on top of this stronger realizability condition. If the chosen function class does not satisfy it, the primal-dual updates can still converge to a non-optimal restricted saddle point and the bound does not apply. It is not obvious how often the condition holds in practice with general function approximation, and the abstract does not give much guidance on checking it. The proofs are not visible here, so the tightness of the analysis and the handling of approximation error remain to be verified. This paper is for people working on theoretical offline RL with constraints and safety requirements. A reader who already knows the saddle-point literature and wants oracle-efficient methods under relaxed coverage would get concrete value from the LP decomposition and the spurious-point diagnosis. The math and citation pattern look internally consistent on the surface. I would send it to peer review rather than desk reject; the fix for spurious saddles is worth expert scrutiny even if the realizability assumption turns out to be the main practical hurdle.

Referee Report

2 major / 2 minor

Summary. The paper proposes PDOCRL, an oracle-efficient primal-dual algorithm for offline constrained RL in discounted CMDPs under general function approximation. It introduces a decomposed LP formulation that treats the policy as an explicit optimization variable, avoiding policy extraction that requires knowledge of the data-generating distribution. The work shows that standard saddle-point formulations admit spurious saddle points even when an optimal solution is realizable, identifies a stronger realizability condition ensuring every restricted saddle point is optimal, and proves that under this condition plus partial coverage of an optimal policy, PDOCRL returns a near-optimal near-feasible policy with sample complexity Õ(ε^{-2}) using only standard policy-optimization, online linear-optimization, and linear-minimization oracles. Empirical results demonstrate competitiveness with baselines on standard offline constrained RL benchmarks.

Significance. If the central claims hold, particularly the sample-complexity guarantee under the stronger realizability condition and partial coverage, the paper advances offline constrained RL by relaxing coverage requirements, eliminating dependence on the data-generating distribution, and maintaining oracle efficiency. The identification of spurious saddle points in general-function-approximation saddle-point methods is a useful diagnostic insight. The use of standard oracles and the explicit LP decomposition are technical strengths that could influence follow-up work on constrained offline RL.

major comments (2)

[§4] §4 (stronger realizability condition): The O(ε^{-2}) sample bound is load-bearing on the stronger realizability assumption that every restricted saddle point is optimal. The manuscript demonstrates that standard realizability permits spurious points, but does not provide sufficient conditions or verification procedures for common approximators (e.g., neural networks) under which the stronger condition holds; without this, the guarantee does not apply in general.
[§3.2] §3.2 (decomposed LP formulation): The equivalence between the decomposed LP and the original constrained RL problem under partial coverage of an optimal policy is central to avoiding data-distribution knowledge. The analysis should explicitly bound any additional error introduced by the decomposition when the coverage is only partial rather than full.

minor comments (2)

[§3] Notation for oracles (policy-optimization, online linear-optimization, linear-minimization) is introduced without a consolidated table; adding one would improve readability when tracing the algorithm's oracle calls.
[Experiments] In the empirical section, the function classes used for the baselines and PDOCRL should be stated explicitly so readers can assess whether the stronger realizability condition is plausibly satisfied in the reported experiments.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We are grateful to the referee for the positive summary and for highlighting areas for improvement. We respond to each major comment below, indicating whether revisions will be made.

read point-by-point responses

Referee: [§4] §4 (stronger realizability condition): The O(ε^{-2}) sample bound is load-bearing on the stronger realizability assumption that every restricted saddle point is optimal. The manuscript demonstrates that standard realizability permits spurious points, but does not provide sufficient conditions or verification procedures for common approximators (e.g., neural networks) under which the stronger condition holds; without this, the guarantee does not apply in general.

Authors: The stronger realizability condition is indeed central to avoiding spurious saddle points and achieving the stated sample complexity. Our manuscript provides a counterexample showing that standard realizability is not sufficient, and defines the stronger condition under which all restricted saddle points are optimal. We do not claim to provide general sufficient conditions or verification methods for arbitrary function approximators such as neural networks, as these would depend on specific model classes and are beyond the scope of this work. This is an honest limitation of the current analysis. We will add a discussion remark in Section 4 to clarify the assumption's scope. revision: partial
Referee: [§3.2] §3.2 (decomposed LP formulation): The equivalence between the decomposed LP and the original constrained RL problem under partial coverage of an optimal policy is central to avoiding data-distribution knowledge. The analysis should explicitly bound any additional error introduced by the decomposition when the coverage is only partial rather than full.

Authors: We appreciate this suggestion. In fact, when the data provides partial coverage of an optimal policy, the decomposed LP is equivalent to the original problem with no additional approximation error from the decomposition itself. The partial coverage ensures that the optimal policy can be represented in the optimization without requiring knowledge of the full data distribution. We will update the manuscript to include an explicit statement that the error is zero under the stated assumptions in the analysis of the decomposed formulation. revision: yes

standing simulated objections not resolved

Providing sufficient conditions or verification procedures for the stronger realizability condition to hold for common approximators such as neural networks.

Circularity Check

0 steps flagged

No circularity; derivation self-contained under independent realizability assumption

full rationale

The paper defines PDOCRL via a decomposed LP formulation using standard oracles (policy optimization, online linear optimization, linear minimization). It mathematically identifies a stronger realizability condition (every restricted saddle point is optimal) to rule out spurious points that exist under standard realizability, then derives the Õ(ε^{-2}) sample bound from partial coverage of an optimal policy plus this condition. No step reduces a prediction or result to a fitted parameter by construction, nor relies on self-citation chains or ansatzes smuggled from prior work. The bound follows from standard primal-dual analysis under the stated assumptions, which are external to the algorithm's outputs and not tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central guarantee rests on a stronger realizability condition for saddle points and partial coverage of an optimal policy; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption Stronger realizability condition ensuring every restricted saddle point is optimal
Invoked to eliminate spurious saddle points in the general function approximation setting.
domain assumption Partial coverage of an optimal policy in the offline dataset
Required for the sample-complexity bound to hold without full coverage.

pith-pipeline@v0.9.0 · 5720 in / 1248 out tokens · 39738 ms · 2026-05-19T14:13:34.650934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We show that saddle-point formulations using general function approximation can have spurious saddle points even when an optimal solution is realizable, and identify a stronger realizability condition under which every restricted saddle point is optimal.
Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under this condition and partial coverage of an optimal policy, PDOCRL returns a near-optimal, near-feasible policy with a ~O(eps^{-2}) sample guarantee

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

Constrained Markov Decision Processes

Eitan Altman. Constrained Markov Decision Processes. V ol. 7. CRC Press, 1999

work page 1999
[2]

Convex optimization

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

work page 2004
[3]

Offline reinforcement learning under value and density-ratio realizability: the power of gaps

Jinglin Chen and Nan Jiang. “Offline reinforcement learning under value and density-ratio realizability: the power of gaps”. In: Uncertainty in Artificial Intelligence. PMLR. 2022, pp. 378–388

work page 2022
[4]

Adversarially trained actor critic for offline reinforcement learning

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. “Adversarially trained actor critic for offline reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2022, pp. 3852–3878

work page 2022
[5]

Offline primal-dual reinforcement learning for linear mdps

Germano Gabbianelli, Gergely Neu, Matteo Papini, and Nneka M Okolo. “Offline primal-dual reinforcement learning for linear mdps”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 3169–3177

work page 2024
[6]

Lagrangean decomposition: A model yielding stronger La- grangean bounds

Monique Guignard and Siwhan Kim. “Lagrangean decomposition: A model yielding stronger La- grangean bounds”. In: Mathematical programming 39.2 (1987), pp. 215–228

work page 1987
[7]

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong and Ambuj Tewari. “A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs”. In: International Conference on Machine Learning . PMLR. 2024, pp. 18711–18737

work page 2024
[8]

Offline reinforcement learning in large state spaces: Algorithms and guarantees

Nan Jiang and Tengyang Xie. “Offline reinforcement learning in large state spaces: Algorithms and guarantees”. In: Statistical Science (2024)

work page 2024
[9]

Batch policy learning under constraints

Hoang Le, Cameron V oloshin, and Yisong Yue. “Batch policy learning under constraints”. In:Inter- national Conference on Machine Learning. PMLR. 2019, pp. 3703–3712

work page 2019
[10]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. “Offline reinforcement learning: Tutorial, review, and perspectives on open problems”. In:arXiv preprint arXiv:2005.01643 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2005
[11]

Linear programming and sequential decisions

Alan S Manne. “Linear programming and sequential decisions”. In: Management Science 6.3 (1960), pp. 259–267

work page 1960
[12]

Q-learning and Pontryagin’s minimum principle

Prashant Mehta and Sean Meyn. “Q-learning and Pontryagin’s minimum principle”. In:Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference. IEEE. 2009, pp. 3598–3605

work page 2009
[13]

Finite-Time Bounds for Fitted Value Iteration

Rémi Munos and Csaba Szepesvári. “Finite-Time Bounds for Fitted Value Iteration.” In:Journal of Machine Learning Research 9.5 (2008)

work page 2008
[14]

Efficient global planning in large MDPs via stochastic primal-dual optimization

Gergely Neu and Nneka Okolo. “Efficient global planning in large MDPs via stochastic primal-dual optimization”. In: International Conference on Algorithmic Learning Theory. PMLR. 2023, pp. 1101– 1123

work page 2023
[15]

Offline RL via Feature-Occupancy Gradient Ascent

Gergely Neu and Nneka Okolo. “Offline RL via Feature-Occupancy Gradient Ascent”. In:Interna- tional Conference on Artificial Intelligence and Statistics. PMLR. 2025. 11

work page 2025
[16]

Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[17]

Bridging offline rein- forcement learning and imitation learning: A tale of pessimism

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. “Bridging offline rein- forcement learning and imitation learning: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716

work page 2021
[18]

Optimal conserva- tive offline rl with general function approximation via augmented lagrangian

Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. “Optimal conserva- tive offline rl with general function approximation via augmented lagrangian”. In: arXiv preprint arXiv:2211.00716 (2022)

work page arXiv 2022
[19]

A Lagrangean relaxation algorithm for the two duty period scheduling problem

Fred Shepardson and Roy E Marsten. “A Lagrangean relaxation algorithm for the two duty period scheduling problem”. In: Management Science 26.3 (1980), pp. 274–281

work page 1980
[20]

Bellman-consistent pessimism for offline reinforcement learning

Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. “Bellman-consistent pessimism for offline reinforcement learning”. In:Advances in neural information processing systems 34 (2021), pp. 6683–6694

work page 2021
[21]

When is realizability sufficient for off-policy reinforcement learning?

Andrea Zanette. “When is realizability sufficient for off-policy reinforcement learning?” In:Interna- tional Conference on Machine Learning. PMLR. 2023, pp. 40637–40668

work page 2023
[22]

Offline reinforcement learning with realizability and single-policy concentrability

Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. “Offline reinforcement learning with realizability and single-policy concentrability”. In: Conference on Learning Theory. PMLR. 2022, pp. 2730–2775

work page 2022
[23]

Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage

Haobo Zhang, Xiyue Peng, Honghao Wei, and Xin Liu. “Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 34239–34269

work page 2024
[24]

Importance weighted actor-critic for optimal conservative offline reinforcement learning

Hanlin Zhu, Paria Rashidinejad, and Jiantao Jiao. “Importance weighted actor-critic for optimal conservative offline reinforcement learning”. In:Advances in Neural Information Processing Systems 36 (2023), pp. 49579–49602. 12 A Analysis for Offline Unconstrained RL A.1 Invariance of Saddle Points Lemma 12 (Lemma 31 in Zhan et al. [22]). Suppose (x∗, y∗) i...

work page 2023

[1] [1]

Constrained Markov Decision Processes

Eitan Altman. Constrained Markov Decision Processes. V ol. 7. CRC Press, 1999

work page 1999

[2] [2]

Convex optimization

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004

work page 2004

[3] [3]

Offline reinforcement learning under value and density-ratio realizability: the power of gaps

Jinglin Chen and Nan Jiang. “Offline reinforcement learning under value and density-ratio realizability: the power of gaps”. In: Uncertainty in Artificial Intelligence. PMLR. 2022, pp. 378–388

work page 2022

[4] [4]

Adversarially trained actor critic for offline reinforcement learning

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. “Adversarially trained actor critic for offline reinforcement learning”. In: International Conference on Machine Learning. PMLR. 2022, pp. 3852–3878

work page 2022

[5] [5]

Offline primal-dual reinforcement learning for linear mdps

Germano Gabbianelli, Gergely Neu, Matteo Papini, and Nneka M Okolo. “Offline primal-dual reinforcement learning for linear mdps”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 3169–3177

work page 2024

[6] [6]

Lagrangean decomposition: A model yielding stronger La- grangean bounds

Monique Guignard and Siwhan Kim. “Lagrangean decomposition: A model yielding stronger La- grangean bounds”. In: Mathematical programming 39.2 (1987), pp. 215–228

work page 1987

[7] [7]

A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs

Kihyuk Hong and Ambuj Tewari. “A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs”. In: International Conference on Machine Learning . PMLR. 2024, pp. 18711–18737

work page 2024

[8] [8]

Offline reinforcement learning in large state spaces: Algorithms and guarantees

Nan Jiang and Tengyang Xie. “Offline reinforcement learning in large state spaces: Algorithms and guarantees”. In: Statistical Science (2024)

work page 2024

[9] [9]

Batch policy learning under constraints

Hoang Le, Cameron V oloshin, and Yisong Yue. “Batch policy learning under constraints”. In:Inter- national Conference on Machine Learning. PMLR. 2019, pp. 3703–3712

work page 2019

[10] [10]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. “Offline reinforcement learning: Tutorial, review, and perspectives on open problems”. In:arXiv preprint arXiv:2005.01643 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2005

[11] [11]

Linear programming and sequential decisions

Alan S Manne. “Linear programming and sequential decisions”. In: Management Science 6.3 (1960), pp. 259–267

work page 1960

[12] [12]

Q-learning and Pontryagin’s minimum principle

Prashant Mehta and Sean Meyn. “Q-learning and Pontryagin’s minimum principle”. In:Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference. IEEE. 2009, pp. 3598–3605

work page 2009

[13] [13]

Finite-Time Bounds for Fitted Value Iteration

Rémi Munos and Csaba Szepesvári. “Finite-Time Bounds for Fitted Value Iteration.” In:Journal of Machine Learning Research 9.5 (2008)

work page 2008

[14] [14]

Efficient global planning in large MDPs via stochastic primal-dual optimization

Gergely Neu and Nneka Okolo. “Efficient global planning in large MDPs via stochastic primal-dual optimization”. In: International Conference on Algorithmic Learning Theory. PMLR. 2023, pp. 1101– 1123

work page 2023

[15] [15]

Offline RL via Feature-Occupancy Gradient Ascent

Gergely Neu and Nneka Okolo. “Offline RL via Feature-Occupancy Gradient Ascent”. In:Interna- tional Conference on Artificial Intelligence and Statistics. PMLR. 2025. 11

work page 2025

[16] [16]

Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[17] [17]

Bridging offline rein- forcement learning and imitation learning: A tale of pessimism

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. “Bridging offline rein- forcement learning and imitation learning: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716

work page 2021

[18] [18]

Optimal conserva- tive offline rl with general function approximation via augmented lagrangian

Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. “Optimal conserva- tive offline rl with general function approximation via augmented lagrangian”. In: arXiv preprint arXiv:2211.00716 (2022)

work page arXiv 2022

[19] [19]

A Lagrangean relaxation algorithm for the two duty period scheduling problem

Fred Shepardson and Roy E Marsten. “A Lagrangean relaxation algorithm for the two duty period scheduling problem”. In: Management Science 26.3 (1980), pp. 274–281

work page 1980

[20] [20]

Bellman-consistent pessimism for offline reinforcement learning

Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. “Bellman-consistent pessimism for offline reinforcement learning”. In:Advances in neural information processing systems 34 (2021), pp. 6683–6694

work page 2021

[21] [21]

When is realizability sufficient for off-policy reinforcement learning?

Andrea Zanette. “When is realizability sufficient for off-policy reinforcement learning?” In:Interna- tional Conference on Machine Learning. PMLR. 2023, pp. 40637–40668

work page 2023

[22] [22]

Offline reinforcement learning with realizability and single-policy concentrability

Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason Lee. “Offline reinforcement learning with realizability and single-policy concentrability”. In: Conference on Learning Theory. PMLR. 2022, pp. 2730–2775

work page 2022

[23] [23]

Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage

Haobo Zhang, Xiyue Peng, Honghao Wei, and Xin Liu. “Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 34239–34269

work page 2024

[24] [24]

Importance weighted actor-critic for optimal conservative offline reinforcement learning

Hanlin Zhu, Paria Rashidinejad, and Jiantao Jiao. “Importance weighted actor-critic for optimal conservative offline reinforcement learning”. In:Advances in Neural Information Processing Systems 36 (2023), pp. 49579–49602. 12 A Analysis for Offline Unconstrained RL A.1 Invariance of Saddle Points Lemma 12 (Lemma 31 in Zhan et al. [22]). Suppose (x∗, y∗) i...

work page 2023