ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

Shaoqin Zhu; Xiaoqiang Ji; Yifei Chen

arxiv: 2605.18320 · v1 · pith:PXNMNZNLnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

Yifei Chen , Shaoqin Zhu , Xiaoqiang Ji This is my paper

Pith reviewed 2026-05-20 12:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningsupport expansionvalue function interpolationstochastic policy optimizationmode collapseflow matchingbounded errormultimodal optimization

0 comments

The pith

Offline reinforcement learning can expand action support beyond the behavior policy by interpolating value functions and using stochastic optimization to handle multimodal landscapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to relax the strict constraints in offline RL that limit policies to the support of the behavior policy, which often blocks optimal behaviors. It establishes that interpolating a value function between in-distribution data and generated policy samples implicitly expands the feasible action support while keeping value errors bounded. This densifies high-reward regions to form a path for improvement. The method then employs stochastic policy optimization, alternating between conservative and optimistic signals, to navigate the resulting multimodal landscape without mode collapse. A successful demonstration would mean offline agents can achieve higher returns on tasks requiring novel actions not present in the dataset.

Core claim

ISEP uses an interpolated value function between in-distribution data and policy samples to implicitly expand the feasible action support, densifying high-reward regions and creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. This is optimized stochastically by alternating between conservative cloning and optimistic expansion signals to avoid invalid actions from mode collapse, and can be realized using Conditional Flow Matching with classifier-free guidance.

What carries the argument

An interpolated value function that combines in-distribution and policy-sampled signals to expand support, optimized via stochastic action selection alternating conservative and optimistic signals.

If this is right

Optimal behaviors outside the immediate support of the behavior policy become discoverable.
High-reward regions are densified to create navigable paths for policy improvement.
Value error remains bounded despite the expansion.
Stochastic optimization prevents mode collapse and invalid actions in the multimodal landscape.
The framework can be instantiated with Conditional Flow Matching for practical implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may generalize to other constrained optimization problems in machine learning where support expansion is needed.
Testing ISEP on datasets with very sparse action coverage could reveal how much expansion is practically achievable.
Combining ISEP with ensemble methods might further stabilize the value estimates during expansion.

Load-bearing premise

That the interpolated value function produces a well-behaved multimodal landscape where stochastic optimization yields valid actions without uncontrolled extrapolation error.

What would settle it

Running ISEP on a benchmark offline RL task with limited data support and measuring if the learned policy achieves higher returns than baselines on out-of-support actions while value estimates stay close to ground truth.

Figures

Figures reproduced from arXiv: 2605.18320 by Shaoqin Zhu, Xiaoqiang Ji, Yifei Chen.

**Figure 1.** Figure 1: Conceptual Illustration of Support Expansion on Disjoint Manifolds. The environment consists of two safe, high-reward "islands" (radius r = 1) surrounded by a low-reward danger background. The dataset is heavily skewed towards the suboptimal island (bottom-left).(Left) Gaussian Failure: A unimodal Gaussian policy attempts to cover both the dense suboptimal data and the sparse optimal data. This results in … view at source ↗

**Figure 2.** Figure 2: Visualization of the offline dataset used in the 2D bandit task. Contour is used to indicate the reward value [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of Interpolation Parameter p on Policy Performance in a 2D Bandit Task. The figure illustrates the action distributions generated by the policy for various values of p (ranging from 0.0 to 1.0). The color contours represent the reward values, with darker regions indicating lower rewards. The red star marks the optimal center at (2, 2), while the dashed purple circle delineates the danger zone at (4,… view at source ↗

**Figure 4.** Figure 4: Implicit Support Expansion on a Sparse Multimodal Bandit. The background color contours represent the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of the interpolation parameter p across representative D4RL tasks. Each subplot shows the D4RL Normalized Score over training steps for different values of p, spanning locomotion and manipulation domains. Moderate values (p ∈ {0.3, 0.5}) consistently yield the highest normalized scores across all environments. Extreme values (p ≥ 0.9) diverge and are omitted for visual clarity. Solid l… view at source ↗

**Figure 6.** Figure 6: t-SNE Visualization of Action Support Expansion on halfcheetah-medium-replay-v2. We project sampled actions into 2D space, with color contours indicating reward magnitude (brighter colors denote higher rewards). Left: The strictly in-distribution baseline (p = 0.0) remains trapped in dominant low-reward regions due to dataset density bias.Right: ISEP (p = 0.5) exhibits implicit action support expansion, su… view at source ↗

**Figure 7.** Figure 7: Stochastic Selection vs. Deterministic Interpolation. Ablation study comparing ISEP (labeled Stochastic) against naive action interpolation (labeled Deterministic) on (a) halfcheetah-medium-replay and (b) walker2d-medium. The deterministic baseline underperforms because linear averaging between distinct modes often yields "off-manifold" actions. In contrast, stochastic action selection maintains valid beha… view at source ↗

read the original abstract

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ISEP's interpolation-plus-stochastic-optimization idea targets a real offline RL gap, but the bounded-error claim rests on an unshown condition that may not survive extrapolation.

read the letter

The main point is that ISEP tries to relax the usual strict support constraints in offline RL by interpolating the value function between the data and the current policy samples, then optimizing the policy stochastically so it can reach higher-reward regions without immediate mode collapse or invalid actions. They implement the optimistic part with classifier-free guided flow matching. That combination is the clearest novelty in the abstract; it is not just another conservative penalty or explicit density constraint. The framing of densifying high-reward areas to create a navigable path is a reasonable way to describe the goal. The paper correctly notes that deterministic averaging on the resulting multimodal landscape produces bad actions, so the stochastic alternation between cloning and expansion signals is a direct response to that problem. Those pieces are coherent on their own terms. The soft spot is the theoretical guarantee. The abstract states that the interpolation produces bounded value error, yet gives no derivation or the precise condition under which the bound survives once the policy starts sampling outside the convex hull of the original data. The stress-test concern is on target here: if the flow-matching approximation puts non-negligible mass outside the reliable interpolation region, the value signal becomes uncontrolled extrapolation and the claimed bound is lost. The paper will need to show either a proof that the stochastic selection rule prevents this or at least empirical diagnostics that the error stays controlled in practice. Without that, the central claim is harder to accept at face value. This is work for the offline RL community, especially people who already work on support expansion or flow-based methods and want to see a new angle on controlled extrapolation. A reader who cares about safety-critical or data-scarce settings could extract a useful idea even if the current write-up leaves the theory thin. It is worth sending to a serious referee; the problem is genuine and the proposed mechanism is distinct enough that referees can usefully pressure the authors on the missing steps and the experiments.

Referee Report

1 major / 1 minor

Summary. The paper proposes ISEP for offline RL, which interpolates a value function between in-distribution data and policy samples to implicitly expand action support, densifying high-reward regions and creating a path for policy improvement while claiming a theoretical guarantee of bounded value error. It addresses the resulting multimodal landscape (where deterministic optimization causes mode collapse and invalid actions) via stochastic alternation between conservative cloning and optimistic expansion signals, instantiated as ISEP-FM using conditional flow matching with classifier-free guidance.

Significance. If the bounded-value-error guarantee holds under the proposed stochastic optimization, the work would be significant for offline RL: it offers a mechanism to discover behaviors outside the behavior policy support without rigid constraints, while the flow-matching instantiation provides a concrete way to capture the interpolated multimodal signal.

major comments (1)

[Abstract] Abstract: the paper asserts a 'theoretical guarantee of bounded value error' for the interpolated value function under stochastic policy optimization, yet supplies no derivation, error bounds, or conditions ensuring the bound survives once policy samples exit the convex hull of the data support. This is load-bearing for the central claim, because the method explicitly relies on the interpolation remaining valid rather than becoming uncontrolled extrapolation.

minor comments (1)

The description of the stochastic alternation between cloning and expansion signals would be clearer with explicit pseudocode or a step-by-step algorithmic outline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying the need for greater rigor around our central theoretical claim. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the paper asserts a 'theoretical guarantee of bounded value error' for the interpolated value function under stochastic policy optimization, yet supplies no derivation, error bounds, or conditions ensuring the bound survives once policy samples exit the convex hull of the data support. This is load-bearing for the central claim, because the method explicitly relies on the interpolation remaining valid rather than becoming uncontrolled extrapolation.

Authors: We agree that the current manuscript does not supply a complete formal derivation or explicit error bounds in the main text or appendix. The abstract and Section 4 present only an informal argument that the stochastic alternation between cloning and expansion keeps policy samples sufficiently close to the data support. In the revised version we will add a full proof (new Theorem 1 and Appendix B) that derives a concrete bound on the value error. The proof will show that, under the Lipschitz continuity of the value function and a contraction property induced by the conservative cloning step, the interpolation error remains bounded by O(δ) where δ is the maximum deviation of policy samples from the convex hull of the behavior data; the stochastic schedule explicitly controls δ. We will also revise the abstract to cite the theorem and state the required assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent mechanisms

full rationale

The provided abstract and description present ISEP as a new framework that interpolates a value function between in-distribution data and policy samples, then applies stochastic optimization via flow matching to expand support while claiming a bounded value error guarantee. No equations or steps are shown that reduce the claimed guarantee or expansion to a fitted parameter renamed as prediction, a self-definition, or a load-bearing self-citation chain. The stochastic alternation between cloning and expansion is described as a mitigation for multimodality rather than a re-expression of prior quantities. The derivation chain therefore remains self-contained against external benchmarks with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the interpolation and stochastic selection are described at a conceptual level without numerical fitting details or unstated background lemmas.

pith-pipeline@v0.9.0 · 5676 in / 1146 out tokens · 41144 ms · 2026-05-20T12:03:40.359026+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LV(ψ) = (1−p)E(s,a)∼D [Lτ2(Q̂θ(s,a)−Vψ(s))] + p E s∼D â∼πϕ(·|s) [(Q̂θ(s,â)−Vψ(s))²]
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lπ(ϕ) = (1−B)·E(s,a)∼D[ω(s,a) log πϕ(a|s)] + B·E s∼D, â∼πϕ(·|s) [ω(s,â) log πϕ(â|s)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

[2]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.CoRR, abs/2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 35(8):10237–10257, August 2024

work page 2024
[4]

Offline reinforcement learning with implicit q-learning, 2021

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021

work page 2021
[7]

Off-policy deep reinforcement learning without exploration, 2019

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration, 2019

work page 2019
[8]

A minimalist approach to offline reinforcement learning, 2021

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning, 2021

work page 2021
[9]

Kumar, J

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019

work page arXiv 1906
[10]

Supported policy optimization for offline reinforcement learning, 2022

Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, and Mingsheng Long. Supported policy optimization for offline reinforcement learning, 2022

work page 2022
[11]

Behavior regularized offline reinforcement learning, 2019

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning, 2019. 10 ISEP: Implicit Support Expansion for Offline RL

work page 2019
[12]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.CoRR, abs/2006.04779, 2020

work page arXiv 2006
[13]

Diffusion guidance is a controllable policy improvement operator, 2025

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator, 2025

work page 2025
[14]

Whitney, Rajesh Ranganath, and Joan Bruna

David Brandfonbrener, William F. Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation, 2021

work page 2021
[15]

Offline rl with no ood actions: In-sample learning via implicit value regularization, 2023

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization, 2023

work page 2023
[16]

Extreme q-learning: Maxent rl without entropy, 2023

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy, 2023

work page 2023
[17]

Doubly mild generalization for offline reinforce- ment learning, 2024

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly mild generalization for offline reinforce- ment learning, 2024

work page 2024
[18]

Counterfactual conservative q learning for offline multi-agent reinforcement learning, 2023

Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, and Xiangyang Ji. Counterfactual conservative q learning for offline multi-agent reinforcement learning, 2023

work page 2023
[19]

Adversarially trained actor critic for offline reinforcement learning, 2022

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning, 2022

work page 2022
[20]

Uncertainty-based offline reinforcement learning with diversified q-ensemble, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble, 2021

work page 2021
[21]

Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, 2022

Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, 2022

work page 2022
[22]

Supported trust region optimization for offline reinforcement learning, 2023

Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, and Xiangyang Ji. Supported trust region optimization for offline reinforcement learning, 2023

work page 2023
[23]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

work page 2020
[24]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015

work page 2015
[25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023

work page 2023
[26]

Diffusion policy: Visuomotor policy learning via action diffusion, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024

work page 2024
[27]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning, 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning, 2025

work page 2025
[28]

Diffusion policies as an expressive policy class for offline reinforcement learning, 2023

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning, 2023

work page 2023
[29]

Flow q-learning, 2025

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning, 2025

work page 2025
[30]

Diffusion policies for out-of-distribution generalization in offline reinforcement learning.IEEE Robotics and Automation Letters, 9(4):3116–3123, April 2024

Suzan Ece Ada, Erhan Oztop, and Emre Ugur. Diffusion policies for out-of-distribution generalization in offline reinforcement learning.IEEE Robotics and Automation Letters, 9(4):3116–3123, April 2024

work page 2024
[31]

Diffusion- based reinforcement learning via q-weighted variational policy optimization, 2024

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion- based reinforcement learning via q-weighted variational policy optimization, 2024

work page 2024
[32]

Efficient diffusion policies for offline reinforcement learning, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning, 2023

work page 2023
[33]

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning, 2023

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning, 2023

work page 2023
[34]

Energy-weighted flow matching for offline reinforcement learning, 2025

Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning, 2025

work page 2025
[35]

Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023

work page 2023
[36]

Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

work page 2023
[37]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InProceedings of the 24th international conference on Machine learning, pages 745–750, 2007. 11 ISEP: Implicit Support Expansion for Offline RL

work page 2007
[38]

Exponentially weighted imitation learning for batched historical data

Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang. Exponentially weighted imitation learning for batched historical data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, pages 6291–6300, 2018

work page 2018
[39]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[40]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning.CoRR, abs/2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[41]

Implicit behavioral cloning, 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning, 2021

work page 2021
[42]

Softmax policy gradient methods can take exponential time to converge, 2022

Gen Li, Yuting Wei, Yuejie Chi, and Yuxin Chen. Softmax policy gradient methods can take exponential time to converge, 2022

work page 2022
[43]

Mish: A self regularized non-monotonic activation function, 2020

Diganta Misra. Mish: A self regularized non-monotonic activation function, 2020

work page 2020
[44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

work page 2017
[45]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

work page 2008
[46]

An overview of gradient descent optimization algorithms, 2017

Sebastian Ruder. An overview of gradient descent optimization algorithms, 2017. 12 ISEP: Implicit Support Expansion for Offline RL A Theoretical analysis We proceed by induction on the value iteration steps. Base Case:Initialize V0(s) = 0 (or any arbitrarily small value such that V0(s)≤V ∗(s)). The base case holds trivially. Inductive Step:Assume that at ...

work page 2017
[47]

Expectile Term (In-Sample):First, we bound the updatedQ-values. Qk(s, a) =E (r,s′)∼D(s,a) r+γV k(s′) (16) ≤E (r,s′)∼D(s,a) r+γV ∗(s′) (inductive hypothesisV k ≤V ∗) (17) ≤Q ∗(s, a)(definition ofQ ∗) (18) By Assumption 1, for anysin dataset: Eτ a∼D(s)[Qk(s, a)]≤E τ a∼D(s)[Q∗(s, a)]≤max a∈D(s) Q∗(s, a)−δ τ (19) Invoking Assumption 2, the dataset support is ...

work page
[48]

zig-zags

Policy Term (Out-of-Distribution):For the second term, the policy πϕ may query actions ˆaoutside the dataset support. In practice Q-functions on OOD actions can suffer from overestimation due to function-approximation errors. We use the worst case bound: Eˆa∼πϕ[Qk(s,ˆa)]≤ 2Rmax (1−γ) .(22) In practice, we can cropQif it ever exceeds the bound. Combining t...

work page

[1] [2]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.CoRR, abs/2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[2] [3]

Rafael Figueiredo Prudencio, Marcos R. O. A. Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 35(8):10237–10257, August 2024

work page 2024

[3] [4]

Offline reinforcement learning with implicit q-learning, 2021

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning, 2021

work page 2021

[4] [7]

Off-policy deep reinforcement learning without exploration, 2019

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration, 2019

work page 2019

[5] [8]

A minimalist approach to offline reinforcement learning, 2021

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning, 2021

work page 2021

[6] [9]

Kumar, J

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.CoRR, abs/1906.00949, 2019

work page arXiv 1906

[7] [10]

Supported policy optimization for offline reinforcement learning, 2022

Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, and Mingsheng Long. Supported policy optimization for offline reinforcement learning, 2022

work page 2022

[8] [11]

Behavior regularized offline reinforcement learning, 2019

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning, 2019. 10 ISEP: Implicit Support Expansion for Offline RL

work page 2019

[9] [12]

Conservative Q-Learning for Offline Reinforcement Learning, August 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.CoRR, abs/2006.04779, 2020

work page arXiv 2006

[10] [13]

Diffusion guidance is a controllable policy improvement operator, 2025

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator, 2025

work page 2025

[11] [14]

Whitney, Rajesh Ranganath, and Joan Bruna

David Brandfonbrener, William F. Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation, 2021

work page 2021

[12] [15]

Offline rl with no ood actions: In-sample learning via implicit value regularization, 2023

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization, 2023

work page 2023

[13] [16]

Extreme q-learning: Maxent rl without entropy, 2023

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy, 2023

work page 2023

[14] [17]

Doubly mild generalization for offline reinforce- ment learning, 2024

Yixiu Mao, Qi Wang, Yun Qu, Yuhang Jiang, and Xiangyang Ji. Doubly mild generalization for offline reinforce- ment learning, 2024

work page 2024

[15] [18]

Counterfactual conservative q learning for offline multi-agent reinforcement learning, 2023

Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, and Xiangyang Ji. Counterfactual conservative q learning for offline multi-agent reinforcement learning, 2023

work page 2023

[16] [19]

Adversarially trained actor critic for offline reinforcement learning, 2022

Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning, 2022

work page 2022

[17] [20]

Uncertainty-based offline reinforcement learning with diversified q-ensemble, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble, 2021

work page 2021

[18] [21]

Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, 2022

Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning, 2022

work page 2022

[19] [22]

Supported trust region optimization for offline reinforcement learning, 2023

Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, and Xiangyang Ji. Supported trust region optimization for offline reinforcement learning, 2023

work page 2023

[20] [23]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

work page 2020

[21] [24]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015

work page 2015

[22] [25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023

work page 2023

[23] [26]

Diffusion policy: Visuomotor policy learning via action diffusion, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024

work page 2024

[24] [27]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning, 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning, 2025

work page 2025

[25] [28]

Diffusion policies as an expressive policy class for offline reinforcement learning, 2023

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning, 2023

work page 2023

[26] [29]

Flow q-learning, 2025

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning, 2025

work page 2025

[27] [30]

Diffusion policies for out-of-distribution generalization in offline reinforcement learning.IEEE Robotics and Automation Letters, 9(4):3116–3123, April 2024

Suzan Ece Ada, Erhan Oztop, and Emre Ugur. Diffusion policies for out-of-distribution generalization in offline reinforcement learning.IEEE Robotics and Automation Letters, 9(4):3116–3123, April 2024

work page 2024

[28] [31]

Diffusion- based reinforcement learning via q-weighted variational policy optimization, 2024

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion- based reinforcement learning via q-weighted variational policy optimization, 2024

work page 2024

[29] [32]

Efficient diffusion policies for offline reinforcement learning, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning, 2023

work page 2023

[30] [33]

Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning, 2023

Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning, 2023

work page 2023

[31] [34]

Energy-weighted flow matching for offline reinforcement learning, 2025

Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforcement learning, 2025

work page 2025

[32] [35]

Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023

work page 2023

[33] [36]

Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling, 2023

work page 2023

[34] [37]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InProceedings of the 24th international conference on Machine learning, pages 745–750, 2007. 11 ISEP: Implicit Support Expansion for Offline RL

work page 2007

[35] [38]

Exponentially weighted imitation learning for batched historical data

Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang. Exponentially weighted imitation learning for batched historical data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 31, pages 6291–6300, 2018

work page 2018

[36] [39]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.CoRR, abs/1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[37] [40]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning.CoRR, abs/2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[38] [41]

Implicit behavioral cloning, 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning, 2021

work page 2021

[39] [42]

Softmax policy gradient methods can take exponential time to converge, 2022

Gen Li, Yuting Wei, Yuejie Chi, and Yuxin Chen. Softmax policy gradient methods can take exponential time to converge, 2022

work page 2022

[40] [43]

Mish: A self regularized non-monotonic activation function, 2020

Diganta Misra. Mish: A self regularized non-monotonic activation function, 2020

work page 2020

[41] [44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

work page 2017

[42] [45]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

work page 2008

[43] [46]

An overview of gradient descent optimization algorithms, 2017

Sebastian Ruder. An overview of gradient descent optimization algorithms, 2017. 12 ISEP: Implicit Support Expansion for Offline RL A Theoretical analysis We proceed by induction on the value iteration steps. Base Case:Initialize V0(s) = 0 (or any arbitrarily small value such that V0(s)≤V ∗(s)). The base case holds trivially. Inductive Step:Assume that at ...

work page 2017

[44] [47]

Expectile Term (In-Sample):First, we bound the updatedQ-values. Qk(s, a) =E (r,s′)∼D(s,a) r+γV k(s′) (16) ≤E (r,s′)∼D(s,a) r+γV ∗(s′) (inductive hypothesisV k ≤V ∗) (17) ≤Q ∗(s, a)(definition ofQ ∗) (18) By Assumption 1, for anysin dataset: Eτ a∼D(s)[Qk(s, a)]≤E τ a∼D(s)[Q∗(s, a)]≤max a∈D(s) Q∗(s, a)−δ τ (19) Invoking Assumption 2, the dataset support is ...

work page

[45] [48]

zig-zags

Policy Term (Out-of-Distribution):For the second term, the policy πϕ may query actions ˆaoutside the dataset support. In practice Q-functions on OOD actions can suffer from overestimation due to function-approximation errors. We use the worst case bound: Eˆa∼πϕ[Qk(s,ˆa)]≤ 2Rmax (1−γ) .(22) In practice, we can cropQif it ever exceeds the bound. Combining t...

work page