pith. machine review for the scientific record. sign in

arxiv: 2304.10573 · v2 · submitted 2023-04-20 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords offline reinforcement learningimplicit Q-learningdiffusion policiesactor-criticbehavior regularizationimportance samplingpolicy extractionmultimodal policies
0
0 comments X

The pith

Implicit Q-Learning implicitly defines a behavior-regularized actor that is more accurately extracted using diffusion models and importance sampling than with Gaussian policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reinterprets Implicit Q-Learning as an actor-critic method whose critic objective corresponds to a behavior-regularized implicit actor. This actor trades off reward maximization against divergence from the dataset policy, and the choice of critic loss determines the precise form of that tradeoff. Because the resulting actor distribution can be multimodal and complex, fitting it with a conditional Gaussian via advantage-weighted regression is insufficient. The authors instead draw samples from a diffusion model of the behavior policy and reweight them by the critic values through importance sampling to recover the intended policy. This yields IDQL, which preserves the implementation simplicity of standard IQL while delivering higher performance on offline RL tasks and greater robustness to hyperparameter settings.

Core claim

We reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sample our intended policy. We introduce IDQ

What carries the argument

Generalized IQL critic objective connected to a behavior-regularized implicit actor, extracted by importance sampling critic weights over samples from a diffusion-parameterized behavior policy.

Load-bearing premise

That samples from the diffusion-parameterized behavior policy combined with critic weights via importance sampling correctly recover the intended behavior-regularized implicit actor without introducing prohibitive variance or bias.

What would settle it

An offline RL benchmark run where the importance-sampled diffusion policy achieves returns substantially below the values predicted by the trained IQL critic on the same actions, or where IDQL fails to outperform standard IQL with Gaussian policy extraction.

read the original abstract

Effective offline RL methods require properly handling out-of-distribution actions. Implicit Q-learning (IQL) addresses this by training a Q-function using only dataset actions through a modified Bellman backup. However, it is unclear which policy actually attains the values represented by this implicitly trained Q-function. In this paper, we reinterpret IQL as an actor-critic method by generalizing the critic objective and connecting it to a behavior-regularized implicit actor. This generalization shows how the induced actor balances reward maximization and divergence from the behavior policy, with the specific loss choice determining the nature of this tradeoff. Notably, this actor can exhibit complex and multimodal characteristics, suggesting issues with the conditional Gaussian actor fit with advantage weighted regression (AWR) used in prior methods. Instead, we propose using samples from a diffusion parameterized behavior policy and weights computed from the critic to then importance sampled our intended policy. We introduce Implicit Diffusion Q-learning (IDQL), combining our general IQL critic with the policy extraction method. IDQL maintains the ease of implementation of IQL while outperforming prior offline RL methods and demonstrating robustness to hyperparameters. Code is available at https://github.com/philippe-eecs/IDQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reinterprets Implicit Q-Learning (IQL) as an actor-critic method by generalizing the critic objective to induce a behavior-regularized implicit actor whose density is proportional to exp(Q) times the behavior density. It proposes IDQL, which extracts this actor by sampling from a learned diffusion model of the behavior policy and reweighting samples via importance sampling using critic values, claiming that this approach maintains IQL's implementation simplicity while outperforming prior offline RL methods and demonstrating hyperparameter robustness.

Significance. If the reinterpretation is correct and the importance sampling recovers the intended actor without prohibitive bias or variance, the work would be significant for offline RL by providing a principled mechanism to extract complex multimodal policies from IQL critics using diffusion models. The code release supports reproducibility and extension of the method.

major comments (2)
  1. [IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.
  2. [Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.
minor comments (1)
  1. [Abstract] The abstract states that 'the specific loss choice determining the nature of this tradeoff' but does not identify the loss used in the IDQL experiments; adding this detail would clarify the connection between the generalized critic and the extracted actor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. Below we address each major comment point-by-point, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [IDQL policy extraction description] In the policy extraction step for IDQL, the manuscript provides no analysis, bounds, or empirical diagnostics on the variance of the importance sampling estimator when drawing from the diffusion-parameterized behavior policy and reweighting by critic values. This is load-bearing for the claimed equivalence, as high variance (possible when Q-values vary sharply across actions or the target policy is multimodal) would cause the extracted policy to deviate from the one represented by the trained critic.

    Authors: We acknowledge that the original manuscript did not provide theoretical bounds or explicit variance analysis for the importance sampling step. To address this, we have performed additional empirical diagnostics, including histograms of importance weights, effective sample size calculations, and variance estimates across multiple environments and training checkpoints. These show that the diffusion model's coverage keeps weight variance manageable in practice, supporting the claimed equivalence. We will add these diagnostics, along with a brief discussion of limitations when Q-values are extremely sharp, to the revised policy extraction section. revision: yes

  2. Referee: [Empirical evaluation and hyperparameter robustness claims] The free parameters (diffusion model hyperparameters and importance sampling temperature) are acknowledged but the paper does not demonstrate through ablations or sensitivity analysis that performance remains robust when these are varied, which is necessary to support the robustness claim.

    Authors: We agree that explicit sensitivity analysis is needed to substantiate the robustness claim. In the revised manuscript we will add ablations varying the number of diffusion steps, the noise schedule parameters, and the importance sampling temperature over reasonable ranges. Performance tables and plots on representative environments (e.g., locomotion and manipulation tasks) will demonstrate that results remain stable, thereby strengthening the claim that IDQL is robust to these hyperparameters while preserving implementation simplicity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper reinterprets the IQL critic objective via a generalized Bellman backup to connect it to a behavior-regularized implicit actor whose density is proportional to the behavior density times an exponential of the Q-values; this is a direct algebraic consequence of the modified backup and does not reduce the claimed actor to a fitted quantity by construction. The policy extraction step then introduces an independent approximation that draws samples from a separately trained diffusion model of the behavior policy and applies importance weights derived from the critic, which is presented as a new algorithmic choice rather than a tautological renaming or self-referential prediction. No equations equate the final performance or the extracted policy back to the critic training inputs, self-citations to prior IQL results supply external context instead of load-bearing justification, and empirical outperformance claims rest on experiments outside the derivation. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of reinterpreting IQL as an actor-critic method and on the assumption that diffusion models plus importance sampling recover the target policy without distortion.

free parameters (1)
  • diffusion model hyperparameters and importance sampling temperature
    These control the behavior policy representation and weighting; the paper claims robustness but does not specify they are derived from first principles.
axioms (1)
  • domain assumption The generalized critic objective induces a behavior-regularized implicit actor whose tradeoff is controlled by loss choice
    This is the load-bearing reinterpretation connecting the IQL critic to the actor.

pith-pipeline@v0.9.0 · 5524 in / 1198 out tokens · 66232 ms · 2026-05-13T13:44:20.536346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  2. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  3. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...

  4. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  5. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination quality in offline MARL using only 1-3 denoising steps by natively coupling velocity fields across agents via coordinated attention and gating.

  6. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 7.0

    CoFlow achieves state-of-the-art coordination in offline MARL using single-pass joint velocity fields with Coordinated Velocity Attention and Adaptive Coordination Gating.

  7. Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    DROL trains one-step offline RL actors via top-1 dynamic routing of dataset actions to latent candidates, enabling local improvements while preserving data support and retaining cheap inference.

  8. Reinforcement Learning via Value Gradient Flow

    cs.LG 2026-04 unverdicted novelty 7.0

    VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

  9. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  10. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  11. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  12. Discrete Flow Matching for Offline-to-Online Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    DRIFT enables stable offline-to-online fine-tuning of CTMC policies in discrete RL via advantage-weighted discrete flow matching, path-space regularization, and candidate-set approximation.

  13. Adaptive Action Chunking via Multi-Chunk Q Value Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

  14. ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

    cs.LG 2026-05 unverdicted novelty 6.0

    ACSAC adaptively selects action chunk sizes via a causal Transformer Q-network in actor-critic RL, proves the Bellman operator is a contraction, and reports state-of-the-art results on long-horizon manipulation tasks.

  15. Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

    cs.LG 2026-05 unverdicted novelty 6.0

    DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.

  16. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  17. CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

    cs.AI 2026-05 unverdicted novelty 6.0

    CoFlow preserves inter-agent coordination in few-step offline MARL by using a natively joint velocity field with Coordinated Velocity Attention and Adaptive Coordination Gating, matching or exceeding baselines in 1-3 ...

  18. FASTER: Value-Guided Sampling for Fast RL

    cs.LG 2026-04 unverdicted novelty 6.0

    FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

  19. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  20. Mean Flow Policy Optimization

    cs.LG 2026-04 conditional novelty 6.0

    Mean Flow Policy Optimization (MFPO) uses few-step flow-based models for RL policies and achieves performance on par with or better than diffusion-based methods while substantially lowering training and inference time...

  21. Whole-Body Mobile Manipulation using Offline Reinforcement Learning on Sub-optimal Controllers

    cs.RO 2026-04 unverdicted novelty 6.0

    WHOLE-MoMa improves whole-body mobile manipulation by applying offline RL with Q-chunking to demonstrations from randomized sub-optimal controllers, outperforming baselines and transferring to real robots without tele...

  22. Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    TRFP combines rectified flow models with truncation to support multimodal policies in MaxEnt RL while allowing fast one-step sampling and stable training.

  23. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  24. Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.

  25. Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    ME-AM adds entropy regularization and a mixture prior to adjoint matching in flow-based offline RL to extract better multi-modal policies from limited data.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 22 Pith papers · 12 internal anchors

  1. [1]

    Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022

  2. [2]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  3. [3]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023

  4. [4]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax

  5. [5]

    Offline rl without off-policy evaluation

    David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946, 2021

  6. [6]

    What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881

    Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881. PMLR, 2019

  7. [7]

    Offline reinforcement learning via high-fidelity generative behavior modeling

    Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022

  8. [8]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021

  9. [9]

    Distributional reinforce- ment learning with quantile regression

    Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  10. [10]

    Implicit behavioral cloning

    Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. In 5th Annual Conference on Robot Learning , 2021. URL https://openreview.net/ forum?id=rif3a5NAxU6

  11. [11]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  12. [12]

    A minimalist approach to offline reinforcement learning

    Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021

  13. [13]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. 10

  14. [14]

    Extreme q-learning: Maxent rl without entropy

    Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy. arXiv preprint arXiv:2301.02328, 2023

  15. [15]

    Emaq: Expected-max q-learning operator for simple yet effective offline and online rl

    Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021

  16. [16]

    Know your boundaries: The necessity of explicit behavioral cloning in offline rl

    Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. arXiv preprint arXiv:2206.00695, 2022

  17. [17]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018

  18. [18]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  19. [19]

    Flax: A neural network library and ecosystem for JAX, 2023

    Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax

  20. [20]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  21. [21]

    Offline reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273– 1286, 2021

  22. [22]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

  23. [23]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  24. [24]

    JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10

    Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10

  25. [25]

    URL https://github.com/ikostrikov/jaxrl

  26. [26]

    Offline reinforcement learning with fisher divergence critic regularization

    Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. InInternational Conference on Machine Learning, pages 5774–5783. PMLR, 2021

  27. [27]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021

  28. [28]

    Stabilizing off- policy q-learning via bootstrapping error reduction

    Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off- policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019

  29. [29]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020

  30. [30]

    Controlling overestimation bias with truncated mixture of continuous distributional quantile critics

    Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pages 5556–5566. PMLR, 2020

  31. [31]

    Batch reinforcement learning

    Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforce- ment learning: State-of-the-art, pages 45–73, 2012

  32. [32]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020

  33. [33]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 11

  34. [34]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

  35. [35]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning

    Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning. arXiv preprint arXiv:2303.05479, 2023

  36. [36]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

  37. [37]

    Imitating human behaviour with diffusion models

    Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Pv1GPQzRrC8

  38. [38]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

  39. [39]

    Reinforcement learning by reward-weighted regression for operational space control

    Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, volume 227 of ACM International Conference Proceeding Series, pages 745–750. ACM, 2007. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273590

  40. [40]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  41. [41]

    Goal-conditioned imitation learning us- ing score-based diffusion policies

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

  42. [42]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015

  43. [43]

    Learning structured output representation using deep conditional generative models

    Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/ file/8d55a249e6b...

  44. [44]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020

  45. [45]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

  46. [46]

    Critic regularized regression

    Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020

  47. [47]

    Q-learning

    Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992

  48. [48]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019

  49. [49]

    Understanding the role of importance weighting for deep learning

    Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for deep learning. arXiv preprint arXiv:2103.15209, 2021

  50. [50]

    -A"), 1e6 (

    Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023. 12 A Reinforcement Learning Definitions RL is formulated in the context of a Markov decision process (MDP), which is defined as a tu...