pith. sign in

hub Canonical reference

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Canonical reference. 73% of citing Pith papers cite this work as background.

33 Pith papers citing it
Background 73% of classified citations
abstract

Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.

hub tools

citation-role summary

background 9 baseline 1 method 1

citation-polarity summary

representative citing papers

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Muninn: Your Trajectory Diffusion Model But Faster

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

Receding-Horizon Control via Drifting Models

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.

HITL-D: Human In The Loop Diffusion Assisted Shared Control

cs.RO · 2026-05-20 · unverdicted · novelty 6.0

HITL-D combines diffusion policies with human input for shared robotic control, reducing required joystick axes and improving speed and workload in manipulation tasks per a 12-participant study.

Refining Compositional Diffusion for Reliable Long-Horizon Planning

cs.RO · 2026-05-04 · unverdicted · novelty 6.0

RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-based long-horizon tasks.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

Fisher Decorator: Refining Flow Policy via a Local Transport Map

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

Real-Time Execution of Action Chunking Flow Policies

cs.RO · 2025-06-09 · unverdicted · novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

Diffusion Policy Policy Optimization

cs.RO · 2024-09-01 · unverdicted · novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.

citing papers explorer

Showing 33 of 33 citing papers.