hub Canonical reference

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, Mingyuan Zhou · 2022 · cs.LG · arXiv 2208.06193

Canonical reference. 73% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 method 1

citation-polarity summary

background 8 baseline 1 unclear 1 use method 1

representative citing papers

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

cs.RO · 2023-03-07 · accept · novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Muninn: Your Trajectory Diffusion Model But Faster

cs.RO · 2026-05-11 · unverdicted · novelty 7.0

Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.

Receding-Horizon Control via Drifting Models

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

cs.RO · 2025-06-18 · unverdicted · novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

cs.LG · 2025-06-06 · conditional · novelty 7.0

BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.

HITL-D: Human In The Loop Diffusion Assisted Shared Control

cs.RO · 2026-05-20 · unverdicted · novelty 6.0

HITL-D combines diffusion policies with human input for shared robotic control, reducing required joystick axes and improving speed and workload in manipulation tasks per a 12-participant study.

Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.

Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.

Refining Compositional Diffusion for Reliable Long-Horizon Planning

cs.RO · 2026-05-04 · unverdicted · novelty 6.0

RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-based long-horizon tasks.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

Accelerating trajectory optimization with Sobolev-trained diffusion policies

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

Sobolev-trained diffusion policies using trajectories and feedback gains provide warm-starts that reduce trajectory optimization solving time by 2x to 20x while avoiding compounding errors.

Fisher Decorator: Refining Flow Policy via a Local Transport Map

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

cs.LG · 2026-02-02 · unverdicted · novelty 6.0

ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

cs.LG · 2025-09-26 · unverdicted · novelty 6.0

A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.

Real-Time Execution of Action Chunking Flow Policies

cs.RO · 2025-06-09 · unverdicted · novelty 6.0

Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

cs.CV · 2025-03-13 · unverdicted · novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

Diffusion Policy Policy Optimization

cs.RO · 2024-09-01 · unverdicted · novelty 6.0

DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.

citing papers explorer

Showing 33 of 33 citing papers.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion cs.RO · 2023-03-07 · accept · none · ref 10 · internal anchor
Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
Aligning Flow Map Policies with Optimal Q-Guidance cs.LG · 2026-05-12 · unverdicted · none · ref 43 · internal anchor
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 64 · 2 links · internal anchor
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Muninn: Your Trajectory Diffusion Model But Faster cs.RO · 2026-05-11 · unverdicted · none · ref 62 · internal anchor
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.
Receding-Horizon Control via Drifting Models cs.AI · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 46 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning cs.LG · 2025-06-06 · conditional · none · ref 44 · internal anchor
BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
HITL-D: Human In The Loop Diffusion Assisted Shared Control cs.RO · 2026-05-20 · unverdicted · none · ref 16 · internal anchor
HITL-D combines diffusion policies with human input for shared robotic control, reducing required joystick axes and improving speed and workload in manipulation tasks per a 12-participant study.
Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing cs.LG · 2026-05-15 · unverdicted · none · ref 211 · internal anchor
Recasts sampling-based nonconvex optimization as smoothed gradient descent to obtain non-asymptotic convergence guarantees and introduces the DIDA annealed algorithm that converges to the global optimum.
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making cs.LG · 2026-05-15 · unverdicted · none · ref 24 · internal anchor
Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients cs.LG · 2026-05-11 · unverdicted · none · ref 49 · internal anchor
The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
Refining Compositional Diffusion for Reliable Long-Horizon Planning cs.RO · 2026-05-04 · unverdicted · none · ref 75 · internal anchor
RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-based long-horizon tasks.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 60 · internal anchor
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
FASTER: Value-Guided Sampling for Fast RL cs.LG · 2026-04-21 · unverdicted · none · ref 9 · internal anchor
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
Accelerating trajectory optimization with Sobolev-trained diffusion policies cs.LG · 2026-04-21 · unverdicted · none · ref 38 · internal anchor
Sobolev-trained diffusion policies using trajectories and feedback gains provide warm-starts that reduce trajectory optimization solving time by 2x to 20x while avoiding compounding errors.
Fisher Decorator: Refining Flow Policy via a Local Transport Map cs.LG · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models? cs.LG · 2026-02-02 · unverdicted · none · ref 21 · internal anchor
ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.
Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces cs.LG · 2025-09-26 · unverdicted · none · ref 62 · internal anchor
A method trains discrete diffusion policies for combinatorial RL by matching to a PMD-regularized target distribution, reporting SOTA performance and sample efficiency on DNA generation, macro-action, and multi-agent benchmarks.
Real-Time Execution of Action Chunking Flow Policies cs.RO · 2025-06-09 · unverdicted · none · ref 64 · internal anchor
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model cs.CV · 2025-03-13 · unverdicted · none · ref 51 · internal anchor
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
Diffusion Policy Policy Optimization cs.RO · 2024-09-01 · unverdicted · none · ref 100 · internal anchor
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations cs.RO · 2024-02-16 · conditional · none · ref 39 · internal anchor
3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 29 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies cs.LG · 2023-04-20 · conditional · none · ref 45 · internal anchor
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market cs.LG · 2026-05-19 · unverdicted · none · ref 21 · 3 links · internal anchor
D³-Subsidy is a prefix-conditioned diffusion model plus Lagrangian mapping that generates city-level subsidy plans from historical data and maps them to per-order incentives while respecting rate caps.
Driving Intents Amplify Planning-Oriented Reinforcement Learning cs.RO · 2026-05-12 · unverdicted · none · ref 20 · 2 links · internal anchor
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
Insider Attacks in Multi-Agent LLM Consensus Systems cs.MA · 2026-05-08 · unverdicted · none · ref 95 · internal anchor
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 13 · internal anchor
Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning cs.LG · 2025-04-16 · unverdicted · none · ref 9 · internal anchor
VIPO improves model-based offline RL by minimizing value function inconsistency between direct data estimates and model predictions, achieving SOTA results on D4RL and NeoRL benchmarks.
BalancedDPO: Adaptive Multi-Metric Alignment cs.CV · 2025-03-16 · unverdicted · none · ref 14 · internal anchor
BalancedDPO applies majority-vote consensus from multiple preference scorers and dynamic reference model updates within DPO to achieve multi-metric alignment for text-to-image diffusion models, reporting improved win rates on Pick-a-Pic, PartiPrompt, and HPD datasets across SD 1.5, 2.1, and SDXL.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer