pith. machine review for the scientific record. sign in

arxiv: 2208.06193 · v3 · submitted 2022-08-12 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:48 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords offline reinforcement learningdiffusion modelspolicy classD4RL benchmarkDiffusion-QLaction-value functionbehavior cloninggenerative models
0
0 comments X

The pith

Diffusion models represent policies in a way that lets offline RL reach state-of-the-art on most D4RL tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline reinforcement learning must extract good policies from fixed datasets, yet conventional methods often fail because they cannot handle actions outside the data without large errors. This paper replaces limited policy classes with conditional diffusion models and trains them by adding a term that maximizes a learned action-value function to the usual diffusion objective. The resulting loss pushes the policy toward better actions while keeping them close to observed behavior. The diffusion model's flexibility avoids the suboptimal traps of simpler regularizers, and experiments confirm gains on a multimodal bandit task plus top scores on the bulk of D4RL benchmarks.

Core claim

Representing the policy as a conditional diffusion model and augmenting its training loss with a term that maximizes the action-value function produces a policy that selects high-value actions near the behavior policy; the combination of diffusion expressiveness and the coupled cloning-plus-improvement objective yields better solutions than prior regularization approaches that constrain policy classes.

What carries the argument

Conditional diffusion model for the policy, trained with an augmented loss that adds maximization of a learned action-value function.

If this is right

  • Outperforms prior regularization methods on a simple 2D bandit with multimodal behavior policy.
  • Achieves state-of-the-art performance on the majority of D4RL benchmark tasks.
  • Reduces the impact of function approximation errors on out-of-distribution actions through greater policy expressiveness.
  • Couples behavior cloning and policy improvement inside the same diffusion training procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diffusion-policy construction could be tested in online RL by continuing diffusion training on fresh interaction data.
  • Other high-capacity generative models might be substituted for diffusion while retaining the value-augmented loss.
  • Scaling the approach to larger, noisier real-world datasets would test whether the observed stability generalizes beyond current benchmarks.

Load-bearing premise

The added action-value term in the diffusion training loss reliably produces policy improvement without destabilizing the generative model or causing mode collapse on real datasets.

What would settle it

Run Diffusion-QL on the complete D4RL suite and check whether the method fails to match or beat the best prior scores or shows clear mode collapse in sampled action distributions.

read the original abstract

Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Diffusion-QL for offline RL, representing the policy as a conditional diffusion model. An action-value function is learned and a maximization term is added to the diffusion training loss, producing a combined objective that seeks optimal actions near the behavior policy. The authors highlight the expressiveness of diffusion policies and their coupling of behavior cloning with improvement; they demonstrate superiority over prior methods on a 2D multimodal bandit toy task and report state-of-the-art results on the majority of D4RL benchmark tasks.

Significance. If the central empirical claims hold under proper controls, the work would be significant for establishing diffusion models as a highly expressive policy class in offline RL. It provides a concrete mechanism to combine behavior cloning and policy improvement within a single generative training objective, addressing expressiveness limitations of prior policy classes. The 2D bandit illustration offers a clear qualitative demonstration of multimodal handling.

major comments (2)
  1. [§3] §3 (method): The description of the Q-augmented diffusion loss states only that it 'seeks optimal actions that are near the behavior policy' without specifying the weighting coefficient between the standard diffusion objective and the action-value term, any scheduling during training, clipping, or regularization on the Q contribution. This weighting is load-bearing for the central claim; without it, the reported gains cannot be isolated from hyperparameter tuning or reduced to behavior cloning.
  2. [§4.2] §4.2 (D4RL experiments): The claim of state-of-the-art performance on the majority of tasks is presented without statistical significance (means and standard deviations over multiple random seeds), ablation studies that isolate the Q-augmentation term from the base diffusion policy, or sensitivity analysis to the loss weighting and other hyperparameters. These omissions leave the support for superiority moderate, as noted in the soundness assessment.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'outstanding performance' is qualitative; consider replacing or supplementing it with a brief quantitative summary of the D4RL improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested clarifications and additional experimental details.

read point-by-point responses
  1. Referee: [§3] §3 (method): The description of the Q-augmented diffusion loss states only that it 'seeks optimal actions that are near the behavior policy' without specifying the weighting coefficient between the standard diffusion objective and the action-value term, any scheduling during training, clipping, or regularization on the Q contribution. This weighting is load-bearing for the central claim; without it, the reported gains cannot be isolated from hyperparameter tuning or reduced to behavior cloning.

    Authors: We agree that explicit details on the weighting are necessary. The full manuscript (Equation 3 in §3) defines the objective as the standard diffusion loss plus λ ⋅ E[-Q(s, a)], with λ fixed at 1.0 for all reported experiments. No scheduling, clipping, or extra regularization on the Q term is used, because the diffusion denoising process itself provides the necessary regularization toward the behavior distribution. We will expand §3 with a dedicated paragraph stating the exact formulation, the fixed value of λ, and the rationale for omitting additional controls. revision: yes

  2. Referee: [§4.2] §4.2 (D4RL experiments): The claim of state-of-the-art performance on the majority of tasks is presented without statistical significance (means and standard deviations over multiple random seeds), ablation studies that isolate the Q-augmentation term from the base diffusion policy, or sensitivity analysis to the loss weighting and other hyperparameters. These omissions leave the support for superiority moderate, as noted in the soundness assessment.

    Authors: We acknowledge that statistical reporting and ablations strengthen the claims. All D4RL results were obtained with 5 independent random seeds; we will update the tables in §4.2 to report both mean and standard deviation. We will also add an ablation comparing the full Diffusion-QL objective (λ = 1) against the base diffusion policy (λ = 0) to isolate the Q-augmentation effect, and include a short sensitivity study on λ in the appendix. These revisions will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new Q-augmented diffusion loss is independently defined

full rationale

The derivation introduces an explicit new loss term that augments the conditional diffusion objective with action-value maximization from a separately learned Q-function. This does not reduce to a quantity defined by previously fitted diffusion parameters, nor does it rely on self-citation chains or uniqueness theorems from the authors' prior work for its justification. Benchmark claims are supported by external D4RL comparisons rather than internal redefinitions. The central method remains self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that diffusion models can represent multimodal action distributions and that the combined loss remains stable during training.

axioms (1)
  • domain assumption Conditional diffusion models can faithfully represent the behavior policy distribution while allowing controlled deviation toward higher-value actions.
    Invoked to justify the policy class choice and the added loss term.

pith-pipeline@v0.9.0 · 5531 in / 1136 out tokens · 22739 ms · 2026-05-15T07:48:52.471767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    cs.RO 2023-03 accept novelty 8.0

    Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

  2. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  3. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  4. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

  5. Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...

  6. Muninn: Your Trajectory Diffusion Model But Faster

    cs.RO 2026-05 unverdicted novelty 7.0

    Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

  7. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  8. Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline ...

  9. Receding-Horizon Control via Drifting Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.

  10. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  11. Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

    cs.LG 2026-05 unverdicted novelty 6.0

    The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.

  12. Refining Compositional Diffusion for Reliable Long-Horizon Planning

    cs.RO 2026-05 unverdicted novelty 6.0

    RCD steers compositional diffusion sampling toward high-density coherent plans by combining reconstruction-error guidance with overlap consistency, outperforming prior methods on locomotion, manipulation, and pixel-ba...

  13. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

  14. FASTER: Value-Guided Sampling for Fast RL

    cs.LG 2026-04 unverdicted novelty 6.0

    FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

  15. Accelerating trajectory optimization with Sobolev-trained diffusion policies

    cs.LG 2026-04 unverdicted novelty 6.0

    Sobolev-trained diffusion policies using trajectories and feedback gains provide warm-starts that reduce trajectory optimization solving time by 2x to 20x while avoiding compounding errors.

  16. Fisher Decorator: Refining Flow Policy via a Local Transport Map

    cs.LG 2026-04 unverdicted novelty 6.0

    Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.

  17. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  18. IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    cs.LG 2023-04 conditional novelty 6.0

    IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.

  19. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

  20. Insider Attacks in Multi-Agent LLM Consensus Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.

  21. Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 19 Pith papers · 11 internal anchors

  1. [1]

    Is conditional generative model- ing all you need for decision-making?arXiv preprint arXiv:2211.15657,

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,

  2. [2]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219,

  3. [3]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning , pp. 2052–2062. PMLR,

  4. [4]

    Know your boundaries: The necessity of explicit behavioral cloning in offline rl

    Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. arXiv preprint arXiv:2206.00695,

  5. [5]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991,

  6. [6]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  7. [7]

    Offline Reinforcement Learning with Implicit Q-Learning

    10 Published as a conference paper at ICLR 2023 Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning , pp. 5774–5783. PMLR, 2021a. Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implici...

  8. [8]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

  9. [9]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927,

  10. [10]

    Mildly conservative Q-learning for offline reinforcement learning

    Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative Q-learning for offline reinforcement learning. arXiv preprint arXiv:2206.04745,

  11. [11]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online rein- forcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

  12. [12]

    Imitating human behaviour with diffusion models

    Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Ser- gio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677,

  13. [13]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

  14. [14]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  15. [15]

    Behavior transformers: Cloning k modes with one stone

    Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ariuntuya Altanzaya, and Lerrel Pinto. Behavior transformers: Cloning k modes with one stone. arXiv preprint arXiv:2206.11251,

  16. [16]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585,

  17. [17]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

  18. [18]

    11 Published as a conference paper at ICLR 2023 Richard S Sutton and Andrew G Barto

    URL https://openreview.net/ forum?id=PxTIG12RRHS. 11 Published as a conference paper at ICLR 2023 Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,

  19. [19]

    Diffusion- GAN: Training gans with diffusion

    Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Diffusion- GAN: Training gans with diffusion. arXiv preprint arXiv:2206.02262,

  20. [20]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361,

  21. [21]

    Tackling the generative learning trilemma with denoising diffusion gans

    Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804,

  22. [22]

    Truncated diffusion proba- bilistic models and diffusion-based adversarial auto-encoders

    Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion proba- bilistic models and diffusion-based adversarial auto-encoders. arXiv preprint arXiv:2202.09671,

  23. [23]

    Actions are again in a real-valued 2D space, a ∈ [−1, 1]2

    12 Published as a conference paper at ICLR 2023 Appendix A M ORE TOY EXPERIMENTS Here we describe an additional toy experiment on a bandit task. Actions are again in a real-valued 2D space, a ∈ [−1, 1]2. The offline data D = {(ai)}10000 i=1 are col- lected by sampling actions equally from four Gaussian distributions with centers µ ∈ {(−0.8, 0.8), (0.8, 0....

  24. [24]

    For behavior-cloning experiments, we observe that only our diffusion model could recover the orig- inal data distribution while the prior regularization methods fail in some way

    The only difference in this experiment is that the samples are now in the corners of the ation space. For behavior-cloning experiments, we observe that only our diffusion model could recover the orig- inal data distribution while the prior regularization methods fail in some way. For example, CV AE could only capture the two diagonal modes and place densi...

  25. [25]

    C E XPERIMENTAL DETAILS We train our algorithm with 2000 epochs for Gym tasks and 1000 epochs for the other tasks, where each epoch consists of 1000 gradient steps

    optimizer for the training of both Diffusion policy and Q networks. C E XPERIMENTAL DETAILS We train our algorithm with 2000 epochs for Gym tasks and 1000 epochs for the other tasks, where each epoch consists of 1000 gradient steps. For the Gym locomotion tasks, we average mean returns over 6 independently trained models and 10 trajectories per mode. For ...

  26. [26]

    for the AntMaze datasets. D O FFLINE MODEL SELECTION For reducing the training cost and picking the best model during training without any interaction with the real environment, we provide a way to properly conduct early stopping for Diffusion- QL. Empirically, we found that Ld loss is a lagging indicator of online performance. Note Ld is the behavior clo...

  27. [27]

    14 Published as a conference paper at ICLR 2023 Table 3: Hyperparameter settings of all selected tasks. Tasks learning rate η max Q backup halfcheetah-medium-v2 3×10−4 1.0 False hopper-medium-v2 3×10−4 1.0 False walker2d-medium-v2 3×10−4 1.0 False halfcheetah-medium-replay-v23×10−4 1.0 False hopper-medium-replay-v2 3×10−4 1.0 False walker2d-medium-replay-...

  28. [28]

    We have shown this results in learning better policies in offlineRL

    G L IMITATIONS AND FUTURE WORK Diffusion policies are highly expressive and hence they can capture multi-modal distributions well. We have shown this results in learning better policies in offlineRL. However, at the inference time, the reverse sampling defined in Equation (1) requires iteratively computingϵθ networks N times, and this can become a bottlen...

  29. [29]

    by replacing the original deterministic actor with a mixture density network (Bishop, 1994), where each mixture component is a Gaussian. Since a Gaussian mixture policy is applied, we replaced minimizing the L2 loss (from TD3+BC) between predicted actions and real actions, with maximizing the likelihood estimate of Gaussian mixtures on real state-action p...