SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Pengcheng Li; Qiang Fang; Tong Zhao; Xin Xu; Yixing Lan

arxiv: 2512.18583 · v2 · submitted 2025-12-21 · 💻 cs.LG · cs.RO

SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Pengcheng Li , Qiang Fang , Tong Zhao , Yixing Lan , Xin Xu This is my paper

Pith reviewed 2026-05-16 20:55 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords adversarial imitation learningdiffusion modelssynthetic demonstrationsprioritized replayimitation learningreinforcement learning

0 comments

The pith

Diffusion models generate synthetic expert trajectories that augment limited demonstrations and improve adversarial imitation learning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SD2AIL, which trains a diffusion model on scarce expert demonstrations to produce additional synthetic trajectories. These synthetics are supplied to the adversarial discriminator as pseudo-positive examples, expanding the effective expert distribution used to shape the reward signal for the policy. A prioritized expert demonstration replay strategy then selects the most useful samples from the combined real and synthetic pool during training. Experiments across simulation benchmarks show improved returns and stability, with the Hopper task reaching an average return of 3441, 89 points above the prior state-of-the-art. The approach therefore targets the data bottleneck in imitation learning by leveraging generative models instead of collecting more real demonstrations.

Core claim

Embedding a diffusion model inside the discriminator allows the generation of synthetic demonstrations whose distribution is close enough to expert behavior to serve as useful positive examples, while the prioritized replay mechanism selects high-value samples from the enlarged pool to guide more effective policy optimization.

What carries the argument

A diffusion model trained on expert state-action pairs that produces pseudo-expert trajectories fed directly into the discriminator, paired with prioritized expert demonstration replay (PEDR) that ranks and replays the most informative samples.

If this is right

Imitation learning agents achieve higher returns without collecting additional real expert trajectories.
Adversarial training becomes more stable when the effective expert set is expanded by generative augmentation.
Prioritized selection from the mixed real-synthetic pool reduces the impact of low-quality samples on the learned reward.
The method scales to tasks where expert data collection is costly or unsafe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar diffusion-based augmentation could extend to other imitation frameworks that rely on expert data matching.
The approach suggests testing whether diffusion models trained in simulation transfer to generating useful synthetics for physical robot tasks.
Ablations on the prioritization criterion would clarify how much of the gain comes from selection versus generation alone.

Load-bearing premise

The diffusion model trained on limited real demonstrations must produce synthetic trajectories close enough to actual expert behavior that the discriminator treats them as informative positives rather than noise.

What would settle it

Training the policy with the synthetic-augmented set yields equal or lower average returns than training with real demonstrations alone, measured across multiple seeds in the same environment.

read the original abstract

Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at https://github.com/positron-lpc/SD2AIL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SD2AIL gets a solid Hopper score by putting diffusion inside the AIL discriminator and adding prioritized replay, but needs better checks on whether the synthetic demos actually help.

read the letter

The headline result is a 3441 return on Hopper that beats the prior best by 89, achieved by training a diffusion model on limited expert demos to generate synthetic trajectories and then feeding those into the AIL discriminator with a prioritized replay buffer called PEDR. What is new is the specific placement of the diffusion model inside the discriminator rather than using it separately for data augmentation, plus the PEDR rule for picking valuable pseudo-experts from the expanded pool. This is a clean technical step on top of standard adversarial imitation learning. The paper does well on the practical side. Collecting expert demos is often the bottleneck in robotics, so any method that makes better use of few real ones is worth looking at. The simulation results on standard tasks suggest the method is stable enough to run without heavy tuning. The soft spot is that we do not see direct evidence the synthetic trajectories sit close to the expert distribution. There are no reported distribution distances, no sample trajectory comparisons, and no ablation that disables the diffusion generator while keeping PEDR to test whether the replay alone explains the gain. If the generated data drifts, the discriminator could treat it as noise and the reported improvement might not hold up. This work is for people already doing AIL or diffusion-based RL who want to stretch small expert datasets further. A reader focused on empirical methods in continuous control would find the architecture and the Hopper number useful to build on. It deserves a serious referee. The central claim is testable and the method is described clearly enough to reproduce, even if the current version needs more diagnostics on the synthetic data quality. Recommendation: Send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper proposes SD2AIL, an extension of adversarial imitation learning that trains a diffusion model on limited expert demonstrations to generate synthetic trajectories as pseudo-expert data, augments the expert set with these samples, and applies a prioritized expert demonstration replay (PEDR) strategy to selectively replay high-value demonstrations during discriminator training. It evaluates the method on standard MuJoCo tasks and reports concrete performance gains, including an average return of 3441 on Hopper that exceeds the prior state-of-the-art by 89.

Significance. If the reported gains hold under rigorous verification, the work would provide a practical way to mitigate data scarcity in AIL by leveraging diffusion models for trajectory augmentation, with the PEDR component offering a general mechanism for handling large demonstration pools. The concrete benchmark numbers and open-source code commitment are positive elements.

major comments (3)

[§4] §4 (Experimental Results), Hopper row: the headline claim of 3441 average return (+89 over SOTA) is presented without the number of random seeds, standard deviations, or statistical significance tests, which is required to establish that the improvement is robust rather than sensitive to hyperparameter choices or training stochasticity.
[§3.2] §3.2 (Synthetic Demonstration Generation): no distributional distance metrics (MMD, Wasserstein, or state-action occupancy divergence) are reported between the diffusion-generated trajectories and the real expert demonstrations, leaving the central assumption—that the synthetics lie sufficiently close to the expert distribution for the discriminator to treat them as useful positives—unverified.
[§4.3] §4.3 (Ablation Studies): the manuscript contains no ablation that isolates the diffusion-based augmentation from the PEDR replay strategy, so it is impossible to determine whether the observed gains derive from the synthetic data, the prioritization mechanism, or their interaction.

minor comments (2)

[Abstract] Abstract: the phrasing 'employ a diffusion model in the discriminator to generate' is ambiguous; clarify whether the diffusion model is trained jointly or separately from the discriminator.
[§2] §2 (Related Work): the discussion of diffusion models in RL could usefully cite more recent works on trajectory generation for imitation learning to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to improve experimental rigor, data quality verification, and component isolation as suggested.

read point-by-point responses

Referee: [§4] §4 (Experimental Results), Hopper row: the headline claim of 3441 average return (+89 over SOTA) is presented without the number of random seeds, standard deviations, or statistical significance tests, which is required to establish that the improvement is robust rather than sensitive to hyperparameter choices or training stochasticity.

Authors: We agree that reporting the number of random seeds, standard deviations, and statistical significance tests is essential to substantiate the robustness of the reported gains. We will revise §4 to include these details for all tasks, with particular attention to the Hopper results, along with appropriate statistical tests comparing against the prior state-of-the-art. revision: yes
Referee: [§3.2] §3.2 (Synthetic Demonstration Generation): no distributional distance metrics (MMD, Wasserstein, or state-action occupancy divergence) are reported between the diffusion-generated trajectories and the real expert demonstrations, leaving the central assumption—that the synthetics lie sufficiently close to the expert distribution for the discriminator to treat them as useful positives—unverified.

Authors: We acknowledge that explicit distributional distance metrics would provide stronger verification of the synthetic demonstrations' quality. In the revised manuscript, we will add computations and reporting of metrics such as Maximum Mean Discrepancy (MMD) and Wasserstein distance between the state-action distributions of the diffusion-generated trajectories and the real expert demonstrations in §3.2. revision: yes
Referee: [§4.3] §4.3 (Ablation Studies): the manuscript contains no ablation that isolates the diffusion-based augmentation from the PEDR replay strategy, so it is impossible to determine whether the observed gains derive from the synthetic data, the prioritization mechanism, or their interaction.

Authors: We agree that a more explicit isolation of the two components would clarify their individual and combined contributions. We will expand the ablation studies in the revised §4.3 to include separate evaluations of diffusion-based augmentation without PEDR and PEDR without diffusion-based augmentation, in addition to the full combination. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent experimental validation

full rationale

The paper presents an algorithmic proposal (diffusion-based synthetic demo generation + PEDR replay inside AIL) whose performance claims rest on direct empirical comparisons in simulation environments. No derivation chain exists that reduces a claimed result to a fitted quantity defined by the same data, a self-citation load-bearing uniqueness theorem, or an ansatz smuggled via prior work. The reported Hopper return of 3441 is an observed outcome, not a quantity forced by construction from the method's own inputs. The central argument is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method introduces a diffusion model whose training objective and sampling procedure are treated as standard, plus a priority function whose exact form is not specified in the abstract. No new physical entities or unstated mathematical axioms are introduced.

free parameters (2)

diffusion model hyperparameters
Number of diffusion steps, noise schedule, and network architecture are chosen and fitted to the expert data distribution.
PEDR priority parameters
The ranking or weighting rule that decides which synthetic demonstrations are replayed most often is a tunable component.

pith-pipeline@v0.9.0 · 5478 in / 1166 out tokens · 16160 ms · 2026-05-16T20:55:54.381071+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data... prioritized expert demonstration replay strategy (PEDR)... Dϕ(si, ai, ϵ) = 1/T Σ exp(−Lϕ...) and Rϕ = −log(1−Dϕ)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The optimization objective is then modified to: min max E[log Dϕ] over πpe + πe + πθ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

Behavior Cloning (BC) [4, 5] employs supervised learning to simply and efficiently replicate expert actions

INTRODUCTION Imitation Learning (IL) learns policies directly from expert demon- strations without predefined reward signals [1, 2, 3], offering an alternative to Reinforcement Learning (RL), which often requires carefully designed reward functions that are difficult to obtain in cer- tain scenarios. Behavior Cloning (BC) [4, 5] employs supervised learnin...

work page
[2]

METHOD In this section, we introduce our method SD2AIL, as shown in Fig. 1. First, we use high-quality samples generated by the diffusion model arXiv:2512.18583v1 [cs.LG] 21 Dec 2025 𝑠𝜋𝑒 𝜋𝑒 𝜋𝑝𝑒 𝜋𝜃 PEDR … 1 𝐷𝜙 𝑠𝜋𝑝𝑒, 𝑎𝜋𝑝𝑒, 𝜖 𝐷𝜙 𝑠𝜋𝑒, 𝑎𝜋𝑒 , 𝜖 𝐷𝜙 𝑠 𝜋𝜃, 𝑎 𝜋𝜃, 𝜖 𝜋𝑝𝑒 Environment 𝑆 𝜋𝜃 𝑅𝜙 𝑠, 𝑎, 𝜖 𝐷𝜙 𝑠, 𝑎, 𝜖 Expert Pseudo- expert Agent Learnable Frozen (b) Optimizin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Experimental Setup Dataset.As shown in Fig

EXPERIMENTS 3.1. Experimental Setup Dataset.As shown in Fig. 2, we evaluate the performance of our model on four classic MuJoCo tasks: Ant, Walker, Hopper, and HalfCheetah. All datasets consist of 40 trajectories, each containing 1,000 state–action pairs. Among these, the datasets for Ant, Walker, and HalfCheetah were provided by Kostrikov et al [26], whi...

work page
[4]

In the Walker task, our method also achieved a result of 5743, surpassing the baselines. Notably, when there is only one expert trajectory, our method requires significantly fewer time steps to con- verge in the Hopper and HalfCheetah experiments, with about 210k and 180k steps, respectively. Our method outperforms DRAIL and SMILING across all four tasks,...

work page
[5]

First, we adopt a diffusion model in the discriminator of AIL to generate pseudo-expert demonstrations that augment the real expert demonstrations

CONCLUSIONS In this work, we present SD2AIL, a novel Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models. First, we adopt a diffusion model in the discriminator of AIL to generate pseudo-expert demonstrations that augment the real expert demonstrations. We further introduce a prioritized expert demon- stration replay (PEDR) m...

work page
[6]

Robust behav- ioral cloning for autonomous vehicles using end-to-end imita- tion learning,

T. V . Samak, C. V . Samak, and S. Kandhasamy, “Robust behav- ioral cloning for autonomous vehicles using end-to-end imita- tion learning,” SAE Int. J. Connected Autom. Veh., vol. 4, no. 3, pp. 279–295, 2021

work page 2021
[7]

Algorithms for inverse reinforcement learning,

A. Y . Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in Proc. 17th Int. Conf. Mach. Learn., pp. 1–2, 2000

work page 2000
[8]

Generative adversarial imitation learn- ing,

J. Ho and S. Ermon, “Generative adversarial imitation learn- ing,” in Adv. Neural Inf. Process. Syst., vol. 29, 2016

work page 2016
[9]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proc. 14th Int. Conf. Artif. Intell. Stat., pp. 627–635, 2011

work page 2011
[10]

Florence, C

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, ”Im- plicit behavioral cloning,” in Conference on Robot Learning, Jan. 2022, pp. 158-168. PMLR

work page 2022
[11]

Learning robust rewards with adversarial inverse reinforcement learning,

J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in Proc. 6th Int. Conf. Learn. Represent., 2018

work page 2018
[12]

Discriminator-actor-critic: Addressing sample in- efficiency and reward bias in adversarial imitation learning,

I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-actor-critic: Addressing sample in- efficiency and reward bias in adversarial imitation learning,” in Proc. 7th Int. Conf. Learn. Represent., 2019

work page 2019
[13]

Adaptive generative adversarial maximum entropy inverse reinforcement learning,

L. Song, D. Li, and X. Xu, “Adaptive generative adversarial maximum entropy inverse reinforcement learning,”Informa- tion Sciences, vol. 695, p. 121712, 2025

work page 2025
[14]

A coupled flow approach to imitation learning,

G. Freund, A. Gleave, and S. Levine, “A coupled flow approach to imitation learning,” in Proc. 40th Int. Conf. Mach. Learn., pp. 10357–10372, 2023

work page 2023
[15]

Dadashi, L

R. Dadashi, L. Hussenot, M. Geist, and O. Pietquin, ”Primal Wasserstein Imitation Learning,” in Proceedings of the Interna- tional Conference on Learning Representations (ICLR), 2021

work page 2021
[16]

Denoising diffusion probabilis- tic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis- tic models,” in Adv. Neural Inf. Process. Syst., vol. 33, pp. 6840–6851, 2020

work page 2020
[17]

Score-based generative modeling through stochastic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- mon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. Int. Conf. Learn. Represent., 2021

work page 2021
[18]

Planning with diffusion for flexible behavior synthesis,

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” in Proc. 39th Int. Conf. Mach. Learn., 2022

work page 2022
[19]

Goal condi- tioned imitation learning using score-based diffusion policies,

M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal condi- tioned imitation learning using score-based diffusion policies,” in Proc. Robot.: Sci. Syst., 2023

work page 2023
[20]

Diffusion policies as an expressive policy class for offline reinforcement learning,

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2023

work page 2023
[21]

DiffStitch: Boosting offline reinforcement learning with diffusion-based trajectory stitching,

L. Guanghe, Y . Shan, Z. Zhengbang, T. Long, and W. Zhang, “DiffStitch: Boosting offline reinforcement learning with diffusion-based trajectory stitching,” in Proc. 41st Int. Conf. Mach. Learn., 2024

work page 2024
[22]

Diffus- ing states and matching scores: A new framework for imitation learning,

R. Wu, Y . Chen, G. Swamy, K. Brantley, and W. Sun, “Diffus- ing states and matching scores: A new framework for imitation learning,” in Proc. Int. Conf. Learn. Represent, 2025

work page 2025
[23]

DiffAIL: Diffusion adversarial imitation learning,

B. Wang, G. Wu, T. Pang, Y . Zhang, and Y . Yin, “DiffAIL: Diffusion adversarial imitation learning,” in Proc. AAAI Conf. Artif. Intell., vol. 38, no. 14, pp. 15447–15455, 2024

work page 2024
[24]

Diffusion-reward adversarial imitation learning,

C. M. Lai, H. C. Wang, P. C. Hsieh, F. Wang, M. H. Chen, and S. H. Sun, “Diffusion-reward adversarial imitation learning,” in Adv. Neural Inf. Process. Syst., vol. 37, pp. 95456–95487, 2024

work page 2024
[25]

Prioritized experience replay,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Proc. 4th Int. Conf. Learn. Represent., 2016

work page 2016
[26]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 5026–5033, 2012

work page 2012
[27]

Generative ad- versarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative ad- versarial nets,” in Adv. Neural Inf. Process. Syst., vol. 27, 2014

work page 2014
[28]

Hindsight experience replay,

M. Andrychowicz, P. Wolski, R. Ray, J. Schneider, R. Fong, P. Welinder, et al., “Hindsight experience replay,” in Adv. Neural Inf. Process. Syst., pp. 5055–5065, 2017

work page 2017
[29]

High-value pri- oritized experience replay for off-policy reinforcement learn- ing,

X. Cao, H. Y . Wan, Y . F. Lin, and S. Han, “High-value pri- oritized experience replay for off-policy reinforcement learn- ing,” in Proc. IEEE 31st Int. Conf. Tools Artif. Intell., pp. 1510–1514, 2019

work page 2019
[30]

Hindsight goal ranking on replay buffer for sparse reward environment,

T. M. Luu and C. D. Yoo, “Hindsight goal ranking on replay buffer for sparse reward environment,” IEEE Access, vol. 9, pp. 51996–52007, 2021

work page 2021
[31]

Imitation learning via off-policy distribution matching,

I. Kostrikov, O. Nachum, and J. Tompson, “Imitation learning via off-policy distribution matching,” in Proc. 8th Int. Conf. Learn. Represent., 2020

work page 2020
[32]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Adv. Neural Inf. Pro- cess. Syst., vol. 30, pp. 6626–6637, 2017

work page 2017

[1] [1]

Behavior Cloning (BC) [4, 5] employs supervised learning to simply and efficiently replicate expert actions

INTRODUCTION Imitation Learning (IL) learns policies directly from expert demon- strations without predefined reward signals [1, 2, 3], offering an alternative to Reinforcement Learning (RL), which often requires carefully designed reward functions that are difficult to obtain in cer- tain scenarios. Behavior Cloning (BC) [4, 5] employs supervised learnin...

work page

[2] [2]

METHOD In this section, we introduce our method SD2AIL, as shown in Fig. 1. First, we use high-quality samples generated by the diffusion model arXiv:2512.18583v1 [cs.LG] 21 Dec 2025 𝑠𝜋𝑒 𝜋𝑒 𝜋𝑝𝑒 𝜋𝜃 PEDR … 1 𝐷𝜙 𝑠𝜋𝑝𝑒, 𝑎𝜋𝑝𝑒, 𝜖 𝐷𝜙 𝑠𝜋𝑒, 𝑎𝜋𝑒 , 𝜖 𝐷𝜙 𝑠 𝜋𝜃, 𝑎 𝜋𝜃, 𝜖 𝜋𝑝𝑒 Environment 𝑆 𝜋𝜃 𝑅𝜙 𝑠, 𝑎, 𝜖 𝐷𝜙 𝑠, 𝑎, 𝜖 Expert Pseudo- expert Agent Learnable Frozen (b) Optimizin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Experimental Setup Dataset.As shown in Fig

EXPERIMENTS 3.1. Experimental Setup Dataset.As shown in Fig. 2, we evaluate the performance of our model on four classic MuJoCo tasks: Ant, Walker, Hopper, and HalfCheetah. All datasets consist of 40 trajectories, each containing 1,000 state–action pairs. Among these, the datasets for Ant, Walker, and HalfCheetah were provided by Kostrikov et al [26], whi...

work page

[4] [4]

In the Walker task, our method also achieved a result of 5743, surpassing the baselines. Notably, when there is only one expert trajectory, our method requires significantly fewer time steps to con- verge in the Hopper and HalfCheetah experiments, with about 210k and 180k steps, respectively. Our method outperforms DRAIL and SMILING across all four tasks,...

work page

[5] [5]

First, we adopt a diffusion model in the discriminator of AIL to generate pseudo-expert demonstrations that augment the real expert demonstrations

CONCLUSIONS In this work, we present SD2AIL, a novel Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models. First, we adopt a diffusion model in the discriminator of AIL to generate pseudo-expert demonstrations that augment the real expert demonstrations. We further introduce a prioritized expert demon- stration replay (PEDR) m...

work page

[6] [6]

Robust behav- ioral cloning for autonomous vehicles using end-to-end imita- tion learning,

T. V . Samak, C. V . Samak, and S. Kandhasamy, “Robust behav- ioral cloning for autonomous vehicles using end-to-end imita- tion learning,” SAE Int. J. Connected Autom. Veh., vol. 4, no. 3, pp. 279–295, 2021

work page 2021

[7] [7]

Algorithms for inverse reinforcement learning,

A. Y . Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in Proc. 17th Int. Conf. Mach. Learn., pp. 1–2, 2000

work page 2000

[8] [8]

Generative adversarial imitation learn- ing,

J. Ho and S. Ermon, “Generative adversarial imitation learn- ing,” in Adv. Neural Inf. Process. Syst., vol. 29, 2016

work page 2016

[9] [9]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proc. 14th Int. Conf. Artif. Intell. Stat., pp. 627–635, 2011

work page 2011

[10] [10]

Florence, C

P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, ”Im- plicit behavioral cloning,” in Conference on Robot Learning, Jan. 2022, pp. 158-168. PMLR

work page 2022

[11] [11]

Learning robust rewards with adversarial inverse reinforcement learning,

J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in Proc. 6th Int. Conf. Learn. Represent., 2018

work page 2018

[12] [12]

Discriminator-actor-critic: Addressing sample in- efficiency and reward bias in adversarial imitation learning,

I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-actor-critic: Addressing sample in- efficiency and reward bias in adversarial imitation learning,” in Proc. 7th Int. Conf. Learn. Represent., 2019

work page 2019

[13] [13]

Adaptive generative adversarial maximum entropy inverse reinforcement learning,

L. Song, D. Li, and X. Xu, “Adaptive generative adversarial maximum entropy inverse reinforcement learning,”Informa- tion Sciences, vol. 695, p. 121712, 2025

work page 2025

[14] [14]

A coupled flow approach to imitation learning,

G. Freund, A. Gleave, and S. Levine, “A coupled flow approach to imitation learning,” in Proc. 40th Int. Conf. Mach. Learn., pp. 10357–10372, 2023

work page 2023

[15] [15]

Dadashi, L

R. Dadashi, L. Hussenot, M. Geist, and O. Pietquin, ”Primal Wasserstein Imitation Learning,” in Proceedings of the Interna- tional Conference on Learning Representations (ICLR), 2021

work page 2021

[16] [16]

Denoising diffusion probabilis- tic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis- tic models,” in Adv. Neural Inf. Process. Syst., vol. 33, pp. 6840–6851, 2020

work page 2020

[17] [17]

Score-based generative modeling through stochastic differential equations,

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- mon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. Int. Conf. Learn. Represent., 2021

work page 2021

[18] [18]

Planning with diffusion for flexible behavior synthesis,

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” in Proc. 39th Int. Conf. Mach. Learn., 2022

work page 2022

[19] [19]

Goal condi- tioned imitation learning using score-based diffusion policies,

M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal condi- tioned imitation learning using score-based diffusion policies,” in Proc. Robot.: Sci. Syst., 2023

work page 2023

[20] [20]

Diffusion policies as an expressive policy class for offline reinforcement learning,

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2023

work page 2023

[21] [21]

DiffStitch: Boosting offline reinforcement learning with diffusion-based trajectory stitching,

L. Guanghe, Y . Shan, Z. Zhengbang, T. Long, and W. Zhang, “DiffStitch: Boosting offline reinforcement learning with diffusion-based trajectory stitching,” in Proc. 41st Int. Conf. Mach. Learn., 2024

work page 2024

[22] [22]

Diffus- ing states and matching scores: A new framework for imitation learning,

R. Wu, Y . Chen, G. Swamy, K. Brantley, and W. Sun, “Diffus- ing states and matching scores: A new framework for imitation learning,” in Proc. Int. Conf. Learn. Represent, 2025

work page 2025

[23] [23]

DiffAIL: Diffusion adversarial imitation learning,

B. Wang, G. Wu, T. Pang, Y . Zhang, and Y . Yin, “DiffAIL: Diffusion adversarial imitation learning,” in Proc. AAAI Conf. Artif. Intell., vol. 38, no. 14, pp. 15447–15455, 2024

work page 2024

[24] [24]

Diffusion-reward adversarial imitation learning,

C. M. Lai, H. C. Wang, P. C. Hsieh, F. Wang, M. H. Chen, and S. H. Sun, “Diffusion-reward adversarial imitation learning,” in Adv. Neural Inf. Process. Syst., vol. 37, pp. 95456–95487, 2024

work page 2024

[25] [25]

Prioritized experience replay,

T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Proc. 4th Int. Conf. Learn. Represent., 2016

work page 2016

[26] [26]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 5026–5033, 2012

work page 2012

[27] [27]

Generative ad- versarial nets,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative ad- versarial nets,” in Adv. Neural Inf. Process. Syst., vol. 27, 2014

work page 2014

[28] [28]

Hindsight experience replay,

M. Andrychowicz, P. Wolski, R. Ray, J. Schneider, R. Fong, P. Welinder, et al., “Hindsight experience replay,” in Adv. Neural Inf. Process. Syst., pp. 5055–5065, 2017

work page 2017

[29] [29]

High-value pri- oritized experience replay for off-policy reinforcement learn- ing,

X. Cao, H. Y . Wan, Y . F. Lin, and S. Han, “High-value pri- oritized experience replay for off-policy reinforcement learn- ing,” in Proc. IEEE 31st Int. Conf. Tools Artif. Intell., pp. 1510–1514, 2019

work page 2019

[30] [30]

Hindsight goal ranking on replay buffer for sparse reward environment,

T. M. Luu and C. D. Yoo, “Hindsight goal ranking on replay buffer for sparse reward environment,” IEEE Access, vol. 9, pp. 51996–52007, 2021

work page 2021

[31] [31]

Imitation learning via off-policy distribution matching,

I. Kostrikov, O. Nachum, and J. Tompson, “Imitation learning via off-policy distribution matching,” in Proc. 8th Int. Conf. Learn. Represent., 2020

work page 2020

[32] [32]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Adv. Neural Inf. Pro- cess. Syst., vol. 30, pp. 6626–6637, 2017

work page 2017