pith. sign in

arxiv: 2512.08052 · v3 · pith:C7ULMUVUnew · submitted 2025-12-08 · 💻 cs.RO · cs.LG

An Introduction to Deep Reinforcement and Imitation Learning

Pith reviewed 2026-05-21 17:15 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords deep reinforcement learningimitation learningembodied agentsmarkov decision processesproximal policy optimizationgenerative adversarial imitation learningbehavioral cloningdataset aggregation
0
0 comments X

The pith

A self-contained introduction presents deep reinforcement and imitation learning for embodied agents through a small set of core algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Deep Reinforcement Learning and Deep Imitation Learning for embodied agents such as robots and virtual characters. It covers all required mathematical and machine learning concepts as they appear, beginning with Markov Decision Processes. For DRL the text focuses on REINFORCE and Proximal Policy Optimization; for DIL it covers Behavioral Cloning, Dataset Aggregation, and Generative Adversarial Imitation Learning. The approach deliberately stays depth-first and limited in scope so that readers build solid understanding without external references.

Core claim

Embodied agents solve sequential decision-making problems by learning from reward signals or expert demonstrations, and these two families of methods can be understood in depth from a concise, self-contained presentation of a few foundational algorithms and the concepts that support them.

What carries the argument

The depth-first, self-contained treatment that introduces every necessary concept only when required, centered on the progression from Markov Decision Processes through REINFORCE and PPO on the reinforcement side and Behavioral Cloning, DAgger, and GAIL on the imitation side.

If this is right

  • Embodied agents can acquire effective controllers by optimizing against reward signals using methods such as PPO.
  • Controllers can also be acquired by imitating expert demonstrations via techniques such as GAIL without an explicit reward function.
  • Sequential decision problems become approachable once the supporting concepts of Markov Decision Processes and policy gradients are in place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same limited set of algorithms could be used as a practical starting curriculum when teaching robotics students to implement learning-based controllers.
  • Because every concept is introduced on demand, the material might be directly usable as lecture notes for a short course on learning for physical agents.

Load-bearing premise

A small, fixed collection of foundational algorithms and techniques is enough to give readers an in-depth grasp of both fields without any need for outside material.

What would settle it

A reader who works through the entire document yet still cannot follow the derivation or implementation of one of the listed algorithms without consulting external sources.

Figures

Figures reproduced from arXiv: 2512.08052 by Pedro Santana.

Figure 2.1
Figure 2.1. Figure 2.1: PDFs of Gaussian distributions with different means and standard devia [PITH_FULL_IMAGE:figures/full_fig_p011_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Plot of the natural logarithm function, ln( [PITH_FULL_IMAGE:figures/full_fig_p014_2_2.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: The agent-environment interaction cycle in a MDP (adapted from [20]). [PITH_FULL_IMAGE:figures/full_fig_p032_3_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: illustrates an example trajectory of a rabbit-agent in a simple grid-world [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Example of the MDP interaction cycle: a rabbit-agent randomly moving in a [PITH_FULL_IMAGE:figures/full_fig_p033_3_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the Markov property in a grid-world environment, highlighting [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Illustrative grid-world showing a local transition neighbourhood around state [PITH_FULL_IMAGE:figures/full_fig_p035_3_3.png] view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Illustrative MDP example. For simplicity, the visualization of the MDP is [PITH_FULL_IMAGE:figures/full_fig_p037_3_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the path an agent takes in a grid-world, along with the non-zero [PITH_FULL_IMAGE:figures/full_fig_p038_3.png] view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: Discounted return at t = 0, t = 5, and t = 9 along an agent trajectory on a grid-world, with a given γ. 3.3 Exact Methods for MDPs This section explores the problem of solving MDPs using exact methods, which are guar￾anteed to find the optimal mapping from states to action probabilities. These methods assume complete knowledge of the MDP, which can limit their practical application in real-world problems… view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: Example of the optimal policy π1 (left) and a sub-optimal policy π2 (right) for the problem of a rabbit-agent moving the a grid-world with rewards as follows: landing on a carrot = +5, landing on poison = -5, moving out-of-bounds = -10. See the main text for a complete explanation. 3.3.2 Value Functions The goal of learning is to discover the optimal policy through experience. A stepping stone toward thi… view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Example of trajectories induced by two different policies, [PITH_FULL_IMAGE:figures/full_fig_p042_3_7.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the grid world with the values of [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗
Figure 3.8
Figure 3.8. Figure 3.8: Iterative policy evaluation for grid world with [PITH_FULL_IMAGE:figures/full_fig_p045_3_8.png] view at source ↗
Figure 3.9
Figure 3.9. Figure 3.9: shows the same procedure but for a different action distribution. In this case the random walk is biased towards moving to right as a consequence of wind presence, P(At = north) = 0.25, P(At = south) = 0.25, P(At = left) = 0.15, P(At = right) = 0.35, ∀t. Note how the states with higher value shifted to the left. That is the because starting away from the right boundary renders less likely and delays the … view at source ↗
Figure 3.10
Figure 3.10. Figure 3.10: Iterative policy evaluation for grid world with [PITH_FULL_IMAGE:figures/full_fig_p046_3_10.png] view at source ↗
Figure 3.11
Figure 3.11. Figure 3.11: Value iteration for grid world with jumps and no walls. Left: the rules of [PITH_FULL_IMAGE:figures/full_fig_p049_3_11.png] view at source ↗
Figure 3.12
Figure 3.12. Figure 3.12: Value iteration for grid world with jumps and walls. Left: the rules of the [PITH_FULL_IMAGE:figures/full_fig_p049_3_12.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Gradient vectors (red arrows) of function [PITH_FULL_IMAGE:figures/full_fig_p054_4_1.png] view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Gradient vectors (red arrows) of function [PITH_FULL_IMAGE:figures/full_fig_p055_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: 3D surface plot of the function f(θ) = 10 − θ 2 1 − 3θ 2 2 . At point θ0 = (1, 1), f(θ0) = 10 − 1 2 − 3 × 1 2 = 6, and ∇f(θ0) = (−2 × 1, −6 × 1) = (−2, −6). Note that the second coordinate of the gradient is three times higher than the first, indicating that a change in parameter θ2 induces a change in f that is three times higher than a change in parameter θ1. Hence, if the goal is to maximize f, θ2 sho… view at source ↗
Figure 4
Figure 4. Figure 4: shows the 2D contour plot of the function with the first 35 gradient ascent [PITH_FULL_IMAGE:figures/full_fig_p056_4.png] view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: 2D contour plot of function f(θ) = 10 − θ 2 1 − 3θ 2 2 , with gradient vectors as red arrows. Learning Rate Decay When optimizing a function, it is often beneficial to lower the learning rate as the optimization progresses, that is as t varies. This approach allows for a more extensive exploration of the parameter space in the initial stages and gradually reduces the step sizes to refine the policy. Ther… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Illustration of a policy rollout: the agent interacts with the environment over an episode, generating state-action-return tuples that provide unbiased samples of the policy gradient. Once the rollout is complete, the generated trajectory can be analyzed to evaluate the quantity inside the brackets in Equation 4.4 at each time step. First, for each time step t of the rollout, the discounted return is com… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the policy-parameter update step for the simplified case of a [PITH_FULL_IMAGE:figures/full_fig_p061_4.png] view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Visual illustration of the policy update step, where the red arrows represent [PITH_FULL_IMAGE:figures/full_fig_p061_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: State and actions of the OpenAI CartPole problem. The cart, the pole, [PITH_FULL_IMAGE:figures/full_fig_p069_4_7.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the learning progress of the REINFORCE algorithm with de [PITH_FULL_IMAGE:figures/full_fig_p071_4.png] view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: REINFORCE algorithm with decaying learning rate (but without baseline), [PITH_FULL_IMAGE:figures/full_fig_p072_4_8.png] view at source ↗
Figure 4.9
Figure 4.9. Figure 4.9: Diagram of an artificial neuron. In an MLP, neurons are organized in layers (see [PITH_FULL_IMAGE:figures/full_fig_p073_4_9.png] view at source ↗
Figure 4.10
Figure 4.10. Figure 4.10: Diagram of an MLP. The neurons in the MLP are represented by their [PITH_FULL_IMAGE:figures/full_fig_p074_4_10.png] view at source ↗
Figure 4.11
Figure 4.11. Figure 4.11: Tanh, ReLu, and sigmoid activation functions. [PITH_FULL_IMAGE:figures/full_fig_p075_4_11.png] view at source ↗
Figure 4
Figure 4. Figure 4: plots the [PITH_FULL_IMAGE:figures/full_fig_p075_4.png] view at source ↗
Figure 4.12
Figure 4.12. Figure 4.12: Architecture of a 4-Layer MLP modeling a policy over [PITH_FULL_IMAGE:figures/full_fig_p076_4_12.png] view at source ↗
Figure 4.13
Figure 4.13. Figure 4.13: Diagram of an MLP for state-value function modeling. [PITH_FULL_IMAGE:figures/full_fig_p077_4_13.png] view at source ↗
Figure 4.14
Figure 4.14. Figure 4.14: Schematic overview of the state-value function training process. [PITH_FULL_IMAGE:figures/full_fig_p078_4_14.png] view at source ↗
Figure 4
Figure 4. Figure 4: displays the accumulated reward (undiscounted) per learning episode [PITH_FULL_IMAGE:figures/full_fig_p080_4.png] view at source ↗
Figure 4.15
Figure 4.15. Figure 4.15: Smoothed accumulated reward (undiscounted) per learning episode for [PITH_FULL_IMAGE:figures/full_fig_p081_4_15.png] view at source ↗
Figure 4.16
Figure 4.16. Figure 4.16: The OpenAI Lunar Lander problem. The objective is to safely land the [PITH_FULL_IMAGE:figures/full_fig_p082_4_16.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the accumulated reward (undiscounted) per learning episode for [PITH_FULL_IMAGE:figures/full_fig_p083_4.png] view at source ↗
Figure 4.17
Figure 4.17. Figure 4.17: REINFORCE algorithm using MLPs for both the policy and state-value [PITH_FULL_IMAGE:figures/full_fig_p083_4_17.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates an example of an MLP modeling a policy over a single con [PITH_FULL_IMAGE:figures/full_fig_p085_4.png] view at source ↗
Figure 4.18
Figure 4.18. Figure 4.18: Architecture of a 4-Layer MLP modeling a policy over one continuous [PITH_FULL_IMAGE:figures/full_fig_p086_4_18.png] view at source ↗
Figure 4.19
Figure 4.19. Figure 4.19: Collecting data by rolling out the current policy [PITH_FULL_IMAGE:figures/full_fig_p090_4_19.png] view at source ↗
Figure 4.20
Figure 4.20. Figure 4.20: Creating the mini-batch list B from the experiences buffer D. Algorithm 6: CreateMiniBatches (D, M) 1: Input: experiences buffer, D 2: Input: mini-batch size, M 3: 4: Initialize list of mini-batches, B ← [] 5: Shuffle the experience buffer, Dshuffled ← Shuffle(D) 6: While |Dshuffled| > 0 7: If |Dshuffled| < M 8: Randomly sample M − |Dshuffled| experiences from B, Z 9: Append the sampled experiences Z to… view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the process of rollouts collection and subsequent policy update [PITH_FULL_IMAGE:figures/full_fig_p094_4.png] view at source ↗
Figure 4.21
Figure 4.21. Figure 4.21: A PPO iteration: (1) multiple rollouts are collected using [PITH_FULL_IMAGE:figures/full_fig_p094_4_21.png] view at source ↗
Figure 4.22
Figure 4.22. Figure 4.22: Schematics of three environments in MuJoCo: HalfCheetah (left), Hopper [PITH_FULL_IMAGE:figures/full_fig_p101_4_22.png] view at source ↗
Figure 4.23
Figure 4.23. Figure 4.23: Smoothed average learning curves of PPO (in black) and VPG (in blue) [PITH_FULL_IMAGE:figures/full_fig_p101_4_23.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: The agent-environment interaction cycle in an MDP (adapted from [20]). [PITH_FULL_IMAGE:figures/full_fig_p103_5_1.png] view at source ↗
Figure 5
Figure 5. Figure 5: depicts an example of an MLP modeling a policy over [PITH_FULL_IMAGE:figures/full_fig_p106_5.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Architecture of a 4-Layer MLP modeling a policy over [PITH_FULL_IMAGE:figures/full_fig_p106_5_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates an example of an MLP modeling a policy over a single continuous [PITH_FULL_IMAGE:figures/full_fig_p107_5.png] view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Architecture of a 4-Layer MLP modeling a policy over one continuous action. [PITH_FULL_IMAGE:figures/full_fig_p107_5_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates how the demonstration dataset [PITH_FULL_IMAGE:figures/full_fig_p109_5.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: Construction of the demonstration dataset [PITH_FULL_IMAGE:figures/full_fig_p109_5_4.png] view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Behavioral Cloning treats imitation learning as a supervised learning prob [PITH_FULL_IMAGE:figures/full_fig_p110_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Illustration of covariate shift and compounding errors in Behavioral Cloning. [PITH_FULL_IMAGE:figures/full_fig_p115_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Diagram of the DAgger [15] data aggregation loop for iterations [PITH_FULL_IMAGE:figures/full_fig_p118_5_7.png] view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: Zoomed-in view of the rollout loop over time steps [PITH_FULL_IMAGE:figures/full_fig_p119_5_8.png] view at source ↗
Figure 5.9
Figure 5.9. Figure 5.9: Comparison of average learning performance over multiple runs between [PITH_FULL_IMAGE:figures/full_fig_p120_5_9.png] view at source ↗
Figure 5.10
Figure 5.10. Figure 5.10: Conceptual illustration of a single training iteration in GAIL [6]. [PITH_FULL_IMAGE:figures/full_fig_p122_5_10.png] view at source ↗
Figure 5.11
Figure 5.11. Figure 5.11: High-level summary of the GAIL [6] training loop. The learner’s policy [PITH_FULL_IMAGE:figures/full_fig_p123_5_11.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows that GAIL consistently outperforms Behavioral Cloning (BC) in [PITH_FULL_IMAGE:figures/full_fig_p126_5.png] view at source ↗
Figure 5.12
Figure 5.12. Figure 5.12: Schematic of 3D humanoid-like simulation in MuJoCo (left), and compar [PITH_FULL_IMAGE:figures/full_fig_p127_5_12.png] view at source ↗
read the original abstract

Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is an introductory tutorial on Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL) targeted at embodied agents such as robots and virtual characters. It claims to adopt a concise, depth-first approach that is self-contained by introducing all necessary mathematical and machine-learning prerequisites on demand. The scope is deliberately narrow: for DRL it covers Markov Decision Processes through REINFORCE and Proximal Policy Optimization (PPO); for DIL it covers Behavioral Cloning through Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL). The central pedagogical claim is that this limited set of foundational algorithms suffices for in-depth understanding without requiring external references for core concepts.

Significance. If the exposition is accurate and the chosen algorithms are presented with sufficient rigor and clarity, the document could serve as a compact entry point for students and researchers entering learning-based control in robotics. Its value is pedagogical rather than scientific; it does not advance new theorems, empirical results, or theoretical derivations. The deliberate restriction to a small set of methods is a strength for depth but also limits its utility as a standalone reference for the broader literature.

major comments (1)
  1. [Abstract and §1] Abstract and §1 (Introduction): The claim that the selected algorithms (MDPs to PPO; BC to GAIL) provide 'in-depth understanding of the broader fields' without external references is a scope choice rather than a demonstrated result. The manuscript should include a brief explicit justification, perhaps in the introduction, for why these particular algorithms are treated as foundational and sufficient, or acknowledge the trade-off explicitly.
minor comments (2)
  1. Ensure every mathematical symbol (e.g., state space S, action space A, transition function P) is defined at first use and that notation remains consistent across DRL and DIL sections.
  2. Add a short concluding section that points readers to the most important limitations of the covered methods (e.g., sample inefficiency of on-policy DRL, distribution shift in imitation learning) to maintain balance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestion. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that the selected algorithms (MDPs to PPO; BC to GAIL) provide 'in-depth understanding of the broader fields' without external references is a scope choice rather than a demonstrated result. The manuscript should include a brief explicit justification, perhaps in the introduction, for why these particular algorithms are treated as foundational and sufficient, or acknowledge the trade-off explicitly.

    Authors: We agree that the selection of these algorithms is a deliberate pedagogical scope choice rather than an empirical demonstration. The current abstract already states that the document 'is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage.' To make this framing more explicit, we will add a short paragraph in the introduction that justifies why MDPs-to-PPO and BC-to-GAIL are treated as core building blocks: they introduce the essential concepts of policy optimization, value estimation, and distribution matching in a self-contained manner, thereby equipping readers to engage with the wider literature. We will also briefly note the inherent trade-off of limited breadth. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This document is explicitly framed as a tutorial-style introduction rather than a research contribution containing novel derivations, predictions, or empirical claims. It presents established concepts from MDPs through PPO for DRL and from behavioral cloning through GAIL for DIL, introducing all necessary mathematical and machine-learning prerequisites on demand. No load-bearing steps involve self-definitional equations, fitted inputs renamed as predictions, or self-citation chains that reduce the central assertions to their own inputs. The choices of scope and pedagogical design are not falsifiable technical propositions subject to circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an introductory tutorial document. It contains no new scientific claims, derivations, fitted parameters, or postulated entities. All content draws from the established literature on reinforcement and imitation learning.

pith-pipeline@v0.9.0 · 5702 in / 987 out tokens · 34986 ms · 2026-05-21T17:15:56.548462+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robotic Affection -- Opportunities of AI-based haptic interactions to improve social robotic touch through a multi-deep-learning approach

    cs.HC 2026-05 unverdicted novelty 4.0

    A position paper proposes decomposing affective robotic touch into multiple specialized deep learning models for better social human-robot interaction.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    End to End Learning for Self-Driving Cars

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Ji- akai Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

  2. [2]

    imitation: Clean imitation learning implementations.arXiv preprint arXiv:2211.11972, 2022

    Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, and Stu- art Russell. imitation: Clean imitation learning implementations.arXiv preprint arXiv:2211.11972, 2022. URL:https://github.com/HumanCompatibleAI/ imitation

  3. [3]

    Goodfellow, Y

    I. Goodfellow, Y. Bengio, and A. Courville.Deep Learning. MIT Press, 2016. URL: http://deeplearningbook.org/

  4. [4]

    Generative adversarial net- works.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial net- works.Communications of the ACM, 63(11):139–144, 2020

  5. [5]

    Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr,

  6. [6]

    URL:https://arxiv.org/pdf/1801.01290

  7. [7]

    Generative adversarial imitation learning

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, volume 29, 2016

  8. [8]

    Imi- tation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

    Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imi- tation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

  9. [9]

    Juliani, V .-P

    Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar, et al. Unity: A general platform for intelligent agents.arXiv preprint arXiv:1809.02627,

  10. [10]

    URL:https://github.com/Unity-Technologies/ml-agents

  11. [11]

    A survey on reinforcement learning methods in character animation

    Ariel Kwiatkowski, Eduardo Alvarado, Vicky Kalogeiton, C Karen Liu, Julien Pettr´ e, Michiel van de Panne, and Marie-Paule Cani. A survey on reinforcement learning methods in character animation. InComputer graphics forum, volume 41, pages 613–639. Wiley Online Library, 2022. 124 Bibliography 125

  12. [12]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015. URL:https: //arxiv.org/pdf/1509.02971

  13. [13]

    Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Ve- ness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learn- ing.nature, 518(7540):529–533, 2015. URL:https://www.nature.com/articles/ nature14236

  14. [14]

    A survey on deep learning for skeleton-based human animation

    Lucas Mourot, Ludovic Hoyet, Fran¸ cois Le Clerc, Fran¸ cois Schnitzler, and Pierre Hellier. A survey on deep learning for skeleton-based human animation. InCom- puter Graphics Forum, volume 41, pages 122–157. Wiley Online Library, 2022

  15. [15]

    Openai spinning up in deep rl, 2018

    OpenAI. Openai spinning up in deep rl, 2018. URL:https://spinningup.openai. com/

  16. [16]

    Efficient training of artificial neural networks for autonomous navigation.Neural computation, 3(1):88–97, 1991

    Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation.Neural computation, 3(1):88–97, 1991

  17. [17]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St´ ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

  18. [18]

    Learning repre- sentations by back-propagating errors.Nature, 323(6088):533–536, 1986

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre- sentations by back-propagating errors.Nature, 323(6088):533–536, 1986

  19. [19]

    Stable baselines 3, 2024

    SB3 Team. Stable baselines 3, 2024. URL:https://github.com/DLR-RM/ stable-baselines3/

  20. [20]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URL:https://arxiv.org/pdf/1707.06347

  22. [22]

    R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2018. URL:https://www.andrew.cmu.edu/course/10-703/textbook/ BartoSutton.pdf

  23. [23]

    Simple statistical gradient-following algorithms for connection- ist reinforcement learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connection- ist reinforcement learning.Machine learning, 8(3):229–256, 1992

  24. [24]

    A sur- vey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 2024

    Maryam Zare, Parham M Kebria, Abbas Khosravi, and Saeid Nahavandi. A sur- vey of imitation learning: Algorithms, recent developments, and challenges.IEEE Transactions on Cybernetics, 2024. Bibliography 126

  25. [25]

    Imi- tation learning: Progress, taxonomies and challenges.IEEE Transactions on Neural Networks and Learning Systems, 35(5):6322–6337, 2022

    Boyuan Zheng, Sunny Verma, Jianlong Zhou, Ivor W Tsang, and Fang Chen. Imi- tation learning: Progress, taxonomies and challenges.IEEE Transactions on Neural Networks and Learning Systems, 35(5):6322–6337, 2022