A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

Akshit Saradagi; Fausto Mauricio Lagos Suarez; George Nikolakopoulos; Vidya Sumathy

arxiv: 2605.19166 · v1 · pith:7DKXQTUPnew · submitted 2026-05-18 · 💻 cs.RO · cs.LG· math.OC

A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

Fausto Mauricio Lagos Suarez , Akshit Saradagi , Vidya Sumathy , George Nikolakopoulos This is my paper

Pith reviewed 2026-05-20 08:47 UTC · model grok-4.3

classification 💻 cs.RO cs.LGmath.OC

keywords reinforcement learningquadrotor controlreward designperformance tuningheuristic approachPPOsetpoint trackingcritically damped response

0 comments

The pith

Heuristic rules on reward weights and exponential coefficients allow tunable settling times in RL quadrotor control while keeping critically damped behavior and low steady-state error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reward design for reinforcement learning that trains quadrotor controllers to follow position and yaw commands with a smooth critically damped response and about two percent steady-state error. Simple heuristic adjustments to the reward weights and the coefficients in the exponential terms shift the policy toward either quicker responses suited to agile maneuvers or slower responses suited to careful inspection tasks. These changes are made without retraining from scratch or losing the desirable damping and accuracy properties. A reader cares because the same training process can produce a family of controllers matched to different application needs such as drone racing or infrastructure monitoring.

Core claim

The authors present a reward structure with dual bandwidth exponentials that, when used with PPO training and episode truncation, yields a baseline policy with critically damped setpoint tracking and low steady-state error. Intuitive heuristic rules then modify the reward weights and exponential coefficients to produce faster acrobatic-like or slower inspection-like settling times while retaining the baseline response characteristics and approximately two percent steady-state error.

What carries the argument

Dual bandwidth exponentials within the reward function that shape the learned policy toward a baseline critically damped response; heuristic scaling of weights and coefficients then controls the speed of convergence.

If this is right

Training reaches the desired performance in roughly six million time steps for each of the three policies.
The baseline, faster, and slower policies all achieve accurate position and yaw tracking from random initial conditions in one hundred evaluation trials.
Each tuned policy keeps the critically damped character and holds steady-state error near two percent.
Episode truncation conditions support the emergence of the target behavior during learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such reward-based tuning could reduce computational cost by allowing reuse of training runs across multiple performance regimes.
The approach might extend to other continuous control tasks where RL policies need adjustable response speeds.
Further work could explore whether these heuristics remain effective when the quadrotor model includes wind disturbances or payload changes.

Load-bearing premise

That changes to a few reward weights and exponential coefficients will reliably produce the desired shifts in settling time while preserving critical damping and keeping steady-state error close to two percent for random starting conditions.

What would settle it

Train the policies using the faster and slower reward settings, then run one hundred trials from random initial states and check whether settling times change as expected while damping and error stay the same; failure to observe these outcomes would falsify the heuristic rules.

Figures

Figures reproduced from arXiv: 2605.19166 by Akshit Saradagi, Fausto Mauricio Lagos Suarez, George Nikolakopoulos, Vidya Sumathy.

**Figure 2.** Figure 2: Reinforcement Learning setup and illustration of th [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Rollout episode mean reward across 5 random seeds [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Performance profile comparison between the three [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Motor RPMs average across 5 tests with random [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2\% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete heuristic for tuning RL quadrotor policies between fast and slow settling via dual-exponential rewards, but the claim that it preserves critically damped behavior is under-supported for nonlinear controllers.

read the letter

The paper offers a practical heuristic for adjusting RL quadrotor policies to trade settling speed while aiming for low steady-state error. They build a reward with two exponential terms of different bandwidths to establish a baseline response, train with PPO plus episode truncation, and then supply rules for shifting the weights and coefficients to get faster or slower behavior. The abstract reports that this reaches usable tracking after 6 million steps and holds up across 100 trials from random starts with roughly 2% error in position and yaw. That combination of reward structure and explicit adjustment rules looks like the actual new piece relative to standard shaping or curriculum approaches in the quadrotor RL literature. It gives engineers a knob they can turn without retraining from scratch, which could matter for inspection tasks that want deliberate motion or for agile flight that wants quicker response. The training efficiency and the fact that they evaluate three variants on the same task are clear positives. The soft spots are mostly around verification. The repeated reference to a “critically damped response” comes from linear second-order systems, yet the policies are nonlinear and the paper supplies no overshoot numbers, damping proxies, or step-response metrics to show the property survives the heuristic changes. There are also no baseline comparisons and no error bars or statistical tests on the 100 trials. The heuristics are presented as reliable, but the evidence is limited to the reported simulation outcomes. This work is aimed at practitioners who already run RL on quadrotors and want tunable performance in simulation. A reader focused on setpoint tracking or reward design for drones would find usable details. It is worth sending for peer review because the idea is straightforward to implement and addresses a concrete need, though referees will probably ask for more quantitative checks on the damping claim and some ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a heuristic approach for tuning RL-based quadrotor control policies via reward design and termination conditions. It introduces a reward structure with dual-bandwidth exponentials that, when optimized with PPO, yields a baseline policy exhibiting critically damped setpoint tracking and approximately 2% steady-state error after 6 million training steps. Intuitive heuristic rules are given for adjusting reward weights and exponential coefficients to produce faster (acrobatic) or slower (inspection) settling times while preserving the baseline response and error level. Three policies are evaluated over 100 trials, showing accurate position and yaw tracking from random initial conditions.

Significance. If the heuristic rules can be shown to reliably produce the claimed tunable settling behavior while preserving low overshoot and steady-state error in the nonlinear setting, the work would offer a practical, sample-efficient alternative to full retraining for adapting quadrotor RL controllers to different task requirements. The focus on reward shaping and episode truncation as direct tuning mechanisms is a useful empirical contribution, though its scope is currently limited to the specific quadrotor dynamics and PPO implementation described.

major comments (2)

[Abstract] Abstract: The central claim that heuristic adjustments to reward weights and exponential coefficients achieve faster or slower settling times 'while retaining the baseline critically damped response and approximately 2% steady-state error' is load-bearing yet under-supported. The evaluation over 100 trials reports only 'accurate and tunable performance' without quantitative metrics (overshoot, damping-ratio proxy, or step-response characteristics) or statistical tests to verify that the critically damped property survives the adjustments for the nonlinear PPO policies under random initial conditions.
[Evaluation] Evaluation section: No baseline comparisons (e.g., to classical PID controllers or untuned PPO policies) or error bars are provided, making it difficult to assess whether the observed tunability is attributable to the proposed heuristics rather than general PPO training variability.

minor comments (2)

The description of the dual-bandwidth exponential terms in the reward function would benefit from an explicit equation or pseudocode to clarify how the two bandwidth parameters interact with the position and velocity errors.
Figure captions for the trajectory plots should include the specific initial conditions and trial count to allow direct comparison with the 100-trial aggregate results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to strengthen the quantitative support for our claims and to include additional evaluation elements as detailed below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that heuristic adjustments to reward weights and exponential coefficients achieve faster or slower settling times 'while retaining the baseline critically damped response and approximately 2% steady-state error' is load-bearing yet under-supported. The evaluation over 100 trials reports only 'accurate and tunable performance' without quantitative metrics (overshoot, damping-ratio proxy, or step-response characteristics) or statistical tests to verify that the critically damped property survives the adjustments for the nonlinear PPO policies under random initial conditions.

Authors: We agree that the original presentation relied on qualitative descriptions of the response behavior. In the revised manuscript we have added explicit quantitative metrics for each of the three policies: percent overshoot, 2%-settling time, and steady-state error (reported as mean and standard deviation over the 100 trials). We also include a damping-ratio proxy computed from the dominant poles of a second-order fit to the averaged step-response trajectories. These values are now stated in the abstract and supported by new step-response plots and a statistical summary table in the Evaluation section, confirming that the critically damped character and ~2% error level are retained after heuristic adjustment. revision: yes
Referee: [Evaluation] Evaluation section: No baseline comparisons (e.g., to classical PID controllers or untuned PPO policies) or error bars are provided, making it difficult to assess whether the observed tunability is attributable to the proposed heuristics rather than general PPO training variability.

Authors: We acknowledge the value of error bars and have added them (standard deviation across trials) to all performance plots in the revised Evaluation section. Regarding baselines, the manuscript's scope centers on demonstrating heuristic tunability within the RL setting rather than a comprehensive controller comparison; a full PID benchmark would require additional experimental design outside the current contribution. We have therefore added a short discussion noting this limitation and included a qualitative reference to a standard PID controller tuned for similar quadrotor dynamics, while retaining focus on the RL policies. We believe these changes address the core concern about variability without expanding the paper's primary claims. revision: partial

Circularity Check

0 steps flagged

No circularity; heuristic empirical tuning is self-contained

full rationale

The paper presents heuristic rules for adjusting reward weights and dual-bandwidth exponential coefficients in a PPO-trained RL policy for quadrotor setpoint tracking. These rules are explicitly introduced as intuitive adjustments to achieve faster or slower settling times around a baseline response, with performance claims supported solely by empirical evaluation across 100 random-initial-condition trials. No derivation chain, fitted-parameter predictions, self-citation load-bearing steps, or ansatz smuggling appears in the provided text; the reward structure and termination conditions are inputs whose outcomes are observed rather than mathematically forced by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of the chosen reward form and the unproven transferability of the heuristic adjustments; no formal derivation of the damping property is supplied.

free parameters (2)

reward weights
Adjusted heuristically to shift between baseline, acrobatic, and inspection behaviors
exponential coefficients
Chosen to set the two bandwidths that define the baseline critically damped response

axioms (2)

domain assumption A reward containing dual bandwidth exponentials produces critically damped setpoint tracking with low steady-state error
Invoked to establish the baseline behavior before heuristic tuning
domain assumption Episode truncation conditions improve sample efficiency of PPO training for this task
Used to reach desired performance in 6 million time steps

pith-pipeline@v0.9.0 · 5786 in / 1551 out tokens · 63186 ms · 2026-05-20T08:47:29.395111+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response... intuitive heuristic rules to adjust the reward weights and exponential coefficients
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three policies (baseline, acrobatic, and inspection) ... critically damped response (center), the settling time (up)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Drone Deep Reinforcement Learnin g: A Review,

A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone Deep Reinforcement Learnin g: A Review,” Electronics, vol. 10, p. 999, Jan. 2021. Number: 9 Publisher: Multidisciplinary Digital Publishing Institute

work page 2021
[2]

Agile ﬂ ights through a moving narrow gap for quadrotors using adaptive cu rriculum learning,

M. Wang, S. Jia, Y . Niu, Y . Liu, C. Y an, and C. Wang, “Agile ﬂ ights through a moving narrow gap for quadrotors using adaptive cu rriculum learning,” IEEE Transactions on Intelligent V ehicles , vol. 9, no. 11, pp. 6936–6949, 2024

work page 2024
[3]

Reinforcement Learning for C ollision- free Flight Exploiting Deep Collision Encoding,

M. Kulkarni and K. Alexis, “Reinforcement Learning for C ollision- free Flight Exploiting Deep Collision Encoding,” in 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pp. 15781– 15788, May 2024

work page 2024
[4]

Champion-level drone racing using deep rei nforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M¨ uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep rei nforce- ment learning,” Nature, vol. 620, pp. 982–987, Aug. 2023

work page 2023
[5]

Control schemes for quadrotor uav: Taxonomy and survey,

A. Khalid, Z. Mushtaq, S. Arif, K. Zeb, M. A. Khan, and S. Ba kshi, “Control schemes for quadrotor uav: Taxonomy and survey,” ACM Comput. Surv., vol. 56, Nov. 2023

work page 2023
[6]

Pid contro l of quadrotor uavs: A survey,

I. Lopez-Sanchez and J. Moreno-V alenzuela, “Pid contro l of quadrotor uavs: A survey,” Annual Reviews in Control , vol. 56, p. 100900, 2023

work page 2023
[7]

Cascade ﬂight control of quadrotors based on deep reinforcement learning,

H. Han, J. Cheng, Z. Xi, and B. Y ao, “Cascade ﬂight control of quadrotors based on deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11134–11141, 2022

work page 2022
[8]

Reinforcement le arning position control of a quadrotor using soft actor-critic (sa c),

Y . Mahran, Z. Gamal, and A. El-Badawy, “Reinforcement le arning position control of a quadrotor using soft actor-critic (sa c),” in 2024 6th Novel Intelligent and Leading Emerging Sciences Conferenc e (NILES), pp. 72–75, 2024

work page 2024
[9]

Level ing the playing ﬁeld: Carefully comparing classical and learned co ntrollers for quadrotor trajectory tracking,

P . Kunapuli, J. Welde, D. Jayaraman, and V . Kumar, “Level ing the playing ﬁeld: Carefully comparing classical and learned co ntrollers for quadrotor trajectory tracking,” 2025

work page 2025
[10]

Reinforcement learni ng stabilization for quadrotor uavs via lipschitz-constrain ed policy reg- ularization,

J. Quan, W. Hu, X. Ma, and G. Chen, “Reinforcement learni ng stabilization for quadrotor uavs via lipschitz-constrain ed policy reg- ularization,” Drones, vol. 9, no. 10, 2025

work page 2025
[11]

Reinforcement learning with form al per- formance metrics for quadcopter attitude control under non -nominal contexts,

N. Bernini, M. Bessa, R. Delmas, A. Gold, E. Goubault, R. Pennec, S. Putot, and F. Sillion, “Reinforcement learning with form al per- formance metrics for quadcopter attitude control under non -nominal contexts,” Engineering Applications of Artiﬁcial Intelligence , vol. 127, p. 107090, 2024

work page 2024
[12]

System identiﬁcation of the Crazyﬂie 2.0 nano quadro- copter,

J. F¨ orster, “System identiﬁcation of the Crazyﬂie 2.0 nano quadro- copter,” 2015

work page 2015
[13]

Modelling and control of the crazyﬂie quadr otor for ag- gressive and autonomous ﬂight by optical ﬂow driven state es timation,

M. Greiff, “Modelling and control of the crazyﬂie quadr otor for ag- gressive and autonomous ﬂight by optical ﬂow driven state es timation,” 2017

work page 2017
[14]

Proximal policy optimization algorithms,

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. K limov, “Proximal policy optimization algorithms,” 2017

work page 2017
[15]

Eschmann, Reward Function Design in Reinforcement Learning , pp

J. Eschmann, Reward Function Design in Reinforcement Learning , pp. 25–33. Cham: Springer International Publishing, 2021

work page 2021
[16]

Tim e limits in reinforcement learning,

F. Pardo, A. Tavakoli, V . Levdik, and P . Kormushev, “Tim e limits in reinforcement learning,” in Proceedings of the 35th International Con- ference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research , pp. 4045–4054, PMLR, 10–15 Jul 2018

work page 2018
[17]

Learning to ﬂy—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control ,

J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P . Schoellig, “Learning to ﬂy—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control ,” in 2021 IEEE/RSJ International Conference on Intelligent Robots a nd Systems (IROS), pp. 7512–7519, 2021

work page 2021

[1] [1]

Drone Deep Reinforcement Learnin g: A Review,

A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone Deep Reinforcement Learnin g: A Review,” Electronics, vol. 10, p. 999, Jan. 2021. Number: 9 Publisher: Multidisciplinary Digital Publishing Institute

work page 2021

[2] [2]

Agile ﬂ ights through a moving narrow gap for quadrotors using adaptive cu rriculum learning,

M. Wang, S. Jia, Y . Niu, Y . Liu, C. Y an, and C. Wang, “Agile ﬂ ights through a moving narrow gap for quadrotors using adaptive cu rriculum learning,” IEEE Transactions on Intelligent V ehicles , vol. 9, no. 11, pp. 6936–6949, 2024

work page 2024

[3] [3]

Reinforcement Learning for C ollision- free Flight Exploiting Deep Collision Encoding,

M. Kulkarni and K. Alexis, “Reinforcement Learning for C ollision- free Flight Exploiting Deep Collision Encoding,” in 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pp. 15781– 15788, May 2024

work page 2024

[4] [4]

Champion-level drone racing using deep rei nforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M¨ uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep rei nforce- ment learning,” Nature, vol. 620, pp. 982–987, Aug. 2023

work page 2023

[5] [5]

Control schemes for quadrotor uav: Taxonomy and survey,

A. Khalid, Z. Mushtaq, S. Arif, K. Zeb, M. A. Khan, and S. Ba kshi, “Control schemes for quadrotor uav: Taxonomy and survey,” ACM Comput. Surv., vol. 56, Nov. 2023

work page 2023

[6] [6]

Pid contro l of quadrotor uavs: A survey,

I. Lopez-Sanchez and J. Moreno-V alenzuela, “Pid contro l of quadrotor uavs: A survey,” Annual Reviews in Control , vol. 56, p. 100900, 2023

work page 2023

[7] [7]

Cascade ﬂight control of quadrotors based on deep reinforcement learning,

H. Han, J. Cheng, Z. Xi, and B. Y ao, “Cascade ﬂight control of quadrotors based on deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11134–11141, 2022

work page 2022

[8] [8]

Reinforcement le arning position control of a quadrotor using soft actor-critic (sa c),

Y . Mahran, Z. Gamal, and A. El-Badawy, “Reinforcement le arning position control of a quadrotor using soft actor-critic (sa c),” in 2024 6th Novel Intelligent and Leading Emerging Sciences Conferenc e (NILES), pp. 72–75, 2024

work page 2024

[9] [9]

Level ing the playing ﬁeld: Carefully comparing classical and learned co ntrollers for quadrotor trajectory tracking,

P . Kunapuli, J. Welde, D. Jayaraman, and V . Kumar, “Level ing the playing ﬁeld: Carefully comparing classical and learned co ntrollers for quadrotor trajectory tracking,” 2025

work page 2025

[10] [10]

Reinforcement learni ng stabilization for quadrotor uavs via lipschitz-constrain ed policy reg- ularization,

J. Quan, W. Hu, X. Ma, and G. Chen, “Reinforcement learni ng stabilization for quadrotor uavs via lipschitz-constrain ed policy reg- ularization,” Drones, vol. 9, no. 10, 2025

work page 2025

[11] [11]

Reinforcement learning with form al per- formance metrics for quadcopter attitude control under non -nominal contexts,

N. Bernini, M. Bessa, R. Delmas, A. Gold, E. Goubault, R. Pennec, S. Putot, and F. Sillion, “Reinforcement learning with form al per- formance metrics for quadcopter attitude control under non -nominal contexts,” Engineering Applications of Artiﬁcial Intelligence , vol. 127, p. 107090, 2024

work page 2024

[12] [12]

System identiﬁcation of the Crazyﬂie 2.0 nano quadro- copter,

J. F¨ orster, “System identiﬁcation of the Crazyﬂie 2.0 nano quadro- copter,” 2015

work page 2015

[13] [13]

Modelling and control of the crazyﬂie quadr otor for ag- gressive and autonomous ﬂight by optical ﬂow driven state es timation,

M. Greiff, “Modelling and control of the crazyﬂie quadr otor for ag- gressive and autonomous ﬂight by optical ﬂow driven state es timation,” 2017

work page 2017

[14] [14]

Proximal policy optimization algorithms,

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. K limov, “Proximal policy optimization algorithms,” 2017

work page 2017

[15] [15]

Eschmann, Reward Function Design in Reinforcement Learning , pp

J. Eschmann, Reward Function Design in Reinforcement Learning , pp. 25–33. Cham: Springer International Publishing, 2021

work page 2021

[16] [16]

Tim e limits in reinforcement learning,

F. Pardo, A. Tavakoli, V . Levdik, and P . Kormushev, “Tim e limits in reinforcement learning,” in Proceedings of the 35th International Con- ference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research , pp. 4045–4054, PMLR, 10–15 Jul 2018

work page 2018

[17] [17]

Learning to ﬂy—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control ,

J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P . Schoellig, “Learning to ﬂy—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control ,” in 2021 IEEE/RSJ International Conference on Intelligent Robots a nd Systems (IROS), pp. 7512–7519, 2021

work page 2021