pith. sign in

arxiv: 1906.12082 · v1 · pith:P2Q45MWOnew · submitted 2019-06-28 · 💻 cs.RO · cs.LG

Sample Efficient Learning of Path Following and Obstacle Avoidance Behavior for Quadrotors

Pith reviewed 2026-05-25 14:08 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords imitation learningquadrotor controlpath followingobstacle avoidanceneural network policymodel predictive controlsample efficient learningrobot learning
0
0 comments X

The pith

A neural network policy trained by imitating a model predictive controller lets quadrotors follow paths while avoiding unseen obstacles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that imitation learning can produce a neural network control policy for quadrotors that follows a global reference path and performs local collision avoidance on obstacles absent from training. Demonstrations come from a time-free model predictive path-following controller run on only a few example paths. An adapted version of the same supervisor enables safe exploration during data collection on the real robot. The resulting policy computes actions directly from sensor inputs and generalizes beyond the demonstrated paths.

Core claim

The central claim is that an imitation learning algorithm using a time-free model predictive path-following controller as supervisor produces a neural network policy that reproduces path following with collision avoidance. Due to the generalization ability of neural networks, the resulting policy performs local collision avoidance of unseen obstacles while following a global reference path. The controller generates demonstrations by following few example paths, enabling an easy to implement learning algorithm that is robust to errors of the model used in the model predictive controller. The policy is trained on the real quadrotor using an adapted supervisor for collision-free exploration, so

What carries the argument

Imitation learning algorithm supervised by an adapted time-free model predictive path-following controller that generates demonstrations from few example paths.

If this is right

  • The policy computes control commands directly from sensor inputs and runs in real time without online optimization.
  • Local collision avoidance works for obstacles never encountered during training or demonstration.
  • Training succeeds with a relatively small number of example paths collected on the physical quadrotor.
  • The learned policy remains functional even when the model inside the supervisor contains errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervisor-plus-imitation pattern could reduce reliance on large simulation datasets for other mobile robots.
  • Reactive avoidance behaviors learned this way might transfer across different sensor suites or vehicle dynamics with modest retraining.
  • Combining a model-based planner for data collection with a neural policy for execution offers one route to safe real-world learning loops.

Load-bearing premise

An adapted version of the supervisor can enable collision-free exploration around the example path without introducing systematic bias into the collected demonstrations or the learned policy.

What would settle it

Fly the trained policy in an environment containing obstacles absent from all training paths; repeated collisions with those obstacles or large deviations from the reference path would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 1906.12082 by Javier Alonso-Mora, Otmar Hilliges, Stefan Stevsic, Tobias Naegeli.

Figure 1
Figure 1. Figure 1: A policy is learned from few, short local collision avoidance and path following maneuvers (red). The learned policy generalizes to unseen scenes and can track long guidance paths (green) through complex environments while successfully avoiding obstacles (blue). estimation of obstacle positions. Second, function approxi￾mators, such as neural networks, can be much more compu￾tationally efficient compared t… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Coordinate systems: Global and quadrotor coor￾dinate systems. The quadrotor coordinate system is denoted with a subscript q. Policy inputs and outputs are always calculated in the quadrotor frame. Right: Contouring error approximation: Illustration of the real contouring and lag errors (green) as well as the approximations (orange) used in our MPCC implementation. where R(φg) is a rotation matrix aro… view at source ↗
Figure 4
Figure 4. Figure 4: Overview: The algorithm for training the policy π(ot) (left). Off-policy and on-policy steps for data collection (middle and right). each consisting of 30 neurons with softplus activation and linear neurons in the output layer. Initial weights W are initialized randomly using zero mean normal distribution with standard deviation 0.01. 3) Data collection: To collect data for training, we have two different … view at source ↗
Figure 6
Figure 6. Figure 6: Execution time: Horizon length wrt. execution time of controllers. The control policy imitates a long horizon behavior having the same computation time of 2 · 10−4 s. simulation) and in real settings (policy trained on real robot). A. Implementation Details 1) Global path following: The global guidance g coarsely specifies quadrotor motion, but does not need to be aware of obstacles. The policy controls th… view at source ↗
Figure 7
Figure 7. Figure 7: Average flight distance: Distance to collision on different obstacle courses (higher is better). Blue (ours), red (APF) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison trajectories: Trajectories while avoiding a single obstacle positioned on the guidance g. consequence produces non-smooth trajectories (cf [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Policy robustness: Policy performance as a function of the supervisor. The average error from three trained policies are shown. The error is bounded to 50. From experiments, we found that error below 10 gives satisfactory performance. Lower is better. TABLE I COMPARISON WITH THE BASELINE SUPERVISOR Task MPC policy MPCC policy Max. tracking deviation z axis 0.847 m 0.077 m Average flight length 41.67 m 183.… view at source ↗
Figure 10
Figure 10. Figure 10: Left: Policy roll-out: Unseen test scene including long guidance (green), obstacles and flown policy roll-out (blue). Right: Static obstacle. Policy roll-out in the real environment. Three obstacles are positioned along a circular reference. large deviation from the example path (high train error) and results in poor generalization (high test error). Too large penalization of the contouring cost (Kc = 25.… view at source ↗
read the original abstract

In this paper we propose an algorithm for the training of neural network control policies for quadrotors. The learned control policy computes control commands directly from sensor inputs and is hence computationally efficient. An imitation learning algorithm produces a policy that reproduces the behavior of a path following control algorithm with collision avoidance. Due to the generalization ability of neural networks, the resulting policy performs local collision avoidance of unseen obstacles while following a global reference path. The algorithm uses a time-free model predictive path-following controller as a supervisor. The controller generates demonstrations by following few example paths. This enables an easy to implement learning algorithm that is robust to errors of the model used in the model predictive controller. The policy is trained on the real quadrotor, which requires collision-free exploration around the example path. An adapted version of the supervisor is used to enable exploration. Thus, the policy can be trained from a relatively small number of examples on the real quadrotor, making the training sample efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an imitation learning algorithm to train a neural network policy for quadrotor control that directly maps sensor inputs to control commands. The policy is trained to reproduce the behavior of a time-free model predictive path-following controller with collision avoidance, using demonstrations generated by following a small number of example paths. An adapted version of the supervisor enables collision-free exploration during real-robot training, with the claim that the resulting policy generalizes to local avoidance of unseen obstacles while following a global reference path. The approach is presented as sample-efficient and robust to MPC model errors.

Significance. If the results hold with the required evidence, the work would demonstrate a practical route to sample-efficient real-world policy learning for aerial robots by using MPC as a supervisor, potentially reducing the data requirements for deploying learned controllers that handle both path following and reactive avoidance. This could influence hybrid model-based/learning approaches in robotics where real-robot data collection is costly or risky.

major comments (2)
  1. [Abstract] Abstract, final paragraph: the central generalization claim (local collision avoidance of unseen obstacles) depends on the adapted supervisor producing exploration trajectories whose state-action distribution does not systematically differ from an unbiased explorer near obstacles. The manuscript provides no quantitative comparison of visited state distributions, no ablation of the adaptation, and no analysis of whether the supervisor's heuristic is reproduced by the policy rather than general avoidance behavior being learned.
  2. [Abstract] The abstract states the central claims but supplies no quantitative results, error metrics, or ablation details on sample efficiency or generalization performance. Without these, the support for the sample-efficiency and robustness assertions cannot be evaluated from the provided text.
minor comments (1)
  1. [Abstract] The term 'time-free' applied to the MPC controller is introduced without definition or reference; this should be clarified in the method description with a brief explanation of how it differs from standard time-parameterized MPC.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly identify that the abstract could be strengthened with additional quantitative support and analysis details. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract, final paragraph: the central generalization claim (local collision avoidance of unseen obstacles) depends on the adapted supervisor producing exploration trajectories whose state-action distribution does not systematically differ from an unbiased explorer near obstacles. The manuscript provides no quantitative comparison of visited state distributions, no ablation of the adaptation, and no analysis of whether the supervisor's heuristic is reproduced by the policy rather than general avoidance behavior being learned.

    Authors: We agree that the abstract does not contain a quantitative comparison of state distributions or an ablation of the supervisor adaptation. The full manuscript describes the adapted supervisor and presents experimental results showing generalization to unseen obstacles, but does not include the requested distribution comparison or ablation. We will revise the abstract to reference key supporting metrics from the experiments and will add a short discussion of exploration trajectories and policy behavior in the revised manuscript to clarify the basis for the generalization claim. revision: partial

  2. Referee: [Abstract] The abstract states the central claims but supplies no quantitative results, error metrics, or ablation details on sample efficiency or generalization performance. Without these, the support for the sample-efficiency and robustness assertions cannot be evaluated from the provided text.

    Authors: The abstract is a concise summary, with quantitative results, error metrics, and experimental details on sample efficiency and generalization provided in the body of the manuscript. To directly address the concern, we will revise the abstract to incorporate specific quantitative results (e.g., number of training trajectories and success rates on unseen obstacles) and error metrics supporting the sample-efficiency and robustness claims. revision: yes

Circularity Check

0 steps flagged

No circularity; method uses independent external MPC supervisor

full rationale

The paper describes a standard imitation learning pipeline in which a separate time-free MPC path-following controller (with an adaptation for safe exploration) generates demonstration trajectories that are then used to train a neural policy. No derivation, equation, or claim reduces the learned policy's performance or generalization to a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The supervisor is external to the policy being trained, and the sample-efficiency claim rests on the empirical properties of neural-network generalization rather than on any internal fitting that forces the reported outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach rests on standard assumptions of imitation learning and MPC that are not detailed here.

pith-pipeline@v0.9.0 · 5708 in / 1025 out tokens · 20639 ms · 2026-05-25T14:08:35.844476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    A fully autonomous indoor quadrotor,

    S. Grzonka, G. Grisetti, and W. Burgard, “A fully autonomous indoor quadrotor,” IEEE Transactions on Robotics, vol. 28, pp. 90–100, 2012

  2. [2]

    A model predictive controller for quadrocopter state interception,

    M. W. Mueller and R. D’Andrea, “A model predictive controller for quadrocopter state interception,” Control Conference (ECC), 2013 European, pp. 1383–1389, 2013

  3. [3]

    Continuous-time trajectory optimization for online uav replanning,

    H. Oleynikova, M. Burri, Z. Taylor, J. Nieto, R. Siegwart, and E. Galceran, “Continuous-time trajectory optimization for online uav replanning,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on . IEEE, 2016, pp. 5332–5339

  4. [4]

    Incremental micro-uav motion replanning for exploring unknown environments,

    M. Pivtoraiko, D. Mellinger, and V . Kumar, “Incremental micro-uav motion replanning for exploring unknown environments,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on . IEEE, 2013, pp. 2452–2458

  5. [5]

    Obstacle avoidance with sensor uncertainty for small unmanned aircraft,

    E. Frew and R. Sengupta, “Obstacle avoidance with sensor uncertainty for small unmanned aircraft,” in Decision and Control, 2004. CDC. 43rd IEEE Conference on , vol. 1. IEEE, 2004, pp. 614–619

  6. [6]

    Trajectory tracking with collision avoidance for nonholonomic vehi- cles with acceleration constraints and limited sensing,

    E. J. Rodr ´ıguez-Seda, C. Tang, M. W. Spong, and D. M. Stipanovi ´c, “Trajectory tracking with collision avoidance for nonholonomic vehi- cles with acceleration constraints and limited sensing,” The Interna- tional Journal of Robotics Research , vol. 33, no. 12, pp. 1569–1592, 2014

  7. [7]

    Automated aerial suspended cargo delivery through reinforcement learning,

    A. Faust, I. Palunko, P. Cruz, R. Fierro, and L. Tapia, “Automated aerial suspended cargo delivery through reinforcement learning,” Ar- tificial Intelligence, 2014

  8. [8]

    Playing Atari with Deep Reinforcement Learning

    V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013

  9. [9]

    A survey on policy search for robotics,

    M. P. Deisenroth, G. Neumann, J. Peters, et al., “A survey on policy search for robotics,” Foundations and Trends R⃝ in Robotics , vol. 2, no. 1–2, pp. 1–142, 2013

  10. [10]

    An application of reinforcement learning to aerobatic helicopter flight,

    P. Abbeel, A. Coates, M. Quigley, and A. Y . Ng, “An application of reinforcement learning to aerobatic helicopter flight,” in NIPS, 2007

  11. [11]

    End-to-end training of deep visuomotor policies,

    S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research , vol. 17, no. 39, pp. 1–40, 2016

  12. [12]

    Plato: Policy learning using adaptive trajectory optimization,

    G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning using adaptive trajectory optimization,” Robotics and Automation (ICRA), 2017 IEEE International Conference on , pp. 3342–3349, 2017

  13. [13]

    A reduction of imitation learning and structured prediction to no-regret online learning

    S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning.” in AISTATS, vol. 1, no. 2, 2011, p. 6

  14. [14]

    Learning monocular reactive uav control in cluttered natural environments,

    S. Ross, N. Melik-Barkhudarov, K. S. Shankar, A. Wendel, D. Dey, J. A. Bagnell, and M. Hebert, “Learning monocular reactive uav control in cluttered natural environments,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on . IEEE, 2013, pp. 1765–1772

  15. [15]

    Interactive control of diverse complex characters with neural net- works,

    I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. V . Todorov, “Interactive control of diverse complex characters with neural net- works,” in NIPS, 2015

  16. [16]

    Learning deep con- trol policies for autonomous aerial vehicles with mpc-guided policy search,

    T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep con- trol policies for autonomous aerial vehicles with mpc-guided policy search,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 528–535

  17. [17]

    Model predictive contouring control,

    D. Lam, C. Manzie, and M. Good, “Model predictive contouring control,” in 49th IEEE Conference on Decision and Control (CDC) . IEEE, 2010, pp. 6137–6142

  18. [18]

    Real-time planning for automated multi-view drone cinematography,

    T. N ¨ageli, L. Meier, A. Domahidi, J. Alonso-Mora, and O. Hilliges, “Real-time planning for automated multi-view drone cinematography,” in ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2017

  19. [19]

    FORCES Pro: code generation for embed- ded optimization,

    A. Domahidi and J. Jerez, “FORCES Pro: code generation for embed- ded optimization,” 2016, https://www.embotech.com/FORCES-Pro

  20. [20]

    Robot operating system (ros),

    F. Furrer, M. Burri, M. Achtelik, and R. Siegwart, “Robot operating system (ros),” Studies Comp.Intelligence Volume Number:625 , vol. The Complete Reference (V olume 1), p. Chapter 23, 2016

  21. [21]

    Design and use paradigms for gazebo, an open-source multi-robot simulator,

    N. Koenig and A. Howard, “Design and use paradigms for gazebo, an open-source multi-robot simulator,” in Intelligent Robots and Systems (IROS), 2004 IEEE/RSJ International Conference on , vol. 3. IEEE, 2004, pp. 2149–2154