Learning an Urban Air Mobility Encounter Model from Expert Preferences

Anne-Claire Le Bihan; Mykel J. Kochenderfer; Sydney M. Katz

arxiv: 1907.05575 · v1 · pith:U6I7FCNZnew · submitted 2019-07-12 · 💻 cs.AI

Learning an Urban Air Mobility Encounter Model from Expert Preferences

Sydney M. Katz , Anne-Claire Le Bihan , Mykel J. Kochenderfer This is my paper

Pith reviewed 2026-05-24 22:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords urban air mobilityencounter modelpreference-based learningMarkov decision processcollision avoidanceexpert preferencesstochastic policyairspace modeling

0 comments

The pith

An encounter model for urban air mobility is learned from a domain expert's pairwise preferences over simulated trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to generate realistic encounter trajectories for urban air mobility aircraft without access to large operational datasets. It models the problem as a Markov decision process and learns the underlying reward function directly from an expert's answers to pairwise queries about which trajectories look more realistic. Two active querying strategies are tested to extract the most information from each comparison. The result is that plausible models can be produced after only a few minutes of expert time. This approach is positioned as an alternative to data-driven statistical fitting used in prior manned and unmanned aircraft encounter models.

Core claim

The paper claims that extending preference-based learning to an MDP formulation allows the reward function of an encounter model to be recovered from expert pairwise preferences, yielding a stochastic policy that generates trajectories representative of expected UAM operations, and that information-maximizing query selection makes this feasible with minimal expert effort.

What carries the argument

A stochastic policy for a Markov decision process whose reward function is recovered from expert pairwise trajectory preferences via preference-based learning.

If this is right

New encounter models for UAM can be built before operational data exists.
Collision avoidance algorithms for UAM can be tested against trajectories that encode expert expectations of future traffic patterns.
Preference elicitation with information-maximizing queries reduces the expert time cost to minutes rather than hours or days.
The same framework can replace dataset-dependent fitting for other aircraft types where data is scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other data-poor domains such as novel drone traffic patterns or autonomous surface vehicle encounters.
Once real UAM data arrives, the learned policies could be used as priors that are then refined with observed trajectories.
The approach assumes the expert can reliably judge realism from short trajectory snippets; this may require interface design that is not addressed in the paper.

Load-bearing premise

That an expert's preferences over simulated trajectories will produce encounter statistics that match those of real future urban air mobility operations.

What would settle it

Once real UAM flight data becomes available, compare the distribution of encounter geometries, speeds, and altitudes in the learned model against the empirical distribution from actual flights; significant mismatch would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.05575 by Anne-Claire Le Bihan, Mykel J. Kochenderfer, Sydney M. Katz.

**Figure 1.** Figure 1: Sample query. Plots are generated by sampling initial states and performing rollouts of the query policies. The initial states are the same for each [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Performance of multiobjective optimization querying method for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Univariate Gaussian kernel density estimate of the distribution of each [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Comparison of convergence for each querying method. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Progression of bivariate Gaussian kernel density estimate of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Error analysis for both querying methods. All curves are averaged [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of the precision parameter on the distribution of trajectories. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Airspace models have played an important role in the development and evaluation of aircraft collision avoidance systems for both manned and unmanned aircraft. As Urban Air Mobility (UAM) systems are being developed, we need new encounter models that are representative of their operational environment. Developing such models is challenging due to the lack of data on UAM behavior in the airspace. While previous encounter models for other aircraft types rely on large datasets to produce realistic trajectories, this paper presents an approach to encounter modeling that instead relies on expert knowledge. In particular, recent advances in preference-based learning are extended to tune an encounter model from expert preferences. The model takes the form of a stochastic policy for a Markov decision process (MDP) in which the reward function is learned from pairwise queries of a domain expert. We evaluate the performance of two querying methods that seek to maximize the information obtained from each query. Ultimately, we demonstrate a method for generating realistic encounter trajectories with only a few minutes of an expert's time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper extends preference-based reinforcement learning to learn a reward function for an MDP-based stochastic policy that generates Urban Air Mobility encounter trajectories. It evaluates two information-maximizing query selection methods and claims that realistic encounters can be produced from only a few minutes of domain-expert pairwise preference input, addressing the absence of operational UAM data.

Significance. If the central claim holds, the work offers a practical alternative to data-driven encounter modeling for emerging UAM collision-avoidance evaluation, where real trajectories do not yet exist. It demonstrates an application of preference-based RL with low expert burden and compares query strategies, which could generalize to other safety-critical domains lacking observational data.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation section: the claim that the generated trajectories are 'realistic' and representative of future UAM operational environments rests on expert preferences alone, with no quantitative metrics, comparison to any external criterion, or validation against plausible dynamics reported. This is load-bearing for the central demonstration.
[Method / Results] The weakest assumption—that expert pairwise preferences over simulated trajectories will yield models representative of actual UAM environments—is invoked when presenting the learned policy as realistic, yet no test of this assumption (e.g., sensitivity analysis or consistency checks) is described.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify that our evaluation of trajectory realism rests entirely on expert preferences in the absence of operational UAM data. We address each point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation section: the claim that the generated trajectories are 'realistic' and representative of future UAM operational environments rests on expert preferences alone, with no quantitative metrics, comparison to any external criterion, or validation against plausible dynamics reported. This is load-bearing for the central demonstration.

Authors: We agree that the central claim relies on expert preferences without external quantitative benchmarks. This limitation is inherent to the problem setting, as the work is motivated by the complete absence of UAM operational data. We will revise the abstract and evaluation section to replace unqualified statements of 'realism' with phrasing such as 'consistent with domain-expert preferences' and add a paragraph discussing how the approach could be validated against future observational data or physics-based plausibility checks. This is a partial revision. revision: partial
Referee: [Method / Results] The weakest assumption—that expert pairwise preferences over simulated trajectories will yield models representative of actual UAM environments—is invoked when presenting the learned policy as realistic, yet no test of this assumption (e.g., sensitivity analysis or consistency checks) is described.

Authors: The manuscript does not report sensitivity analysis on the learned reward or cross-expert consistency metrics. With a single expert, such checks are necessarily limited; however, we can add an analysis of reward-function stability across different query batches and a dedicated limitations subsection that explicitly flags the assumption and outlines how multi-expert or future-data consistency tests could be performed. This is a partial revision. revision: partial

standing simulated objections not resolved

Quantitative validation against real UAM trajectory data or external operational criteria, which do not yet exist for this domain.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents a preference-based RL method that learns an MDP reward function directly from external expert pairwise trajectory preferences. The derivation relies on standard active querying and optimization steps applied to this independent human input rather than any internal model outputs or self-referential fits. No equations reduce generated trajectories to quantities defined by the model's own predictions, no self-citation chains bear the central claim, and the approach is self-contained against the external benchmark of expert judgments. This is the normal non-circular case for a demonstration that starts from the explicit premise of absent operational data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert preferences suffice to define realistic encounters and on the modeling choice of an MDP whose reward is the only learned component.

axioms (1)

domain assumption Expert pairwise preferences over trajectories accurately reflect the statistics of future UAM operations
Invoked when the learned policy is asserted to generate realistic encounters without any real flight data for calibration.

pith-pipeline@v0.9.0 · 5704 in / 1139 out tokens · 22072 ms · 2026-05-24T22:51:26.270099+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,

B. J. Chludzinski, “Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-346, 2009

work page 2009
[2]

Op- timizing the next generation collision avoidance system for safe, suitable, and acceptable operational performance,

J. E. Holland, M. J. Kochenderfer, and W. A. Olson, “Op- timizing the next generation collision avoidance system for safe, suitable, and acceptable operational performance,” Air Trafﬁc Control Quarterly, vol. 21, no. 3, pp. 275–297, 2013

work page 2013
[3]

Airspace encounter models for estimating collision risk,

M. J. Kochenderfer, M. W. M. Edwards, L. P. Espindle, J. K. Kuchar, and J. D. Grifﬁth, “Airspace encounter models for estimating collision risk,” Journal of Guidance, Control, and Dynamics, vol. 33, no. 2, pp. 487–499, 2010

work page 2010
[4]

Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,

E. R. Mueller and M. J. Kochenderfer, “Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,” in AIAA Modeling and Simulation Technologies Conference, 2016, p. 3674. 7 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 0.5 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 1.0 0 200 400 ...

work page 2016
[5]

Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,

M. J. Kochenderfer, L. P. Espindle, J. K. Kuchar, and J. D. Grifﬁth, “Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-344, 2008

work page 2008
[6]

Learning probabilistic trajectory models of aircraft in terminal airspace from position data,

S. T. Barratt, M. J. Kochenderfer, and S. P. Boyd, “Learning probabilistic trajectory models of aircraft in terminal airspace from position data,” IEEE Transactions on Intelligent Trans- portation Systems, 2018

work page 2018
[7]

Algorithms for inverse reinforce- ment learning.,

A. Y . Ng and S. Russell, “Algorithms for inverse reinforce- ment learning.,” in International Conference on Machine Learning (ICML), 2000

work page 2000
[8]

Preference-based policy learning,

R. Akrour, M. Schoenauer, and M. Sebag, “Preference-based policy learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2011, pp. 12–27

work page 2011
[9]

A Bayesian approach for policy learning from trajectory preference queries,

A. Wilson, A. Fern, and P. Tadepalli, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in Neural Information Processing Systems (NIPS) , 2012

work page 2012
[10]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307

work page 2017
[11]

Active preference-based learning of reward functions,

D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in Robotics: Science and Systems (RSS) , 2017

work page 2017
[12]

Batch Active Preference-Based Learning of Reward Functions

E. Bıyık and D. Sadigh, “Batch active preference-based learn- ing of reward functions,” ArXiv preprint arXiv:1810.04303 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

M. J. Kochenderfer and T. A. Wheeler, Algorithms for opti- mization. MIT Press, 2019

work page 2019
[14]

Bayesian preference elicitation for multiobjective engineering design optimization,

J. R. Lepird, M. P. Owen, and M. J. Kochenderfer, “Bayesian preference elicitation for multiobjective engineering design optimization,” Journal of Aerospace Information Systems, vol. 12, no. 10, pp. 634–645, 2015

work page 2015
[15]

Evaluating multiple attribute items using queries,

V . S. Iyengar, J. Lee, and M. Campbell, “Evaluating multiple attribute items using queries,” in ACM Conference on Elec- tronic Commerce, ACM, 2001, pp. 144–153

work page 2001
[16]

A survey of preference-based reinforcement learning methods,

C. Wirth, R. Akrour, G. Neumann, and J. F ¨urnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research , vol. 18, no. 1, pp. 4945–4990, 2017

work page 2017
[17]

M. J. Kochenderfer, Decision making under uncertainty: The- ory and application . MIT Press, 2015

work page 2015
[18]

The theory of dynamic programming,

R. Bellman, “The theory of dynamic programming,” Bulletin of the American Mathematical Society, vol. 60, no. 6, pp. 503– 515, 1954

work page 1954
[19]

An adaptive Metropolis algorithm,

H. Haario, E. Saksman, and J. Tamminen, “An adaptive Metropolis algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, 2001. 8

work page 2001

[1] [1]

Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,

B. J. Chludzinski, “Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-346, 2009

work page 2009

[2] [2]

Op- timizing the next generation collision avoidance system for safe, suitable, and acceptable operational performance,

J. E. Holland, M. J. Kochenderfer, and W. A. Olson, “Op- timizing the next generation collision avoidance system for safe, suitable, and acceptable operational performance,” Air Trafﬁc Control Quarterly, vol. 21, no. 3, pp. 275–297, 2013

work page 2013

[3] [3]

Airspace encounter models for estimating collision risk,

M. J. Kochenderfer, M. W. M. Edwards, L. P. Espindle, J. K. Kuchar, and J. D. Grifﬁth, “Airspace encounter models for estimating collision risk,” Journal of Guidance, Control, and Dynamics, vol. 33, no. 2, pp. 487–499, 2010

work page 2010

[4] [4]

Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,

E. R. Mueller and M. J. Kochenderfer, “Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,” in AIAA Modeling and Simulation Technologies Conference, 2016, p. 3674. 7 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 0.5 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 1.0 0 200 400 ...

work page 2016

[5] [5]

Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,

M. J. Kochenderfer, L. P. Espindle, J. K. Kuchar, and J. D. Grifﬁth, “Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-344, 2008

work page 2008

[6] [6]

Learning probabilistic trajectory models of aircraft in terminal airspace from position data,

S. T. Barratt, M. J. Kochenderfer, and S. P. Boyd, “Learning probabilistic trajectory models of aircraft in terminal airspace from position data,” IEEE Transactions on Intelligent Trans- portation Systems, 2018

work page 2018

[7] [7]

Algorithms for inverse reinforce- ment learning.,

A. Y . Ng and S. Russell, “Algorithms for inverse reinforce- ment learning.,” in International Conference on Machine Learning (ICML), 2000

work page 2000

[8] [8]

Preference-based policy learning,

R. Akrour, M. Schoenauer, and M. Sebag, “Preference-based policy learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2011, pp. 12–27

work page 2011

[9] [9]

A Bayesian approach for policy learning from trajectory preference queries,

A. Wilson, A. Fern, and P. Tadepalli, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in Neural Information Processing Systems (NIPS) , 2012

work page 2012

[10] [10]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307

work page 2017

[11] [11]

Active preference-based learning of reward functions,

D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in Robotics: Science and Systems (RSS) , 2017

work page 2017

[12] [12]

Batch Active Preference-Based Learning of Reward Functions

E. Bıyık and D. Sadigh, “Batch active preference-based learn- ing of reward functions,” ArXiv preprint arXiv:1810.04303 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

M. J. Kochenderfer and T. A. Wheeler, Algorithms for opti- mization. MIT Press, 2019

work page 2019

[14] [14]

Bayesian preference elicitation for multiobjective engineering design optimization,

J. R. Lepird, M. P. Owen, and M. J. Kochenderfer, “Bayesian preference elicitation for multiobjective engineering design optimization,” Journal of Aerospace Information Systems, vol. 12, no. 10, pp. 634–645, 2015

work page 2015

[15] [15]

Evaluating multiple attribute items using queries,

V . S. Iyengar, J. Lee, and M. Campbell, “Evaluating multiple attribute items using queries,” in ACM Conference on Elec- tronic Commerce, ACM, 2001, pp. 144–153

work page 2001

[16] [16]

A survey of preference-based reinforcement learning methods,

C. Wirth, R. Akrour, G. Neumann, and J. F ¨urnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research , vol. 18, no. 1, pp. 4945–4990, 2017

work page 2017

[17] [17]

M. J. Kochenderfer, Decision making under uncertainty: The- ory and application . MIT Press, 2015

work page 2015

[18] [18]

The theory of dynamic programming,

R. Bellman, “The theory of dynamic programming,” Bulletin of the American Mathematical Society, vol. 60, no. 6, pp. 503– 515, 1954

work page 1954

[19] [19]

An adaptive Metropolis algorithm,

H. Haario, E. Saksman, and J. Tamminen, “An adaptive Metropolis algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, 2001. 8

work page 2001