Learning an Urban Air Mobility Encounter Model from Expert Preferences
Pith reviewed 2026-05-24 22:51 UTC · model grok-4.3
The pith
An encounter model for urban air mobility is learned from a domain expert's pairwise preferences over simulated trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that extending preference-based learning to an MDP formulation allows the reward function of an encounter model to be recovered from expert pairwise preferences, yielding a stochastic policy that generates trajectories representative of expected UAM operations, and that information-maximizing query selection makes this feasible with minimal expert effort.
What carries the argument
A stochastic policy for a Markov decision process whose reward function is recovered from expert pairwise trajectory preferences via preference-based learning.
If this is right
- New encounter models for UAM can be built before operational data exists.
- Collision avoidance algorithms for UAM can be tested against trajectories that encode expert expectations of future traffic patterns.
- Preference elicitation with information-maximizing queries reduces the expert time cost to minutes rather than hours or days.
- The same framework can replace dataset-dependent fitting for other aircraft types where data is scarce.
Where Pith is reading between the lines
- The method could extend to other data-poor domains such as novel drone traffic patterns or autonomous surface vehicle encounters.
- Once real UAM data arrives, the learned policies could be used as priors that are then refined with observed trajectories.
- The approach assumes the expert can reliably judge realism from short trajectory snippets; this may require interface design that is not addressed in the paper.
Load-bearing premise
That an expert's preferences over simulated trajectories will produce encounter statistics that match those of real future urban air mobility operations.
What would settle it
Once real UAM flight data becomes available, compare the distribution of encounter geometries, speeds, and altitudes in the learned model against the empirical distribution from actual flights; significant mismatch would falsify the claim.
Figures
read the original abstract
Airspace models have played an important role in the development and evaluation of aircraft collision avoidance systems for both manned and unmanned aircraft. As Urban Air Mobility (UAM) systems are being developed, we need new encounter models that are representative of their operational environment. Developing such models is challenging due to the lack of data on UAM behavior in the airspace. While previous encounter models for other aircraft types rely on large datasets to produce realistic trajectories, this paper presents an approach to encounter modeling that instead relies on expert knowledge. In particular, recent advances in preference-based learning are extended to tune an encounter model from expert preferences. The model takes the form of a stochastic policy for a Markov decision process (MDP) in which the reward function is learned from pairwise queries of a domain expert. We evaluate the performance of two querying methods that seek to maximize the information obtained from each query. Ultimately, we demonstrate a method for generating realistic encounter trajectories with only a few minutes of an expert's time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends preference-based reinforcement learning to learn a reward function for an MDP-based stochastic policy that generates Urban Air Mobility encounter trajectories. It evaluates two information-maximizing query selection methods and claims that realistic encounters can be produced from only a few minutes of domain-expert pairwise preference input, addressing the absence of operational UAM data.
Significance. If the central claim holds, the work offers a practical alternative to data-driven encounter modeling for emerging UAM collision-avoidance evaluation, where real trajectories do not yet exist. It demonstrates an application of preference-based RL with low expert burden and compares query strategies, which could generalize to other safety-critical domains lacking observational data.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation section: the claim that the generated trajectories are 'realistic' and representative of future UAM operational environments rests on expert preferences alone, with no quantitative metrics, comparison to any external criterion, or validation against plausible dynamics reported. This is load-bearing for the central demonstration.
- [Method / Results] The weakest assumption—that expert pairwise preferences over simulated trajectories will yield models representative of actual UAM environments—is invoked when presenting the learned policy as realistic, yet no test of this assumption (e.g., sensitivity analysis or consistency checks) is described.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify that our evaluation of trajectory realism rests entirely on expert preferences in the absence of operational UAM data. We address each point below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation section: the claim that the generated trajectories are 'realistic' and representative of future UAM operational environments rests on expert preferences alone, with no quantitative metrics, comparison to any external criterion, or validation against plausible dynamics reported. This is load-bearing for the central demonstration.
Authors: We agree that the central claim relies on expert preferences without external quantitative benchmarks. This limitation is inherent to the problem setting, as the work is motivated by the complete absence of UAM operational data. We will revise the abstract and evaluation section to replace unqualified statements of 'realism' with phrasing such as 'consistent with domain-expert preferences' and add a paragraph discussing how the approach could be validated against future observational data or physics-based plausibility checks. This is a partial revision. revision: partial
-
Referee: [Method / Results] The weakest assumption—that expert pairwise preferences over simulated trajectories will yield models representative of actual UAM environments—is invoked when presenting the learned policy as realistic, yet no test of this assumption (e.g., sensitivity analysis or consistency checks) is described.
Authors: The manuscript does not report sensitivity analysis on the learned reward or cross-expert consistency metrics. With a single expert, such checks are necessarily limited; however, we can add an analysis of reward-function stability across different query batches and a dedicated limitations subsection that explicitly flags the assumption and outlines how multi-expert or future-data consistency tests could be performed. This is a partial revision. revision: partial
- Quantitative validation against real UAM trajectory data or external operational criteria, which do not yet exist for this domain.
Circularity Check
No significant circularity identified
full rationale
The paper presents a preference-based RL method that learns an MDP reward function directly from external expert pairwise trajectory preferences. The derivation relies on standard active querying and optimization steps applied to this independent human input rather than any internal model outputs or self-referential fits. No equations reduce generated trajectories to quantities defined by the model's own predictions, no self-citation chains bear the central claim, and the approach is self-contained against the external benchmark of expert judgments. This is the normal non-circular case for a demonstration that starts from the explicit premise of absent operational data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert pairwise preferences over trajectories accurately reflect the statistics of future UAM operations
Reference graph
Works this paper leans on
-
[1]
Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,
B. J. Chludzinski, “Evaluation of TCAS II version 7.1 using the FAA fast-time encounter generator model,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-346, 2009
work page 2009
-
[2]
J. E. Holland, M. J. Kochenderfer, and W. A. Olson, “Op- timizing the next generation collision avoidance system for safe, suitable, and acceptable operational performance,” Air Traffic Control Quarterly, vol. 21, no. 3, pp. 275–297, 2013
work page 2013
-
[3]
Airspace encounter models for estimating collision risk,
M. J. Kochenderfer, M. W. M. Edwards, L. P. Espindle, J. K. Kuchar, and J. D. Griffith, “Airspace encounter models for estimating collision risk,” Journal of Guidance, Control, and Dynamics, vol. 33, no. 2, pp. 487–499, 2010
work page 2010
-
[4]
Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,
E. R. Mueller and M. J. Kochenderfer, “Simulation com- parison of collision avoidance algorithms for small multi- rotor aircraft,” in AIAA Modeling and Simulation Technologies Conference, 2016, p. 3674. 7 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 0.5 0 200 400 600 800 1,000 1,200 1,400 1,6000 75 150 Altitude (m) λ = 1.0 0 200 400 ...
work page 2016
-
[5]
Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,
M. J. Kochenderfer, L. P. Espindle, J. K. Kuchar, and J. D. Griffith, “Correlated encounter model for cooperative aircraft in the national airspace system version 1.0,” Massachusetts Institute of Technology, Lincoln Laboratory, Project Report ATC-344, 2008
work page 2008
-
[6]
Learning probabilistic trajectory models of aircraft in terminal airspace from position data,
S. T. Barratt, M. J. Kochenderfer, and S. P. Boyd, “Learning probabilistic trajectory models of aircraft in terminal airspace from position data,” IEEE Transactions on Intelligent Trans- portation Systems, 2018
work page 2018
-
[7]
Algorithms for inverse reinforce- ment learning.,
A. Y . Ng and S. Russell, “Algorithms for inverse reinforce- ment learning.,” in International Conference on Machine Learning (ICML), 2000
work page 2000
-
[8]
Preference-based policy learning,
R. Akrour, M. Schoenauer, and M. Sebag, “Preference-based policy learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2011, pp. 12–27
work page 2011
-
[9]
A Bayesian approach for policy learning from trajectory preference queries,
A. Wilson, A. Fern, and P. Tadepalli, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in Neural Information Processing Systems (NIPS) , 2012
work page 2012
-
[10]
Deep reinforcement learning from human preferences,
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307
work page 2017
-
[11]
Active preference-based learning of reward functions,
D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” in Robotics: Science and Systems (RSS) , 2017
work page 2017
-
[12]
Batch Active Preference-Based Learning of Reward Functions
E. Bıyık and D. Sadigh, “Batch active preference-based learn- ing of reward functions,” ArXiv preprint arXiv:1810.04303 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
M. J. Kochenderfer and T. A. Wheeler, Algorithms for opti- mization. MIT Press, 2019
work page 2019
-
[14]
Bayesian preference elicitation for multiobjective engineering design optimization,
J. R. Lepird, M. P. Owen, and M. J. Kochenderfer, “Bayesian preference elicitation for multiobjective engineering design optimization,” Journal of Aerospace Information Systems, vol. 12, no. 10, pp. 634–645, 2015
work page 2015
-
[15]
Evaluating multiple attribute items using queries,
V . S. Iyengar, J. Lee, and M. Campbell, “Evaluating multiple attribute items using queries,” in ACM Conference on Elec- tronic Commerce, ACM, 2001, pp. 144–153
work page 2001
-
[16]
A survey of preference-based reinforcement learning methods,
C. Wirth, R. Akrour, G. Neumann, and J. F ¨urnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research , vol. 18, no. 1, pp. 4945–4990, 2017
work page 2017
-
[17]
M. J. Kochenderfer, Decision making under uncertainty: The- ory and application . MIT Press, 2015
work page 2015
-
[18]
The theory of dynamic programming,
R. Bellman, “The theory of dynamic programming,” Bulletin of the American Mathematical Society, vol. 60, no. 6, pp. 503– 515, 1954
work page 1954
-
[19]
An adaptive Metropolis algorithm,
H. Haario, E. Saksman, and J. Tamminen, “An adaptive Metropolis algorithm,” Bernoulli, vol. 7, no. 2, pp. 223–242, 2001. 8
work page 2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.