PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3
The pith
LLMs follow safety instructions in flight trajectory prediction but at the cost of lower numerical precision than traditional forecasters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PilotBench reveals a Precision-Controllability Dichotomy in safety-critical aviation prediction: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs achieve 86-89 percent instruction-following at the cost of 11-14 MAE precision. This pattern is quantified via the Pilot-Score composite metric on 708 real trajectories spanning nine flight phases, with further degradation observed in complex phases such as Climb and Approach.
What carries the argument
Pilot-Score, a composite evaluation metric that balances 60 percent regression accuracy with 40 percent instruction adherence and safety compliance on synchronized 34-channel telemetry from real aviation trajectories.
Load-bearing premise
The 708 selected trajectories and the safety constraints encoded in Pilot-Score adequately represent the full range of real-world general aviation risks and operational variability.
What would settle it
Evaluating the same 41 models on an independent collection of flight trajectories collected from different aircraft types, regions, or weather conditions and checking whether the precision-controllability split and phase degradation pattern persist.
Figures
read the original abstract
As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs' symbolic reasoning with specialized forecasters' numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PilotBench, a benchmark for evaluating LLMs and traditional forecasters on safety-critical general aviation flight trajectory and attitude prediction. It is built from 708 real-world trajectories across nine phases with 34-channel telemetry, introduces the Pilot-Score metric (60% regression accuracy + 40% instruction adherence/safety compliance), and reports a Precision-Controllability Dichotomy: traditional models achieve MAE 7.01 while LLMs reach 86-89% instruction following at the cost of 11-14 MAE, with sharper degradation in high-workload phases such as Climb and Approach. The work motivates hybrid LLM-forecaster architectures for embodied AI in safety-constrained domains.
Significance. If the benchmark construction and evaluation protocols hold under scrutiny, the work provides a useful empirical testbed for embodied AI research by quantifying trade-offs between numerical precision and semantic controllability on real telemetry data. The scale (41 models, phase-stratified analysis) and introduction of a composite Pilot-Score are strengths that could help guide hybrid system design. However, the absence of external validation against operational statistics limits claims about real-world generalizability.
major comments (2)
- [Benchmark construction] Benchmark construction (abstract and implied §3): The 708 trajectories and Pilot-Score weights lack any external anchoring to NTSB incident statistics, FAA operational data, or expert pilot review. This is load-bearing for the central Precision-Controllability Dichotomy claim, as unrepresentative phase or channel distributions would render the reported MAE gap (7.01 vs 11-14) and LLM degradation in Climb/Approach benchmark-specific rather than general.
- [Evaluation protocol] Evaluation protocol (abstract and results): No details are provided on data preprocessing, LLM prompt engineering, statistical tests, or error-bar computation for the comparative MAE and 86-89% adherence numbers. Without these, the dichotomy cannot be independently verified and the soundness of the 41-model comparison is compromised.
minor comments (1)
- [Pilot-Score definition] Notation for Pilot-Score components could be clarified with an explicit equation or table showing the 60/40 weighting and safety compliance sub-metrics.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments identify key areas where additional transparency and discussion will strengthen the manuscript. We address each major comment below and commit to revisions that improve reproducibility and acknowledge limitations without overstating generalizability.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and implied §3): The 708 trajectories and Pilot-Score weights lack any external anchoring to NTSB incident statistics, FAA operational data, or expert pilot review. This is load-bearing for the central Precision-Controllability Dichotomy claim, as unrepresentative phase or channel distributions would render the reported MAE gap (7.01 vs 11-14) and LLM degradation in Climb/Approach benchmark-specific rather than general.
Authors: We agree that the absence of explicit anchoring to NTSB incident statistics, FAA operational data, or expert pilot review is a limitation for claims of broad real-world applicability. The 708 trajectories were selected from publicly available general aviation telemetry to span nine operationally distinct phases, but we did not perform distributional matching to incident rates or solicit pilot validation for the 60/40 Pilot-Score weights. The Precision-Controllability Dichotomy and phase-stratified degradations are therefore demonstrated within this specific benchmark rather than proven as universal. In revision we will (a) expand §3 with a detailed data-sourcing and phase-selection protocol, (b) add a limitations subsection that explicitly discusses the lack of external validation and its implications, and (c) note that future work could incorporate NTSB/FAA statistics for re-weighting. These changes will clarify the scope of the reported findings. revision: partial
-
Referee: [Evaluation protocol] Evaluation protocol (abstract and results): No details are provided on data preprocessing, LLM prompt engineering, statistical tests, or error-bar computation for the comparative MAE and 86-89% adherence numbers. Without these, the dichotomy cannot be independently verified and the soundness of the 41-model comparison is compromised.
Authors: We acknowledge that the submitted version did not provide sufficient methodological detail for independent reproduction. The full manuscript contains descriptions of telemetry preprocessing, prompt templates, and metric computation, but these were not presented at the level of explicit steps, templates, or statistical procedures. In the revised manuscript we will expand the evaluation section to include: (1) precise preprocessing steps (channel-wise z-score normalization, phase segmentation rules, and handling of missing values); (2) the complete prompt templates used for all LLM evaluations, including safety-constraint phrasing; (3) bootstrapped 95% confidence intervals computed over 1,000 resamples for all MAE and adherence figures; and (4) statistical significance tests (paired Wilcoxon signed-rank tests with Bonferroni correction) for the reported model comparisons. These additions will allow full verification of the 41-model results and the observed dichotomy. revision: yes
Circularity Check
No circularity: purely empirical benchmark from external data
full rationale
The paper constructs PilotBench directly from 708 real-world trajectories and 34-channel telemetry, defines Pilot-Score as an explicit composite (60% regression + 40% adherence/safety), and reports comparative MAE and instruction-following rates across 41 models. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the load-bearing claims. The Precision-Controllability Dichotomy and phase-stratified gaps are observational outcomes on held-out data, not reductions to the paper's own inputs. The benchmark is therefore self-contained against external telemetry sources.
Axiom & Free-Parameter Ledger
free parameters (1)
- Pilot-Score weights =
60/40 split
axioms (1)
- domain assumption The 708 real-world trajectories and nine flight phases adequately represent general aviation safety-critical scenarios
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost definition and uniqueness)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance... F(MAE,RMSE,c) = 0.6·R(MAE,RMSE) + 0.4·I(c)
-
IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing via circle linking)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Nine-phase segmentation... five straight segments P1–P5 and four transition segments T1–T4
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
R. Mon-Williams, G. Li, R. Long, W. Du, and C. G. Lucas, “Embodied large language models enable robots to complete complex tasks in unpredictable environments,”Nature Machine Intelligence, pp. 1–10, 2025
work page 2025
-
[2]
Lampilot: An open benchmark dataset for autonomous driving with language model programs,
Y . Ma, C. Cui, X. Cao, W. Ye, P. Liu, J. Lu, A. Abdelraouf, R. Gupta, K. Han, A. Beraet al., “Lampilot: An open benchmark dataset for autonomous driving with language model programs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 141–15 151
work page 2024
-
[3]
S. Abdulhak, W. Hubbard, K. Gopalakrishnan, and M. Z. Li, “Chatatc: Large language model-driven conversational agents for supporting strategic air traffic flow management,”arXiv preprint arXiv:2402.14850, 2024
-
[4]
D. Guo, Z. Zhang, B. Yang, J. Zhang, H. Yang, and Y . Lin, “Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control,”Nature Communications, vol. 15, no. 1, p. 9662, 2024
work page 2024
-
[5]
Research on flight accidents prediction based back propagation neural network,
H. Liu, F. Shen, F. Gaoet al., “Research on flight accidents prediction based back propagation neural network,”arXiv preprint arXiv:2406.13954, 2024
-
[6]
Agieval: A human-centric benchmark for evaluating foundation models,
W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. G ´omez-Adorno, and S. Bethard, Eds. Association for Computational Linguisti...
work page 2024
-
[7]
Make your llm fully utilize the context,
S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Make your llm fully utilize the context,”Advances in Neural Information Processing Systems, vol. 37, pp. 62 160–62 188, 2024
work page 2024
-
[8]
Cognitive state detection in task context based on graph attention network during flight,
E. Q. Wu, Y . Gao, W. Tong, Y . Hou, R. Law, and G. Zhu, “Cognitive state detection in task context based on graph attention network during flight,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024
work page 2024
-
[9]
Flightbert++: A non- autoregressive multi-horizon flight trajectory prediction framework,
D. Guo, Z. Zhang, Z. Yan, J. Zhang, and Y . Lin, “Flightbert++: A non- autoregressive multi-horizon flight trajectory prediction framework,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 127–134
work page 2024
-
[10]
A novel trajectory prediction method based on cnn, bilstm, and multi-head attention mechanism,
Y . Xu, Q. Pan, Z. Wang, and B. Hu, “A novel trajectory prediction method based on cnn, bilstm, and multi-head attention mechanism,”Aerospace, vol. 11, no. 10, p. 822, 2024
work page 2024
-
[11]
Research on flight trajectory prediction method based on transformer,
X. Dong, Y . Tian, K. Niu, M. Sun, and J. Li, “Research on flight trajectory prediction method based on transformer,” inInternational Conference on Smart Transportation and City Engineering (STCE 2023), vol. 13018. SPIE, 2024, pp. 1403–1409
work page 2023
-
[12]
A generalized approach to aircraft trajectory prediction via supervised deep learning,
N. Schimpf, Z. Wang, S. Li, E. J. Knoblock, H. Li, and R. D. Apaza, “A generalized approach to aircraft trajectory prediction via supervised deep learning,”IEEE Access, vol. 11, pp. 116 183–116 195, 2023. [Online]. Available: https://doi.org/10.1109/ACCESS.2023.3325053
-
[13]
Opensky report 2025: Improving crowdsourced flight trajectories with ads-c data,
J. Sun, X. Olive, M. Strohmeier, and V . Lenders, “Opensky report 2025: Improving crowdsourced flight trajectories with ads-c data,” in2025 Integrated Communications, Navigation and Surveillance Conference (ICNS). IEEE, 2025, pp. 1–8
work page 2025
-
[14]
Tartanaviation: Image, speech, and ADS-B trajectory datasets for terminal airspace operations,
J. Patrikar, J. P. A. Dantas, B. G. Moon, M. M. Hamidi, S. Ghosh, N. V . Keetha, I. Higgins, A. Chandak, T. Yoneyama, and S. A. Scherer, “Tartanaviation: Image, speech, and ADS-B trajectory datasets for terminal airspace operations,”CoRR, vol. abs/2403.03372, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.03372
-
[15]
Toolllm: Facilitating large language models to master 16000+ real-world apis,
Y . Qin, S. Liang, Y . Ye, K. Zhuet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=dHng2O0Jjr
work page 2024
-
[16]
Z. Guo, S. Cheng, H. Wang, S. Lianget al., “Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistic...
-
[17]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
F. A. Administration,Airplane Flying Handbook: FAA-H-8083-3C (2025). Simon and Schuster, 2022
work page 2025
-
[19]
International standardization compliance in aviation,
T. B. Spence, R. O. Fanjoy, C.-t. Lu, and S. W. Schreckengast, “International standardization compliance in aviation,”Journal of air transport management, vol. 49, pp. 1–8, 2015
work page 2015
-
[20]
A review of general aviation safety (1984–2017),
D. D. Boyd, “A review of general aviation safety (1984–2017),”Aerospace medicine and human performance, vol. 88, no. 7, pp. 657–664, 2017
work page 1984
-
[21]
Energy-based metrics for safety analysis of general aviation operations,
T. Puranik, H. Jimenez, and D. Mavris, “Energy-based metrics for safety analysis of general aviation operations,”Journal of Aircraft, vol. 54, no. 6, pp. 2285–2297, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.