pith. sign in

arxiv: 2604.08987 · v1 · submitted 2026-04-10 · 💻 cs.AI

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords aviation benchmarkLLM agentstrajectory predictionsafety constraintsembodied AIflight phasesmodel comparisonPilot-Score
0
0 comments X

The pith

LLMs follow safety instructions in flight trajectory prediction but at the cost of lower numerical precision than traditional forecasters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PilotBench evaluates whether large language models can reason about complex flight physics while obeying safety constraints, using 708 real-world general aviation trajectories across nine phases. The benchmark compares 41 models on synchronized telemetry data and introduces Pilot-Score, a metric that weights regression accuracy at 60 percent against instruction adherence and safety compliance at 40 percent. Traditional forecasters deliver the lowest mean absolute error of 7.01 yet fail to interpret semantic instructions, whereas LLMs reach 86 to 89 percent instruction-following rates but incur 11 to 14 mean absolute error. Phase-level breakdowns reveal that LLM performance drops sharply during high-workload segments such as climb and approach. The results motivate hybrid systems that pair LLM symbolic reasoning with specialized numerical forecasters.

Core claim

PilotBench reveals a Precision-Controllability Dichotomy in safety-critical aviation prediction: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs achieve 86-89 percent instruction-following at the cost of 11-14 MAE precision. This pattern is quantified via the Pilot-Score composite metric on 708 real trajectories spanning nine flight phases, with further degradation observed in complex phases such as Climb and Approach.

What carries the argument

Pilot-Score, a composite evaluation metric that balances 60 percent regression accuracy with 40 percent instruction adherence and safety compliance on synchronized 34-channel telemetry from real aviation trajectories.

Load-bearing premise

The 708 selected trajectories and the safety constraints encoded in Pilot-Score adequately represent the full range of real-world general aviation risks and operational variability.

What would settle it

Evaluating the same 41 models on an independent collection of flight trajectories collected from different aircraft types, regions, or weather conditions and checking whether the precision-controllability split and phase degradation pattern persist.

Figures

Figures reproduced from arXiv: 2604.08987 by Boyang Wang, Haotian Liu, Yalun Wu, Zhoujun Li.

Figure 1
Figure 1. Figure 1: Synchronized flight-state snapshot from P [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Nine-phase segmentation in PILOTBENCH based on standard traffic pattern [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Eight-stage pipeline for building PILOTBENCH [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Flight and spatial statistics of PILOTBENCH. trajectory and attitude predictions in a fixed schema. Decoding is deterministic with temperature 0 and p=1.0. Experiments ran on eight NVIDIA A100 80GB GPUs; the model roster spans Qwen, DeepSeek, GLM, InternLM, GPT, and Doubao. Results (Table I) show Qwen3-32B achieves the best MAE of 9.54. Scaling helps but non-linearly; architecture and training matter more … view at source ↗
Figure 5
Figure 5. Figure 5: Performance radar: traditional models shown in blue dominate [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decoupling prompting effects: +ICL benefits MAE precision; +Phy-CoT [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: MAE heatmap across flight phases. LLM errors spike in Climb and [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs' symbolic reasoning with specialized forecasters' numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PilotBench, a benchmark for evaluating LLMs and traditional forecasters on safety-critical general aviation flight trajectory and attitude prediction. It is built from 708 real-world trajectories across nine phases with 34-channel telemetry, introduces the Pilot-Score metric (60% regression accuracy + 40% instruction adherence/safety compliance), and reports a Precision-Controllability Dichotomy: traditional models achieve MAE 7.01 while LLMs reach 86-89% instruction following at the cost of 11-14 MAE, with sharper degradation in high-workload phases such as Climb and Approach. The work motivates hybrid LLM-forecaster architectures for embodied AI in safety-constrained domains.

Significance. If the benchmark construction and evaluation protocols hold under scrutiny, the work provides a useful empirical testbed for embodied AI research by quantifying trade-offs between numerical precision and semantic controllability on real telemetry data. The scale (41 models, phase-stratified analysis) and introduction of a composite Pilot-Score are strengths that could help guide hybrid system design. However, the absence of external validation against operational statistics limits claims about real-world generalizability.

major comments (2)
  1. [Benchmark construction] Benchmark construction (abstract and implied §3): The 708 trajectories and Pilot-Score weights lack any external anchoring to NTSB incident statistics, FAA operational data, or expert pilot review. This is load-bearing for the central Precision-Controllability Dichotomy claim, as unrepresentative phase or channel distributions would render the reported MAE gap (7.01 vs 11-14) and LLM degradation in Climb/Approach benchmark-specific rather than general.
  2. [Evaluation protocol] Evaluation protocol (abstract and results): No details are provided on data preprocessing, LLM prompt engineering, statistical tests, or error-bar computation for the comparative MAE and 86-89% adherence numbers. Without these, the dichotomy cannot be independently verified and the soundness of the 41-model comparison is compromised.
minor comments (1)
  1. [Pilot-Score definition] Notation for Pilot-Score components could be clarified with an explicit equation or table showing the 60/40 weighting and safety compliance sub-metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments identify key areas where additional transparency and discussion will strengthen the manuscript. We address each major comment below and commit to revisions that improve reproducibility and acknowledge limitations without overstating generalizability.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (abstract and implied §3): The 708 trajectories and Pilot-Score weights lack any external anchoring to NTSB incident statistics, FAA operational data, or expert pilot review. This is load-bearing for the central Precision-Controllability Dichotomy claim, as unrepresentative phase or channel distributions would render the reported MAE gap (7.01 vs 11-14) and LLM degradation in Climb/Approach benchmark-specific rather than general.

    Authors: We agree that the absence of explicit anchoring to NTSB incident statistics, FAA operational data, or expert pilot review is a limitation for claims of broad real-world applicability. The 708 trajectories were selected from publicly available general aviation telemetry to span nine operationally distinct phases, but we did not perform distributional matching to incident rates or solicit pilot validation for the 60/40 Pilot-Score weights. The Precision-Controllability Dichotomy and phase-stratified degradations are therefore demonstrated within this specific benchmark rather than proven as universal. In revision we will (a) expand §3 with a detailed data-sourcing and phase-selection protocol, (b) add a limitations subsection that explicitly discusses the lack of external validation and its implications, and (c) note that future work could incorporate NTSB/FAA statistics for re-weighting. These changes will clarify the scope of the reported findings. revision: partial

  2. Referee: [Evaluation protocol] Evaluation protocol (abstract and results): No details are provided on data preprocessing, LLM prompt engineering, statistical tests, or error-bar computation for the comparative MAE and 86-89% adherence numbers. Without these, the dichotomy cannot be independently verified and the soundness of the 41-model comparison is compromised.

    Authors: We acknowledge that the submitted version did not provide sufficient methodological detail for independent reproduction. The full manuscript contains descriptions of telemetry preprocessing, prompt templates, and metric computation, but these were not presented at the level of explicit steps, templates, or statistical procedures. In the revised manuscript we will expand the evaluation section to include: (1) precise preprocessing steps (channel-wise z-score normalization, phase segmentation rules, and handling of missing values); (2) the complete prompt templates used for all LLM evaluations, including safety-constraint phrasing; (3) bootstrapped 95% confidence intervals computed over 1,000 resamples for all MAE and adherence figures; and (4) statistical significance tests (paired Wilcoxon signed-rank tests with Bonferroni correction) for the reported model comparisons. These additions will allow full verification of the 41-model results and the observed dichotomy. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark from external data

full rationale

The paper constructs PilotBench directly from 708 real-world trajectories and 34-channel telemetry, defines Pilot-Score as an explicit composite (60% regression + 40% adherence/safety), and reports comparative MAE and instruction-following rates across 41 models. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear in the load-bearing claims. The Precision-Controllability Dichotomy and phase-stratified gaps are observational outcomes on held-out data, not reductions to the paper's own inputs. The benchmark is therefore self-contained against external telemetry sources.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The benchmark itself is the main contribution; the only notable free parameter is the weighting inside Pilot-Score, and the central assumption is that the chosen trajectories capture safety-critical behavior.

free parameters (1)
  • Pilot-Score weights = 60/40 split
    60% regression accuracy and 40% instruction adherence chosen to balance numerical precision against safety compliance.
axioms (1)
  • domain assumption The 708 real-world trajectories and nine flight phases adequately represent general aviation safety-critical scenarios
    Invoked as the foundation for all comparative claims without stated external validation.

pith-pipeline@v0.9.0 · 5538 in / 1291 out tokens · 48633 ms · 2026-05-10T17:53:08.099542+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Embodied large language models enable robots to complete complex tasks in unpredictable environments,

    R. Mon-Williams, G. Li, R. Long, W. Du, and C. G. Lucas, “Embodied large language models enable robots to complete complex tasks in unpredictable environments,”Nature Machine Intelligence, pp. 1–10, 2025

  2. [2]

    Lampilot: An open benchmark dataset for autonomous driving with language model programs,

    Y . Ma, C. Cui, X. Cao, W. Ye, P. Liu, J. Lu, A. Abdelraouf, R. Gupta, K. Han, A. Beraet al., “Lampilot: An open benchmark dataset for autonomous driving with language model programs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 141–15 151

  3. [3]

    Abdulhak, W

    S. Abdulhak, W. Hubbard, K. Gopalakrishnan, and M. Z. Li, “Chatatc: Large language model-driven conversational agents for supporting strategic air traffic flow management,”arXiv preprint arXiv:2402.14850, 2024

  4. [4]

    Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control,

    D. Guo, Z. Zhang, B. Yang, J. Zhang, H. Yang, and Y . Lin, “Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control,”Nature Communications, vol. 15, no. 1, p. 9662, 2024

  5. [5]

    Research on flight accidents prediction based back propagation neural network,

    H. Liu, F. Shen, F. Gaoet al., “Research on flight accidents prediction based back propagation neural network,”arXiv preprint arXiv:2406.13954, 2024

  6. [6]

    Agieval: A human-centric benchmark for evaluating foundation models,

    W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” inFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. G ´omez-Adorno, and S. Bethard, Eds. Association for Computational Linguisti...

  7. [7]

    Make your llm fully utilize the context,

    S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Make your llm fully utilize the context,”Advances in Neural Information Processing Systems, vol. 37, pp. 62 160–62 188, 2024

  8. [8]

    Cognitive state detection in task context based on graph attention network during flight,

    E. Q. Wu, Y . Gao, W. Tong, Y . Hou, R. Law, and G. Zhu, “Cognitive state detection in task context based on graph attention network during flight,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024

  9. [9]

    Flightbert++: A non- autoregressive multi-horizon flight trajectory prediction framework,

    D. Guo, Z. Zhang, Z. Yan, J. Zhang, and Y . Lin, “Flightbert++: A non- autoregressive multi-horizon flight trajectory prediction framework,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 127–134

  10. [10]

    A novel trajectory prediction method based on cnn, bilstm, and multi-head attention mechanism,

    Y . Xu, Q. Pan, Z. Wang, and B. Hu, “A novel trajectory prediction method based on cnn, bilstm, and multi-head attention mechanism,”Aerospace, vol. 11, no. 10, p. 822, 2024

  11. [11]

    Research on flight trajectory prediction method based on transformer,

    X. Dong, Y . Tian, K. Niu, M. Sun, and J. Li, “Research on flight trajectory prediction method based on transformer,” inInternational Conference on Smart Transportation and City Engineering (STCE 2023), vol. 13018. SPIE, 2024, pp. 1403–1409

  12. [12]

    A generalized approach to aircraft trajectory prediction via supervised deep learning,

    N. Schimpf, Z. Wang, S. Li, E. J. Knoblock, H. Li, and R. D. Apaza, “A generalized approach to aircraft trajectory prediction via supervised deep learning,”IEEE Access, vol. 11, pp. 116 183–116 195, 2023. [Online]. Available: https://doi.org/10.1109/ACCESS.2023.3325053

  13. [13]

    Opensky report 2025: Improving crowdsourced flight trajectories with ads-c data,

    J. Sun, X. Olive, M. Strohmeier, and V . Lenders, “Opensky report 2025: Improving crowdsourced flight trajectories with ads-c data,” in2025 Integrated Communications, Navigation and Surveillance Conference (ICNS). IEEE, 2025, pp. 1–8

  14. [14]

    Tartanaviation: Image, speech, and ADS-B trajectory datasets for terminal airspace operations,

    J. Patrikar, J. P. A. Dantas, B. G. Moon, M. M. Hamidi, S. Ghosh, N. V . Keetha, I. Higgins, A. Chandak, T. Yoneyama, and S. A. Scherer, “Tartanaviation: Image, speech, and ADS-B trajectory datasets for terminal airspace operations,”CoRR, vol. abs/2403.03372, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.03372

  15. [15]

    Toolllm: Facilitating large language models to master 16000+ real-world apis,

    Y . Qin, S. Liang, Y . Ye, K. Zhuet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Online]. Available: https://openreview.net/forum?id=dHng2O0Jjr

  16. [16]

    StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , booktitle =

    Z. Guo, S. Cheng, H. Wang, S. Lianget al., “Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistic...

  17. [17]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    F. A. Administration,Airplane Flying Handbook: FAA-H-8083-3C (2025). Simon and Schuster, 2022

  19. [19]

    International standardization compliance in aviation,

    T. B. Spence, R. O. Fanjoy, C.-t. Lu, and S. W. Schreckengast, “International standardization compliance in aviation,”Journal of air transport management, vol. 49, pp. 1–8, 2015

  20. [20]

    A review of general aviation safety (1984–2017),

    D. D. Boyd, “A review of general aviation safety (1984–2017),”Aerospace medicine and human performance, vol. 88, no. 7, pp. 657–664, 2017

  21. [21]

    Energy-based metrics for safety analysis of general aviation operations,

    T. Puranik, H. Jimenez, and D. Mavris, “Energy-based metrics for safety analysis of general aviation operations,”Journal of Aircraft, vol. 54, no. 6, pp. 2285–2297, 2017