Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
Pith reviewed 2026-07-03 13:32 UTC · model grok-4.3
The pith
Even the strongest LLM reaches only 82.7 percent on a benchmark of aviation operational knowledge where experts score around 95 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pre-Flight is an open-source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material. When scored by accuracy under a standard multiple choice protocol, even the strongest evaluated model reaches 82.7 percent, compared with an informal expert reference of around 95 percent, leaving a substantial and persistent gap below expert-level reliability.
What carries the argument
The Pre-Flight benchmark, a set of 300 multiple choice questions authored and reviewed by aviation practitioners and drawn from international standards and airport ground operations material.
If this is right
- Current models remain insufficiently reliable for aviation business operations even when those operations are described as non-safety-critical.
- Domain-specific benchmarks are required before generative AI can be deployed responsibly in regulated aviation contexts.
- The gap below expert performance has narrowed only gradually despite ongoing model releases.
- The released dataset and evaluation harness enable continuous tracking of progress on this task.
Where Pith is reading between the lines
- Comparable domain-specific benchmarks may be useful in other regulated fields where incorrect reasoning carries high costs.
- Targeted training or retrieval methods could be tested to determine whether they close the observed performance gap.
- The benchmark could serve as a filter for selecting models intended for aviation-adjacent customer or documentation tasks.
Load-bearing premise
The 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.
What would settle it
A new model scoring 95 percent or higher on the Pre-Flight questions, or a larger sample of aviation experts scoring substantially below 95 percent on the same questions.
Figures
read the original abstract
Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Pre-Flight, an open-source benchmark of 300 multiple-choice questions on aviation operational knowledge covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge, and complex scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations, and commercial flying. The authors evaluate contemporary LLMs via the Inspect framework under a standard multiple-choice protocol, report that the strongest model (released in 2026) reaches 82.7% accuracy (improving gradually from ~75% in early 2025), and compare this to an informal expert reference of ~95% from a low-sample conference quiz. They release the dataset, evaluation harness, and leaderboard, arguing that domain-specific benchmarks are required for responsible deployment of generative AI in non-safety-critical aviation operations.
Significance. If the benchmark is representative and the expert baseline reliable, the work provides a concrete, reusable instrument for measuring LLM performance on regulated operational knowledge and documents a persistent gap below expert level. The open release of the dataset, harness, and rolling leaderboard is a concrete contribution that enables community tracking of progress.
major comments (2)
- [Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.
- [Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.
minor comments (1)
- [Abstract] Abstract: the description of gradual improvement 'from roughly 75% in early 2025' to 82.7% would benefit from naming the specific models or providing a table of per-model scores to make the trend verifiable.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the need for greater transparency around the expert baseline and benchmark validation. We agree that the informal nature of the expert reference requires more explicit qualification in the abstract and will revise the manuscript to address this. We also acknowledge that formal statistical details on reviewer agreement and calibration were not collected during development. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.
Authors: We agree the expert reference is informal and its limitations should be stated more clearly. The manuscript already labels it as 'informal' and 'low sample,' but we will revise the abstract to add an explicit caveat noting the absence of sample size, selection criteria, and distribution matching. We will also soften the phrasing of the 'substantial and persistent gap' claim to indicate it is relative to this informal baseline and may be influenced by benchmark difficulty or format. These changes will appear in the revised version. revision: yes
-
Referee: [Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.
Authors: The questions were authored and reviewed by multiple practitioners with domain experience, but we did not collect formal inter-rater agreement metrics or perform quantitative difficulty calibration. Statistical properties of the expert quiz sample beyond the informal description are also unavailable. We will revise the manuscript to describe the available review process details and explicitly note the lack of these formal measures, while qualifying claims about benchmark validity to reflect that it is a practitioner-developed instrument rather than a psychometrically validated test. revision: partial
- Sample size, selection criteria, and inter-rater reliability for the conference quiz expert reference
- Formal inter-rater agreement statistics among the practitioner reviewers
- Question difficulty calibration data or item analysis statistics
Circularity Check
No circularity: empirical benchmark construction and direct model evaluation
full rationale
The paper constructs a 300-question multiple-choice benchmark from standards and operations material, has it reviewed by practitioners, and evaluates released LLMs using an external framework (Inspect). No derivations, equations, fitted parameters, or predictions are present. The informal expert reference is stated as a separate low-sample conference quiz and is not derived from or used to define the benchmark itself. No self-citation chains, ansatzes, or renamings reduce any claim to the paper's own inputs. The work is self-contained empirical evaluation against external models and an external (if informal) baseline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple-choice questions authored and reviewed by domain practitioners provide a valid proxy for aviation operational knowledge.
Reference graph
Works this paper leans on
-
[1]
and Lavie, A
Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72
2005
-
[2]
E., Ré, C., et al
Guha, N., Nyarko, J., Ho, D. E., Ré, C., et al. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 36 (NeurIPS Datasets and Benchmarks Track). 8
2023
-
[3]
and Steinhardt, J
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR)
2021
-
[4]
and Szolovits, P
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. and Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421
2021
-
[5]
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pp. 74–81. Association for Computational Linguistics
2004
-
[6]
and Sarkhel, K
Mangortey, E., Singh, S., Chen, S. and Sarkhel, K. (2025). Aviation Language Understanding Evaluation (ALUE) – Large Language Model Benchmark with Aviation Datasets. InAIAA AVIATION FORUM AND ASCEND 2025, paper AIAA 2025-3247 (FAA and MITRE). https://github.com/mitre/alue
2025
-
[7]
and Zhu, W.-J
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318
2002
-
[8]
Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023
Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., Lopez de Lacalle, O. and Agirre, E. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 10776–10787. arXiv:2310.18018. UK AI Security Institute (2024a).Inspect AI: Framewor...
-
[9]
https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance
UKAISecurityInstitute, ArcadiaImpactandVectorInstitute(2024b).Inspect Evals: Community-Contributed LLM Evaluations for Inspect AI.https://github.com/UKGovernmentBEIS/inspect_evals US Federal Aviation Administration (FAA) (2024).Roadmap for Artificial Intelligence Safety Assurance. https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance
2024
-
[10]
Measuring short-form factuality in large language models
Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., Schulman, J. and Fedus, W. (2024). Measuring Short-Form Factuality in Large Language Models. arXiv:2411.04368
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
PilotBench: A Benchmark for General Aviation Agents with Safety Constraints
Wu, Y., Liu, H., Li, Z. and Wang, B. (2026). PilotBench: A Benchmark for General Aviation Agents with Safety Constraints. arXiv:2604.08987. Appendix A. Consistently failed items Nine items were answered incorrectly by every one of the thirty one models in the cohort: eight US regulation (14 CFR) items and one international airport ground operations item. ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.