Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Alex Brooker; Tim Hughes

arxiv: 2607.01829 · v1 · pith:AZLBWI6Rnew · submitted 2026-07-02 · 💻 cs.AI · cs.CL

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Alex Brooker , Tim Hughes This is my paper

Pith reviewed 2026-07-03 13:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords aviationLLM benchmarkoperational knowledgeground operationsregulationssafety evaluationmultiple choice

0 comments

The pith

Even the strongest LLM reaches only 82.7 percent on a benchmark of aviation operational knowledge where experts score around 95 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Pre-Flight, a benchmark of 300 multiple choice questions on international airport ground operations, ICAO and FAA regulations, general aviation knowledge and complex operational scenarios. Questions were written and reviewed by practitioners in air traffic management, ground operations and commercial flying. Evaluation of commercial and open weight models shows the top performer, released in 2026, scoring 82.7 percent accuracy, up only gradually from roughly 75 percent in early 2025. This remains well below an informal expert reference level of around 95 percent obtained from aviation professionals. The authors conclude that domain-specific evaluation is a necessary precondition for responsible deployment of generative AI even in non-safety-critical aviation operations.

Core claim

Pre-Flight is an open-source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material. When scored by accuracy under a standard multiple choice protocol, even the strongest evaluated model reaches 82.7 percent, compared with an informal expert reference of around 95 percent, leaving a substantial and persistent gap below expert-level reliability.

What carries the argument

The Pre-Flight benchmark, a set of 300 multiple choice questions authored and reviewed by aviation practitioners and drawn from international standards and airport ground operations material.

If this is right

Current models remain insufficiently reliable for aviation business operations even when those operations are described as non-safety-critical.
Domain-specific benchmarks are required before generative AI can be deployed responsibly in regulated aviation contexts.
The gap below expert performance has narrowed only gradually despite ongoing model releases.
The released dataset and evaluation harness enable continuous tracking of progress on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable domain-specific benchmarks may be useful in other regulated fields where incorrect reasoning carries high costs.
Targeted training or retrieval methods could be tested to determine whether they close the observed performance gap.
The benchmark could serve as a filter for selecting models intended for aviation-adjacent customer or documentation tasks.

Load-bearing premise

The 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.

What would settle it

A new model scoring 95 percent or higher on the Pre-Flight questions, or a larger sample of aviation experts scoring substantially below 95 percent on the same questions.

Figures

Figures reproduced from arXiv: 2607.01829 by Alex Brooker, Tim Hughes.

**Figure 1.** Figure 1: plots accuracy against model release date and traces the frontier over time. Three findings stand out. First, the gap to expert level performance is large and persistent: the strongest model in the snapshot, GPT-5.5, reaches 82.7%, around twelve points below the informal expert reference. Second, progress has been slow relative to that gap: the frontier rose quickly through 2024, from the mid 60s to the lo… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pre-Flight releases a practical 300-question benchmark on aviation regs and ops, but the claimed expert gap at 95% versus 82.7% for models rests on an unverified low-sample conference quiz.

read the letter

The paper's main output is an open 300-question multiple choice set covering ICAO and FAA material, airport ground operations, and scenario questions, with practitioner authorship and review. They run it through the Inspect framework on a range of models and keep a rolling leaderboard. The dataset and harness are released, which is the part that actually adds something usable.

What stands out is the focus on a regulated domain where general benchmarks fall short. The results show steady but slow gains, with the 2026 model at 82.7%. That gives a concrete data point for anyone tracking domain-specific reliability.

The weak point is the expert baseline. The abstract ties the 95% figure to an informal low-sample conference quiz with no reported participant count, selection method, or check that the quiz items match the benchmark distribution. Without those, the size of the gap is hard to read as a stable reliability shortfall rather than a test-format effect. The description also lacks any mention of difficulty calibration or inter-rater stats on the 300 items themselves.

This is for groups already working on specialized LLM evaluations or safety-adjacent applications. The construction details and released materials would be the useful parts for them.

It should go to peer review because the artifact is new and the motivation is grounded, though the baseline comparison will need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Pre-Flight, an open-source benchmark of 300 multiple-choice questions on aviation operational knowledge covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge, and complex scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations, and commercial flying. The authors evaluate contemporary LLMs via the Inspect framework under a standard multiple-choice protocol, report that the strongest model (released in 2026) reaches 82.7% accuracy (improving gradually from ~75% in early 2025), and compare this to an informal expert reference of ~95% from a low-sample conference quiz. They release the dataset, evaluation harness, and leaderboard, arguing that domain-specific benchmarks are required for responsible deployment of generative AI in non-safety-critical aviation operations.

Significance. If the benchmark is representative and the expert baseline reliable, the work provides a concrete, reusable instrument for measuring LLM performance on regulated operational knowledge and documents a persistent gap below expert level. The open release of the dataset, harness, and rolling leaderboard is a concrete contribution that enables community tracking of progress.

major comments (2)

[Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.
[Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.

minor comments (1)

[Abstract] Abstract: the description of gradual improvement 'from roughly 75% in early 2025' to 82.7% would benefit from naming the specific models or providing a table of per-model scores to make the trend verifiable.

Simulated Author's Rebuttal

2 responses · 3 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater transparency around the expert baseline and benchmark validation. We agree that the informal nature of the expert reference requires more explicit qualification in the abstract and will revise the manuscript to address this. We also acknowledge that formal statistical details on reviewer agreement and calibration were not collected during development. Below we respond point by point.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.

Authors: We agree the expert reference is informal and its limitations should be stated more clearly. The manuscript already labels it as 'informal' and 'low sample,' but we will revise the abstract to add an explicit caveat noting the absence of sample size, selection criteria, and distribution matching. We will also soften the phrasing of the 'substantial and persistent gap' claim to indicate it is relative to this informal baseline and may be influenced by benchmark difficulty or format. These changes will appear in the revised version. revision: yes
Referee: [Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.

Authors: The questions were authored and reviewed by multiple practitioners with domain experience, but we did not collect formal inter-rater agreement metrics or perform quantitative difficulty calibration. Statistical properties of the expert quiz sample beyond the informal description are also unavailable. We will revise the manuscript to describe the available review process details and explicitly note the lack of these formal measures, while qualifying claims about benchmark validity to reflect that it is a practitioner-developed instrument rather than a psychometrically validated test. revision: partial

standing simulated objections not resolved

Sample size, selection criteria, and inter-rater reliability for the conference quiz expert reference
Formal inter-rater agreement statistics among the practitioner reviewers
Question difficulty calibration data or item analysis statistics

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and direct model evaluation

full rationale

The paper constructs a 300-question multiple-choice benchmark from standards and operations material, has it reviewed by practitioners, and evaluates released LLMs using an external framework (Inspect). No derivations, equations, fitted parameters, or predictions are present. The informal expert reference is stated as a separate low-sample conference quiz and is not derived from or used to define the benchmark itself. No self-citation chains, ansatzes, or renamings reduce any claim to the paper's own inputs. The work is self-contained empirical evaluation against external models and an external (if informal) baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions about multiple-choice testing as a proxy for knowledge and on the representativeness of the practitioner-authored questions; no free parameters or new entities are introduced.

axioms (1)

domain assumption Multiple-choice questions authored and reviewed by domain practitioners provide a valid proxy for aviation operational knowledge.
Invoked when interpreting model accuracy scores as evidence of a gap below expert reliability.

pith-pipeline@v0.9.1-grok · 5790 in / 1177 out tokens · 40036 ms · 2026-07-03T13:32:54.095527+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 2 internal anchors

[1]

and Lavie, A

Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72

2005
[2]

E., Ré, C., et al

Guha, N., Nyarko, J., Ho, D. E., Ré, C., et al. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 36 (NeurIPS Datasets and Benchmarks Track). 8

2023
[3]

and Steinhardt, J

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR)

2021
[4]

and Szolovits, P

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. and Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421

2021
[5]

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pp. 74–81. Association for Computational Linguistics

2004
[6]

and Sarkhel, K

Mangortey, E., Singh, S., Chen, S. and Sarkhel, K. (2025). Aviation Language Understanding Evaluation (ALUE) – Large Language Model Benchmark with Aviation Datasets. InAIAA AVIATION FORUM AND ASCEND 2025, paper AIAA 2025-3247 (FAA and MITRE). https://github.com/mitre/alue

2025
[7]

and Zhu, W.-J

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318

2002
[8]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., Lopez de Lacalle, O. and Agirre, E. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 10776–10787. arXiv:2310.18018. UK AI Security Institute (2024a).Inspect AI: Framewor...

work page arXiv 2023
[9]

https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

UKAISecurityInstitute, ArcadiaImpactandVectorInstitute(2024b).Inspect Evals: Community-Contributed LLM Evaluations for Inspect AI.https://github.com/UKGovernmentBEIS/inspect_evals US Federal Aviation Administration (FAA) (2024).Roadmap for Artificial Intelligence Safety Assurance. https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

2024
[10]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., Schulman, J. and Fedus, W. (2024). Measuring Short-Form Factuality in Large Language Models. arXiv:2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Wu, Y., Liu, H., Li, Z. and Wang, B. (2026). PilotBench: A Benchmark for General Aviation Agents with Safety Constraints. arXiv:2604.08987. Appendix A. Consistently failed items Nine items were answered incorrectly by every one of the thirty one models in the cohort: eight US regulation (14 CFR) items and one international airport ground operations item. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

and Lavie, A

Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72

2005

[2] [2]

E., Ré, C., et al

Guha, N., Nyarko, J., Ho, D. E., Ré, C., et al. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 36 (NeurIPS Datasets and Benchmarks Track). 8

2023

[3] [3]

and Steinhardt, J

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR)

2021

[4] [4]

and Szolovits, P

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. and Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421

2021

[5] [5]

Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pp. 74–81. Association for Computational Linguistics

2004

[6] [6]

and Sarkhel, K

Mangortey, E., Singh, S., Chen, S. and Sarkhel, K. (2025). Aviation Language Understanding Evaluation (ALUE) – Large Language Model Benchmark with Aviation Datasets. InAIAA AVIATION FORUM AND ASCEND 2025, paper AIAA 2025-3247 (FAA and MITRE). https://github.com/mitre/alue

2025

[7] [7]

and Zhu, W.-J

Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318

2002

[8] [8]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., Lopez de Lacalle, O. and Agirre, E. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 10776–10787. arXiv:2310.18018. UK AI Security Institute (2024a).Inspect AI: Framewor...

work page arXiv 2023

[9] [9]

https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

UKAISecurityInstitute, ArcadiaImpactandVectorInstitute(2024b).Inspect Evals: Community-Contributed LLM Evaluations for Inspect AI.https://github.com/UKGovernmentBEIS/inspect_evals US Federal Aviation Administration (FAA) (2024).Roadmap for Artificial Intelligence Safety Assurance. https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

2024

[10] [10]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., Schulman, J. and Fedus, W. (2024). Measuring Short-Form Factuality in Large Language Models. arXiv:2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Wu, Y., Liu, H., Li, Z. and Wang, B. (2026). PilotBench: A Benchmark for General Aviation Agents with Safety Constraints. arXiv:2604.08987. Appendix A. Consistently failed items Nine items were answered incorrectly by every one of the thirty one models in the cohort: eight US regulation (14 CFR) items and one international airport ground operations item. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026