pith. sign in

arxiv: 2607.01829 · v1 · pith:AZLBWI6Rnew · submitted 2026-07-02 · 💻 cs.AI · cs.CL

Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge

Pith reviewed 2026-07-03 13:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords aviationLLM benchmarkoperational knowledgeground operationsregulationssafety evaluationmultiple choice
0
0 comments X

The pith

Even the strongest LLM reaches only 82.7 percent on a benchmark of aviation operational knowledge where experts score around 95 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Pre-Flight, a benchmark of 300 multiple choice questions on international airport ground operations, ICAO and FAA regulations, general aviation knowledge and complex operational scenarios. Questions were written and reviewed by practitioners in air traffic management, ground operations and commercial flying. Evaluation of commercial and open weight models shows the top performer, released in 2026, scoring 82.7 percent accuracy, up only gradually from roughly 75 percent in early 2025. This remains well below an informal expert reference level of around 95 percent obtained from aviation professionals. The authors conclude that domain-specific evaluation is a necessary precondition for responsible deployment of generative AI even in non-safety-critical aviation operations.

Core claim

Pre-Flight is an open-source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material. When scored by accuracy under a standard multiple choice protocol, even the strongest evaluated model reaches 82.7 percent, compared with an informal expert reference of around 95 percent, leaving a substantial and persistent gap below expert-level reliability.

What carries the argument

The Pre-Flight benchmark, a set of 300 multiple choice questions authored and reviewed by aviation practitioners and drawn from international standards and airport ground operations material.

If this is right

  • Current models remain insufficiently reliable for aviation business operations even when those operations are described as non-safety-critical.
  • Domain-specific benchmarks are required before generative AI can be deployed responsibly in regulated aviation contexts.
  • The gap below expert performance has narrowed only gradually despite ongoing model releases.
  • The released dataset and evaluation harness enable continuous tracking of progress on this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Comparable domain-specific benchmarks may be useful in other regulated fields where incorrect reasoning carries high costs.
  • Targeted training or retrieval methods could be tested to determine whether they close the observed performance gap.
  • The benchmark could serve as a filter for selecting models intended for aviation-adjacent customer or documentation tasks.

Load-bearing premise

The 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.

What would settle it

A new model scoring 95 percent or higher on the Pre-Flight questions, or a larger sample of aviation experts scoring substantially below 95 percent on the same questions.

Figures

Figures reproduced from arXiv: 2607.01829 by Alex Brooker, Tim Hughes.

Figure 1
Figure 1. Figure 1: plots accuracy against model release date and traces the frontier over time. Three findings stand out. First, the gap to expert level performance is large and persistent: the strongest model in the snapshot, GPT-5.5, reaches 82.7%, around twelve points below the informal expert reference. Second, progress has been slow relative to that gap: the frontier rose quickly through 2024, from the mid 60s to the lo… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Pre-Flight, an open-source benchmark of 300 multiple-choice questions on aviation operational knowledge covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge, and complex scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations, and commercial flying. The authors evaluate contemporary LLMs via the Inspect framework under a standard multiple-choice protocol, report that the strongest model (released in 2026) reaches 82.7% accuracy (improving gradually from ~75% in early 2025), and compare this to an informal expert reference of ~95% from a low-sample conference quiz. They release the dataset, evaluation harness, and leaderboard, arguing that domain-specific benchmarks are required for responsible deployment of generative AI in non-safety-critical aviation operations.

Significance. If the benchmark is representative and the expert baseline reliable, the work provides a concrete, reusable instrument for measuring LLM performance on regulated operational knowledge and documents a persistent gap below expert level. The open release of the dataset, harness, and rolling leaderboard is a concrete contribution that enables community tracking of progress.

major comments (2)
  1. [Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.
  2. [Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.
minor comments (1)
  1. [Abstract] Abstract: the description of gradual improvement 'from roughly 75% in early 2025' to 82.7% would benefit from naming the specific models or providing a table of per-model scores to make the trend verifiable.

Simulated Author's Rebuttal

2 responses · 3 unresolved

We thank the referee for the thoughtful review and for highlighting the need for greater transparency around the expert baseline and benchmark validation. We agree that the informal nature of the expert reference requires more explicit qualification in the abstract and will revise the manuscript to address this. We also acknowledge that formal statistical details on reviewer agreement and calibration were not collected during development. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of a 'substantial and persistent gap below expert level reliability' rests on the informal expert reference of ~95% obtained from a low-sample conference quiz. No sample size, selection criteria, inter-rater reliability, or confirmation that quiz items match the 300-question benchmark distribution is supplied; this directly weakens the interpretation that 82.7% indicates a meaningful shortfall rather than benchmark difficulty or format effects.

    Authors: We agree the expert reference is informal and its limitations should be stated more clearly. The manuscript already labels it as 'informal' and 'low sample,' but we will revise the abstract to add an explicit caveat noting the absence of sample size, selection criteria, and distribution matching. We will also soften the phrasing of the 'substantial and persistent gap' claim to indicate it is relative to this informal baseline and may be influenced by benchmark difficulty or format. These changes will appear in the revised version. revision: yes

  2. Referee: [Abstract] Abstract / evaluation protocol: the manuscript provides no details on question difficulty calibration, inter-rater agreement among the practitioner reviewers, or statistical properties of the expert reference sample. These omissions are load-bearing for the central claim that the 300 questions form a valid and representative measure of the aviation operational knowledge needed for safe reasoning.

    Authors: The questions were authored and reviewed by multiple practitioners with domain experience, but we did not collect formal inter-rater agreement metrics or perform quantitative difficulty calibration. Statistical properties of the expert quiz sample beyond the informal description are also unavailable. We will revise the manuscript to describe the available review process details and explicitly note the lack of these formal measures, while qualifying claims about benchmark validity to reflect that it is a practitioner-developed instrument rather than a psychometrically validated test. revision: partial

standing simulated objections not resolved
  • Sample size, selection criteria, and inter-rater reliability for the conference quiz expert reference
  • Formal inter-rater agreement statistics among the practitioner reviewers
  • Question difficulty calibration data or item analysis statistics

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and direct model evaluation

full rationale

The paper constructs a 300-question multiple-choice benchmark from standards and operations material, has it reviewed by practitioners, and evaluates released LLMs using an external framework (Inspect). No derivations, equations, fitted parameters, or predictions are present. The informal expert reference is stated as a separate low-sample conference quiz and is not derived from or used to define the benchmark itself. No self-citation chains, ansatzes, or renamings reduce any claim to the paper's own inputs. The work is self-contained empirical evaluation against external models and an external (if informal) baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions about multiple-choice testing as a proxy for knowledge and on the representativeness of the practitioner-authored questions; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Multiple-choice questions authored and reviewed by domain practitioners provide a valid proxy for aviation operational knowledge.
    Invoked when interpreting model accuracy scores as evidence of a gap below expert reliability.

pith-pipeline@v0.9.1-grok · 5790 in / 1177 out tokens · 40036 ms · 2026-07-03T13:32:54.095527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    and Lavie, A

    Banerjee, S. and Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72

  2. [2]

    E., Ré, C., et al

    Guha, N., Nyarko, J., Ho, D. E., Ré, C., et al. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 36 (NeurIPS Datasets and Benchmarks Track). 8

  3. [3]

    and Steinhardt, J

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. and Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. InInternational Conference on Learning Representations (ICLR)

  4. [4]

    and Szolovits, P

    Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. and Szolovits, P. (2021). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421

  5. [5]

    Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pp. 74–81. Association for Computational Linguistics

  6. [6]

    and Sarkhel, K

    Mangortey, E., Singh, S., Chen, S. and Sarkhel, K. (2025). Aviation Language Understanding Evaluation (ALUE) – Large Language Model Benchmark with Aviation Datasets. InAIAA AVIATION FORUM AND ASCEND 2025, paper AIAA 2025-3247 (FAA and MITRE). https://github.com/mitre/alue

  7. [7]

    and Zhu, W.-J

    Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318

  8. [8]

    Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark, 2023

    Sainz, O., Campos, J. A., García-Ferrero, I., Etxaniz, J., Lopez de Lacalle, O. and Agirre, E. (2023). NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 10776–10787. arXiv:2310.18018. UK AI Security Institute (2024a).Inspect AI: Framewor...

  9. [9]

    https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

    UKAISecurityInstitute, ArcadiaImpactandVectorInstitute(2024b).Inspect Evals: Community-Contributed LLM Evaluations for Inspect AI.https://github.com/UKGovernmentBEIS/inspect_evals US Federal Aviation Administration (FAA) (2024).Roadmap for Artificial Intelligence Safety Assurance. https://www.faa.gov/aircraft/air_cert/step/roadmap_for_AI_safety_assurance

  10. [10]

    Measuring short-form factuality in large language models

    Wei, J., Karina, N., Chung, H. W., Jiao, Y. J., Papay, S., Glaese, A., Schulman, J. and Fedus, W. (2024). Measuring Short-Form Factuality in Large Language Models. arXiv:2411.04368

  11. [11]

    PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

    Wu, Y., Liu, H., Li, Z. and Wang, B. (2026). PilotBench: A Benchmark for General Aviation Agents with Safety Constraints. arXiv:2604.08987. Appendix A. Consistently failed items Nine items were answered incorrectly by every one of the thirty one models in the cohort: eight US regulation (14 CFR) items and one international airport ground operations item. ...