Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety
Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3
The pith
LLMs grounded by an aviation safety knowledge graph deliver more accurate and traceable responses than standalone models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The dual-phase pipeline automates construction of an Aviation Safety Knowledge Graph from text, images, and other records using LLMs, then retrieves from that graph to validate, explain, and constrain every generated answer, producing measurable gains in accuracy and traceability over LLM-only baselines.
What carries the argument
The Aviation Safety Knowledge Graph (ASKG), automatically extracted and refreshed by LLMs, then queried inside the RAG layer to anchor and explain every response.
Load-bearing premise
Automated LLM extraction from multimodal aviation sources yields an accurate and complete knowledge graph free of systematic gaps that would break downstream reliability.
What would settle it
A side-by-side test on a held-out set of real aviation safety queries where the framework still generates hallucinations or unverifiable answers at rates comparable to a plain LLM.
Figures
read the original abstract
The integration of Large Language Models (LLMs) into aviation safety decision-making represents a significant technological advancement, yet their standalone application poses critical risks due to inherent limitations such as factual inaccuracies, hallucination, and lack of verifiability. These challenges undermine the reliability required for safety-critical environments where errors can have catastrophic consequences. To address these challenges, this paper proposes a novel, end-to-end framework that synergistically combines LLMs and Knowledge Graphs (KGs) to enhance the trustworthiness of safety analytics. The framework introduces a dual-phase pipeline: it first employs LLMs to automate the construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources. It then leverages this curated KG within a Retrieval-Augmented Generation (RAG) architecture to ground, validate, and explain LLM-generated responses. The implemented system demonstrates improved accuracy and traceability over LLM-only approaches, effectively supporting complex querying and mitigating hallucination. Results confirm the framework's capability to deliver context-aware, verifiable safety insights, addressing the stringent reliability requirements of the aviation industry. Future work will focus on enhancing relationship extraction and integrating hybrid retrieval mechanisms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a dual-phase framework for aviation safety analytics that first uses LLMs to automate construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources, then integrates the resulting KG into a RAG pipeline to ground, validate, and explain LLM outputs, claiming improved accuracy, traceability, and hallucination mitigation relative to LLM-only baselines.
Significance. If the performance claims were substantiated, the work would address a timely need for verifiable AI in safety-critical domains by combining automated KG construction with retrieval grounding. The approach could serve as a template for reducing hallucinations in high-stakes querying, but the complete absence of quantitative validation limits its immediate contribution.
major comments (2)
- Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.
- ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.
minor comments (1)
- Abstract: the acronym ASKG is introduced without an immediate parenthetical expansion on first use.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We agree that the current manuscript lacks sufficient quantitative validation to support the performance claims, and we will revise it accordingly to include empirical evaluations.
read point-by-point responses
-
Referee: Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.
Authors: We acknowledge this limitation in the current version. The claims in the abstract were based on preliminary internal testing, but we agree they require formal substantiation. In the revised manuscript, we will add a new 'Evaluation' section that details the experimental setup, including quantitative metrics such as accuracy improvements, traceability scores, baseline comparisons against standard LLM approaches, and error analysis. This will directly support the assertions made. revision: yes
-
Referee: ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.
Authors: We concur that the ASKG construction details are presented at a conceptual level without quantitative backing. To rectify this, we will enhance the 'ASKG Construction' section with specifics on the extraction methodology, including the use of LLMs for entity and relation extraction from multimodal data. We will report precision, recall, and F1 scores based on a validation set, provide coverage analysis for safety-critical relations, and describe the expert validation process used to create the gold standard. For dynamic updating, we will include metrics on the accuracy of incremental updates. revision: yes
Circularity Check
No circularity: framework is descriptive with independent implementation claims
full rationale
The paper describes an end-to-end LLM+KG pipeline for aviation safety without equations, fitted parameters, or any derivation that reduces outputs to inputs by construction. Claims of improved accuracy and hallucination mitigation rest on the implemented RAG system rather than self-referential definitions or self-citation chains. No load-bearing steps match the enumerated circularity patterns; the KG construction step is presented as an engineering choice whose accuracy is asserted via results, not presupposed.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can reliably extract and structure knowledge from multimodal aviation safety sources into a usable graph
- domain assumption Retrieval-augmented generation over the constructed KG will consistently reduce hallucinations and increase traceability for safety queries
invented entities (1)
-
Aviation Safety Knowledge Graph (ASKG)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Roadmap for artificial intelligence safety assurance (version i)
Federal Aviation Administration. Roadmap for artificial intelligence safety assurance (version i). Technical report, U.S. Department of Transportation, Washington, DC, 2024. Accessed: 2025-03
work page 2024
-
[2]
Trung T. Pham. Faa roadmap on artificial intelligence safety: Presenta- tion to redac-nas. Presentation, Research, Engineering, and Development Advisory Committee (REDAC-NAS), Federal Aviation Administration,
-
[3]
Presented September 4, 2024
work page 2024
-
[4]
Li Zhang, Hao Wang, and Xu Chen. Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022
work page 2022
-
[5]
Regulatory compliance querying with large language models in aviation
Wei Chen and Rajesh Gupta. Regulatory compliance querying with large language models in aviation. InProceedings of the IEEE International Conference on Artificial Intelligence in Safety Systems, pages 112–119, 2023
work page 2023
-
[6]
James K. Miller and Ying Zhao. Hallucination in domain-specific large language models: An empirical study in aviation safety.Safety Science, 172:106382, 2024
work page 2024
-
[7]
An ontology for aviation safety management systems
Adrian Stroe, Patrick Klein, and Matthias Bauer. An ontology for aviation safety management systems. InEuropean Conference on Knowledge Management, pages 820–828, 2021
work page 2021
-
[8]
Yiming Wang, Hiroshi Tanaka, and Robert Smith. Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023
work page 2023
-
[9]
Automated regulatory compliance check- ing for aviation operations using knowledge graphs
Soo-Jin Kim and Min-Su Park. Automated regulatory compliance check- ing for aviation operations using knowledge graphs. InInternational Conference on Knowledge Engineering and Knowledge Management, pages 45–60. Springer, 2022
work page 2022
-
[10]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[11]
Graph rag: Un- leashing the power of knowledge graphs with large language models
Darren Edge, Rishi Trivedi, and Mehrdad Mozafari. Graph rag: Un- leashing the power of knowledge graphs with large language models. arXiv preprint arXiv:2401.15841, 2024
-
[12]
Pan, Simon Razniewski, Jan-Christoph Kalo, et al
Jeff Z. Pan, Simon Razniewski, Jan-Christoph Kalo, et al. Zero-shot information extraction for knowledge graph construction using large language models. InProceedings of the International Semantic Web Conference, pages 89–106, 2023
work page 2023
-
[13]
From natural language to graph query: A framework using llms for knowledge graph interaction
Jiashu Sun, Ming Sun, and Jiawei Zhang. From natural language to graph query: A framework using llms for knowledge graph interaction. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1568–1580, 2023
work page 2023
-
[14]
Preliminary Accident and Incident Re- port
Federal Aviation Administration. Preliminary Accident and Incident Re- port. Technical Report 100:93, FAA, 2025. Aviation Safety Information Analysis and Sharing
work page 2025
-
[15]
Harrison Chase. Langchain, 2022. Python framework for developing applications powered by language models
work page 2022
-
[16]
spacy: Industrial-strength natural language processing in python
Matthew Honnibal and Ines Montani. spacy: Industrial-strength natural language processing in python. InProceedings of the Conference on Python for Scientific Computing, 2017
work page 2017
-
[17]
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019
work page 2019
-
[18]
Faiss: A library for efficient similarity search and clustering of dense vectors, 2017
Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Faiss: A library for efficient similarity search and clustering of dense vectors, 2017. Meta AI Research
work page 2017
-
[19]
Langgraph: Building stateful, multi- agent applications with llms, 2024
Harrison Chase and LangChain AI. Langgraph: Building stateful, multi- agent applications with llms, 2024
work page 2024
-
[20]
Carol query system: Aviation accident database and synopses, 2024
National Transportation Safety Board. Carol query system: Aviation accident database and synopses, 2024. Accessed: 2025-12-27
work page 2024
-
[21]
Aviation accident statistics and monthly summaries
National Transportation Safety Board. Aviation accident statistics and monthly summaries. Technical report, NTSB Office of Aviation Safety, Washington, D.C., 2024
work page 2024
-
[22]
Neo4j, Inc. Neo4j graph database, 2023. Native graph database platform
work page 2023
-
[23]
Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43
Thomas R. Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43. Academic Press, 1993
work page 1993
-
[24]
Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr ´es Taylor. Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018
work page 2018
-
[25]
GPT-3.5 Turbo Model Specifications and API Documentation,
OpenAI. GPT-3.5 Turbo Model Specifications and API Documentation,
-
[26]
Model documentation and API reference
-
[27]
Salvatore Sanfilippo. Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009. Original Redis introduction by its creator
work page 2009
-
[28]
Josiah L. Carlson. Redis in action. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 953–
work page 2013
-
[29]
AI@Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Meta’s Llama 3 model series, 8B to 405B parameters
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.