pith. sign in

arxiv: 2604.13101 · v1 · submitted 2026-04-10 · 💻 cs.SE · cs.AI

Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety

Pith reviewed 2026-05-10 17:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords aviation safetyknowledge graphslarge language modelsretrieval-augmented generationhallucination mitigationtrustworthy AIsafety analytics
0
0 comments X

The pith

LLMs grounded by an aviation safety knowledge graph deliver more accurate and traceable responses than standalone models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an end-to-end system that first uses LLMs to build and update an Aviation Safety Knowledge Graph from multimodal sources, then applies that graph inside a retrieval-augmented generation pipeline to ground and explain the model's outputs. A sympathetic reader would care because aviation safety decisions demand verifiable facts and carry catastrophic risk when models invent details. The implemented framework is shown to handle complex queries while cutting hallucinations and adding traceability links back to source data. This combination addresses the core weakness of pure LLM use in regulated, high-stakes domains.

Core claim

The dual-phase pipeline automates construction of an Aviation Safety Knowledge Graph from text, images, and other records using LLMs, then retrieves from that graph to validate, explain, and constrain every generated answer, producing measurable gains in accuracy and traceability over LLM-only baselines.

What carries the argument

The Aviation Safety Knowledge Graph (ASKG), automatically extracted and refreshed by LLMs, then queried inside the RAG layer to anchor and explain every response.

Load-bearing premise

Automated LLM extraction from multimodal aviation sources yields an accurate and complete knowledge graph free of systematic gaps that would break downstream reliability.

What would settle it

A side-by-side test on a held-out set of real aviation safety queries where the framework still generates hallucinations or unverifiable answers at rates comparable to a plain LLM.

Figures

Figures reproduced from arXiv: 2604.13101 by Alisa Tiselska, Anirudh Iyengar, Dumindu Samaraweera, Hong Liu.

Figure 1
Figure 1. Figure 1: High-level system architecture of the proposed framework, unifying [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A sub-knowledge graph generated by the pipeline from a sample set [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the aviation safety query system. The red line denotes the data-flow path, while the green line indicates the return path of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Natural language to Cypher translation. The GraphRAG interface [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The dashboard demonstrates the semantic mapping of a natural [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

The integration of Large Language Models (LLMs) into aviation safety decision-making represents a significant technological advancement, yet their standalone application poses critical risks due to inherent limitations such as factual inaccuracies, hallucination, and lack of verifiability. These challenges undermine the reliability required for safety-critical environments where errors can have catastrophic consequences. To address these challenges, this paper proposes a novel, end-to-end framework that synergistically combines LLMs and Knowledge Graphs (KGs) to enhance the trustworthiness of safety analytics. The framework introduces a dual-phase pipeline: it first employs LLMs to automate the construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources. It then leverages this curated KG within a Retrieval-Augmented Generation (RAG) architecture to ground, validate, and explain LLM-generated responses. The implemented system demonstrates improved accuracy and traceability over LLM-only approaches, effectively supporting complex querying and mitigating hallucination. Results confirm the framework's capability to deliver context-aware, verifiable safety insights, addressing the stringent reliability requirements of the aviation industry. Future work will focus on enhancing relationship extraction and integrating hybrid retrieval mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a dual-phase framework for aviation safety analytics that first uses LLMs to automate construction and dynamic updating of an Aviation Safety Knowledge Graph (ASKG) from multimodal sources, then integrates the resulting KG into a RAG pipeline to ground, validate, and explain LLM outputs, claiming improved accuracy, traceability, and hallucination mitigation relative to LLM-only baselines.

Significance. If the performance claims were substantiated, the work would address a timely need for verifiable AI in safety-critical domains by combining automated KG construction with retrieval grounding. The approach could serve as a template for reducing hallucinations in high-stakes querying, but the complete absence of quantitative validation limits its immediate contribution.

major comments (2)
  1. Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.
  2. ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.
minor comments (1)
  1. Abstract: the acronym ASKG is introduced without an immediate parenthetical expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We agree that the current manuscript lacks sufficient quantitative validation to support the performance claims, and we will revise it accordingly to include empirical evaluations.

read point-by-point responses
  1. Referee: Abstract: the assertion that 'the implemented system demonstrates improved accuracy and traceability over LLM-only approaches' is unsupported by any quantitative metrics, evaluation protocol, baseline comparisons, or error analysis, leaving the central empirical claim unsubstantiated.

    Authors: We acknowledge this limitation in the current version. The claims in the abstract were based on preliminary internal testing, but we agree they require formal substantiation. In the revised manuscript, we will add a new 'Evaluation' section that details the experimental setup, including quantitative metrics such as accuracy improvements, traceability scores, baseline comparisons against standard LLM approaches, and error analysis. This will directly support the assertions made. revision: yes

  2. Referee: ASKG construction pipeline: the automated LLM-based extraction and dynamic updating from multimodal sources is described at a high level with no reported precision/recall/F1 scores, coverage analysis of safety-critical relations, or expert gold-standard validation, which directly undermines the reliability of all downstream RAG grounding and hallucination-mitigation claims.

    Authors: We concur that the ASKG construction details are presented at a conceptual level without quantitative backing. To rectify this, we will enhance the 'ASKG Construction' section with specifics on the extraction methodology, including the use of LLMs for entity and relation extraction from multimodal data. We will report precision, recall, and F1 scores based on a validation set, provide coverage analysis for safety-critical relations, and describe the expert validation process used to create the gold standard. For dynamic updating, we will include metrics on the accuracy of incremental updates. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is descriptive with independent implementation claims

full rationale

The paper describes an end-to-end LLM+KG pipeline for aviation safety without equations, fitted parameters, or any derivation that reduces outputs to inputs by construction. Claims of improved accuracy and hallucination mitigation rest on the implemented RAG system rather than self-referential definitions or self-citation chains. No load-bearing steps match the enumerated circularity patterns; the KG construction step is presented as an engineering choice whose accuracy is asserted via results, not presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework depends on two unproven domain assumptions about LLM extraction quality and RAG effectiveness, plus one newly postulated entity (the ASKG) whose independent validation is not provided.

axioms (2)
  • domain assumption LLMs can reliably extract and structure knowledge from multimodal aviation safety sources into a usable graph
    Invoked in the first phase of the dual-phase pipeline described in the abstract.
  • domain assumption Retrieval-augmented generation over the constructed KG will consistently reduce hallucinations and increase traceability for safety queries
    Central premise of the second phase and the claimed improvement over LLM-only baselines.
invented entities (1)
  • Aviation Safety Knowledge Graph (ASKG) no independent evidence
    purpose: Structured knowledge base to ground and validate LLM outputs in aviation safety
    Newly introduced entity constructed by the LLM in phase one; no external validation or independent evidence supplied.

pith-pipeline@v0.9.0 · 5508 in / 1292 out tokens · 50390 ms · 2026-05-10T17:27:48.646965+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Roadmap for artificial intelligence safety assurance (version i)

    Federal Aviation Administration. Roadmap for artificial intelligence safety assurance (version i). Technical report, U.S. Department of Transportation, Washington, DC, 2024. Accessed: 2025-03

  2. [2]

    Trung T. Pham. Faa roadmap on artificial intelligence safety: Presenta- tion to redac-nas. Presentation, Research, Engineering, and Development Advisory Committee (REDAC-NAS), Federal Aviation Administration,

  3. [3]

    Presented September 4, 2024

  4. [4]

    Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

    Li Zhang, Hao Wang, and Xu Chen. Automated analysis of aviation safety reports using transformer models.Journal of Aerospace Informa- tion Systems, 19(6):401–415, 2022

  5. [5]

    Regulatory compliance querying with large language models in aviation

    Wei Chen and Rajesh Gupta. Regulatory compliance querying with large language models in aviation. InProceedings of the IEEE International Conference on Artificial Intelligence in Safety Systems, pages 112–119, 2023

  6. [6]

    Miller and Ying Zhao

    James K. Miller and Ying Zhao. Hallucination in domain-specific large language models: An empirical study in aviation safety.Safety Science, 172:106382, 2024

  7. [7]

    An ontology for aviation safety management systems

    Adrian Stroe, Patrick Klein, and Matthias Bauer. An ontology for aviation safety management systems. InEuropean Conference on Knowledge Management, pages 820–828, 2021

  8. [8]

    Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

    Yiming Wang, Hiroshi Tanaka, and Robert Smith. Causal chain analysis of aviation incidents using knowledge graphs.Reliability Engineering & System Safety, 230:108924, 2023

  9. [9]

    Automated regulatory compliance check- ing for aviation operations using knowledge graphs

    Soo-Jin Kim and Min-Su Park. Automated regulatory compliance check- ing for aviation operations using knowledge graphs. InInternational Conference on Knowledge Engineering and Knowledge Management, pages 45–60. Springer, 2022

  10. [10]

    Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  11. [11]

    Graph rag: Un- leashing the power of knowledge graphs with large language models

    Darren Edge, Rishi Trivedi, and Mehrdad Mozafari. Graph rag: Un- leashing the power of knowledge graphs with large language models. arXiv preprint arXiv:2401.15841, 2024

  12. [12]

    Pan, Simon Razniewski, Jan-Christoph Kalo, et al

    Jeff Z. Pan, Simon Razniewski, Jan-Christoph Kalo, et al. Zero-shot information extraction for knowledge graph construction using large language models. InProceedings of the International Semantic Web Conference, pages 89–106, 2023

  13. [13]

    From natural language to graph query: A framework using llms for knowledge graph interaction

    Jiashu Sun, Ming Sun, and Jiawei Zhang. From natural language to graph query: A framework using llms for knowledge graph interaction. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 1568–1580, 2023

  14. [14]

    Preliminary Accident and Incident Re- port

    Federal Aviation Administration. Preliminary Accident and Incident Re- port. Technical Report 100:93, FAA, 2025. Aviation Safety Information Analysis and Sharing

  15. [15]

    Langchain, 2022

    Harrison Chase. Langchain, 2022. Python framework for developing applications powered by language models

  16. [16]

    spacy: Industrial-strength natural language processing in python

    Matthew Honnibal and Ines Montani. spacy: Industrial-strength natural language processing in python. InProceedings of the Conference on Python for Scientific Computing, 2017

  17. [17]

    Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992, 2019

  18. [18]

    Faiss: A library for efficient similarity search and clustering of dense vectors, 2017

    Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Faiss: A library for efficient similarity search and clustering of dense vectors, 2017. Meta AI Research

  19. [19]

    Langgraph: Building stateful, multi- agent applications with llms, 2024

    Harrison Chase and LangChain AI. Langgraph: Building stateful, multi- agent applications with llms, 2024

  20. [20]

    Carol query system: Aviation accident database and synopses, 2024

    National Transportation Safety Board. Carol query system: Aviation accident database and synopses, 2024. Accessed: 2025-12-27

  21. [21]

    Aviation accident statistics and monthly summaries

    National Transportation Safety Board. Aviation accident statistics and monthly summaries. Technical report, NTSB Office of Aviation Safety, Washington, D.C., 2024

  22. [22]

    Neo4j graph database, 2023

    Neo4j, Inc. Neo4j graph database, 2023. Native graph database platform

  23. [23]

    Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43

    Thomas R. Gruber.Toward principles for the design of ontologies used for knowledge sharing, volume 43. Academic Press, 1993

  24. [24]

    Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

    Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr ´es Taylor. Cypher: An evolving query language for property graphs.Proceedings of the 2018 International Conference on Management of Data, pages 1433–1445, 2018

  25. [25]

    GPT-3.5 Turbo Model Specifications and API Documentation,

    OpenAI. GPT-3.5 Turbo Model Specifications and API Documentation,

  26. [26]

    Model documentation and API reference

  27. [27]

    Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009

    Salvatore Sanfilippo. Redis: Lightweight key/value store that goes beyond ordinary caching.Linux Journal, 2009(185):1, 2009. Original Redis introduction by its creator

  28. [28]

    Josiah L. Carlson. Redis in action. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 953–

  29. [29]

    The Llama 3 Herd of Models

    AI@Meta. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Meta’s Llama 3 model series, 8B to 405B parameters