KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance
Pith reviewed 2026-05-18 09:24 UTC · model grok-4.3
The pith
Integrating a knowledge graph into RAG improves system-wide reasoning over text chunks in aviation maintenance QA.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KEO constructs a structured knowledge graph from the OMIn dataset and integrates it into a RAG pipeline. This enables large language models to perform more coherent, dataset-wide reasoning than traditional text-chunk RAG. Evaluations with models such as Gemma-3, Phi-4, and Mistral-Nemo, judged by stronger models, demonstrate that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks.
What carries the argument
Knowledge Graph augmented RAG pipeline, which structures domain relationships from the OMIn dataset to support broad, coherent reasoning over maintenance knowledge.
Load-bearing premise
The knowledge graph built from the OMIn dataset accurately captures real aviation maintenance relationships without structural errors or major coverage gaps.
What would settle it
A case where KEO produces system-level insights that contradict established aviation maintenance procedures or misses documented patterns present in the OMIn data.
read the original abstract
We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning. The code is available at https://github.com/JonathanKarr33/keo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents KEO, a framework that extracts structured knowledge from the OMIn aviation maintenance dataset into a knowledge graph, then integrates the KG into a RAG pipeline for QA. It constructs a benchmark covering global sensemaking and actionable maintenance tasks, evaluates locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) with stronger models (GPT-4o, Llama-3.3) as judges, and claims that KG-augmented RAG markedly improves coherent dataset-wide reasoning and system-level insights over standard text-chunk RAG, while text-chunk RAG remains preferable for fine-grained procedural retrieval. Code is released at the cited GitHub repository.
Significance. If the KG construction faithfully encodes OMIn relationships without systematic errors, the work could meaningfully advance structured knowledge use in safety-critical domains by enabling better global pattern detection than unstructured RAG. The focus on locally deployable models and code release are practical strengths that support reproducibility and deployment considerations.
major comments (3)
- [KG construction / Methods] KG construction section: no precision/recall metrics, human validation, or ablation on triple quality are reported for the LLM-driven entity/relation extraction from OMIn logs and procedures. This directly undermines the central claim that observed global sensemaking gains arise from the KG-RAG design rather than extraction artifacts or coverage gaps.
- [Experiments / Results] Experiments section: the abstract and results provide no quantitative metrics, error bars, statistical significance tests, or per-task breakdowns for the claimed improvements in global sensemaking versus text-chunk RAG. Without these, the headline comparison cannot be rigorously assessed.
- [Evaluation / Judges] Evaluation setup: reliance on external judge models is described without inter-judge agreement statistics or calibration against human ground truth on the OMIn-derived QA benchmark, weakening confidence that the reported patterns reflect true reasoning gains.
minor comments (2)
- [Abstract] Abstract: states performance gains without any numerical values or effect sizes, which reduces immediate clarity for readers.
- [Introduction / Benchmark] Notation: the distinction between 'global sensemaking' and 'fine-grained procedural tasks' is used repeatedly but not formally defined with example questions or criteria.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript describing the KEO framework. We agree that additional validation and quantitative analysis will strengthen the paper and address the concerns raised. We provide point-by-point responses to the major comments below, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [KG construction / Methods] KG construction section: no precision/recall metrics, human validation, or ablation on triple quality are reported for the LLM-driven entity/relation extraction from OMIn logs and procedures. This directly undermines the central claim that observed global sensemaking gains arise from the KG-RAG design rather than extraction artifacts or coverage gaps.
Authors: We concur that rigorous validation of the KG extraction process is essential to support our claims. Although the current manuscript emphasizes the overall framework and results, we will revise the Methods section to include precision and recall metrics evaluated on a randomly sampled subset of triples with human annotation. We will also add an ablation study that compares KG-RAG performance against a version with perturbed or reduced triples to assess sensitivity to extraction quality. These changes will help confirm that the global sensemaking improvements are attributable to the structured KG rather than artifacts. revision: yes
-
Referee: [Experiments / Results] Experiments section: the abstract and results provide no quantitative metrics, error bars, statistical significance tests, or per-task breakdowns for the claimed improvements in global sensemaking versus text-chunk RAG. Without these, the headline comparison cannot be rigorously assessed.
Authors: We recognize that the presentation of results could benefit from greater quantitative rigor. The manuscript currently highlights qualitative differences and overall trends observed across the evaluated models. In the revised version, we will incorporate specific quantitative metrics (such as task accuracy and coherence scores), report standard deviations or error bars from repeated experiments, conduct statistical significance testing (e.g., Wilcoxon signed-rank tests), and provide detailed per-task breakdowns distinguishing global sensemaking questions from localized procedural ones. This will enable a more precise and statistically grounded comparison. revision: yes
-
Referee: [Evaluation / Judges] Evaluation setup: reliance on external judge models is described without inter-judge agreement statistics or calibration against human ground truth on the OMIn-derived QA benchmark, weakening confidence that the reported patterns reflect true reasoning gains.
Authors: We agree that validating the use of LLM judges is important for the credibility of our evaluation. We will augment the Evaluation section with inter-judge agreement statistics, such as pairwise agreement rates and Cohen's kappa between GPT-4o and Llama-3.3. Additionally, we will calibrate the judges by comparing their assessments against human annotations on a held-out portion of the QA benchmark. These steps will provide evidence that the judge models reliably capture the intended reasoning improvements. revision: yes
Circularity Check
Evaluation uses external judges and public dataset; no load-bearing reduction to self-referential fits or definitions
full rationale
The paper describes an empirical framework (KEO) that constructs a knowledge graph from the public OMIn dataset, integrates it into a RAG pipeline, and evaluates it on a self-constructed QA benchmark using locally deployable LLMs with GPT-4o and Llama-3.3 as external judges. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make the central claims (improved global sensemaking via KG-RAG) equivalent to the inputs by construction. The evaluation relies on external model judges and a public dataset rather than tautological self-definition or internal fits, keeping the work self-contained against external benchmarks and warranting only a minimal circularity score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models produce more coherent answers when supplied with a structured knowledge graph spanning the full dataset rather than isolated text chunks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.