KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Chaoli Wang; Jonathan A. Karr Jr; Kuangshi Ai; Meng Jiang; Nitesh V. Chawla

arxiv: 2510.05524 · v2 · submitted 2025-10-07 · 💻 cs.CL · cs.IR

KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Kuangshi Ai , Jonathan A. Karr Jr , Meng Jiang , Nitesh V. Chawla , Chaoli Wang This is my paper

Pith reviewed 2026-05-18 09:24 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords knowledge extractionknowledge graphsRAGaviation maintenancesafety-critical systemslarge language modelsquestion answeringOMIn dataset

0 comments

The pith

Integrating a knowledge graph into RAG improves system-wide reasoning over text chunks in aviation maintenance QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KEO, a framework that builds a knowledge graph from the OMIn aviation maintenance dataset and places it inside a retrieval-augmented generation pipeline with large language models. This structure supports coherent reasoning across the full dataset rather than isolated passages. Experiments with several locally runnable models show the graph version excels at tasks that need patterns and system-level understanding, while plain text-chunk RAG stays stronger on narrow procedural steps. The work targets safer, domain-specific question answering in high-stakes maintenance settings.

Core claim

KEO constructs a structured knowledge graph from the OMIn dataset and integrates it into a RAG pipeline. This enables large language models to perform more coherent, dataset-wide reasoning than traditional text-chunk RAG. Evaluations with models such as Gemma-3, Phi-4, and Mistral-Nemo, judged by stronger models, demonstrate that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks.

What carries the argument

Knowledge Graph augmented RAG pipeline, which structures domain relationships from the OMIn dataset to support broad, coherent reasoning over maintenance knowledge.

Load-bearing premise

The knowledge graph built from the OMIn dataset accurately captures real aviation maintenance relationships without structural errors or major coverage gaps.

What would settle it

A case where KEO produces system-level insights that contradict established aviation maintenance procedures or misses documented patterns present in the OMIn data.

read the original abstract

We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning. The code is available at https://github.com/JonathanKarr33/keo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KEO applies KG-RAG to the OMIn aviation dataset and finds it helps global pattern detection more than chunk RAG, but the graph extraction step lacks validation.

read the letter

The main point is that this paper builds a knowledge graph from the OMIn maintenance dataset and plugs it into RAG, showing better results on broad sensemaking questions than plain text chunks, while chunks stay stronger for narrow procedural lookups. That split is the clearest takeaway from the experiments with Gemma, Phi, and Mistral models judged by larger ones. The code release helps anyone who wants to inspect the pipeline. What is actually new is the concrete QA benchmark on this particular aviation corpus and the direct comparison of structured versus unstructured retrieval for global versus local tasks in a safety-critical setting. They do a straightforward job of testing locally deployable models and pointing out the practical trade-off. The soft spot is the knowledge graph itself. The abstract gives little on how entities and relations were pulled from the logs or whether anyone checked the triples for accuracy or coverage. Without precision numbers, human review, or an ablation on graph quality, the claimed gains in revealing system-level patterns could partly reflect extraction artifacts rather than the KG design. The lack of specific scores or error bars in the summary also leaves the size of the improvement unclear. This work is for people building domain-specific QA tools in engineering or maintenance, especially those already experimenting with graphs and retrieval. A reader who needs an example of KG-RAG on technical logs will find usable ideas and runnable code. It deserves peer review because the application is relevant and the open implementation lets referees check the details directly, though the graph construction section will need tightening to support the main claims.

Referee Report

3 major / 2 minor

Summary. The manuscript presents KEO, a framework that extracts structured knowledge from the OMIn aviation maintenance dataset into a knowledge graph, then integrates the KG into a RAG pipeline for QA. It constructs a benchmark covering global sensemaking and actionable maintenance tasks, evaluates locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) with stronger models (GPT-4o, Llama-3.3) as judges, and claims that KG-augmented RAG markedly improves coherent dataset-wide reasoning and system-level insights over standard text-chunk RAG, while text-chunk RAG remains preferable for fine-grained procedural retrieval. Code is released at the cited GitHub repository.

Significance. If the KG construction faithfully encodes OMIn relationships without systematic errors, the work could meaningfully advance structured knowledge use in safety-critical domains by enabling better global pattern detection than unstructured RAG. The focus on locally deployable models and code release are practical strengths that support reproducibility and deployment considerations.

major comments (3)

[KG construction / Methods] KG construction section: no precision/recall metrics, human validation, or ablation on triple quality are reported for the LLM-driven entity/relation extraction from OMIn logs and procedures. This directly undermines the central claim that observed global sensemaking gains arise from the KG-RAG design rather than extraction artifacts or coverage gaps.
[Experiments / Results] Experiments section: the abstract and results provide no quantitative metrics, error bars, statistical significance tests, or per-task breakdowns for the claimed improvements in global sensemaking versus text-chunk RAG. Without these, the headline comparison cannot be rigorously assessed.
[Evaluation / Judges] Evaluation setup: reliance on external judge models is described without inter-judge agreement statistics or calibration against human ground truth on the OMIn-derived QA benchmark, weakening confidence that the reported patterns reflect true reasoning gains.

minor comments (2)

[Abstract] Abstract: states performance gains without any numerical values or effect sizes, which reduces immediate clarity for readers.
[Introduction / Benchmark] Notation: the distinction between 'global sensemaking' and 'fine-grained procedural tasks' is used repeatedly but not formally defined with example questions or criteria.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript describing the KEO framework. We agree that additional validation and quantitative analysis will strengthen the paper and address the concerns raised. We provide point-by-point responses to the major comments below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [KG construction / Methods] KG construction section: no precision/recall metrics, human validation, or ablation on triple quality are reported for the LLM-driven entity/relation extraction from OMIn logs and procedures. This directly undermines the central claim that observed global sensemaking gains arise from the KG-RAG design rather than extraction artifacts or coverage gaps.

Authors: We concur that rigorous validation of the KG extraction process is essential to support our claims. Although the current manuscript emphasizes the overall framework and results, we will revise the Methods section to include precision and recall metrics evaluated on a randomly sampled subset of triples with human annotation. We will also add an ablation study that compares KG-RAG performance against a version with perturbed or reduced triples to assess sensitivity to extraction quality. These changes will help confirm that the global sensemaking improvements are attributable to the structured KG rather than artifacts. revision: yes
Referee: [Experiments / Results] Experiments section: the abstract and results provide no quantitative metrics, error bars, statistical significance tests, or per-task breakdowns for the claimed improvements in global sensemaking versus text-chunk RAG. Without these, the headline comparison cannot be rigorously assessed.

Authors: We recognize that the presentation of results could benefit from greater quantitative rigor. The manuscript currently highlights qualitative differences and overall trends observed across the evaluated models. In the revised version, we will incorporate specific quantitative metrics (such as task accuracy and coherence scores), report standard deviations or error bars from repeated experiments, conduct statistical significance testing (e.g., Wilcoxon signed-rank tests), and provide detailed per-task breakdowns distinguishing global sensemaking questions from localized procedural ones. This will enable a more precise and statistically grounded comparison. revision: yes
Referee: [Evaluation / Judges] Evaluation setup: reliance on external judge models is described without inter-judge agreement statistics or calibration against human ground truth on the OMIn-derived QA benchmark, weakening confidence that the reported patterns reflect true reasoning gains.

Authors: We agree that validating the use of LLM judges is important for the credibility of our evaluation. We will augment the Evaluation section with inter-judge agreement statistics, such as pairwise agreement rates and Cohen's kappa between GPT-4o and Llama-3.3. Additionally, we will calibrate the judges by comparing their assessments against human annotations on a held-out portion of the QA benchmark. These steps will provide evidence that the judge models reliably capture the intended reasoning improvements. revision: yes

Circularity Check

0 steps flagged

Evaluation uses external judges and public dataset; no load-bearing reduction to self-referential fits or definitions

full rationale

The paper describes an empirical framework (KEO) that constructs a knowledge graph from the public OMIn dataset, integrates it into a RAG pipeline, and evaluates it on a self-constructed QA benchmark using locally deployable LLMs with GPT-4o and Llama-3.3 as external judges. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make the central claims (improved global sensemaking via KG-RAG) equivalent to the inputs by construction. The evaluation relies on external model judges and a public dataset rather than tautological self-definition or internal fits, keeping the work self-contained against external benchmarks and warranting only a minimal circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that LLMs can leverage structured graphs for coherent reasoning and that the OMIn dataset contains extractable relational knowledge suitable for graph construction.

axioms (1)

domain assumption Large language models produce more coherent answers when supplied with a structured knowledge graph spanning the full dataset rather than isolated text chunks.
This premise underpins the claim that KG-augmented RAG outperforms text-chunk RAG on global sensemaking tasks.

pith-pipeline@v0.9.0 · 5747 in / 1215 out tokens · 37696 ms · 2026-05-18T09:24:02.191120+00:00 · methodology

KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)