arxiv: 2604.20795 · v1 · submitted 2026-04-22 · 💻 cs.AI

Recognition: unknown

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Pavel Salovskii (Partenit.io , San Francisco , CA , USA) , Iuliia Gorshkova (Partenit.io

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords ontology constructionknowledge graphsLLM augmentationhybrid intelligent systemsautomated pipelinemulti-step planningSHACL validationverification pipeline

0 comments

The pith

An external ontological memory layer built automatically from data sources improves LLM performance on multi-step planning tasks and enables formal validation of outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to establish that large language models can be extended with a persistent, structured knowledge graph in RDF/OWL form to overcome limitations in memory and reasoning. An automated pipeline extracts entities, relations, and triples from documents, APIs, and logs, then validates and updates the graph continuously. During use the LLM combines this graph with vector retrieval and tools, turning generation into a verifiable process. A reader would care because the setup promises more reliable multi-step decisions in domains that need lasting knowledge and error correction.

Core claim

The paper claims that an automated pipeline for ontology construction from heterogeneous sources, followed by SHACL and OWL validation, creates an external memory layer that augments LLMs. This layer supports combined vector and graph-based reasoning, yields measurable gains on planning benchmarks such as the Tower of Hanoi, and converts the system into a generation-verification-correction workflow.

What carries the argument

The automated ontology construction pipeline that extracts entities and relations, generates triples, applies SHACL and OWL validation, and maintains continuous graph updates as external verifiable memory.

If this is right

Multi-step planning tasks exhibit higher success rates when the ontology layer is present.
Generated outputs can be checked and corrected against formal graph constraints.
Knowledge persists independently of the LLM parameters and remains queryable.
Inference merges vector retrieval with graph reasoning and external tool calls.
The architecture supports agent and robotics applications that require explainable, reliable decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous interaction logs could feed the same pipeline to evolve the ontology without retraining the underlying model.
Robotics planners might adopt the graph as a shared world model to ground actions in verified relations.
Enterprise systems could layer the same validation step over existing RAG setups to reduce unverified outputs.

Load-bearing premise

The automated pipeline can extract and normalize entities and relations from varied sources with enough accuracy that validation catches errors without heavy manual correction or added inconsistencies.

What would settle it

A direct comparison on the Tower of Hanoi benchmark showing no gain in success rate or step efficiency for the ontology-augmented system versus baseline LLMs would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.20795 by CA, Iuliia Gorshkova (Partenit.io, Pavel Salovskii (Partenit.io, San Francisco, USA).

**Figure 1.** Figure 1: Reconstructed architecture of a hybrid system based on the provided project diagrams. The diagram combines project-specific topology with formally standardized MCP components and context-centric orchestration [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Ontology Builder pipeline in publication form. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear architectural sketch for LLM-plus-ontology hybrids that names real problems but supplies no numbers to show the claimed planning gains or pipeline accuracy.

read the letter

The paper lays out a hybrid system where an LLM gets an external RDF/OWL ontology layer built automatically from documents, APIs, and dialogue logs. The layer is supposed to serve as persistent memory, support graph-based reasoning, and run formal validation through SHACL and OWL constraints before outputs are accepted. The authors also describe a generation-verification-correction loop that uses the ontology to catch errors the LLM might miss. That combination of automated construction, continuous updates, and mixed retrieval is the concrete contribution here. It extends earlier LLM-knowledge-graph ideas by focusing on the full pipeline from heterogeneous sources rather than assuming a clean graph already exists. The description of how the system would work in inference is straightforward and practical for anyone who has tried to keep long-term structure in agent setups. The Tower of Hanoi mention is meant to illustrate better multi-step planning, and the validation step is presented as turning the system into something more reliable than plain prompting. Those are useful directions for robotics or enterprise applications that need explainable decisions over time. The soft spot is the complete absence of supporting data. No precision or recall figures appear for the entity recognition, relation extraction, or triple generation steps. There are no baselines, no error rates on the validation, and no breakdown of what changed in the planning tasks when the ontology was added. Without those, the performance improvement and the reliability of the verification loop cannot be checked. The assumption that the automated extraction produces a graph clean enough for SHACL and OWL to do real work without extensive manual fixes is carrying most of the weight. If noise from the sources is high, the claimed benefits could disappear or even add inconsistencies. This kind of paper is aimed at people already working on LLM agents who want more structure than vector retrieval alone. It is coherent enough on its own terms to go to peer review, mainly so referees can require the missing metrics and concrete examples from the planning experiments. I would recommend sending it once the authors add those results.

Referee Report

2 major / 0 minor

Summary. The paper proposes a hybrid architecture extending LLMs with an external ontological memory layer using RDF/OWL for persistent, verifiable knowledge. It describes an automated pipeline performing entity recognition, relation extraction, normalization, and triple generation from heterogeneous sources (documents, APIs, dialogue logs), followed by SHACL/OWL validation and graph updates. During inference, LLMs combine vector retrieval with graph-based reasoning. The central claim is that this ontology augmentation yields performance gains on multi-step planning tasks such as Tower of Hanoi and enables a generation-verification-correction pipeline.

Significance. If the performance claims and pipeline reliability were substantiated with quantitative evidence, the architecture could meaningfully address LLM limitations in long-term memory, structural reasoning, and explainability, offering a foundation for agentic and robotic systems. The proposal itself is conceptually coherent but currently lacks the empirical grounding needed to assess its practical significance.

major comments (2)

[Abstract] Abstract: The claim that 'ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems' on the Tower of Hanoi benchmark supplies no quantitative results, baselines, error metrics, or methodology details, rendering the central empirical claim unverifiable and load-bearing for the paper's contribution.
[Abstract] Abstract and experimental observations: No precision, recall, F1, or error-rate measurements are reported for any stage of the automated pipeline (entity recognition, relation extraction, normalization, triple generation), so it is impossible to determine whether SHACL/OWL validation actually catches errors or whether observed planning gains can be attributed to the ontology layer rather than prompt engineering or retrieval artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript lacks the quantitative evidence needed to fully support the central claims regarding performance improvements and pipeline reliability. We will revise the paper to address these points by adding the requested metrics, baselines, and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems' on the Tower of Hanoi benchmark supplies no quantitative results, baselines, error metrics, or methodology details, rendering the central empirical claim unverifiable and load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract and experimental observations section currently present only high-level indications of improvement without supporting numerical data. The manuscript describes the hybrid architecture and notes qualitative benefits on planning tasks but does not report specific metrics such as success rates, step counts, or error reductions. In the revised version, we will expand the experimental section to include quantitative results from Tower of Hanoi trials, including success rates over repeated runs, comparisons against baseline LLMs (with and without retrieval), average planning steps, and a full description of the evaluation methodology and prompt setups. revision: yes
Referee: [Abstract] Abstract and experimental observations: No precision, recall, F1, or error-rate measurements are reported for any stage of the automated pipeline (entity recognition, relation extraction, normalization, triple generation), so it is impossible to determine whether SHACL/OWL validation actually catches errors or whether observed planning gains can be attributed to the ontology layer rather than prompt engineering or retrieval artifacts.

Authors: We agree that the lack of these metrics limits the ability to assess the pipeline's effectiveness and to attribute gains specifically to the ontology layer. The current manuscript emphasizes the architectural design and high-level observations rather than a comprehensive empirical study of the construction stages. We will add a new evaluation subsection that reports precision, recall, and F1 scores for entity recognition and relation extraction on annotated test data, along with error rates before and after SHACL/OWL validation. This will also include ablation-style comparisons to help isolate the contribution of the ontology components from prompt engineering or vector retrieval effects. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural proposal and high-level observations are self-contained without self-referential definitions or fitted predictions

full rationale

The paper describes a hybrid LLM-ontology architecture and an automated pipeline for entity/relation extraction followed by SHACL/OWL validation, then reports high-level experimental observations on planning tasks such as Tower of Hanoi. No equations, parameters, or derivations appear in the provided text. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported, and no fitted inputs are relabeled as predictions. The central claims rest on the proposed system design and external experimental results rather than reducing to the inputs by construction. This matches the default expectation of no significant circularity for descriptive system papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that LLM-driven extraction and formal constraints can produce and maintain a sufficiently accurate ontology without introducing uncaught errors or requiring human curation at scale.

axioms (2)

domain assumption LLMs can perform reliable entity recognition, relation extraction, and triple generation from heterogeneous unstructured sources
This is invoked as the foundation of the automated pipeline described in the abstract.
domain assumption SHACL and OWL constraints are sufficient to validate and correct the generated ontology for downstream reasoning tasks
Stated as the mechanism that turns the system into a generation-verification-correction pipeline.

invented entities (1)

External ontological memory layer no independent evidence
purpose: Provide persistent, verifiable, semantically grounded knowledge that augments LLM parametric memory and vector retrieval
New architectural component introduced to address lack of long-term memory and weak structural understanding in current LLM systems.

pith-pipeline@v0.9.0 · 5560 in / 1525 out tokens · 40892 ms · 2026-05-09T23:34:58.483494+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 14 canonical work pages · 2 internal anchors

[1]

RDF 1.1 Concepts and Abstract Syntax

W3C. RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation, 2014

2014
[2]

OWL 2 Web Ontology Language Document Overview

W3C. OWL 2 Web Ontology Language Document Overview. W3C Recommendation, 2012

2012
[3]

SPARQL 1.1 Query Language

W3C. SPARQL 1.1 Query Language. W3C Recommendation, 2013

2013
[4]

Shapes Constraint Language (SHACL)

W3C. Shapes Constraint Language (SHACL). W3C Recommendation, 2017

2017
[5]

MCP specification, version dated 2025- 11-25

Model Context Protocol Specification. MCP specification, version dated 2025- 11-25

2025
[6]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

D. Edge et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[7]

Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang

H. Han et al. Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv:2501.00309, 2025

work page arXiv 2025
[8]

Microsoft Research documentation, 2025

Welcome — GraphRAG. Microsoft Research documentation, 2025

2025
[9]

H. Bian. LLM-empowered knowledge graph construction: A survey. arXiv:2510.20345, 2025

work page arXiv 2025
[10]

Garijo et al

D. Garijo et al. LLMs for Ontology Engineering: A landscape of Tasks and Benchmarking challenges. CEUR Workshop Proceedings, 2025

2025
[11]

V. K. Kommineni, B. König-Ries, S. Samuel. From human experts to machines: An LLM supported approach to ontology and knowledge graph construction. arXiv:2403.08345, 2024

work page arXiv 2024
[12]

A. S. Lippolis et al. Ontology Generation using Large Language Models. arXiv:2503.05388, 2025

work page arXiv 2025
[13]

Retrieval-augmented gen- eration of ontologies from relational databases

M. Nayyeri et al. Retrieval-Augmented Generation of Ontologies from Relational Databases. arXiv:2506.01232, 2025

work page arXiv 2025
[14]

X. Feng, X. Wu, H. Meng. Ontology-grounded Automatic Knowledge Graph Construction by LLM under Wikidata schema. arXiv:2412.20942, 2024

work page arXiv 2024
[15]

Aggarwal, A

T. Aggarwal, A. Salatino, F. Osborne, E. Motta. Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field. arXiv:2412.08258, 2024/2025. 16

work page arXiv 2024
[16]

H. V. Khurdula, V. Agarwal, Y. D. Khemlani. Interfaze: The Future of AI is built on Task-Specific Small Models. arXiv:2602.04101, 2026

work page arXiv 2026
[17]

Small Language Models are the Future of Agentic AI

P. Belcak et al. Small Language Models are the Future of Agentic AI. arXiv:2506.02153, 2025

work page internal anchor Pith review arXiv 2025
[18]

arXiv preprint arXiv:2402.01817 , year=

S. Kambhampati et al. LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks. arXiv:2402.01817, 2024

work page arXiv 2024
[19]

Valmeekam et al

K. Valmeekam et al. Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change). FMDM@NeurIPS / OpenReview, 2022

2022
[20]

When LLM Stops Understanding

Pavel Salovskii (Platonic). When LLM Stops Understanding. Habr, 2026 https://habr.com/ru/articles/1012702/

work page arXiv 2026
[21]

From Text to Knowledge

Pavel Salovskii (Platonic). From Text to Knowledge. Habr, 2026. https://habr.com/ru/articles/1012714/

work page arXiv 2026
[22]

Memory for AI and Robots

Pavel Salovskii (Platonic). Memory for AI and Robots. Habr, 2026.Pavel https://habr.com/ru/articles/1012726/

work page arXiv 2026
[23]

Salovsky. RAG. Examples of Using External Memory and Data Sources to Improve LLM Performance. AGI Seminar / RUTUBE, April 9, 2025. https://rutube.ru/video/f33cf2556b5eed3d58a870a86266276e/

2025