arxiv: 2605.11234 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

Grama Chethan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords semantic training gapontology-grounded toolsAI agentsmanufacturingtool hallucinationsemantic driftdigital twinAIOps

0 comments

The pith

Ontology-grounded tool parameters eliminate hallucination of domain identifiers in industrial AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models handle manufacturing terminology with statistical fluency but lack the relational structure that ties equipment IDs, process parameters, failure codes, and constraints together in actual operations. This structural mismatch, termed the semantic training gap, produces outputs that are linguistically correct yet operationally wrong and leads to compounding errors called semantic drift when multiple agents interact. The paper introduces an architecture that places manufacturing ontology directly into the tool layer as typed relational configurations, enforced through a fixed three-step interface of resolve, contextualize, and annotate operations under an AIOps layer. Controlled tests across six industry setups and 72 tool calls on Qwen3-32B showed domain-identifier hallucinations drop from 43 percent to zero, while a single codebase serves different domains through ontology configuration alone.

Core claim

The semantic training gap is the disconnect between how models acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships among equipment, processes, and constraints. Embedding ontology as a typed relational configuration inside the AI tool layer enforces semantic invariants at runtime via the resolve-contextualize-annotate contract, eliminating operationally incorrect tool calls and preventing semantic drift in multi-agent systems.

What carries the argument

The ontology-grounded tool architecture realized as a three-operation interface contract (resolve, contextualize, annotate) whose invariants are maintained by an AIOps orchestration layer.

If this is right

A single codebase can support multiple manufacturing domains through changes only to the ontology configuration files.
Tool-call hallucination of domain identifiers is eliminated under the tested conditions.
Semantic drift is prevented in multi-agent agent configurations.
Semantic constraints are enforced at runtime instead of relying on further model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interface contract could be applied in other domains that maintain formal ontologies, such as clinical decision support or regulatory compliance systems.
Deployment effort for industrial agents may decrease because domain adaptation occurs through configuration rather than code or fine-tuning changes.
Controlled tests that deliberately vary ontology completeness would isolate the exact contribution of the grounding mechanism.

Load-bearing premise

The measured drop in hallucinations is produced by the ontology grounding rather than by other features of the tool design or model choice, and the ontologies fully and correctly represent all relevant operational relationships without omissions or internal contradictions.

What would settle it

Replicate the 72-call experiment across the same six configurations but substitute ontologies that contain documented gaps or inconsistencies in domain relationships, then measure whether the hallucination rate remains at zero.

read the original abstract

Large language model (LLM)-based AI agents are increasingly deployed in manufacturing environments for analytics, quality management, and decision support. These agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics -- the relational structure that connects equipment identifiers, process parameters, failure codes, and regulatory constraints within a specific production context. This paper identifies and formalizes the semantic training gap: a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships. We demonstrate that this gap causes operationally incorrect outputs even when model responses are linguistically precise, and that in multi-agent configurations it produces a compounding failure mode we term semantic drift. To close this gap, we present an architecture that embeds manufacturing ontology directly into the AI tool layer as a typed relational configuration, enforcing semantic constraints at runtime rather than relying on model training. The architecture is formalized as a three-operation interface contract -- resolve, contextualize, annotate -- with invariants enforced by an AIOps orchestration layer. In a controlled experiment across six industry configurations (72 tool invocations using Qwen3-32B), unconstrained tool parameters produced a 43% hallucination rate for domain identifiers; ontology-grounded parameters reduced this to 0%. We validate the approach through a digital twin analytics platform demonstrating that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names the semantic training gap and claims ontology-grounded tools eliminate domain hallucinations in LLM agents, but the experiment does not clearly isolate the ontology from generic structuring.

read the letter

The one thing to know is that this paper puts a name on the mismatch between how LLMs pick up manufacturing jargon during training and the actual relational rules that govern real factory data, then offers a tool-layer fix using ontologies. Their controlled test shows a big reduction in bad tool calls, but I have doubts about whether the ontology is the active ingredient. What stands out as new is the explicit three-operation interface contract and the way they tie it to an AIOps layer for enforcement. The digital twin platform example shows how one set of code can handle multiple domains just by swapping ontology configs, which is a clean engineering win. They also flag semantic drift as a compounding issue in multi-agent systems, which feels like a useful addition to the conversation around agent reliability. The experiment is where things get soft. Unconstrained parameters hit 43% hallucination on domain IDs while grounded ones hit zero, but the description leaves open whether the unconstrained version was truly free text or simply lacked the ontology relations while still having some structure. If any typing or validation was added in the grounded case, the drop might not prove the value of the full ontological relationships. Seventy-two invocations is a start, but without details on measurement, statistical tests, or what the exact parameter formats were, it's tough to trust the zero as a general result. The paper would be stronger with an ablation that keeps structure constant and varies only the relational content. On the positive side, the approach avoids retraining and focuses on runtime constraints, which is pragmatic for industrial settings where models change often. The claim that it eliminates tool-call hallucination in their platform is bold but needs the full methods to back it up. This is the kind of paper that would interest engineers and researchers working on deploying agents in production environments with strict semantics. It gives enough of a blueprint that someone could try implementing the contract. I would recommend sending it for peer review. The idea has legs for practical use, and referees can push on the experimental controls to make the causal story tighter.

Referee Report

2 major / 1 minor

Summary. The paper identifies a 'semantic training gap' in LLM-based AI agents for manufacturing, where statistical fluency with domain terms fails to capture operational semantics defined by relational ontologies. It formalizes this gap and 'semantic drift' in multi-agent settings, then proposes an architecture embedding domain ontologies into the tool layer as typed relational configurations. The core interface is a three-operation contract (resolve, contextualize, annotate) with invariants enforced by an AIOps orchestration layer. A controlled experiment across six industry configurations (72 tool invocations with Qwen3-32B) reports 43% domain-identifier hallucination for unconstrained parameters versus 0% for ontology-grounded parameters; the approach is validated in a digital twin analytics platform showing single-codebase cross-domain configurability without code changes.

Significance. If the experimental isolation of ontology grounding holds and the architecture generalizes beyond the reported configurations, the work offers a practical, training-independent mechanism for enforcing semantic correctness in industrial AI agents. This could reduce operationally invalid outputs in analytics and decision-support systems while enabling reusable codebases across domains.

major comments (2)

[Abstract] Abstract and experimental description: the manuscript reports a 43% to 0% reduction in domain-identifier hallucinations but provides no details on how hallucinations were measured or annotated, what exactly distinguishes 'unconstrained' from 'ontology-grounded' parameters (e.g., presence of any JSON schema, type constraints, or runtime validation), the construction and coverage of the ontologies, or any statistical analysis of the 72 invocations. Without these, the causal attribution to ontological relationships rather than generic structuring cannot be verified and is load-bearing for the central claim.
[Architecture] Architecture section: the three-operation interface contract (resolve, contextualize, annotate) and AIOps-enforced invariants are described at a high level, but no formal specification, pseudocode, or example of how relational ontology constraints are checked at runtime is given. This leaves open whether the claimed elimination of tool-call hallucination follows from the ontology embedding or from unstated implementation choices.

minor comments (1)

[Introduction] The terms 'semantic training gap' and 'semantic drift' are introduced as novel; a brief comparison to existing literature on LLM grounding, ontology-driven agents, or hallucination taxonomies would clarify the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where additional clarity will strengthen the manuscript. We address each major comment below and have prepared revisions to incorporate the requested details.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: the manuscript reports a 43% to 0% reduction in domain-identifier hallucinations but provides no details on how hallucinations were measured or annotated, what exactly distinguishes 'unconstrained' from 'ontology-grounded' parameters (e.g., presence of any JSON schema, type constraints, or runtime validation), the construction and coverage of the ontologies, or any statistical analysis of the 72 invocations. Without these, the causal attribution to ontological relationships rather than generic structuring cannot be verified and is load-bearing for the central claim.

Authors: We agree that the original manuscript provided insufficient methodological detail on these points, which limits the ability to verify the claims. In the revised version we have added a new subsection to the Experiments section that specifies the hallucination annotation protocol (including the rubric and inter-annotator process), the exact implementation differences between the two parameter regimes, the sources and coverage metrics used to construct the six ontologies, and the statistical tests applied to the 72 invocations. These additions make the experimental design fully reproducible and allow readers to evaluate whether the observed effect is attributable to the relational ontology constraints. revision: yes
Referee: [Architecture] Architecture section: the three-operation interface contract (resolve, contextualize, annotate) and AIOps-enforced invariants are described at a high level, but no formal specification, pseudocode, or example of how relational ontology constraints are checked at runtime is given. This leaves open whether the claimed elimination of tool-call hallucination follows from the ontology embedding or from unstated implementation choices.

Authors: We accept that the architecture description was too high-level to demonstrate the mechanism clearly. The revised manuscript now contains a formal contract specification using predicate logic for the invariants, pseudocode for each of the three operations, and a worked example showing how a relational constraint (e.g., equipment-identifier to process-parameter linkage) is validated at runtime by the AIOps layer. These additions make explicit that hallucination prevention is enforced by the ontology-grounded interface contract rather than by ad-hoc implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim is empirical validation of architecture via controlled experiment

full rationale

The paper's derivation chain consists of conceptual identification of the semantic training gap, formalization of an architecture as a three-operation interface contract (resolve, contextualize, annotate) with AIOps-enforced invariants, and empirical validation through a controlled experiment (six configurations, 72 invocations with Qwen3-32B) reporting hallucination reduction from 43% to 0%. No mathematical equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described content. The result is presented as an independent observation from the experiment rather than reducing to its own inputs by construction. This is a self-contained empirical finding against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The paper rests on domain assumptions about LLM limitations and the utility of ontologies for runtime constraint enforcement. No free parameters are described. New concepts are introduced without independent evidence beyond the reported experiment.

axioms (2)

domain assumption LLM agents demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics defined through ontological relationships.
Core premise stated in the abstract identifying the semantic training gap.
domain assumption Embedding manufacturing ontology as a typed relational configuration in the AI tool layer enforces semantic constraints at runtime.
Basis for the proposed three-operation interface contract and AIOps orchestration.

invented entities (2)

semantic training gap no independent evidence
purpose: To name the structural disconnect between statistical vocabulary acquisition and ontological relational meaning in AI agents.
New term introduced to formalize the identified problem.
semantic drift no independent evidence
purpose: To describe the compounding failure mode in multi-agent configurations caused by the semantic training gap.
New term for the observed phenomenon in multi-agent setups.

pith-pipeline@v0.9.0 · 5549 in / 1586 out tokens · 45256 ms · 2026-05-13T01:59:47.219692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

[1]

ANSI/ISA-5.1-2022: Instrumentation symbols and identification

ISA. ANSI/ISA-5.1-2022: Instrumentation symbols and identification. International Society of Automation; 2022

work page 2022
[2]

IEC 61131-3:2013: Programmable controllers – Part 3: Programming languages

IEC. IEC 61131-3:2013: Programmable controllers – Part 3: Programming languages. International Electrotechnical Commission; 2013

work page 2013
[3]

IPC-9850: Surface mount placement equipment characterization

IPC. IPC-9850: Surface mount placement equipment characterization. Association Connecting Electronics Industries; 2020

work page 2020
[4]

A survey on large language model based autonomous agents , volume =

Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, et al. A survey on large language model based autonomous agents. Front Comput Sci 2024;18(6):186345. https://doi.org/10.1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2024
[5]

The Rise and Potential of Large Language Model Based Agents: A Survey

Xi Z, Chen W, Guo X, He W, Ding Y, Hong B, et al. The rise and potential of large language model based agents: a survey. arXiv preprint arXiv:2309.07864; 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Toward principles for the design of ontologies used for knowledge sharing

Gruber TR. Toward principles for the design of ontologies used for knowledge sharing. Int J Hum-Comput Stud 1995;43(5–6):907–28. https://doi.org/10.1006/ijhc.1995.1081

work page doi:10.1006/ijhc.1995.1081 1995
[7]

What is an ontology? In: Staab S, Studer R, editors

Guarino N, Oberle D, Staab S. What is an ontology? In: Staab S, Studer R, editors. Handbook on ontologies. Berlin: Springer; 2009. p. 1–17. https://doi.org/10.1007/978-3-540-92673-3_0

work page doi:10.1007/978-3-540-92673-3_0 2009
[8]

MASON: a proposal for an ontology of manufacturing domain

Lemaignan S, Siadat A, Dantan JY, Semenenko A. MASON: a proposal for an ontology of manufacturing domain. In: Proc IEEE Workshop on Distributed Intelligent Systems. Prague: IEEE; 2006. p. 195–200. https://doi.org/10.1109/DIS.2006.48

work page doi:10.1109/dis.2006.48 2006
[9]

Towards a formal manufacturing reference ontology

Usman Z, Young RIM, Chungoora N, Palmer C, Case K, Harding JA. Towards a formal manufacturing reference ontology. Int J Prod Res 2013;51(22):6553–72. https://doi.org/10.1080/00207543.2013.801570

work page doi:10.1080/00207543.2013.801570 2013
[10]

Multi-disciplinary engineering for cyber-physical production systems

Biffl S, Lüder A, Gerhard D, editors. Multi-disciplinary engineering for cyber-physical production systems. Cham: Springer; 2017. https://doi.org/10.1007/978-3-319-56345-9

work page doi:10.1007/978-3-319-56345-9 2017
[11]

ANSI/ISA-95 (IEC 62264): Enterprise-control system integration

ISA. ANSI/ISA-95 (IEC 62264): Enterprise-control system integration. International Society of Automation; 2010

work page 2010
[12]

B2MML: Business to Manufacturing Markup Language, Version 7.0

MESA International. B2MML: Business to Manufacturing Markup Language, Version 7.0. 2018. Available from: https://mesa.org/topics-resources/b2mml/

work page 2018
[13]

The road to integration: a guide to applying the ISA-95 standard in manufacturing

Scholten B. The road to integration: a guide to applying the ISA-95 standard in manufacturing. Research Triangle Park, NC: ISA; 2007

work page 2007
[14]

PRONTO: an ontology for comprehensive and consistent representation of product information

Vegetti M, Leone HP, Henning GP. PRONTO: an ontology for comprehensive and consistent representation of product information. Eng Appl Artif Intell 2011;24(8):1305–27. https://doi.org/10.1016/j.engappai.2011.02.014

work page doi:10.1016/j.engappai.2011.02.014 2011
[15]

IEC 62541: OPC Unified Architecture

IEC. IEC 62541: OPC Unified Architecture. International Electrotechnical Commission; 2020

work page 2020
[16]

IEC 62714: AutomationML – Engineering data exchange format

IEC. IEC 62714: AutomationML – Engineering data exchange format. International Electrotechnical Commission; 2018

work page 2018
[17]

Big data analysis of the internet of things in the digital twins of smart city based on deep learning

Li X, Liu H, Wang W, Zheng Y, Lv H, Lv Z. Big data analysis of the internet of things in the digital twins of smart city based on deep learning. Future Gener Comput Syst 2022;128:167–77. https://doi.org/10.1016/j.future.2021.10.006

work page doi:10.1016/j.future.2021.10.006 2022
[18]

A novel knowledge graph-based optimization approach for resource allocation in discrete manufacturing workshops

Zhou B, Bao J, Li J, Lu Y, Liu T, Zhang Q. A novel knowledge graph-based optimization approach for resource allocation in discrete manufacturing workshops. Robot Comput-Integr Manuf 2022;71:102160. https://doi.org/10.1016/j.rcim.2021.102160

work page doi:10.1016/j.rcim.2021.102160 2022
[19]

Unifying large language models and knowledge graphs: a roadmap

Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. IEEE Trans Knowl Data Eng 2024;36(7):3580–99. https://doi.org/10.1109/TKDE.2024.3352100

work page doi:10.1109/tkde.2024.3352100 2024
[20]

Toolformer: language models can teach themselves to use tools

Schick T, Dwivedi-Yu J, Dessì R, Raileanu R, Lomeli M, Hambro E, et al. Toolformer: language models can teach themselves to use tools. Adv Neural Inf Process Syst 2023;36

work page 2023
[21]

Gorilla: Large Language Model Connected with Massive APIs

Patil SG, Zhang T, Wang X, Gonzalez JE. Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334; 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Tool use (function calling) with Claude

Anthropic. Tool use (function calling) with Claude. Anthropic Documentation; 2024. Available from: https://docs.anthropic.com/en/docs/build-with-claude/tool-use

work page 2024
[23]

NeMo Guardrails: a toolkit for controllable and safe LLM applications with programmable rails

Rebedea T, Dinu R, Sreedhar M, Parisien C, Cohen J. NeMo Guardrails: a toolkit for controllable and safe LLM applications with programmable rails. In: Proc 2023 Conf Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations. Singapore: ACL; 2023. p. 431–45

work page 2023
[24]

ACM Comput

Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Comput Surv 2023;55(12):1–38. https://doi.org/10.1145/3571730

work page doi:10.1145/3571730 2023
[25]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232; 2023

work page internal anchor Pith review arXiv 2023
[26]

A review of the roles of digital twin in CPS-based production systems

Negri E, Fumagalli L, Macchi M. A review of the roles of digital twin in CPS-based production systems. Procedia Manuf 2017;11:939–48. https://doi.org/10.1016/j.promfg.2017.07.198

work page doi:10.1016/j.promfg.2017.07.198 2017
[27]

Representing layout information in the CMSD specification

Riddick F, Lee YT. Representing layout information in the CMSD specification. In: Proc Winter Simulation Conference. Phoenix, AZ: IEEE; 2011. p. 2157–68

work page 2011
[28]

The Synthetic Data Vault

Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. In: Proc IEEE International Conference on Data Science and Advanced Analytics (DSAA). Montreal: IEEE; 2016. p. 399–410

work page 2016
[29]

The data layer nobody builds: how template-as-ontology alignment enables cross-domain synthetic data for industrial AI validation

Chethan G. The data layer nobody builds: how template-as-ontology alignment enables cross-domain synthetic data for industrial AI validation. Manuscript in preparation; 2025

work page 2025
[30]

Systems of knowledge organization for digital libraries: beyond traditional authority files

Hodge G. Systems of knowledge organization for digital libraries: beyond traditional authority files. Washington, DC: Council on Library and Information Resources; 2000

work page 2000
[31]

NADCAP: National Aerospace and Defense Contractors Accreditation Program

PRI. NADCAP: National Aerospace and Defense Contractors Accreditation Program. Performance Review Institute; 2024

work page 2024
[32]

Hidden technical debt in machine learning systems

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, et al. Hidden technical debt in machine learning systems. Adv Neural Inf Process Syst 2015;28:2503–11

work page 2015
[33]

Ontology development 101: a guide to creating your first ontology

Noy NF, McGuinness DL. Ontology development 101: a guide to creating your first ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05; 2001

work page 2001
[34]

21 CFR Part 11: Electronic records; electronic signatures

FDA. 21 CFR Part 11: Electronic records; electronic signatures. U.S. Food and Drug Administration; 2003

work page 2003
[35]

IPC-A-610: Acceptability of electronic assemblies

IPC. IPC-A-610: Acceptability of electronic assemblies. Association Connecting Electronics Industries; 2021

work page 2021
[36]

IATF 16949:2016: Quality management system requirements for automotive production

IATF. IATF 16949:2016: Quality management system requirements for automotive production. International Automotive Task Force; 2016

work page 2016
[37]

Frontiers in Robotics and AI , author =

Dang Y, Lin Q, Huang P. AIOps: real-world challenges and research innovations. In: Proc 41st Int Conf Softw Eng (ICSE-SEIP). Montreal, QC: IEEE; 2019. p. 4–13. https://doi.org/10.1109/ICSE-SEIP.2019.00009

work page doi:10.1109/icse-seip.2019.00009 2019
[38]

A systematic mapping study in AIOps

Notaro P, Cardoso J, Gerndt M. A systematic mapping study in AIOps. In: Proc Int Conf on Service-Oriented Computing (ICSOC). Dubai: Springer; 2020. p. 110–23

work page 2020
[39]

Model Context Protocol specification

Anthropic. Model Context Protocol specification. 2024. Available from: https://modelcontextprotocol.io/

work page 2024
[40]

On the dangers of stochastic parrots: can language models be too big? In: Proc ACM Conf Fairness, Accountability, and Transparency (FAccT)

Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: can language models be too big? In: Proc ACM Conf Fairness, Accountability, and Transparency (FAccT). New York: ACM; 2021. p. 610–23

work page 2021