Recognition: unknown
From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies
Pith reviewed 2026-05-10 08:15 UTC · model grok-4.3
The pith
LLM-assisted active learning reformulates OWL subsumption checks as satisfiability queries, queries models for counter-concept examples, and ensures errors are only Type II delays rather than inconsistencies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall remains stable across several well-established ontologies.
Load-bearing premise
That verbalizing counter-concepts in controlled natural language preserves enough logical meaning for LLMs to generate useful and unbiased example instances, and that the reduction from subsumption to satisfiability remains sound when the oracle is an LLM rather than a perfect reasoner.
Figures
read the original abstract
In active learning, membership queries (MQs) allow a learner to pose questions to a teacher, such as ''Is every apple a fruit?'', to which the teacher responds correctly with yes or no. These MQs can be viewed as subsumption tests with respect to the target ontology. Inspired by the standard reduction of subsumption to satisfiability in description logics, we reformulate each candidate axiom into its corresponding counter-concept and verbalise it in controlled natural language before presenting it to Large Language Models (LLMs). We introduce LLMs as a third component that provides real-world examples approximating an instance of the counter-concept. This design property ensures that only Type II errors may occur in ontology modelling; in the worst case, these errors merely delay the construction process without introducing inconsistencies. Experimental results on 13 commercial LLMs show that recall, corresponding to Type II errors in our framework, remains stable across several well-established ontologies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes reformulating membership queries in active learning for OWL ontologies as subsumption tests, reduced to satisfiability via counter-concepts (C ⊓ ¬D). These are verbalized in controlled natural language and presented to LLMs, which act as oracles by attempting to generate real-world instances. The central claim is that this design ensures only Type II errors (delaying axiom addition) can occur, with no risk of introducing inconsistencies; experiments across 13 commercial LLMs report stable recall on several established ontologies.
Significance. If the Type-II-only error guarantee were rigorously established, the approach would offer a low-risk method for LLM-assisted ontology construction, addressing a key barrier in scaling knowledge engineering. The use of the standard DL reduction combined with LLMs as an external oracle is a clear strength, and the reported stability of recall across multiple models and ontologies provides initial evidence of practical utility. However, the significance is limited by the need to substantiate the safety property against LLM incompleteness.
major comments (2)
- [Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.
- [Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.
minor comments (2)
- [Abstract] The abstract is overly dense; expanding the description of how counter-concepts are verbalized and how LLM responses are interpreted as satisfiability verdicts would improve readability.
- No discussion of related work on LLM reliability for logical reasoning or prior active learning frameworks in description logics is visible in the provided abstract; adding targeted citations would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify the presentation of our core safety claim and experimental methodology. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the design 'ensures that only Type II errors may occur... without introducing inconsistencies' does not follow from the reduction. Absence of an LLM-generated instance for the counter-concept leads to accepting the subsumption C ⊑ D; because LLMs have incomplete coverage and verbalization may obscure edge cases, a satisfiable counter-concept can be misclassified as unsatisfiable, wrongly adding the axiom (a Type I error). This directly contradicts the safety claim and is load-bearing for the paper's main contribution.
Authors: We agree that the original abstract wording overstates the guarantee. The framework queries LLMs for real-world instances of the counter-concept C ⊓ ¬D; generation of a valid instance rejects the subsumption (preventing addition of a false axiom), while absence leads to acceptance. This yields only Type II errors (delays) under the assumption that LLMs produce no hallucinations, but LLM incompleteness can indeed cause missed counterexamples and thus Type I errors. We will revise the abstract to qualify the claim explicitly—stating that the approach reduces inconsistency risk relative to direct membership queries, with the Type-II-only property holding only when hallucinations are absent—and add a short discussion of this assumption and its implications for the safety property. revision: yes
-
Referee: [Experimental results] Experimental results section (referenced in abstract): The claim of 'stable recall' across 13 LLMs lacks any description of prompt engineering, instance generation/verification procedure, statistical tests for stability, or controls for verbalization fidelity. Without these, it is impossible to assess whether the reported recall truly measures only Type II errors or whether the oracle implementation introduces the very false-positive subsumptions the framework claims to avoid.
Authors: We concur that the experimental section requires substantially more methodological detail to allow proper evaluation. We will expand it to document: the exact prompt templates and any prompt-engineering steps employed; the full instance-generation protocol together with verification procedures (including whether generated instances were manually inspected for validity or automatically checked for consistency with the verbalization); the statistical measures or tests applied to assess recall stability across the 13 models; and any controls or fidelity checks performed on the controlled-natural-language verbalizations. These additions will clarify whether the observed recall corresponds to the intended error profile and will be included in the revised manuscript. revision: yes
Circularity Check
No circularity in the derivation chain
full rationale
The paper's central derivation invokes the standard DL reduction of subsumption to satisfiability (C ⊑ D iff C ⊓ ¬D unsatisfiable), an established external result, then treats LLMs as an external oracle for counter-concept instances after verbalization. No equations, parameters, or definitions reduce to themselves by construction; the Type-II-only error claim is presented as a direct consequence of the oracle's placement in the reduction rather than a self-referential fit or self-citation load. The approach remains self-contained against external benchmarks with no enumerated circularity patterns exhibited.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Subsumption can be reduced to satisfiability in description logics
Reference graph
Works this paper leans on
-
[1]
InProceedings of the AAAI Symposium Series, volume 4, 188–198
Investigating open source LLMs to retrofit compe- tency questions in ontology engineering. InProceedings of the AAAI Symposium Series, volume 4, 188–198. Alharbi, R.; Tamma, V .; Payne, T. R.; and de Berardinis, J. 2025.A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements. Angluin, D. 1988. Queries and concept learning...
2025
-
[2]
Consortium, G
OWL web ontology language reference.W3C recom- mendation10(2):1–53. Consortium, G. O. 2019. The gene ontology resource: 20 years and still going strong.Nucleic acids research 47(D1):D330–D338. Cregan, A.; Schwitter, R.; Meyer, T.; et al. 2007. Sydney OWL syntax-towards a controlled natural language syntax for owl 1.1. InOWLED, volume 258. Cuenca Grau, B.;...
2019
-
[3]
International Joint Conferences on Artificial Intelligence Organization
Learning description logic concepts: when can pos- itive and negative examples be separated? InProceedings of the Twenty-Eighth International Joint Conference on Ar- tificial Intelligence, volume 2019, 1682–1688. International Joint Conferences on Artificial Intelligence Organization. Funk, M.; Jung, J. C.; and Lutz, C. 2021. Actively learning concepts an...
2019
-
[4]
Why Language Models Hallucinate
Why language models hallucinate.arXiv preprint arXiv:2509.04664. Kazakov, Y . 2008.RIQandSROIQare harder than SHOIQ. 274–284. AAAI Press. Konev, B.; Lutz, C.; Ozaki, A.; and Wolter, F. 2018. Exact learning of lightweight description logic ontologies.Journal of Machine Learning Research18(201):1–63. Lehmann, J., and Hitzler, P. 2010. Concept learning in de...
work page internal anchor Pith review arXiv 2008
-
[5]
InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799
Actively learning el terminologies from large lan- guage models. InProceedings of the 27th European Con- ference on Artificial Intelligence (ECAI 2025), 1792–1799. Muggleton, S., and De Raedt, L. 1994. Inductive logic pro- gramming: Theory and methods.The Journal of Logic Pro- gramming19:629–679. Ozaki, A. 2025. Actively learning from machine learning mod...
2025
-
[6]
Sirin, E.; Parsia, B.; Grau, B
SNOMED reaching its adolescence: Ontologists’ and logicians’ health check.International journal of medical informatics78:S86–S94. Sirin, E.; Parsia, B.; Grau, B. C.; Kalyanpur, A.; and Katz, Y . 2007. Pellet: A practical OWL-DL reasoner.Journal of Web Semantics5(2):51–53. Stehman, S. V . 1997. Selecting and interpreting measures of thematic classification...
2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.