Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
Pith reviewed 2026-05-10 08:35 UTC · model grok-4.3
The pith
LLM-generated competency questions exhibit distinct profiles in readability, relevance, and complexity that vary by model type and use case.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.
Load-bearing premise
That the newly introduced quantitative measures for readability, relevance, and structural complexity validly capture the utility of generated competency questions for downstream ontology engineering tasks.
read the original abstract
Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a cross-domain empirical study of competency questions (CQs) generated by LLMs for ontology engineering. It defines quantitative measures of readability, relevance to input text, and structural complexity, applies them to outputs from open models (KimiK2-1T, Llama3.1-8B, Llama3.2-3B) and closed models (Gemini 2.5 Pro, GPT-4.1) across multiple use cases, and concludes that LLMs exhibit distinct generation profiles shaped by the use case.
Significance. If the proposed measures can be shown to correlate with downstream ontology-engineering outcomes, the work would offer practical guidance on model selection for automated CQ generation and help broaden access to ontology engineering. The systematic inclusion of both open and closed models together with a multi-domain design is a clear strength.
major comments (2)
- [Section introducing the quantitative measures] The central claim that LLMs display use-case-shaped generation profiles rests on the three newly introduced quantitative measures (readability, relevance, structural complexity). The manuscript defines these via standard NLP metrics but supplies no validation—such as correlation with expert CQ quality ratings, inter-annotator agreement, or performance on a downstream ontology task—demonstrating that differences in the scores predict actual utility. Without this, observed profile differences risk being artifacts of the chosen proxies rather than substantive distinctions (see the skeptic note on untested validity of the measures).
- [Results and analysis] The results and analysis sections lack essential experimental details required to assess robustness: number of CQs generated per use case and model, temperature or sampling settings, number of independent runs, statistical tests used to declare 'distinct profiles,' and any data-exclusion criteria. These omissions make it impossible to determine whether the reported differences are reliable or merely reflect sampling variability.
minor comments (1)
- [Abstract] The abstract refers to 'well defined use cases and scenarios' without enumerating them; a short list or reference to the specific domains would improve readability and allow readers to judge generalizability.
Circularity Check
No circularity: purely empirical analysis with no derivations or self-referential reductions
full rationale
The paper performs a cross-domain empirical study by generating competency questions with various LLMs and applying newly introduced quantitative measures (readability, relevance to input text, structural complexity) defined via standard NLP metrics. No mathematical derivations, equations, fitted parameters, predictions, or self-citation chains are present in the provided text or abstract. The central claim—that LLM performance reflects distinct generation profiles shaped by use case—arises directly from measurement and comparison of outputs against defined scenarios, without any step reducing to its own inputs by construction. This is self-contained empirical observation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
- [4]
-
[5]
arXiv preprint arXiv:2507.02989 (2025)
Alharbi, R., et al.: A comparative study of competency question elicitation methods from ontology requirements. arXiv preprint arXiv:2507.02989 (2025)
- [6]
- [7]
-
[8]
In: Knowledge Graphs and Semantic Web
Antia, M., Keet, C.M.: Automating the generation of competency questions for on- tologies with agocqs. In: Knowledge Graphs and Semantic Web. pp. 213–227. Springer Nature Switzerland (2023)
work page 2023
- [9]
-
[10]
arXiv preprint arXiv:2311.03942 (2023)
de Berardinis, J., et al.: The music meta ontology: a flexible semantic model for the interoperability of music metadata. arXiv preprint arXiv:2311.03942 (2023)
- [11]
-
[12]
Bezerra, C., Freitas, F.: Verifying description logic ontologies based on competency questions and unit testing. In: ONTOBRAS. pp. 159–164 (2017)
work page 2017
-
[13]
Learning & Nonlinear Models12(2), 115–129 (2014)
Bezerra, C., et al.: CQChecker: A tool to check ontologies in OWL-DL using competency questions written in controlled natural language. Learning & Nonlinear Models12(2), 115–129 (2014)
work page 2014
-
[14]
Journal of Web Semantics82, 100822 (2024)
Ciroku, F., et al.: Revont: Reverse engineering of competency questions from knowledge graphs via language models. Journal of Web Semantics82, 100822 (2024)
work page 2024
-
[15]
Ohio State University Bureau of Educational Research (1948)
Dale, E., Chall, J.S.: A Formula for Predicting Readability: Instructions. Ohio State University Bureau of Educational Research (1948)
work page 1948
- [16]
- [17]
-
[18]
Dubey, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
work page 2024
-
[19]
Fernández-López, M., et al.: Why are ontologies not reused across the same domain? Journal of Web Semantics57, 100492 (2019)
work page 2019
-
[20]
Grüninger, M., Fox, M.S.: The Role of Competency Questions in Enterprise Engi- neering, pp. 22–31. Springer US (1995)
work page 1995
-
[21]
In: Benchmarking — Theory and Practice
Gruninger, M., Fox, M.S.: Therole of competencyquestions in enterpriseengineering. In: Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology. pp. 83–95. Springer, Boston, MA (1994)
work page 1994
-
[22]
In: Knowledge Engineering and Knowledge Management
Keet, C.M., Khan, Z.C.: On the roles of competency questions in ontology engineer- ing. In: Knowledge Engineering and Knowledge Management. pp. 123–132. Springer Nature Switzerland (2025)
work page 2025
- [23]
- [24]
-
[25]
Morgan & Claypool Publishers (2019)
Kendall, E.F., et al.: Ontology engineering. Morgan & Claypool Publishers (2019)
work page 2019
-
[26]
Journal of the Association for Information Systems8, 105–128 (2007)
Kim, H.M., et al.: How to build enterprise data models to achieve compliance to standards or regulatory requirements (and share data). Journal of the Association for Information Systems8, 105–128 (2007)
work page 2007
-
[27]
Kincaid, J.P., et al.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch Report 8-75, Naval Air Station Memphis, Research Branch, Millington TN (1975)
work page 1975
- [28]
- [29]
-
[30]
of the 21st Extended Semantic Web conference, ESWC (2024)
Rebboud, Y., et al.: Can LLMs generate competency questions? In: Proc. of the 21st Extended Semantic Web conference, ESWC (2024)
work page 2024
- [31]
-
[32]
Applied ontology10(2), 107–145 (2015)
Suárez-Figueroa, M.C., et al.: The neon methodology framework: A scenario-based methodology for ontology development. Applied ontology10(2), 107–145 (2015)
work page 2015
- [33]
-
[34]
Uschold, M., King, M.: Towards a methodology for building ontologies. Citeseer (1995)
work page 1995
-
[35]
Journal of Web Semantics59, 100534 (2019)
Wiśniewski, D., et al.: Analysis of ontology competency questions and their formal- izations in sparql-owl. Journal of Web Semantics59, 100534 (2019)
work page 2019
- [36]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.