Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

Jacopo de Berardinis; Reham Alharbi; Terry R. Payne; Valentina Tamma

arxiv: 2604.16258 · v1 · submitted 2026-04-17 · 💻 cs.AI

Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

Reham Alharbi , Valentina Tamma , Terry R. Payne , Jacopo de Berardinis This is my paper

Pith reviewed 2026-05-10 08:35 UTC · model grok-4.3

classification 💻 cs.AI

keywords ontologyquestionsanalysiscasesclosedcompetencycomplexitycross-domain

0 comments

The pith

LLM-generated competency questions exhibit distinct profiles in readability, relevance, and complexity that vary by model type and use case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Competency questions are natural language questions that capture what an ontology must be able to answer. Instead of experts writing them by hand, the authors use large language models to generate them automatically from given use cases and scenarios. They created specific numerical ways to score the generated questions on how easy they are to read, how well they match the original text, and how complex their sentence structure is. Testing several models including Llama variants, Gemini, and GPT on multiple domains showed that each model tends to produce questions with its own characteristic style and strengths depending on the task.

Core claim

Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

Load-bearing premise

That the newly introduced quantitative measures for readability, relevance, and structural complexity validly capture the utility of generated competency questions for downstream ontology engineering tasks.

read the original abstract

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a cross-model comparison of LLM-generated competency questions using new metrics for readability, relevance, and complexity, but does not test whether those metrics track actual usefulness in ontology work.

read the letter

The core contribution is an empirical comparison of how several LLMs turn use-case text into competency questions. They apply the same set of measures across open models like Llama 3.1/3.2 and Kimi and closed ones like GPT-4.1 and Gemini 2.5, then report that the outputs show different profiles depending on the domain and scenario. That gives a concrete picture of variation that was not previously quantified for this task in ontology engineering.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a cross-domain empirical study of competency questions (CQs) generated by LLMs for ontology engineering. It defines quantitative measures of readability, relevance to input text, and structural complexity, applies them to outputs from open models (KimiK2-1T, Llama3.1-8B, Llama3.2-3B) and closed models (Gemini 2.5 Pro, GPT-4.1) across multiple use cases, and concludes that LLMs exhibit distinct generation profiles shaped by the use case.

Significance. If the proposed measures can be shown to correlate with downstream ontology-engineering outcomes, the work would offer practical guidance on model selection for automated CQ generation and help broaden access to ontology engineering. The systematic inclusion of both open and closed models together with a multi-domain design is a clear strength.

major comments (2)

[Section introducing the quantitative measures] The central claim that LLMs display use-case-shaped generation profiles rests on the three newly introduced quantitative measures (readability, relevance, structural complexity). The manuscript defines these via standard NLP metrics but supplies no validation—such as correlation with expert CQ quality ratings, inter-annotator agreement, or performance on a downstream ontology task—demonstrating that differences in the scores predict actual utility. Without this, observed profile differences risk being artifacts of the chosen proxies rather than substantive distinctions (see the skeptic note on untested validity of the measures).
[Results and analysis] The results and analysis sections lack essential experimental details required to assess robustness: number of CQs generated per use case and model, temperature or sampling settings, number of independent runs, statistical tests used to declare 'distinct profiles,' and any data-exclusion criteria. These omissions make it impossible to determine whether the reported differences are reliable or merely reflect sampling variability.

minor comments (1)

[Abstract] The abstract refers to 'well defined use cases and scenarios' without enumerating them; a short list or reference to the specific domains would improve readability and allow readers to judge generalizability.

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with no derivations or self-referential reductions

full rationale

The paper performs a cross-domain empirical study by generating competency questions with various LLMs and applying newly introduced quantitative measures (readability, relevance to input text, structural complexity) defined via standard NLP metrics. No mathematical derivations, equations, fitted parameters, predictions, or self-citation chains are present in the provided text or abstract. The central claim—that LLM performance reflects distinct generation profiles shaped by use case—arises directly from measurement and comparison of outputs against defined scenarios, without any step reducing to its own inputs by construction. This is self-contained empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical characterization study; the central claim rests on the assumption that the introduced quantitative measures are appropriate proxies for CQ quality, but no free parameters, axioms, or invented entities are specified in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1041 out tokens · 26148 ms · 2026-05-10T08:35:54.669346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

In: Proc

Alharbi, R.: Assessing candidate ontologies for reuse. In: Proc. of the Doc- toral Consortium at ISWC 2021 (ISWC-DC). pp. 65–72 (2021),https://api. semanticscholar.org/CorpusID:244895203

work page 2021
[2]

In: Proc

Alharbi, R., et al.: Characteristics and desiderata for competency question bench- marks. In: Proc. of the 23rd International Semantic Web Conference, ISWC 2024 (2024)

work page 2024
[3]

In: Proc

Alharbi, R., et al.: An experiment in retrofitting competency questions for existing ontologies. In: Proc. of the 39th ACM/SIGAPP Symposium on Applied Computing. p. 1650–1658. SAC ’24, Association for Computing Machinery (2024)

work page 2024
[4]

In: Proc

Alharbi, R., et al.: A review and comparison of competency question engineering approaches. In: Proc. 24th International Conference on Knowledge Engineering and Knowledge Management, EKAW. pp. 271–290. Springer Nature (2024)

work page 2024
[5]

arXiv preprint arXiv:2507.02989 (2025)

Alharbi, R., et al.: A comparative study of competency question elicitation methods from ontology requirements. arXiv preprint arXiv:2507.02989 (2025)

work page arXiv 2025
[6]

In: Proc

Alharbi, R., et al.: Characterising the gap between theory and practice of ontology reuse. In: Proc. of the K-CAP ’21: Knowledge Capture Conference. pp. 217–224. ACM (2021)

work page 2021
[7]

In: Proc

Antia, M., Keet, C.M.: Assessing and enhancing bottom-up CNL design for compe- tency questions for ontologies. In: Proc. of the Seventh International Workshop on Controlled Natural Language (CNL 2020/21). pp. 1–11. Association for Computa- tional Linguistics (ACL) (2021)

work page 2020
[8]

In: Knowledge Graphs and Semantic Web

Antia, M., Keet, C.M.: Automating the generation of competency questions for on- tologies with agocqs. In: Knowledge Graphs and Semantic Web. pp. 213–227. Springer Nature Switzerland (2023)

work page 2023
[9]

In: Proc

Azzi, S., et al.: Scoring ontologies for reuse: An approach for fitting semantic requirements. In: Proc. of the Reseach Conf. on Metadata and Semantic Research, MTSR 2022. pp. 203–208 (2023)

work page 2022
[10]

arXiv preprint arXiv:2311.03942 (2023)

de Berardinis, J., et al.: The music meta ontology: a flexible semantic model for the interoperability of music metadata. arXiv preprint arXiv:2311.03942 (2023)

work page arXiv 2023
[11]

In: Proc

de Berardinis, J., et al.: The polifonia ontology network: Building a semantic backbone for musical heritage. In: Proc. of the 22nd International Semantic Web Conference, ISWC. pp. 302–322. Springer (2023)

work page 2023
[12]

In: ONTOBRAS

Bezerra, C., Freitas, F.: Verifying description logic ontologies based on competency questions and unit testing. In: ONTOBRAS. pp. 159–164 (2017)

work page 2017
[13]

Learning & Nonlinear Models12(2), 115–129 (2014)

Bezerra, C., et al.: CQChecker: A tool to check ontologies in OWL-DL using competency questions written in controlled natural language. Learning & Nonlinear Models12(2), 115–129 (2014)

work page 2014
[14]

Journal of Web Semantics82, 100822 (2024)

Ciroku, F., et al.: Revont: Reverse engineering of competency questions from knowledge graphs via language models. Journal of Web Semantics82, 100822 (2024)

work page 2024
[15]

Ohio State University Bureau of Educational Research (1948)

Dale, E., Chall, J.S.: A Formula for Predicting Readability: Instructions. Ohio State University Bureau of Educational Research (1948)

work page 1948
[16]

In: Proc

De Marneffe, M.C., et al.: Universal Stanford dependencies: A cross-linguistic typology. In: Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC‘14). pp. 4585–4592. European Language Resources Association (ELRA) (2014)

work page 2014
[17]

In: Proc

Dennis, M., et al.: Computing authoring tests from competency questions: Experi- mental validation. In: Proc. of the 16th International Semantic Web Conference, ISWC. pp. 243–259. Springer International Publishing (2017) 16 R. Alharbi et al

work page 2017
[18]

arXiv e-prints pp

Dubey, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024
[19]

Fernández-López, M., et al.: Why are ontologies not reused across the same domain? Journal of Web Semantics57, 100492 (2019)

work page 2019
[20]

Grüninger, M., Fox, M.S.: The Role of Competency Questions in Enterprise Engi- neering, pp. 22–31. Springer US (1995)

work page 1995
[21]

In: Benchmarking — Theory and Practice

Gruninger, M., Fox, M.S.: Therole of competencyquestions in enterpriseengineering. In: Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology. pp. 83–95. Springer, Boston, MA (1994)

work page 1994
[22]

In: Knowledge Engineering and Knowledge Management

Keet, C.M., Khan, Z.C.: On the roles of competency questions in ontology engineer- ing. In: Knowledge Engineering and Knowledge Management. pp. 123–132. Springer Nature Switzerland (2025)

work page 2025
[23]

In: Proc

Keet, C.M., Ławrynowicz, A.: Test-driven development of ontologies. In: Proc. of the 13th International Conference on the Semantic Web, ESWC. pp. 642–657 (2016)

work page 2016
[24]

In: Proc

Keet, C.M., et al.: CLaRO: A controlled language for authoring competency ques- tions. In: Proc. of the 13th International Conference on Metadata and Semantic Research, MTSR. pp. 3–15. Springer International Publishing (2019)

work page 2019
[25]

Morgan & Claypool Publishers (2019)

Kendall, E.F., et al.: Ontology engineering. Morgan & Claypool Publishers (2019)

work page 2019
[26]

Journal of the Association for Information Systems8, 105–128 (2007)

Kim, H.M., et al.: How to build enterprise data models to achieve compliance to standards or regulatory requirements (and share data). Journal of the Association for Information Systems8, 105–128 (2007)

work page 2007
[27]

Kincaid, J.P., et al.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch Report 8-75, Naval Air Station Memphis, Research Branch, Millington TN (1975)

work page 1975
[28]

In: Proc

Monfardini, G.K.Q., et al.: Use of competency questions in ontology engineering: A survey. In: Proc. of the Conceptual Modeling: 42nd International Conference, ER. p. 45–64. Springer-Verlag (2023)

work page 2023
[29]

In: Proc

Presutti, V., et al.: Extreme design with content ontology design patterns. In: Proc. of the 2009 International Conference on Ontology Patterns. vol. 516, p. 83–97 (2009)

work page 2009
[30]

of the 21st Extended Semantic Web conference, ESWC (2024)

Rebboud, Y., et al.: Can LLMs generate competency questions? In: Proc. of the 21st Extended Semantic Web conference, ESWC (2024)

work page 2024
[31]

In: Proc

Ren, Y., et al.: Towards competency question-driven ontology authoring. In: Proc. of the 11th Extended Semantic Web Conference, ESWC. pp. 752–767. Springer International Publishing (2014)

work page 2014
[32]

Applied ontology10(2), 107–145 (2015)

Suárez-Figueroa, M.C., et al.: The neon methodology framework: A scenario-based methodology for ontology development. Applied ontology10(2), 107–145 (2015)

work page 2015
[33]

In: Proc

Tevet, G., Berant, J.: Evaluating the evaluation of diversity in natural language generation. In: Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 326–346 (2021)

work page 2021
[34]

Citeseer (1995)

Uschold, M., King, M.: Towards a methodology for building ontologies. Citeseer (1995)

work page 1995
[35]

Journal of Web Semantics59, 100534 (2019)

Wiśniewski, D., et al.: Analysis of ontology competency questions and their formal- izations in sparql-owl. Journal of Web Semantics59, 100534 (2019)

work page 2019
[36]

In: Proc

Zhang, B., et al.: Ontochat: A framework for conversational ontology engineering using language models. In: Proc. of the 21st Extended Semantic Web conference, ESWC. pp. 102–121. Springer Nature Switzerland (2025)

work page 2025

[1] [1]

In: Proc

Alharbi, R.: Assessing candidate ontologies for reuse. In: Proc. of the Doc- toral Consortium at ISWC 2021 (ISWC-DC). pp. 65–72 (2021),https://api. semanticscholar.org/CorpusID:244895203

work page 2021

[2] [2]

In: Proc

Alharbi, R., et al.: Characteristics and desiderata for competency question bench- marks. In: Proc. of the 23rd International Semantic Web Conference, ISWC 2024 (2024)

work page 2024

[3] [3]

In: Proc

Alharbi, R., et al.: An experiment in retrofitting competency questions for existing ontologies. In: Proc. of the 39th ACM/SIGAPP Symposium on Applied Computing. p. 1650–1658. SAC ’24, Association for Computing Machinery (2024)

work page 2024

[4] [4]

In: Proc

Alharbi, R., et al.: A review and comparison of competency question engineering approaches. In: Proc. 24th International Conference on Knowledge Engineering and Knowledge Management, EKAW. pp. 271–290. Springer Nature (2024)

work page 2024

[5] [5]

arXiv preprint arXiv:2507.02989 (2025)

Alharbi, R., et al.: A comparative study of competency question elicitation methods from ontology requirements. arXiv preprint arXiv:2507.02989 (2025)

work page arXiv 2025

[6] [6]

In: Proc

Alharbi, R., et al.: Characterising the gap between theory and practice of ontology reuse. In: Proc. of the K-CAP ’21: Knowledge Capture Conference. pp. 217–224. ACM (2021)

work page 2021

[7] [7]

In: Proc

Antia, M., Keet, C.M.: Assessing and enhancing bottom-up CNL design for compe- tency questions for ontologies. In: Proc. of the Seventh International Workshop on Controlled Natural Language (CNL 2020/21). pp. 1–11. Association for Computa- tional Linguistics (ACL) (2021)

work page 2020

[8] [8]

In: Knowledge Graphs and Semantic Web

Antia, M., Keet, C.M.: Automating the generation of competency questions for on- tologies with agocqs. In: Knowledge Graphs and Semantic Web. pp. 213–227. Springer Nature Switzerland (2023)

work page 2023

[9] [9]

In: Proc

Azzi, S., et al.: Scoring ontologies for reuse: An approach for fitting semantic requirements. In: Proc. of the Reseach Conf. on Metadata and Semantic Research, MTSR 2022. pp. 203–208 (2023)

work page 2022

[10] [10]

arXiv preprint arXiv:2311.03942 (2023)

de Berardinis, J., et al.: The music meta ontology: a flexible semantic model for the interoperability of music metadata. arXiv preprint arXiv:2311.03942 (2023)

work page arXiv 2023

[11] [11]

In: Proc

de Berardinis, J., et al.: The polifonia ontology network: Building a semantic backbone for musical heritage. In: Proc. of the 22nd International Semantic Web Conference, ISWC. pp. 302–322. Springer (2023)

work page 2023

[12] [12]

In: ONTOBRAS

Bezerra, C., Freitas, F.: Verifying description logic ontologies based on competency questions and unit testing. In: ONTOBRAS. pp. 159–164 (2017)

work page 2017

[13] [13]

Learning & Nonlinear Models12(2), 115–129 (2014)

Bezerra, C., et al.: CQChecker: A tool to check ontologies in OWL-DL using competency questions written in controlled natural language. Learning & Nonlinear Models12(2), 115–129 (2014)

work page 2014

[14] [14]

Journal of Web Semantics82, 100822 (2024)

Ciroku, F., et al.: Revont: Reverse engineering of competency questions from knowledge graphs via language models. Journal of Web Semantics82, 100822 (2024)

work page 2024

[15] [15]

Ohio State University Bureau of Educational Research (1948)

Dale, E., Chall, J.S.: A Formula for Predicting Readability: Instructions. Ohio State University Bureau of Educational Research (1948)

work page 1948

[16] [16]

In: Proc

De Marneffe, M.C., et al.: Universal Stanford dependencies: A cross-linguistic typology. In: Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC‘14). pp. 4585–4592. European Language Resources Association (ELRA) (2014)

work page 2014

[17] [17]

In: Proc

Dennis, M., et al.: Computing authoring tests from competency questions: Experi- mental validation. In: Proc. of the 16th International Semantic Web Conference, ISWC. pp. 243–259. Springer International Publishing (2017) 16 R. Alharbi et al

work page 2017

[18] [18]

arXiv e-prints pp

Dubey, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024

[19] [19]

Fernández-López, M., et al.: Why are ontologies not reused across the same domain? Journal of Web Semantics57, 100492 (2019)

work page 2019

[20] [20]

Grüninger, M., Fox, M.S.: The Role of Competency Questions in Enterprise Engi- neering, pp. 22–31. Springer US (1995)

work page 1995

[21] [21]

In: Benchmarking — Theory and Practice

Gruninger, M., Fox, M.S.: Therole of competencyquestions in enterpriseengineering. In: Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology. pp. 83–95. Springer, Boston, MA (1994)

work page 1994

[22] [22]

In: Knowledge Engineering and Knowledge Management

Keet, C.M., Khan, Z.C.: On the roles of competency questions in ontology engineer- ing. In: Knowledge Engineering and Knowledge Management. pp. 123–132. Springer Nature Switzerland (2025)

work page 2025

[23] [23]

In: Proc

Keet, C.M., Ławrynowicz, A.: Test-driven development of ontologies. In: Proc. of the 13th International Conference on the Semantic Web, ESWC. pp. 642–657 (2016)

work page 2016

[24] [24]

In: Proc

Keet, C.M., et al.: CLaRO: A controlled language for authoring competency ques- tions. In: Proc. of the 13th International Conference on Metadata and Semantic Research, MTSR. pp. 3–15. Springer International Publishing (2019)

work page 2019

[25] [25]

Morgan & Claypool Publishers (2019)

Kendall, E.F., et al.: Ontology engineering. Morgan & Claypool Publishers (2019)

work page 2019

[26] [26]

Journal of the Association for Information Systems8, 105–128 (2007)

Kim, H.M., et al.: How to build enterprise data models to achieve compliance to standards or regulatory requirements (and share data). Journal of the Association for Information Systems8, 105–128 (2007)

work page 2007

[27] [27]

Kincaid, J.P., et al.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch Report 8-75, Naval Air Station Memphis, Research Branch, Millington TN (1975)

work page 1975

[28] [28]

In: Proc

Monfardini, G.K.Q., et al.: Use of competency questions in ontology engineering: A survey. In: Proc. of the Conceptual Modeling: 42nd International Conference, ER. p. 45–64. Springer-Verlag (2023)

work page 2023

[29] [29]

In: Proc

Presutti, V., et al.: Extreme design with content ontology design patterns. In: Proc. of the 2009 International Conference on Ontology Patterns. vol. 516, p. 83–97 (2009)

work page 2009

[30] [30]

of the 21st Extended Semantic Web conference, ESWC (2024)

Rebboud, Y., et al.: Can LLMs generate competency questions? In: Proc. of the 21st Extended Semantic Web conference, ESWC (2024)

work page 2024

[31] [31]

In: Proc

Ren, Y., et al.: Towards competency question-driven ontology authoring. In: Proc. of the 11th Extended Semantic Web Conference, ESWC. pp. 752–767. Springer International Publishing (2014)

work page 2014

[32] [32]

Applied ontology10(2), 107–145 (2015)

Suárez-Figueroa, M.C., et al.: The neon methodology framework: A scenario-based methodology for ontology development. Applied ontology10(2), 107–145 (2015)

work page 2015

[33] [33]

In: Proc

Tevet, G., Berant, J.: Evaluating the evaluation of diversity in natural language generation. In: Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 326–346 (2021)

work page 2021

[34] [34]

Citeseer (1995)

Uschold, M., King, M.: Towards a methodology for building ontologies. Citeseer (1995)

work page 1995

[35] [35]

Journal of Web Semantics59, 100534 (2019)

Wiśniewski, D., et al.: Analysis of ontology competency questions and their formal- izations in sparql-owl. Journal of Web Semantics59, 100534 (2019)

work page 2019

[36] [36]

In: Proc

Zhang, B., et al.: Ontochat: A framework for conversational ontology engineering using language models. In: Proc. of the 21st Extended Semantic Web conference, ESWC. pp. 102–121. Springer Nature Switzerland (2025)

work page 2025