pith. sign in

arxiv: 2604.16258 · v1 · submitted 2026-04-17 · 💻 cs.AI

Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

Pith reviewed 2026-05-10 08:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords ontologyquestionsanalysiscasesclosedcompetencycomplexitycross-domain
0
0 comments X

The pith

LLM-generated competency questions exhibit distinct profiles in readability, relevance, and complexity that vary by model type and use case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Competency questions are natural language questions that capture what an ontology must be able to answer. Instead of experts writing them by hand, the authors use large language models to generate them automatically from given use cases and scenarios. They created specific numerical ways to score the generated questions on how easy they are to read, how well they match the original text, and how complex their sentence structure is. Testing several models including Llama variants, Gemini, and GPT on multiple domains showed that each model tends to produce questions with its own characteristic style and strengths depending on the task.

Core claim

Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

Load-bearing premise

That the newly introduced quantitative measures for readability, relevance, and structural complexity validly capture the utility of generated competency questions for downstream ontology engineering tasks.

read the original abstract

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a cross-domain empirical study of competency questions (CQs) generated by LLMs for ontology engineering. It defines quantitative measures of readability, relevance to input text, and structural complexity, applies them to outputs from open models (KimiK2-1T, Llama3.1-8B, Llama3.2-3B) and closed models (Gemini 2.5 Pro, GPT-4.1) across multiple use cases, and concludes that LLMs exhibit distinct generation profiles shaped by the use case.

Significance. If the proposed measures can be shown to correlate with downstream ontology-engineering outcomes, the work would offer practical guidance on model selection for automated CQ generation and help broaden access to ontology engineering. The systematic inclusion of both open and closed models together with a multi-domain design is a clear strength.

major comments (2)
  1. [Section introducing the quantitative measures] The central claim that LLMs display use-case-shaped generation profiles rests on the three newly introduced quantitative measures (readability, relevance, structural complexity). The manuscript defines these via standard NLP metrics but supplies no validation—such as correlation with expert CQ quality ratings, inter-annotator agreement, or performance on a downstream ontology task—demonstrating that differences in the scores predict actual utility. Without this, observed profile differences risk being artifacts of the chosen proxies rather than substantive distinctions (see the skeptic note on untested validity of the measures).
  2. [Results and analysis] The results and analysis sections lack essential experimental details required to assess robustness: number of CQs generated per use case and model, temperature or sampling settings, number of independent runs, statistical tests used to declare 'distinct profiles,' and any data-exclusion criteria. These omissions make it impossible to determine whether the reported differences are reliable or merely reflect sampling variability.
minor comments (1)
  1. [Abstract] The abstract refers to 'well defined use cases and scenarios' without enumerating them; a short list or reference to the specific domains would improve readability and allow readers to judge generalizability.

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with no derivations or self-referential reductions

full rationale

The paper performs a cross-domain empirical study by generating competency questions with various LLMs and applying newly introduced quantitative measures (readability, relevance to input text, structural complexity) defined via standard NLP metrics. No mathematical derivations, equations, fitted parameters, predictions, or self-citation chains are present in the provided text or abstract. The central claim—that LLM performance reflects distinct generation profiles shaped by use case—arises directly from measurement and comparison of outputs against defined scenarios, without any step reducing to its own inputs by construction. This is self-contained empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical characterization study; the central claim rests on the assumption that the introduced quantitative measures are appropriate proxies for CQ quality, but no free parameters, axioms, or invented entities are specified in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1041 out tokens · 26148 ms · 2026-05-10T08:35:54.669346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    In: Proc

    Alharbi, R.: Assessing candidate ontologies for reuse. In: Proc. of the Doc- toral Consortium at ISWC 2021 (ISWC-DC). pp. 65–72 (2021),https://api. semanticscholar.org/CorpusID:244895203

  2. [2]

    In: Proc

    Alharbi, R., et al.: Characteristics and desiderata for competency question bench- marks. In: Proc. of the 23rd International Semantic Web Conference, ISWC 2024 (2024)

  3. [3]

    In: Proc

    Alharbi, R., et al.: An experiment in retrofitting competency questions for existing ontologies. In: Proc. of the 39th ACM/SIGAPP Symposium on Applied Computing. p. 1650–1658. SAC ’24, Association for Computing Machinery (2024)

  4. [4]

    In: Proc

    Alharbi, R., et al.: A review and comparison of competency question engineering approaches. In: Proc. 24th International Conference on Knowledge Engineering and Knowledge Management, EKAW. pp. 271–290. Springer Nature (2024)

  5. [5]

    arXiv preprint arXiv:2507.02989 (2025)

    Alharbi, R., et al.: A comparative study of competency question elicitation methods from ontology requirements. arXiv preprint arXiv:2507.02989 (2025)

  6. [6]

    In: Proc

    Alharbi, R., et al.: Characterising the gap between theory and practice of ontology reuse. In: Proc. of the K-CAP ’21: Knowledge Capture Conference. pp. 217–224. ACM (2021)

  7. [7]

    In: Proc

    Antia, M., Keet, C.M.: Assessing and enhancing bottom-up CNL design for compe- tency questions for ontologies. In: Proc. of the Seventh International Workshop on Controlled Natural Language (CNL 2020/21). pp. 1–11. Association for Computa- tional Linguistics (ACL) (2021)

  8. [8]

    In: Knowledge Graphs and Semantic Web

    Antia, M., Keet, C.M.: Automating the generation of competency questions for on- tologies with agocqs. In: Knowledge Graphs and Semantic Web. pp. 213–227. Springer Nature Switzerland (2023)

  9. [9]

    In: Proc

    Azzi, S., et al.: Scoring ontologies for reuse: An approach for fitting semantic requirements. In: Proc. of the Reseach Conf. on Metadata and Semantic Research, MTSR 2022. pp. 203–208 (2023)

  10. [10]

    arXiv preprint arXiv:2311.03942 (2023)

    de Berardinis, J., et al.: The music meta ontology: a flexible semantic model for the interoperability of music metadata. arXiv preprint arXiv:2311.03942 (2023)

  11. [11]

    In: Proc

    de Berardinis, J., et al.: The polifonia ontology network: Building a semantic backbone for musical heritage. In: Proc. of the 22nd International Semantic Web Conference, ISWC. pp. 302–322. Springer (2023)

  12. [12]

    In: ONTOBRAS

    Bezerra, C., Freitas, F.: Verifying description logic ontologies based on competency questions and unit testing. In: ONTOBRAS. pp. 159–164 (2017)

  13. [13]

    Learning & Nonlinear Models12(2), 115–129 (2014)

    Bezerra, C., et al.: CQChecker: A tool to check ontologies in OWL-DL using competency questions written in controlled natural language. Learning & Nonlinear Models12(2), 115–129 (2014)

  14. [14]

    Journal of Web Semantics82, 100822 (2024)

    Ciroku, F., et al.: Revont: Reverse engineering of competency questions from knowledge graphs via language models. Journal of Web Semantics82, 100822 (2024)

  15. [15]

    Ohio State University Bureau of Educational Research (1948)

    Dale, E., Chall, J.S.: A Formula for Predicting Readability: Instructions. Ohio State University Bureau of Educational Research (1948)

  16. [16]

    In: Proc

    De Marneffe, M.C., et al.: Universal Stanford dependencies: A cross-linguistic typology. In: Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC‘14). pp. 4585–4592. European Language Resources Association (ELRA) (2014)

  17. [17]

    In: Proc

    Dennis, M., et al.: Computing authoring tests from competency questions: Experi- mental validation. In: Proc. of the 16th International Semantic Web Conference, ISWC. pp. 243–259. Springer International Publishing (2017) 16 R. Alharbi et al

  18. [18]

    arXiv e-prints pp

    Dubey, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  19. [19]

    Fernández-López, M., et al.: Why are ontologies not reused across the same domain? Journal of Web Semantics57, 100492 (2019)

  20. [20]

    Grüninger, M., Fox, M.S.: The Role of Competency Questions in Enterprise Engi- neering, pp. 22–31. Springer US (1995)

  21. [21]

    In: Benchmarking — Theory and Practice

    Gruninger, M., Fox, M.S.: Therole of competencyquestions in enterpriseengineering. In: Benchmarking — Theory and Practice. IFIP Advances in Information and Communication Technology. pp. 83–95. Springer, Boston, MA (1994)

  22. [22]

    In: Knowledge Engineering and Knowledge Management

    Keet, C.M., Khan, Z.C.: On the roles of competency questions in ontology engineer- ing. In: Knowledge Engineering and Knowledge Management. pp. 123–132. Springer Nature Switzerland (2025)

  23. [23]

    In: Proc

    Keet, C.M., Ławrynowicz, A.: Test-driven development of ontologies. In: Proc. of the 13th International Conference on the Semantic Web, ESWC. pp. 642–657 (2016)

  24. [24]

    In: Proc

    Keet, C.M., et al.: CLaRO: A controlled language for authoring competency ques- tions. In: Proc. of the 13th International Conference on Metadata and Semantic Research, MTSR. pp. 3–15. Springer International Publishing (2019)

  25. [25]

    Morgan & Claypool Publishers (2019)

    Kendall, E.F., et al.: Ontology engineering. Morgan & Claypool Publishers (2019)

  26. [26]

    Journal of the Association for Information Systems8, 105–128 (2007)

    Kim, H.M., et al.: How to build enterprise data models to achieve compliance to standards or regulatory requirements (and share data). Journal of the Association for Information Systems8, 105–128 (2007)

  27. [27]

    Kincaid, J.P., et al.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Tech. Rep. Research Branch Report 8-75, Naval Air Station Memphis, Research Branch, Millington TN (1975)

  28. [28]

    In: Proc

    Monfardini, G.K.Q., et al.: Use of competency questions in ontology engineering: A survey. In: Proc. of the Conceptual Modeling: 42nd International Conference, ER. p. 45–64. Springer-Verlag (2023)

  29. [29]

    In: Proc

    Presutti, V., et al.: Extreme design with content ontology design patterns. In: Proc. of the 2009 International Conference on Ontology Patterns. vol. 516, p. 83–97 (2009)

  30. [30]

    of the 21st Extended Semantic Web conference, ESWC (2024)

    Rebboud, Y., et al.: Can LLMs generate competency questions? In: Proc. of the 21st Extended Semantic Web conference, ESWC (2024)

  31. [31]

    In: Proc

    Ren, Y., et al.: Towards competency question-driven ontology authoring. In: Proc. of the 11th Extended Semantic Web Conference, ESWC. pp. 752–767. Springer International Publishing (2014)

  32. [32]

    Applied ontology10(2), 107–145 (2015)

    Suárez-Figueroa, M.C., et al.: The neon methodology framework: A scenario-based methodology for ontology development. Applied ontology10(2), 107–145 (2015)

  33. [33]

    In: Proc

    Tevet, G., Berant, J.: Evaluating the evaluation of diversity in natural language generation. In: Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 326–346 (2021)

  34. [34]

    Citeseer (1995)

    Uschold, M., King, M.: Towards a methodology for building ontologies. Citeseer (1995)

  35. [35]

    Journal of Web Semantics59, 100534 (2019)

    Wiśniewski, D., et al.: Analysis of ontology competency questions and their formal- izations in sparql-owl. Journal of Web Semantics59, 100534 (2019)

  36. [36]

    In: Proc

    Zhang, B., et al.: Ontochat: A framework for conversational ontology engineering using language models. In: Proc. of the 21st Extended Semantic Web conference, ESWC. pp. 102–121. Springer Nature Switzerland (2025)