A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3
The pith
An ontology-based knowledge graph yields more consistent and diagnosable gap and overlap results on insurance contracts than direct text inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents an executable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth for gap and overlap analysis. It includes ten simplified yet diverse life-insurance contracts reviewed by a domain expert, a domain ontology with an instantiated knowledge base, and 58 structured scenarios paired with SPARQL queries that return contract-level outcomes and clause-level excerpts. The comparison of a text-only LLM baseline against an ontology-driven pipeline demonstrates that explicit modeling improves consistency and diagnosis for these tasks.
What carries the argument
The benchmark resource that aligns contract text with an ontology, populates an instantiated KG, and supplies SPARQL queries linked to expert-labeled outcomes and justifying clause excerpts.
If this is right
- KG construction methods can be compared systematically on competency questions that require distinguishing coverage from restrictions.
- Explicit ontology modeling supplies traceable clause-level justifications that text-only inference lacks.
- The same benchmark template can be applied to ontology learning, KG population, and evidence-grounded question answering in policy domains.
Where Pith is reading between the lines
- Similar resources could be built for other regulated domains such as healthcare policies or financial regulations to test KG readiness.
- Pure LLM approaches may require hybrid augmentation with structured representations to handle precise logical distinctions in contracts.
- Scaling the benchmark to complete contracts would expose whether current ontology population techniques can maintain the observed consistency gains.
Load-bearing premise
The ten simplified contracts and 58 expert-labeled scenarios are representative of real-world policy gap and overlap tasks and the ground-truth labels are stable and unbiased.
What would settle it
Running the same 58 scenarios on a collection of unmodified, full-length insurance contracts and observing either no gain in consistency for the ontology pipeline or frequent changes in the expert labels when a second independent reviewer is used.
Figures
read the original abstract
Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present an executable benchmark for assessing knowledge graph (KG) task readiness via gap and overlap analysis on policy documents. It constructs ten simplified life-insurance contracts reviewed by a domain expert, a domain ontology (TBox) with instantiated ABox, and 58 structured scenarios paired with SPARQL queries, contract-level outcomes, and clause-level evidence excerpts. The central demonstration is that an ontology-driven pipeline outperforms a text-only LLM baseline in consistency and diagnostic capability for determining overlaps and gaps.
Significance. If the results hold, this provides a reusable template for evaluating KG quality on competency questions that involve genuine coverage differences rather than missing facts. The auditable, evidence-linked design with formal ontology alignment is a clear strength that could support ontology learning, KG population, and explainable QA. The focus on reproducible, traceable determinations addresses a practical need in domains like insurance policy analysis.
major comments (2)
- [Benchmark Construction] Benchmark construction and labeling process: The 58 scenarios receive ground-truth labels from a single domain expert on the ten simplified contracts, with no reported inter-annotator agreement, sensitivity analysis to label perturbations, or validation against additional experts. This is load-bearing for the improvement claim, because any measured consistency or diagnosis gains between the LLM baseline and ontology pipeline could reflect the specific interpretive biases or instabilities in the single-expert labels rather than the benefit of explicit modeling.
- [Evaluation] Evaluation and results: The demonstration of improved consistency and diagnosis lacks accompanying quantitative metrics (e.g., agreement rates, error breakdowns), statistical tests, or ablation on label stability. Without these, it is difficult to determine the magnitude or robustness of the reported gains over the text-only baseline.
minor comments (2)
- The abstract states that the benchmark is 'executable and auditable' but would benefit from an explicit statement of how the SPARQL queries and evidence excerpts are made publicly available (e.g., repository link or supplementary files) to enable full reproducibility.
- [Benchmark Construction] Clarify in the methods whether the ten contracts are provided in full or only as excerpts, as this affects the ability of readers to assess how simplification impacts the gap/overlap task.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark construction and labeling process: The 58 scenarios receive ground-truth labels from a single domain expert on the ten simplified contracts, with no reported inter-annotator agreement, sensitivity analysis to label perturbations, or validation against additional experts. This is load-bearing for the improvement claim, because any measured consistency or diagnosis gains between the LLM baseline and ontology pipeline could reflect the specific interpretive biases or instabilities in the single-expert labels rather than the benefit of explicit modeling.
Authors: We acknowledge that the ground-truth labels were assigned by a single domain expert, which represents a genuine limitation for claims about robustness. The 58 scenarios were intentionally designed around unambiguous coverage distinctions in the simplified contracts, and each label is paired with clause-level evidence excerpts to support traceability and independent verification. The benchmark is released publicly as an executable artifact specifically to invite community review and additional annotations. In the revised manuscript we will add an explicit limitations subsection discussing single-expert labeling and outline plans for future multi-expert validation. We continue to hold that the ontology pipeline's advantages in consistency and diagnosis arise from formal modeling rather than label artifacts, but we accept that additional validation would strengthen this position. revision: partial
-
Referee: [Evaluation] Evaluation and results: The demonstration of improved consistency and diagnosis lacks accompanying quantitative metrics (e.g., agreement rates, error breakdowns), statistical tests, or ablation on label stability. Without these, it is difficult to determine the magnitude or robustness of the reported gains over the text-only baseline.
Authors: The manuscript currently illustrates the differences via concrete scenario examples and qualitative analysis of consistency and diagnostic traceability. We agree that quantitative support is needed to convey the magnitude of improvement. In the revision we will add a dedicated results subsection reporting agreement rates of each method against the ground truth across all 58 scenarios, categorized error breakdowns (e.g., false-positive overlaps or missed gaps), and a limited sensitivity check by perturbing a subset of labels to assess stability. These additions will make the comparison more transparent and allow readers to evaluate robustness directly. revision: yes
Circularity Check
No circularity: benchmark construction and empirical comparison are self-contained
full rationale
The paper introduces a new benchmark resource (10 simplified contracts, domain ontology with ABox, and 58 expert-labeled scenarios with SPARQL queries and clause excerpts) and uses it to empirically compare a text-only LLM baseline against an ontology-driven pipeline on gap/overlap tasks. The central claim of improved consistency and diagnosis is demonstrated directly via performance differences on this independently constructed resource, with no equations, fitted parameters, self-definitional reductions, or load-bearing self-citations that collapse predictions back to inputs by construction. The derivation chain consists of resource creation followed by straightforward method evaluation, remaining fully self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Domain-expert labeling of scenarios produces stable and unbiased ground truth for gap/overlap outcomes
Reference graph
Works this paper leans on
-
[1]
B. Xue, L. Zou, Knowledge graph quality management: a comprehensive survey, IEEE Transactions on Knowledge and Data Engineering 35 (2022) 4969–4988
work page 2022
-
[2]
S. Tsaneva, D. Dessì, F. Osborne, M. Sabou, Knowledge graph validation by integrating llms and human-in-the-loop, Information Processing & Management 62 (2025) 104145
work page 2025
-
[3]
C. Bezerra, F. Freitas, F. Santana, Evaluating ontologies with competency questions, in: 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 3, IEEE, 2013, pp. 284–285
work page 2013
-
[4]
W. Araújo, G. Lima, I. Pierozzi Jr, Data-driven ontology evaluation based on competency ques- tions: A study in the agricultural domain, in: Knowledge Organization for a Sustainable World: Challenges and Perspectives for Cultural, Scientific, and Technological Sharing in a Connected Society, Ergon-Verlag, 2016, pp. 326–332
work page 2016
-
[5]
Semantic Web7(1), 63–93 (2016)
A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: A survey, Semantic Web 7 (2016) 63–93. doi:10.3233/SW-150175
-
[6]
H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8 (2016) 489–508
work page 2016
-
[7]
O. Seneviratne, B. Capuzzo, W. Van Woensel, Explainability-driven quality assessment for rule- based systems, in: Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, Association for Computing Machinery, New York, NY, USA, 2025, p. 2133–2140. URL: https: //doi.org/10.1145/3701716.3717571. doi:10.1145/3701716.3717571
-
[8]
Gruninger, Methodology for the design and evaluation of ontologies, in: Proc
M. Gruninger, Methodology for the design and evaluation of ontologies, in: Proc. IJCAI’95, Workshop on Basic Ontological Issues in Knowledge Sharing, 1995
work page 1995
-
[9]
H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), W3C Recommendation, W3C, 2017. URL: https://www.w3.org/TR/2017/REC-shacl-20170720/
work page 2017
-
[10]
Y. Koreeda, C. D. Manning, Contractnli: A dataset for document-level natural language inference for contracts, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 1907–1919
work page 2021
-
[11]
D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)
-
[12]
I. Kang, W. V. Woensel, O. Seneviratne, Using Large Language Models for Generating Smart Con- tracts for Health Insurance from Textual Policies, Springer Nature Switzerland, Cham, 2024, pp. 129–
work page 2024
-
[13]
doi: 10.1007/978-3-031-63592-2_ 11
URL: https://doi.org/10.1007/978-3-031-63592-2_11. doi: 10.1007/978-3-031-63592-2_ 11
-
[14]
O. Seneviratne, A. Gupta, M. Ahmed, Towards Smarter, Efficient and Trusted Insurance Mar- ketplaces through Computable Contracts, in: CapGemini White Papers, 2022. URL: https: //prod.ucwe.capgemini.com/wp-content/uploads/2022/06/Computable_Contracts_20.pdf
work page 2022
-
[15]
W. V. Woensel, M. Shukla, O. Seneviratne, Translating clinical decision logic within knowledge graphs to smart contracts, in: SeWeBMeDa@ ESWC, 2023. URL: https://ceur-ws.org/Vol-3466/ paper3.pdf
work page 2023
-
[16]
M. Bennett, The financial industry business ontology: Best practice for big data, Journal of Banking Regulation 14 (2013) 255–268
work page 2013
-
[17]
R. Hoekstra, J. Breuker, M. Di Bello, A. Boer, et al., The lkif core ontology of basic legal concepts., LOAIT 321 (2007) 43–63
work page 2007
-
[18]
W. Van Woensel, O. Seneviratne, Semantic interoperability on blockchain by generating smart contracts based on knowledge graphs, Blockchain: Research and Applications 7 (2025) 100320. URL: https://www.sciencedirect.com/science/article/pii/S2096720925000478. doi:https://doi. org/10.1016/j.bcra.2025.100320
-
[19]
S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering 36 (2024) 3580–3599
work page 2024
-
[20]
H. Babaei Giglou, J. D’Souza, S. Auer, Llms4ol: Large language models for ontology learning, in: International semantic web conference, Springer, 2023, pp. 408–427
work page 2023
-
[21]
M. J. Saeedizade, E. Blomqvist, Navigating ontology development with large language models, in: European semantic web conference, Springer, 2024, pp. 143–161
work page 2024
-
[22]
N. Mihindukulasooriya, S. Tiwari, C. F. Enguix, K. Lata, Text2kgbench: A benchmark for ontology- driven knowledge graph generation from text, in: International semantic web conference, Springer, 2023, pp. 247–265
work page 2023
-
[23]
URL: https://claude.ai/, accessed: 2026-03-08
Anthropic, Claude sonnet 4.6, 2026. URL: https://claude.ai/, accessed: 2026-03-08
work page 2026
-
[24]
URL: https://chatgpt.com/, model: gpt-5.3-chat-latest
OpenAI, Chatgpt (version 5.3), 2026. URL: https://chatgpt.com/, model: gpt-5.3-chat-latest. Ac- cessed: 2026-03-08
work page 2026
-
[25]
URL: https://gemini.google.com/, large Language Model
Google DeepMind, Gemini 3 flash, 2025. URL: https://gemini.google.com/, large Language Model. Accessed: 2026-03-08
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.