pith. sign in

arxiv: 2604.10853 · v1 · submitted 2026-04-12 · 💻 cs.AI

A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords gap analysisoverlap analysisknowledge graphontologybenchmarkinsurance contractsSPARQLpolicy documents
0
0 comments X

The pith

An ontology-based knowledge graph yields more consistent and diagnosable gap and overlap results on insurance contracts than direct text inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a benchmark to test whether knowledge graphs can answer real competency questions about policy coverage gaps and overlaps in a reproducible, explainable way. It supplies ten simplified life-insurance contracts, a domain ontology with populated facts, and 58 scenarios each tied to SPARQL queries plus clause-level evidence that justifies the labels. Direct comparison shows that routing the same scenarios through the instantiated graph produces steadier outcomes and clearer explanations than an LLM reading the raw text alone. The resource matters because many policy decisions hinge on defensible distinctions between what is covered and what is not, rather than on missing data or query syntax.

Core claim

The paper presents an executable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth for gap and overlap analysis. It includes ten simplified yet diverse life-insurance contracts reviewed by a domain expert, a domain ontology with an instantiated knowledge base, and 58 structured scenarios paired with SPARQL queries that return contract-level outcomes and clause-level excerpts. The comparison of a text-only LLM baseline against an ontology-driven pipeline demonstrates that explicit modeling improves consistency and diagnosis for these tasks.

What carries the argument

The benchmark resource that aligns contract text with an ontology, populates an instantiated KG, and supplies SPARQL queries linked to expert-labeled outcomes and justifying clause excerpts.

If this is right

  • KG construction methods can be compared systematically on competency questions that require distinguishing coverage from restrictions.
  • Explicit ontology modeling supplies traceable clause-level justifications that text-only inference lacks.
  • The same benchmark template can be applied to ontology learning, KG population, and evidence-grounded question answering in policy domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar resources could be built for other regulated domains such as healthcare policies or financial regulations to test KG readiness.
  • Pure LLM approaches may require hybrid augmentation with structured representations to handle precise logical distinctions in contracts.
  • Scaling the benchmark to complete contracts would expose whether current ontology population techniques can maintain the observed consistency gains.

Load-bearing premise

The ten simplified contracts and 58 expert-labeled scenarios are representative of real-world policy gap and overlap tasks and the ground-truth labels are stable and unbiased.

What would settle it

Running the same 58 scenarios on a collection of unmodified, full-length insurance contracts and observing either no gain in consistency for the ontology pipeline or frequent changes in the expert labels when a second independent reviewer is used.

Figures

Figures reproduced from arXiv: 2604.10853 by Maruf Ahmed Mridul, Oshani Seneviratne, Rohit Kapa.

Figure 1
Figure 1. Figure 1: Overview of the 10 life insurance contracts categorized by complexity. The contracts are grouped into three levels: Simple, Moderate, and Complex, based on their complexity and range of features. , , and represent the Key Features, Focus, and Uniqueness, respectively. C1-C10 are the identifiers of the contracts. on term typing, taxonomy discovery, and non-taxonomic relation extraction, finding that while L… view at source ↗
Figure 2
Figure 2. Figure 2: A representative TBox excerpt [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ontology Coverage Analysis. Extracted keyphrases from the contracts are matched against the ontology TBox artifacts (classes, properties, labels) using three different matching techniques. Example tuples (𝑠, 𝑜) denote the source contract term (𝑠) and its corresponding mapped ontology artifact (𝑜). The semantic content of the keyphrase is fully encoded; it is the surface-form divergence between the phrase a… view at source ↗
Figure 4
Figure 4. Figure 4: Inter-LLM agreement patterns across the 580 contract￾scenario pairs. Further insight emerges when examining which con￾tracts produce the largest number of errors. For Claude, the five most error-prone contracts are C8, C9, C6, C10, and C4. For ChatGPT, they are C8, C4, C10, C7, and C9. For Gemini, they are C6, C1, C9, C8, and C7. Notably, C4 (Variable Universal Life), C8 (Joint Survivorship), and C10 (Inde… view at source ↗
read the original abstract

Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to present an executable benchmark for assessing knowledge graph (KG) task readiness via gap and overlap analysis on policy documents. It constructs ten simplified life-insurance contracts reviewed by a domain expert, a domain ontology (TBox) with instantiated ABox, and 58 structured scenarios paired with SPARQL queries, contract-level outcomes, and clause-level evidence excerpts. The central demonstration is that an ontology-driven pipeline outperforms a text-only LLM baseline in consistency and diagnostic capability for determining overlaps and gaps.

Significance. If the results hold, this provides a reusable template for evaluating KG quality on competency questions that involve genuine coverage differences rather than missing facts. The auditable, evidence-linked design with formal ontology alignment is a clear strength that could support ontology learning, KG population, and explainable QA. The focus on reproducible, traceable determinations addresses a practical need in domains like insurance policy analysis.

major comments (2)
  1. [Benchmark Construction] Benchmark construction and labeling process: The 58 scenarios receive ground-truth labels from a single domain expert on the ten simplified contracts, with no reported inter-annotator agreement, sensitivity analysis to label perturbations, or validation against additional experts. This is load-bearing for the improvement claim, because any measured consistency or diagnosis gains between the LLM baseline and ontology pipeline could reflect the specific interpretive biases or instabilities in the single-expert labels rather than the benefit of explicit modeling.
  2. [Evaluation] Evaluation and results: The demonstration of improved consistency and diagnosis lacks accompanying quantitative metrics (e.g., agreement rates, error breakdowns), statistical tests, or ablation on label stability. Without these, it is difficult to determine the magnitude or robustness of the reported gains over the text-only baseline.
minor comments (2)
  1. The abstract states that the benchmark is 'executable and auditable' but would benefit from an explicit statement of how the SPARQL queries and evidence excerpts are made publicly available (e.g., repository link or supplementary files) to enable full reproducibility.
  2. [Benchmark Construction] Clarify in the methods whether the ten contracts are provided in full or only as excerpts, as this affects the ability of readers to assess how simplification impacts the gap/overlap task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark construction and labeling process: The 58 scenarios receive ground-truth labels from a single domain expert on the ten simplified contracts, with no reported inter-annotator agreement, sensitivity analysis to label perturbations, or validation against additional experts. This is load-bearing for the improvement claim, because any measured consistency or diagnosis gains between the LLM baseline and ontology pipeline could reflect the specific interpretive biases or instabilities in the single-expert labels rather than the benefit of explicit modeling.

    Authors: We acknowledge that the ground-truth labels were assigned by a single domain expert, which represents a genuine limitation for claims about robustness. The 58 scenarios were intentionally designed around unambiguous coverage distinctions in the simplified contracts, and each label is paired with clause-level evidence excerpts to support traceability and independent verification. The benchmark is released publicly as an executable artifact specifically to invite community review and additional annotations. In the revised manuscript we will add an explicit limitations subsection discussing single-expert labeling and outline plans for future multi-expert validation. We continue to hold that the ontology pipeline's advantages in consistency and diagnosis arise from formal modeling rather than label artifacts, but we accept that additional validation would strengthen this position. revision: partial

  2. Referee: [Evaluation] Evaluation and results: The demonstration of improved consistency and diagnosis lacks accompanying quantitative metrics (e.g., agreement rates, error breakdowns), statistical tests, or ablation on label stability. Without these, it is difficult to determine the magnitude or robustness of the reported gains over the text-only baseline.

    Authors: The manuscript currently illustrates the differences via concrete scenario examples and qualitative analysis of consistency and diagnostic traceability. We agree that quantitative support is needed to convey the magnitude of improvement. In the revision we will add a dedicated results subsection reporting agreement rates of each method against the ground truth across all 58 scenarios, categorized error breakdowns (e.g., false-positive overlaps or missed gaps), and a limited sensitivity check by perturbing a subset of labels to assess stability. These additions will make the comparison more transparent and allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical comparison are self-contained

full rationale

The paper introduces a new benchmark resource (10 simplified contracts, domain ontology with ABox, and 58 expert-labeled scenarios with SPARQL queries and clause excerpts) and uses it to empirically compare a text-only LLM baseline against an ontology-driven pipeline on gap/overlap tasks. The central claim of improved consistency and diagnosis is demonstrated directly via performance differences on this independently constructed resource, with no equations, fitted parameters, self-definitional reductions, or load-bearing self-citations that collapse predictions back to inputs by construction. The derivation chain consists of resource creation followed by straightforward method evaluation, remaining fully self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that expert review yields reliable ground truth and that the simplified contracts capture genuine coverage differences rather than artifacts of simplification.

axioms (1)
  • domain assumption Domain-expert labeling of scenarios produces stable and unbiased ground truth for gap/overlap outcomes
    The paper uses expert review to create the 58 scenario labels and clause excerpts that serve as the evaluation target.

pith-pipeline@v0.9.0 · 5604 in / 1356 out tokens · 38798 ms · 2026-05-10T15:04:06.154553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    B. Xue, L. Zou, Knowledge graph quality management: a comprehensive survey, IEEE Transactions on Knowledge and Data Engineering 35 (2022) 4969–4988

  2. [2]

    Tsaneva, D

    S. Tsaneva, D. Dessì, F. Osborne, M. Sabou, Knowledge graph validation by integrating llms and human-in-the-loop, Information Processing & Management 62 (2025) 104145

  3. [3]

    Bezerra, F

    C. Bezerra, F. Freitas, F. Santana, Evaluating ontologies with competency questions, in: 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 3, IEEE, 2013, pp. 284–285

  4. [4]

    Araújo, G

    W. Araújo, G. Lima, I. Pierozzi Jr, Data-driven ontology evaluation based on competency ques- tions: A study in the agricultural domain, in: Knowledge Organization for a Sustainable World: Challenges and Perspectives for Cultural, Scientific, and Technological Sharing in a Connected Society, Ergon-Verlag, 2016, pp. 326–332

  5. [5]

    Semantic Web7(1), 63–93 (2016)

    A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: A survey, Semantic Web 7 (2016) 63–93. doi:10.3233/SW-150175

  6. [6]

    Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8 (2016) 489–508

    H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8 (2016) 489–508

  7. [7]

    ASurveyonLLM-as-a-Judge

    O. Seneviratne, B. Capuzzo, W. Van Woensel, Explainability-driven quality assessment for rule- based systems, in: Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, Association for Computing Machinery, New York, NY, USA, 2025, p. 2133–2140. URL: https: //doi.org/10.1145/3701716.3717571. doi:10.1145/3701716.3717571

  8. [8]

    Gruninger, Methodology for the design and evaluation of ontologies, in: Proc

    M. Gruninger, Methodology for the design and evaluation of ontologies, in: Proc. IJCAI’95, Workshop on Basic Ontological Issues in Knowledge Sharing, 1995

  9. [9]

    Knublauch, D

    H. Knublauch, D. Kontokostas, Shapes Constraint Language (SHACL), W3C Recommendation, W3C, 2017. URL: https://www.w3.org/TR/2017/REC-shacl-20170720/

  10. [10]

    Koreeda, C

    Y. Koreeda, C. D. Manning, Contractnli: A dataset for document-level natural language inference for contracts, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 1907–1919

  11. [11]

    Hendrycks, C

    D. Hendrycks, C. Burns, A. Chen, S. Ball, Cuad: An expert-annotated nlp dataset for legal contract review, arXiv preprint arXiv:2103.06268 (2021)

  12. [12]

    I. Kang, W. V. Woensel, O. Seneviratne, Using Large Language Models for Generating Smart Con- tracts for Health Insurance from Textual Policies, Springer Nature Switzerland, Cham, 2024, pp. 129–

  13. [13]

    doi: 10.1007/978-3-031-63592-2_ 11

    URL: https://doi.org/10.1007/978-3-031-63592-2_11. doi: 10.1007/978-3-031-63592-2_ 11

  14. [14]

    Seneviratne, A

    O. Seneviratne, A. Gupta, M. Ahmed, Towards Smarter, Efficient and Trusted Insurance Mar- ketplaces through Computable Contracts, in: CapGemini White Papers, 2022. URL: https: //prod.ucwe.capgemini.com/wp-content/uploads/2022/06/Computable_Contracts_20.pdf

  15. [15]

    W. V. Woensel, M. Shukla, O. Seneviratne, Translating clinical decision logic within knowledge graphs to smart contracts, in: SeWeBMeDa@ ESWC, 2023. URL: https://ceur-ws.org/Vol-3466/ paper3.pdf

  16. [16]

    Bennett, The financial industry business ontology: Best practice for big data, Journal of Banking Regulation 14 (2013) 255–268

    M. Bennett, The financial industry business ontology: Best practice for big data, Journal of Banking Regulation 14 (2013) 255–268

  17. [17]

    Hoekstra, J

    R. Hoekstra, J. Breuker, M. Di Bello, A. Boer, et al., The lkif core ontology of basic legal concepts., LOAIT 321 (2007) 43–63

  18. [18]

    Van Woensel, O

    W. Van Woensel, O. Seneviratne, Semantic interoperability on blockchain by generating smart contracts based on knowledge graphs, Blockchain: Research and Applications 7 (2025) 100320. URL: https://www.sciencedirect.com/science/article/pii/S2096720925000478. doi:https://doi. org/10.1016/j.bcra.2025.100320

  19. [19]

    S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering 36 (2024) 3580–3599

  20. [20]

    Babaei Giglou, J

    H. Babaei Giglou, J. D’Souza, S. Auer, Llms4ol: Large language models for ontology learning, in: International semantic web conference, Springer, 2023, pp. 408–427

  21. [21]

    M. J. Saeedizade, E. Blomqvist, Navigating ontology development with large language models, in: European semantic web conference, Springer, 2024, pp. 143–161

  22. [22]

    Mihindukulasooriya, S

    N. Mihindukulasooriya, S. Tiwari, C. F. Enguix, K. Lata, Text2kgbench: A benchmark for ontology- driven knowledge graph generation from text, in: International semantic web conference, Springer, 2023, pp. 247–265

  23. [23]

    URL: https://claude.ai/, accessed: 2026-03-08

    Anthropic, Claude sonnet 4.6, 2026. URL: https://claude.ai/, accessed: 2026-03-08

  24. [24]

    URL: https://chatgpt.com/, model: gpt-5.3-chat-latest

    OpenAI, Chatgpt (version 5.3), 2026. URL: https://chatgpt.com/, model: gpt-5.3-chat-latest. Ac- cessed: 2026-03-08

  25. [25]

    URL: https://gemini.google.com/, large Language Model

    Google DeepMind, Gemini 3 flash, 2025. URL: https://gemini.google.com/, large Language Model. Accessed: 2026-03-08