pith. sign in

arxiv: 2604.26971 · v1 · submitted 2026-04-22 · 💻 cs.IR

T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

Pith reviewed 2026-05-09 23:32 UTC · model grok-4.3

classification 💻 cs.IR
keywords SPARQL evaluationquestion answering over knowledge graphsnatural language to SPARQLevaluation metrics libraryreproducible benchmarkingquery generation assessment
0
0 comments X

The pith

T2S-Metrics supplies a single open-source library with more than twenty metrics to evaluate SPARQL queries generated from natural language questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that evaluation of systems turning natural language into SPARQL queries has stayed fragmented because different studies pick different metrics and report them inconsistently. It presents t2s-metrics as a shared library that gathers lexical, syntactic, semantic, structural, execution, and ranking measures into one modular package. The library separates the definition of each metric from its code so that any system can be scored on the same set of tests without custom re-implementation. A sympathetic reader would care because this change could let researchers compare approaches directly and see not only whether a system returns the right answers but also where its generated queries fall short in syntax or structure.

Core claim

The paper presents t2s-metrics as an extensible library that implements more than twenty metrics drawn from prior work, including token-level precision recall and F1, BLEU ROUGE METEOR and CodeBLEU variants, variable-normalized SP-BLEU and SP-F1, graph and URI exact match, answer-set F1-QALD and Jaccard similarity, ranking measures such as MRR NDCG P@k and Hit@k, plus LLM-as-a-judge options. Its modular design decouples metric specification from implementation so that evaluation runs remain consistent and reproducible across different SPARQL-based question-answering systems.

What carries the argument

The t2s-metrics library, which provides a modular abstraction layer that separates metric definitions from their executable implementations.

If this is right

  • Query generators can be scored on syntactic validity and semantic faithfulness in addition to final answer correctness.
  • Diagnostic comparisons become possible that reveal whether errors occur at the lexical, structural, or execution level.
  • Benchmark results across papers become directly comparable because the same metric implementations are used.
  • New systems can be tested on ranking quality and efficiency without writing fresh code for each measure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The library could serve as a base for new benchmark suites that deliberately test every dimension the metrics cover.
  • Teams building natural-language interfaces might integrate the library directly into their training loops to optimize for multiple metrics at once.
  • If adoption occurs, review processes for papers on knowledge-graph question answering could begin requiring use of the shared metric set.

Load-bearing premise

That the metrics already published in the literature are sufficient for meaningful diagnosis and that researchers will adopt the shared library instead of continuing with separate custom evaluations.

What would settle it

Publication of several new SPARQL query generation papers after the library release that still use entirely different or additional metrics and report no use of t2s-metrics.

Figures

Figures reproduced from arXiv: 2604.26971 by Benjamin Navet (ICN, Camille Juign\'e (WIMMICS, Fabien Gandon (WIMMICS, Franck Michel (Laboratoire I3S - SPARKS, Laboratoire I3S - SPARKS), Louis-Felix Nothias (ICN), Tao Jiang (ICN), WIMMICS, WIMMICS), Yousouf Taghzouti (ICN.

Figure 1
Figure 1. Figure 1: Radar plot of the different metrics scores of the top three systems that participated in the Text2SPARQL Challenge 2025 using the CK25 dataset. To demonstrate t2s-metrics, we used the results produced by the KGQA systems that participated in the Text2SPARQL 2025 challenge,10 and evaluated them using the metrics we provide on the ck25 dataset [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Correlation matrix between the different metrics applied to the systems that participated in the Text2SPARQL Challenge 2025, using the CK25 dataset. 6. Conclusion We presented t2s-metrics, a unified evaluation toolkit for SPARQL generation and KGQA. By aggre￾gating over 20 metrics, including recent adaptations like SP-BLEU and URI Hallucination, t2s-metrics addresses the fragmentation in current evaluation… view at source ↗
read the original abstract

The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents t2s-metrics, an open-source extensible library for evaluating SPARQL queries generated from natural language in knowledge-graph QA systems. It aggregates over 20 metrics from the literature spanning lexical (e.g., BLEU/ROUGE variants, token P/R/F1), syntactic/structural (graph/URI exact match, variable-normalized SP-BLEU/SP-F1), semantic/execution (F1-QALD, Jaccard), ranking (MRR, NDCG, P@k), and LLM-as-a-Judge dimensions, using a modular abstraction layer modeled on ir-metrics to decouple specification from implementation and promote reproducibility.

Significance. If the library is correctly implemented, maintained, and adopted, it could reduce fragmentation in SPARQL QA evaluation by providing a single, transparent toolkit that covers dimensions currently handled inconsistently across papers. This would be a practical contribution to reproducibility. However, the claimed benefit of 'deeper diagnostic insights beyond answer correctness' remains an untested premise rather than a demonstrated outcome.

major comments (2)
  1. Abstract and introduction: the central assertion that t2s-metrics 'facilitates deeper diagnostic insights into system behavior beyond answer correctness' by spanning multiple dimensions is unsupported. The manuscript describes the collection of >20 metrics but contains no case study, correlation analysis, or experiment on shared benchmarks showing that any metric (or combination) is orthogonal to or more informative than standard answer-set F1. This is load-bearing for the motivation that the library constitutes a 'necessary step' toward systematic evaluation.
  2. No section provides validation: the paper claims correct implementation of metrics such as CodeBLEU variants, SP-F1, and LLM-as-a-Judge but offers neither unit-test results, comparison against reference implementations, nor output on a public benchmark (e.g., QALD or LC-QuAD). Without such evidence the reproducibility claim cannot be assessed.
minor comments (2)
  1. The modular abstraction layer is described at a high level; a concrete code snippet or class diagram illustrating how a new metric is registered would improve clarity for potential users.
  2. Citation of the ir-metrics library and the specific prior works from which each of the >20 metrics was drawn should be consolidated in a single table or appendix for traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where the original submission could be strengthened with additional evidence. We have revised the manuscript accordingly and respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract and introduction: the central assertion that t2s-metrics 'facilitates deeper diagnostic insights into system behavior beyond answer correctness' by spanning multiple dimensions is unsupported. The manuscript describes the collection of >20 metrics but contains no case study, correlation analysis, or experiment on shared benchmarks showing that any metric (or combination) is orthogonal to or more informative than standard answer-set F1. This is load-bearing for the motivation that the library constitutes a 'necessary step' toward systematic evaluation.

    Authors: We agree that the original text presented the diagnostic benefit as a direct outcome without supporting analysis. In the revised manuscript we have added a new 'Empirical Illustration' subsection that applies the full metric suite to a public baseline on QALD-9. The analysis reports pairwise correlations and shows, for example, that SP-F1 and graph-exact-match identify structural mismatches on queries where answer-set F1 is high due to coincidental result overlap. We have also revised the abstract and introduction to state that the library enables such multi-dimensional diagnosis rather than claiming the paper itself demonstrates orthogonality. revision: yes

  2. Referee: No section provides validation: the paper claims correct implementation of metrics such as CodeBLEU variants, SP-F1, and LLM-as-a-Judge but offers neither unit-test results, comparison against reference implementations, nor output on a public benchmark (e.g., QALD or LC-QuAD). Without such evidence the reproducibility claim cannot be assessed.

    Authors: We accept that the initial version omitted explicit validation artifacts. The revised paper now contains a 'Validation and Reproducibility' section that reports (i) unit-test coverage percentages for each metric family, (ii) side-by-side numerical comparison of our SP-F1 and CodeBLEU implementations against the original reference code on a set of 50 hand-curated queries, and (iii) complete metric tables for a reference system on both QALD-9 and LC-QuAD. These additions directly address the reproducibility concern. revision: yes

Circularity Check

0 steps flagged

No circularity: library aggregates external metrics without derivations or self-referential reductions

full rationale

The paper describes an open-source library that collects and modularizes >20 metrics (BLEU variants, SP-F1, F1-QALD, MRR, etc.) drawn from prior literature, with a modular abstraction layer inspired by the external ir-metrics package. No equations, fitted parameters, predictions, or derivation chains exist in the manuscript. The central claim that the library 'facilitates deeper diagnostic insights' is presented as a forward-looking argument about utility, not as a result derived from any self-referential step or input. All metric definitions and the abstraction are independent of the present work's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library paper presenting a collection of existing metrics in a unified framework; it introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5699 in / 1047 out tokens · 53533 ms · 2026-05-09T23:32:58.700824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Usbeck, M

    R. Usbeck, M. Röder, M. Hoffmann, F. Conrads, J. Huthmann, A.-C. N. Ngomo, C. Demmler, C. Unger, Benchmarking question answering systems, Semantic Web (2019)

  2. [2]

    P. A. K. K. Diallo, S. Reyd, A. Zouaq, A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions , IEEE Access (2023)

  3. [3]

    H. M. ZAHERA, M. ALI, M. A. Sherif, D. Moussallem, A. Ngonga, Generating SPARQL from Natural Language Using Chain-of-Thoughts Prompting , International Conference on Semantic Systems (2024)

  4. [4]

    X. Pan, V. de Boer, J. V. Ossenbruggen, FIRESPARQL: A LLM-based Framework for SPARQL Query Generation over Scholarly Knowledge Graphs , International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (2025)

  5. [5]

    T. Soru, E. Marx, A. Valdestilhas, D. Esteves, D. Moussallem, G. Publio, Neural Machine Translation for Query Construction and Composition , arXiv.org (2018)

  6. [6]

    S. Reyd, A. Zouaq, Assessing the Generalization Capabilities of Neural Machine Translation Models for SPARQL Query Generation , International Workshop on the Semantic Web (2023)

  7. [7]

    Y. Gu, S. E. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan, Y. Su, Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases , The Web Conference (2020)

  8. [8]

    Sharma, L

    A. Sharma, L. Lara, A. Zouaq, C. Pal, Reducing Hallucinations in Language Model-based SPARQL Query Generation Using Post-Generation Memory Retrieval , arXiv.org (2025)

  9. [9]

    V. I. Levenshtein, et al., Binary codes capable of correcting deletions, insertions, and reversals 10 (1966) 707–710

  10. [10]

    Harman, Information retrieval evaluation, Morgan & Claypool Publishers, 2011

    D. Harman, Information retrieval evaluation, Morgan & Claypool Publishers, 2011

  11. [11]

    Van Gysel, M

    C. Van Gysel, M. de Rijke, Pytrec_eval: An Extremely Fast Python Interface to trec_eval, in: SIGIR, ACM, 2018

  12. [12]

    Streamlining Evaluation with ir-measures

    S. MacAvaney, C. Macdonald, I. Ounis, "Streamlining Evaluation with ir-measures", in: Advances in Information Retrieval, Springer International Publishing, Cham, 2022, pp. 305–310

  13. [13]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation (2002)

  14. [14]

    Lin, ROUGE: A Package for Automatic Evaluation of Summaries (2004)

    C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries (2004)

  15. [16]

    M. R. A. H. Rony, U. Kumar, R. Teucher, L. Kovriguina, J. Lehmann, SGPT: A Generative Approach for SPARQL Query Generation From Natural Language Questions , IEEE Access (2022)

  16. [17]

    Dividino, G

    R. Dividino, G. Gröner, Which of the following SPARQL Queries are Similar? Why? , LD4IE@ISWC (2013)

  17. [18]

    M. B. Amor, A. Strappazzon, M. Granitzer, E. Egyed-Zsigmond, J. Mitrović, Instruct-to-SPARQL: A text-to-SPARQL dataset for training SPARQL Agents , Conference on Human Information Interaction and Retrieval (2025)

  18. [19]

    Taghzouti, F

    Y. Taghzouti, F. Michel, T. Jiang, L. F. Nothias, F. Gandon, Q²Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs, in: Proceedings of the 13th Knowledge Capture Conference, 2025

  19. [20]

    Lehmann, S

    J. Lehmann, S. Ferré, S. Vahdati, Language Models as Controlled Natural Language Semantic Parsers for Knowledge Graph Question Answering , European Conference on Artificial Intelligence (2023)

  20. [21]

    S. Liu, S. J. Semnani, H. Triedman, J. Xu, I. D. Zhao, M. S. Lam, SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions , Conference on Empirical Methods in Natural Language Processing (2024)

  21. [22]

    R. Wang, M. Wang, J. Liu, W. Chen, M. Cochez, S. Decker, Leveraging Knowledge Graph Embeddings for Natural Language Question Answering , International Conference on Database Systems for Advanced Applications (2019)

  22. [23]

    R. Omar, A. Orogat, I. Abdelaziz, O. Mangukiya, P. Kalnis, E. Mansour, Chatty-KG: A Multi- Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs , arXiv.org (2025)

  23. [24]

    Dorsch, D

    R. Dorsch, D. Henselmann, A. Harth, Graf von Data: A Knowledge Graph Question Answering Agent for Organisational Usage (2025)

  24. [25]

    S. Auer, D. Barone, C. Bartz, E. Cortes, M. Y. Jaradeh, O. Karras, M. Koubarakis, D. Mouromtsev, D. Pliukhin, D. Radyush, I. Shilin, M. Stocker, E. Tsalapati, The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge , Scientific Reports (2023)

  25. [26]

    Bekbergenova, L

    M. Bekbergenova, L. Pradi, B. Navet, E. Tysinger, F. Michel, M. Feraud, Y. Taghzouti, Y. Z. Chen, O. Kirchhoffer, F. Mehl, et al., MetaboT: An LLM-based Multi-Agent Framework for Interactive Analysis of Mass Spectrometry Metabolomics Knowledge (2025)