T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language
Pith reviewed 2026-05-09 23:32 UTC · model grok-4.3
The pith
T2S-Metrics supplies a single open-source library with more than twenty metrics to evaluate SPARQL queries generated from natural language questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents t2s-metrics as an extensible library that implements more than twenty metrics drawn from prior work, including token-level precision recall and F1, BLEU ROUGE METEOR and CodeBLEU variants, variable-normalized SP-BLEU and SP-F1, graph and URI exact match, answer-set F1-QALD and Jaccard similarity, ranking measures such as MRR NDCG P@k and Hit@k, plus LLM-as-a-judge options. Its modular design decouples metric specification from implementation so that evaluation runs remain consistent and reproducible across different SPARQL-based question-answering systems.
What carries the argument
The t2s-metrics library, which provides a modular abstraction layer that separates metric definitions from their executable implementations.
If this is right
- Query generators can be scored on syntactic validity and semantic faithfulness in addition to final answer correctness.
- Diagnostic comparisons become possible that reveal whether errors occur at the lexical, structural, or execution level.
- Benchmark results across papers become directly comparable because the same metric implementations are used.
- New systems can be tested on ranking quality and efficiency without writing fresh code for each measure.
Where Pith is reading between the lines
- The library could serve as a base for new benchmark suites that deliberately test every dimension the metrics cover.
- Teams building natural-language interfaces might integrate the library directly into their training loops to optimize for multiple metrics at once.
- If adoption occurs, review processes for papers on knowledge-graph question answering could begin requiring use of the shared metric set.
Load-bearing premise
That the metrics already published in the literature are sufficient for meaningful diagnosis and that researchers will adopt the shared library instead of continuing with separate custom evaluations.
What would settle it
Publication of several new SPARQL query generation papers after the library release that still use entirely different or additional metrics and report no use of t2s-metrics.
Figures
read the original abstract
The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents t2s-metrics, an open-source extensible library for evaluating SPARQL queries generated from natural language in knowledge-graph QA systems. It aggregates over 20 metrics from the literature spanning lexical (e.g., BLEU/ROUGE variants, token P/R/F1), syntactic/structural (graph/URI exact match, variable-normalized SP-BLEU/SP-F1), semantic/execution (F1-QALD, Jaccard), ranking (MRR, NDCG, P@k), and LLM-as-a-Judge dimensions, using a modular abstraction layer modeled on ir-metrics to decouple specification from implementation and promote reproducibility.
Significance. If the library is correctly implemented, maintained, and adopted, it could reduce fragmentation in SPARQL QA evaluation by providing a single, transparent toolkit that covers dimensions currently handled inconsistently across papers. This would be a practical contribution to reproducibility. However, the claimed benefit of 'deeper diagnostic insights beyond answer correctness' remains an untested premise rather than a demonstrated outcome.
major comments (2)
- Abstract and introduction: the central assertion that t2s-metrics 'facilitates deeper diagnostic insights into system behavior beyond answer correctness' by spanning multiple dimensions is unsupported. The manuscript describes the collection of >20 metrics but contains no case study, correlation analysis, or experiment on shared benchmarks showing that any metric (or combination) is orthogonal to or more informative than standard answer-set F1. This is load-bearing for the motivation that the library constitutes a 'necessary step' toward systematic evaluation.
- No section provides validation: the paper claims correct implementation of metrics such as CodeBLEU variants, SP-F1, and LLM-as-a-Judge but offers neither unit-test results, comparison against reference implementations, nor output on a public benchmark (e.g., QALD or LC-QuAD). Without such evidence the reproducibility claim cannot be assessed.
minor comments (2)
- The modular abstraction layer is described at a high level; a concrete code snippet or class diagram illustrating how a new metric is registered would improve clarity for potential users.
- Citation of the ir-metrics library and the specific prior works from which each of the >20 metrics was drawn should be consolidated in a single table or appendix for traceability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where the original submission could be strengthened with additional evidence. We have revised the manuscript accordingly and respond to each major comment below.
read point-by-point responses
-
Referee: Abstract and introduction: the central assertion that t2s-metrics 'facilitates deeper diagnostic insights into system behavior beyond answer correctness' by spanning multiple dimensions is unsupported. The manuscript describes the collection of >20 metrics but contains no case study, correlation analysis, or experiment on shared benchmarks showing that any metric (or combination) is orthogonal to or more informative than standard answer-set F1. This is load-bearing for the motivation that the library constitutes a 'necessary step' toward systematic evaluation.
Authors: We agree that the original text presented the diagnostic benefit as a direct outcome without supporting analysis. In the revised manuscript we have added a new 'Empirical Illustration' subsection that applies the full metric suite to a public baseline on QALD-9. The analysis reports pairwise correlations and shows, for example, that SP-F1 and graph-exact-match identify structural mismatches on queries where answer-set F1 is high due to coincidental result overlap. We have also revised the abstract and introduction to state that the library enables such multi-dimensional diagnosis rather than claiming the paper itself demonstrates orthogonality. revision: yes
-
Referee: No section provides validation: the paper claims correct implementation of metrics such as CodeBLEU variants, SP-F1, and LLM-as-a-Judge but offers neither unit-test results, comparison against reference implementations, nor output on a public benchmark (e.g., QALD or LC-QuAD). Without such evidence the reproducibility claim cannot be assessed.
Authors: We accept that the initial version omitted explicit validation artifacts. The revised paper now contains a 'Validation and Reproducibility' section that reports (i) unit-test coverage percentages for each metric family, (ii) side-by-side numerical comparison of our SP-F1 and CodeBLEU implementations against the original reference code on a set of 50 hand-curated queries, and (iii) complete metric tables for a reference system on both QALD-9 and LC-QuAD. These additions directly address the reproducibility concern. revision: yes
Circularity Check
No circularity: library aggregates external metrics without derivations or self-referential reductions
full rationale
The paper describes an open-source library that collects and modularizes >20 metrics (BLEU variants, SP-F1, F1-QALD, MRR, etc.) drawn from prior literature, with a modular abstraction layer inspired by the external ir-metrics package. No equations, fitted parameters, predictions, or derivation chains exist in the manuscript. The central claim that the library 'facilitates deeper diagnostic insights' is presented as a forward-looking argument about utility, not as a result derived from any self-referential step or input. All metric definitions and the abstraction are independent of the present work's own outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
P. A. K. K. Diallo, S. Reyd, A. Zouaq, A Comprehensive Evaluation of Neural SPARQL Query Generation From Natural Language Questions , IEEE Access (2023)
work page 2023
-
[3]
H. M. ZAHERA, M. ALI, M. A. Sherif, D. Moussallem, A. Ngonga, Generating SPARQL from Natural Language Using Chain-of-Thoughts Prompting , International Conference on Semantic Systems (2024)
work page 2024
-
[4]
X. Pan, V. de Boer, J. V. Ossenbruggen, FIRESPARQL: A LLM-based Framework for SPARQL Query Generation over Scholarly Knowledge Graphs , International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (2025)
work page 2025
-
[5]
T. Soru, E. Marx, A. Valdestilhas, D. Esteves, D. Moussallem, G. Publio, Neural Machine Translation for Query Construction and Composition , arXiv.org (2018)
work page 2018
-
[6]
S. Reyd, A. Zouaq, Assessing the Generalization Capabilities of Neural Machine Translation Models for SPARQL Query Generation , International Workshop on the Semantic Web (2023)
work page 2023
-
[7]
Y. Gu, S. E. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan, Y. Su, Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases , The Web Conference (2020)
work page 2020
- [8]
-
[9]
V. I. Levenshtein, et al., Binary codes capable of correcting deletions, insertions, and reversals 10 (1966) 707–710
work page 1966
-
[10]
Harman, Information retrieval evaluation, Morgan & Claypool Publishers, 2011
D. Harman, Information retrieval evaluation, Morgan & Claypool Publishers, 2011
work page 2011
-
[11]
C. Van Gysel, M. de Rijke, Pytrec_eval: An Extremely Fast Python Interface to trec_eval, in: SIGIR, ACM, 2018
work page 2018
-
[12]
Streamlining Evaluation with ir-measures
S. MacAvaney, C. Macdonald, I. Ounis, "Streamlining Evaluation with ir-measures", in: Advances in Information Retrieval, Springer International Publishing, Cham, 2022, pp. 305–310
work page 2022
-
[13]
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a Method for Automatic Evaluation of Machine Translation (2002)
work page 2002
-
[14]
Lin, ROUGE: A Package for Automatic Evaluation of Summaries (2004)
C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries (2004)
work page 2004
-
[16]
M. R. A. H. Rony, U. Kumar, R. Teucher, L. Kovriguina, J. Lehmann, SGPT: A Generative Approach for SPARQL Query Generation From Natural Language Questions , IEEE Access (2022)
work page 2022
-
[17]
R. Dividino, G. Gröner, Which of the following SPARQL Queries are Similar? Why? , LD4IE@ISWC (2013)
work page 2013
-
[18]
M. B. Amor, A. Strappazzon, M. Granitzer, E. Egyed-Zsigmond, J. Mitrović, Instruct-to-SPARQL: A text-to-SPARQL dataset for training SPARQL Agents , Conference on Human Information Interaction and Retrieval (2025)
work page 2025
-
[19]
Y. Taghzouti, F. Michel, T. Jiang, L. F. Nothias, F. Gandon, Q²Forge: Minting Competency Questions and SPARQL Queries for Question-Answering Over Knowledge Graphs, in: Proceedings of the 13th Knowledge Capture Conference, 2025
work page 2025
-
[20]
J. Lehmann, S. Ferré, S. Vahdati, Language Models as Controlled Natural Language Semantic Parsers for Knowledge Graph Question Answering , European Conference on Artificial Intelligence (2023)
work page 2023
-
[21]
S. Liu, S. J. Semnani, H. Triedman, J. Xu, I. D. Zhao, M. S. Lam, SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions , Conference on Empirical Methods in Natural Language Processing (2024)
work page 2024
-
[22]
R. Wang, M. Wang, J. Liu, W. Chen, M. Cochez, S. Decker, Leveraging Knowledge Graph Embeddings for Natural Language Question Answering , International Conference on Database Systems for Advanced Applications (2019)
work page 2019
-
[23]
R. Omar, A. Orogat, I. Abdelaziz, O. Mangukiya, P. Kalnis, E. Mansour, Chatty-KG: A Multi- Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs , arXiv.org (2025)
work page 2025
- [24]
-
[25]
S. Auer, D. Barone, C. Bartz, E. Cortes, M. Y. Jaradeh, O. Karras, M. Koubarakis, D. Mouromtsev, D. Pliukhin, D. Radyush, I. Shilin, M. Stocker, E. Tsalapati, The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge , Scientific Reports (2023)
work page 2023
-
[26]
M. Bekbergenova, L. Pradi, B. Navet, E. Tysinger, F. Michel, M. Feraud, Y. Taghzouti, Y. Z. Chen, O. Kirchhoffer, F. Mehl, et al., MetaboT: An LLM-based Multi-Agent Framework for Interactive Analysis of Mass Spectrometry Metabolomics Knowledge (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.