The Wikidata Query Logs Dataset

Hannah Bast; Sebastian Walter

arxiv: 2602.14594 · v2 · pith:FBNWVSECnew · submitted 2026-02-16 · 💻 cs.CL

The Wikidata Query Logs Dataset

Sebastian Walter , Hannah Bast This is my paper

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords Wikidataquery logsSPARQLquestion answeringdatasetknowledge graphnatural language questionsagent-based method

0 comments

The pith

An agent-based process turns anonymized Wikidata query logs into 335k executable question-query pairs without templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Wikidata Query Logs dataset of 335k natural-language question and SPARQL query pairs drawn from the Wikidata knowledge graph. Construction starts from actual queries submitted to the public Wikidata Query Service rather than from hand-written templates. Because the logged queries arrive anonymized and frequently fail to return results, an agent is used to restore variables, remove errors, confirm executability, and produce matching questions. The resulting collection is more than eleven times larger than earlier comparable resources and is shown to improve training of question-answering systems over knowledge graphs.

Core claim

What carries the argument

The agent-based method that iteratively de-anonymizes, cleans, and verifies anonymized SPARQL queries from logs against Wikidata while generating corresponding natural-language questions.

If this is right

Question-answering models can be trained on substantially more realistic query patterns than template-based sets allow.
The same log-to-dataset pipeline supplies a scalable route for creating training data from any public SPARQL endpoint.
Evaluation of semantic parsers now has access to a much larger pool of real-user queries rather than synthetic ones.
Public release of both the pairs and the agent code lets other groups reproduce or extend the collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be reused on logs from other large knowledge graphs to produce comparable real-query collections.
Patterns discovered in the cleaned queries may reveal typical user information needs that template datasets miss.
Models trained on this data might generalize better to live user questions posed against evolving knowledge graphs.

Load-bearing premise

The agent-based method can successfully convert a significant portion of anonymized log queries into meaningful, executable SPARQL queries that pair accurately with generated natural-language questions.

What would settle it

A manual audit finding that most queries in the released dataset either fail to execute on current Wikidata or that the paired questions do not describe the query semantics would falsify the central claim.

read the original abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available via https://github.com/ad-freiburg/wikidata-query-logs under a permissive license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Releases a large real-world Wikidata query dataset but needs more details on agent success and question quality.

read the letter

This paper releases a dataset of 335k question-query pairs over Wikidata drawn from actual query logs rather than templates. At over 11 times the size of prior work, it stands out as a practical resource for training KGQA models on authentic queries. The authors make the data and their agent code publicly available, which is a clear plus for anyone wanting to use or extend it. The agent-based approach to de-anonymizing, cleaning, and verifying the log queries against the live service addresses a real problem with raw logs. The main gap is in the validation details. The abstract mentions the effort required and the iterative process, but it does not include success rates for the agent, rejection statistics, or any metrics on how accurately the generated questions match the query intent. That leaves some uncertainty about the final dataset quality. This work is for people in the question answering over knowledge graphs area who need larger, more realistic training sets. A reader focused on dataset construction or model training on real queries would get direct value from it. It deserves a serious referee because the contribution is a sizable public dataset with code, even though the construction process could use more transparency on its effectiveness. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents the Wikidata Query Logs (WDQL) dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. Constructed from real-world anonymized SPARQL query logs rather than templates, it uses an agent-based method to iteratively de-anonymize, clean, verify queries against the public Wikidata service, and generate corresponding natural-language questions. The dataset is claimed to be over 11x larger than prior similar resources, with a demonstration of its utility for training question-answering models; all assets and code are released publicly.

Significance. If the dataset construction and quality claims hold, this would represent a substantial resource for knowledge-graph question answering research, offering scale and realism not available in template-based alternatives and potentially enabling stronger empirical results in QA model training.

major comments (2)

[Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.
[Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.

minor comments (2)

[Method] Clarify the exact criteria used by the agent for 'verification' against Wikidata (e.g., result non-emptiness thresholds or syntactic checks) to improve reproducibility.
[Dataset Construction] Consider adding a table summarizing the filtering pipeline stages with counts at each step (input logs, after de-anonymization, after verification, final pairs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment in detail below, clarifying the current content and indicating where revisions will be made to strengthen the presentation of the dataset construction and its utility.

read point-by-point responses

Referee: [Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.

Authors: We agree that explicit quantitative metrics on the agent-based pipeline would better support the dataset size and quality claims. The manuscript describes the iterative de-anonymization, cleaning, verification, and question generation process but does not report stage-wise success rates or rejection counts in the main text. In the revision we will add a dedicated subsection (or table) detailing the number of queries processed at each filtering and verification step, the overall success rate of the agent, and the results of a human evaluation on a sample of generated question-query pairs assessing fidelity and naturalness. revision: yes
Referee: [Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.

Authors: The current demonstration section illustrates the dataset's utility through qualitative examples and initial training runs, but we acknowledge that it lacks the quantitative metrics requested. We will expand this section to include concrete evaluation results: we will report exact match accuracy and F1 scores for question-answering models trained on WDQL, along with comparisons against relevant baselines, all evaluated on held-out test sets sampled from the dataset itself. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset built from external logs with public verification

full rationale

The paper constructs the WDQL dataset directly from real-world anonymized SPARQL query logs sent to the public Wikidata Query Service. The agent-based method iteratively de-anonymizes, cleans, and verifies queries against that external service while generating NL questions. No equations, predictions, fitted parameters, or first-principles derivations are present that could reduce to self-definitions or self-citations. The size claim (335k pairs, 11x larger than prior datasets) rests on external comparison rather than internal fitting. All assets and code are released publicly, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the contribution relies on standard domain assumptions about query logs and the Wikidata graph.

axioms (1)

domain assumption The Wikidata Query Service logs contain queries that can be de-anonymized and verified to produce results.
The method assumes that iterative cleaning can recover meaningful queries from anonymized logs.

pith-pipeline@v0.9.0 · 5686 in / 1276 out tokens · 95348 ms · 2026-05-21T12:23:30.290434+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons
cs.CL 2026-04 unverdicted novelty 6.0

GRISP fine-tunes a small language model on skeleton generation and list-wise re-ranking to iteratively complete SPARQL queries over knowledge graphs, reporting better benchmark results than prior methods.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. 2022. Modern Baselines for SPARQL Semantic Parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2260–2265. doi:10.1145/3477495.3531841

work page doi:10.1145/3477495.3531841 2022
[2]

Hannah Bast and Björn Buchhold. 2017. QLever: A Query Engine for Efficient SPARQL+Text Search. In CIKM. ACM, 647–656

work page 2017
[3]

Bollacker, Colin Evans, Praveen K

Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD Conference. ACM, 1247–1250

work page 2008
[4]

Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- Scale Simple Question Answering with Memory Networks. arXiv:1506.02075 [cs] doi:10.48550/arXiv.1506.02075

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02075 2015
[5]

Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In PAKDD (2) (Lecture Notes in Computer Science, Vol. 7819). Springer, 160–172

work page 2013
[6]

Dennis Diefenbach, Thomas Pellissier Tanon, Kamal Deep Singh, and Pierre Maret. 2017. Question Answering Benchmarks for Wikidata. In ISWC (Posters, Demos & Industry Tracks) (CEUR Workshop Proceedings, Vol. 1963). CEUR-WS.org

work page 2017
[7]

Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann

work page
[8]

InISWC (2) (Lecture Notes in Computer Science, Vol

LC-QuAD 2.0: A Large Dataset for Complex Question Answering over Wikidata and DBpedia. InISWC (2) (Lecture Notes in Computer Science, Vol. 11779). Springer, 69–78

work page
[9]

Longquan Jiang and Ricardo Usbeck. 2022. Knowledge Graph Question Answer- ing Datasets and Their Generalizability: Are They Enough for Future Research?. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22) . Association for Computing Machinery, New York, NY, USA, 3209–3218. doi:10.1...

work page doi:10.1145/3477495.3531751 2022
[10]

Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S

Shicheng Liu, Sina J. Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S. Lam. 2024. SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions. In EMNLP (Findings). Association for Computational Linguistics, 15977–16001

work page 2024
[11]

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, and Adrian Bielefeldt. 2018. Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. InProceedings of the 17th International Semantic Web Conference (ISWC’18) (LNCS, Vol. 11137), Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina...

work page 2018
[12]

Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

work page 2020
[13]

Alberto Moya Loustaunau and Aidan Hogan. 2025. QAWiki v1: Knowledge Graph Question Answering (KGQA) / SPARQL Query Generation Dataset for Wikidata. doi:10.5281/zenodo.16787599

work page doi:10.5281/zenodo.16787599 2025
[14]

Qwen Team. 2025. Qwen3 Technical Report

work page 2025
[15]

Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian Haarmann, Anastasia Krithara, Michael Röder, and Giulio Napolitano. 2017. 7th Open Challenge on Question Answering over Linked Data (QALD-7). In SemWebEval@ESWC (Communications in Computer and Information Science, Vol. 769). Springer, 59–69

work page 2017
[16]

Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th Challenge on Question Answering over Linked Data: Shifting from DBpedia to Wikidata as a KG for KGQA. Semantic Web 15, 6 (2024), 2193–2207...

work page 2024
[17]

Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85

work page 2014
[18]

Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs. In ISWC (1) (Lecture Notes in Computer Science, Vol. 16140). Springer, 271–289

work page 2025
[19]

Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs - Demo System. In ISWC (Indus- try/Doctoral Consortium/Posters/Demos) (CEUR Workshop Proceedings, Vol. 4085) . CEUR-WS.org, 475–482

work page 2025
[20]

Sebastian Walter and Hannah Bast. 2025. Knowledge Graph Entity Linking via Interactive Reasoning and Exploration with GRASP. In Proceedings of the 20th International Workshop on Ontology Matching (OM 2025) (CEUR Workshop Proceedings, Vol. 4144). CEUR-WS.org

work page 2025
[21]

Sebastian Walter and Hannah Bast. 2026. GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons. Short paper submitted to SIGIR’26

work page 2026
[22]

Semnani, and Monica S

Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina J. Semnani, and Monica S. Lam. 2023. Fine-Tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In EMNLP. Association for Computational Linguistics, 5778–5791

work page 2023
[23]

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL (2). The Association for Computer Linguistics

work page 2016

[1] [1]

Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. 2022. Modern Baselines for SPARQL Semantic Parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2260–2265. doi:10.1145/3477495.3531841

work page doi:10.1145/3477495.3531841 2022

[2] [2]

Hannah Bast and Björn Buchhold. 2017. QLever: A Query Engine for Efficient SPARQL+Text Search. In CIKM. ACM, 647–656

work page 2017

[3] [3]

Bollacker, Colin Evans, Praveen K

Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD Conference. ACM, 1247–1250

work page 2008

[4] [4]

Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- Scale Simple Question Answering with Memory Networks. arXiv:1506.02075 [cs] doi:10.48550/arXiv.1506.02075

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02075 2015

[5] [5]

Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In PAKDD (2) (Lecture Notes in Computer Science, Vol. 7819). Springer, 160–172

work page 2013

[6] [6]

Dennis Diefenbach, Thomas Pellissier Tanon, Kamal Deep Singh, and Pierre Maret. 2017. Question Answering Benchmarks for Wikidata. In ISWC (Posters, Demos & Industry Tracks) (CEUR Workshop Proceedings, Vol. 1963). CEUR-WS.org

work page 2017

[7] [7]

Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann

work page

[8] [8]

InISWC (2) (Lecture Notes in Computer Science, Vol

LC-QuAD 2.0: A Large Dataset for Complex Question Answering over Wikidata and DBpedia. InISWC (2) (Lecture Notes in Computer Science, Vol. 11779). Springer, 69–78

work page

[9] [9]

Longquan Jiang and Ricardo Usbeck. 2022. Knowledge Graph Question Answer- ing Datasets and Their Generalizability: Are They Enough for Future Research?. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22) . Association for Computing Machinery, New York, NY, USA, 3209–3218. doi:10.1...

work page doi:10.1145/3477495.3531751 2022

[10] [10]

Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S

Shicheng Liu, Sina J. Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S. Lam. 2024. SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions. In EMNLP (Findings). Association for Computational Linguistics, 15977–16001

work page 2024

[11] [11]

Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, and Adrian Bielefeldt. 2018. Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. InProceedings of the 17th International Semantic Web Conference (ISWC’18) (LNCS, Vol. 11137), Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina...

work page 2018

[12] [12]

Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

work page 2020

[13] [13]

Alberto Moya Loustaunau and Aidan Hogan. 2025. QAWiki v1: Knowledge Graph Question Answering (KGQA) / SPARQL Query Generation Dataset for Wikidata. doi:10.5281/zenodo.16787599

work page doi:10.5281/zenodo.16787599 2025

[14] [14]

Qwen Team. 2025. Qwen3 Technical Report

work page 2025

[15] [15]

Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian Haarmann, Anastasia Krithara, Michael Röder, and Giulio Napolitano. 2017. 7th Open Challenge on Question Answering over Linked Data (QALD-7). In SemWebEval@ESWC (Communications in Computer and Information Science, Vol. 769). Springer, 59–69

work page 2017

[16] [16]

Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th Challenge on Question Answering over Linked Data: Shifting from DBpedia to Wikidata as a KG for KGQA. Semantic Web 15, 6 (2024), 2193–2207...

work page 2024

[17] [17]

Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85

work page 2014

[18] [18]

Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs. In ISWC (1) (Lecture Notes in Computer Science, Vol. 16140). Springer, 271–289

work page 2025

[19] [19]

Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs - Demo System. In ISWC (Indus- try/Doctoral Consortium/Posters/Demos) (CEUR Workshop Proceedings, Vol. 4085) . CEUR-WS.org, 475–482

work page 2025

[20] [20]

Sebastian Walter and Hannah Bast. 2025. Knowledge Graph Entity Linking via Interactive Reasoning and Exploration with GRASP. In Proceedings of the 20th International Workshop on Ontology Matching (OM 2025) (CEUR Workshop Proceedings, Vol. 4144). CEUR-WS.org

work page 2025

[21] [21]

Sebastian Walter and Hannah Bast. 2026. GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons. Short paper submitted to SIGIR’26

work page 2026

[22] [22]

Semnani, and Monica S

Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina J. Semnani, and Monica S. Lam. 2023. Fine-Tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In EMNLP. Association for Computational Linguistics, 5778–5791

work page 2023

[23] [23]

Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL (2). The Association for Computer Linguistics

work page 2016