The Wikidata Query Logs Dataset
Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3
The pith
An agent-based process turns anonymized Wikidata query logs into 335k executable question-query pairs without templates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikid
What carries the argument
The agent-based method that iteratively de-anonymizes, cleans, and verifies anonymized SPARQL queries from logs against Wikidata while generating corresponding natural-language questions.
If this is right
- Question-answering models can be trained on substantially more realistic query patterns than template-based sets allow.
- The same log-to-dataset pipeline supplies a scalable route for creating training data from any public SPARQL endpoint.
- Evaluation of semantic parsers now has access to a much larger pool of real-user queries rather than synthetic ones.
- Public release of both the pairs and the agent code lets other groups reproduce or extend the collection.
Where Pith is reading between the lines
- The method could be reused on logs from other large knowledge graphs to produce comparable real-query collections.
- Patterns discovered in the cleaned queries may reveal typical user information needs that template datasets miss.
- Models trained on this data might generalize better to live user questions posed against evolving knowledge graphs.
Load-bearing premise
The agent-based method can successfully convert a significant portion of anonymized log queries into meaningful, executable SPARQL queries that pair accurately with generated natural-language questions.
What would settle it
A manual audit finding that most queries in the released dataset either fail to execute on current Wikidata or that the paired questions do not describe the query semantics would falsify the central claim.
read the original abstract
We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available via https://github.com/ad-freiburg/wikidata-query-logs under a permissive license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the Wikidata Query Logs (WDQL) dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. Constructed from real-world anonymized SPARQL query logs rather than templates, it uses an agent-based method to iteratively de-anonymize, clean, verify queries against the public Wikidata service, and generate corresponding natural-language questions. The dataset is claimed to be over 11x larger than prior similar resources, with a demonstration of its utility for training question-answering models; all assets and code are released publicly.
Significance. If the dataset construction and quality claims hold, this would represent a substantial resource for knowledge-graph question answering research, offering scale and realism not available in template-based alternatives and potentially enabling stronger empirical results in QA model training.
major comments (2)
- [Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.
- [Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.
minor comments (2)
- [Method] Clarify the exact criteria used by the agent for 'verification' against Wikidata (e.g., result non-emptiness thresholds or syntactic checks) to improve reproducibility.
- [Dataset Construction] Consider adding a table summarizing the filtering pipeline stages with counts at each step (input logs, after de-anonymization, after verification, final pairs).
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment in detail below, clarifying the current content and indicating where revisions will be made to strengthen the presentation of the dataset construction and its utility.
read point-by-point responses
-
Referee: [Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.
Authors: We agree that explicit quantitative metrics on the agent-based pipeline would better support the dataset size and quality claims. The manuscript describes the iterative de-anonymization, cleaning, verification, and question generation process but does not report stage-wise success rates or rejection counts in the main text. In the revision we will add a dedicated subsection (or table) detailing the number of queries processed at each filtering and verification step, the overall success rate of the agent, and the results of a human evaluation on a sample of generated question-query pairs assessing fidelity and naturalness. revision: yes
-
Referee: [Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.
Authors: The current demonstration section illustrates the dataset's utility through qualitative examples and initial training runs, but we acknowledge that it lacks the quantitative metrics requested. We will expand this section to include concrete evaluation results: we will report exact match accuracy and F1 scores for question-answering models trained on WDQL, along with comparisons against relevant baselines, all evaluated on held-out test sets sampled from the dataset itself. revision: yes
Circularity Check
No circularity: dataset built from external logs with public verification
full rationale
The paper constructs the WDQL dataset directly from real-world anonymized SPARQL query logs sent to the public Wikidata Query Service. The agent-based method iteratively de-anonymizes, cleans, and verifies queries against that external service while generating NL questions. No equations, predictions, fitted parameters, or first-principles derivations are present that could reduce to self-definitions or self-citations. The size claim (335k pairs, 11x larger than prior datasets) rests on external comparison rather than internal fitting. All assets and code are released publicly, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Wikidata Query Service logs contain queries that can be de-anonymized and verified to produce results.
Forward citations
Cited by 1 Pith paper
-
GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons
GRISP fine-tunes a small language model on skeleton generation and list-wise re-ranking to iteratively complete SPARQL queries over knowledge graphs, reporting better benchmark results than prior methods.
Reference graph
Works this paper leans on
-
[1]
Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. 2022. Modern Baselines for SPARQL Semantic Parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2260–2265. doi:10.1145/3477495.3531841
-
[2]
Hannah Bast and Björn Buchhold. 2017. QLever: A Query Engine for Efficient SPARQL+Text Search. In CIKM. ACM, 647–656
work page 2017
-
[3]
Bollacker, Colin Evans, Praveen K
Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD Conference. ACM, 1247–1250
work page 2008
-
[4]
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- Scale Simple Question Answering with Memory Networks. arXiv:1506.02075 [cs] doi:10.48550/arXiv.1506.02075
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02075 2015
-
[5]
Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In PAKDD (2) (Lecture Notes in Computer Science, Vol. 7819). Springer, 160–172
work page 2013
-
[6]
Dennis Diefenbach, Thomas Pellissier Tanon, Kamal Deep Singh, and Pierre Maret. 2017. Question Answering Benchmarks for Wikidata. In ISWC (Posters, Demos & Industry Tracks) (CEUR Workshop Proceedings, Vol. 1963). CEUR-WS.org
work page 2017
-
[7]
Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann
-
[8]
InISWC (2) (Lecture Notes in Computer Science, Vol
LC-QuAD 2.0: A Large Dataset for Complex Question Answering over Wikidata and DBpedia. InISWC (2) (Lecture Notes in Computer Science, Vol. 11779). Springer, 69–78
-
[9]
Longquan Jiang and Ricardo Usbeck. 2022. Knowledge Graph Question Answer- ing Datasets and Their Generalizability: Are They Enough for Future Research?. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22) . Association for Computing Machinery, New York, NY, USA, 3209–3218. doi:10.1...
-
[10]
Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S
Shicheng Liu, Sina J. Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S. Lam. 2024. SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions. In EMNLP (Findings). Association for Computational Linguistics, 15977–16001
work page 2024
-
[11]
Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, and Adrian Bielefeldt. 2018. Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. InProceedings of the 17th International Semantic Web Conference (ISWC’18) (LNCS, Vol. 11137), Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina...
work page 2018
-
[12]
Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
work page 2020
-
[13]
Alberto Moya Loustaunau and Aidan Hogan. 2025. QAWiki v1: Knowledge Graph Question Answering (KGQA) / SPARQL Query Generation Dataset for Wikidata. doi:10.5281/zenodo.16787599
-
[14]
Qwen Team. 2025. Qwen3 Technical Report
work page 2025
-
[15]
Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian Haarmann, Anastasia Krithara, Michael Röder, and Giulio Napolitano. 2017. 7th Open Challenge on Question Answering over Linked Data (QALD-7). In SemWebEval@ESWC (Communications in Computer and Information Science, Vol. 769). Springer, 59–69
work page 2017
-
[16]
Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th Challenge on Question Answering over Linked Data: Shifting from DBpedia to Wikidata as a KG for KGQA. Semantic Web 15, 6 (2024), 2193–2207...
work page 2024
-
[17]
Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85
work page 2014
-
[18]
Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs. In ISWC (1) (Lecture Notes in Computer Science, Vol. 16140). Springer, 271–289
work page 2025
-
[19]
Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs - Demo System. In ISWC (Indus- try/Doctoral Consortium/Posters/Demos) (CEUR Workshop Proceedings, Vol. 4085) . CEUR-WS.org, 475–482
work page 2025
-
[20]
Sebastian Walter and Hannah Bast. 2025. Knowledge Graph Entity Linking via Interactive Reasoning and Exploration with GRASP. In Proceedings of the 20th International Workshop on Ontology Matching (OM 2025) (CEUR Workshop Proceedings, Vol. 4144). CEUR-WS.org
work page 2025
-
[21]
Sebastian Walter and Hannah Bast. 2026. GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons. Short paper submitted to SIGIR’26
work page 2026
-
[22]
Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina J. Semnani, and Monica S. Lam. 2023. Fine-Tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In EMNLP. Association for Computational Linguistics, 5778–5791
work page 2023
-
[23]
Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL (2). The Association for Computer Linguistics
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.