pith. sign in

arxiv: 2602.14594 · v2 · pith:FBNWVSECnew · submitted 2026-02-16 · 💻 cs.CL

The Wikidata Query Logs Dataset

Pith reviewed 2026-05-21 12:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords Wikidataquery logsSPARQLquestion answeringdatasetknowledge graphnatural language questionsagent-based method
0
0 comments X

The pith

An agent-based process turns anonymized Wikidata query logs into 335k executable question-query pairs without templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Wikidata Query Logs dataset of 335k natural-language question and SPARQL query pairs drawn from the Wikidata knowledge graph. Construction starts from actual queries submitted to the public Wikidata Query Service rather than from hand-written templates. Because the logged queries arrive anonymized and frequently fail to return results, an agent is used to restore variables, remove errors, confirm executability, and produce matching questions. The resulting collection is more than eleven times larger than earlier comparable resources and is shown to improve training of question-answering systems over knowledge graphs.

Core claim

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikid

What carries the argument

The agent-based method that iteratively de-anonymizes, cleans, and verifies anonymized SPARQL queries from logs against Wikidata while generating corresponding natural-language questions.

If this is right

  • Question-answering models can be trained on substantially more realistic query patterns than template-based sets allow.
  • The same log-to-dataset pipeline supplies a scalable route for creating training data from any public SPARQL endpoint.
  • Evaluation of semantic parsers now has access to a much larger pool of real-user queries rather than synthetic ones.
  • Public release of both the pairs and the agent code lets other groups reproduce or extend the collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be reused on logs from other large knowledge graphs to produce comparable real-query collections.
  • Patterns discovered in the cleaned queries may reveal typical user information needs that template datasets miss.
  • Models trained on this data might generalize better to live user questions posed against evolving knowledge graphs.

Load-bearing premise

The agent-based method can successfully convert a significant portion of anonymized log queries into meaningful, executable SPARQL queries that pair accurately with generated natural-language questions.

What would settle it

A manual audit finding that most queries in the released dataset either fail to execute on current Wikidata or that the paired questions do not describe the query semantics would falsify the central claim.

read the original abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available via https://github.com/ad-freiburg/wikidata-query-logs under a permissive license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the Wikidata Query Logs (WDQL) dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. Constructed from real-world anonymized SPARQL query logs rather than templates, it uses an agent-based method to iteratively de-anonymize, clean, verify queries against the public Wikidata service, and generate corresponding natural-language questions. The dataset is claimed to be over 11x larger than prior similar resources, with a demonstration of its utility for training question-answering models; all assets and code are released publicly.

Significance. If the dataset construction and quality claims hold, this would represent a substantial resource for knowledge-graph question answering research, offering scale and realism not available in template-based alternatives and potentially enabling stronger empirical results in QA model training.

major comments (2)
  1. [Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.
  2. [Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.
minor comments (2)
  1. [Method] Clarify the exact criteria used by the agent for 'verification' against Wikidata (e.g., result non-emptiness thresholds or syntactic checks) to improve reproducibility.
  2. [Dataset Construction] Consider adding a table summarizing the filtering pipeline stages with counts at each step (input logs, after de-anonymization, after verification, final pairs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment in detail below, clarifying the current content and indicating where revisions will be made to strengthen the presentation of the dataset construction and its utility.

read point-by-point responses
  1. Referee: [Abstract and Dataset Construction] Abstract and Dataset Construction section: The central claim of a 335k high-quality question-query pair dataset rests on the agent-based method successfully converting a significant portion of anonymized log queries into executable SPARQL queries paired with faithful NL questions, yet no quantitative metrics are supplied on agent success rate, number of queries rejected during filtering, or human-judged fidelity of the generated questions. This absence leaves the headline size and quality unsupported.

    Authors: We agree that explicit quantitative metrics on the agent-based pipeline would better support the dataset size and quality claims. The manuscript describes the iterative de-anonymization, cleaning, verification, and question generation process but does not report stage-wise success rates or rejection counts in the main text. In the revision we will add a dedicated subsection (or table) detailing the number of queries processed at each filtering and verification step, the overall success rate of the agent, and the results of a human evaluation on a sample of generated question-query pairs assessing fidelity and naturalness. revision: yes

  2. Referee: [Experiments] Experiments / Demonstration section: The benefit for training QA methods is asserted, but without reported metrics such as exact match accuracy, F1, or comparison against baselines on held-out test sets derived from the new dataset, it is difficult to assess the practical impact claimed.

    Authors: The current demonstration section illustrates the dataset's utility through qualitative examples and initial training runs, but we acknowledge that it lacks the quantitative metrics requested. We will expand this section to include concrete evaluation results: we will report exact match accuracy and F1 scores for question-answering models trained on WDQL, along with comparisons against relevant baselines, all evaluated on held-out test sets sampled from the dataset itself. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset built from external logs with public verification

full rationale

The paper constructs the WDQL dataset directly from real-world anonymized SPARQL query logs sent to the public Wikidata Query Service. The agent-based method iteratively de-anonymizes, cleans, and verifies queries against that external service while generating NL questions. No equations, predictions, fitted parameters, or first-principles derivations are present that could reduce to self-definitions or self-citations. The size claim (335k pairs, 11x larger than prior datasets) rests on external comparison rather than internal fitting. All assets and code are released publicly, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced; the contribution relies on standard domain assumptions about query logs and the Wikidata graph.

axioms (1)
  • domain assumption The Wikidata Query Service logs contain queries that can be de-anonymized and verified to produce results.
    The method assumes that iterative cleaning can recover meaningful queries from anonymized logs.

pith-pipeline@v0.9.0 · 5686 in / 1276 out tokens · 95348 ms · 2026-05-21T12:23:30.290434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons

    cs.CL 2026-04 unverdicted novelty 6.0

    GRISP fine-tunes a small language model on skeleton generation and list-wise re-ranking to iteratively complete SPARQL queries over knowledge graphs, reporting better benchmark results than prior methods.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. 2022. Modern Baselines for SPARQL Semantic Parsing. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2260–2265. doi:10.1145/3477495.3531841

  2. [2]

    Hannah Bast and Björn Buchhold. 2017. QLever: A Query Engine for Efficient SPARQL+Text Search. In CIKM. ACM, 647–656

  3. [3]

    Bollacker, Colin Evans, Praveen K

    Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD Conference. ACM, 1247–1250

  4. [4]

    Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large- Scale Simple Question Answering with Memory Networks. arXiv:1506.02075 [cs] doi:10.48550/arXiv.1506.02075

  5. [5]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In PAKDD (2) (Lecture Notes in Computer Science, Vol. 7819). Springer, 160–172

  6. [6]

    Dennis Diefenbach, Thomas Pellissier Tanon, Kamal Deep Singh, and Pierre Maret. 2017. Question Answering Benchmarks for Wikidata. In ISWC (Posters, Demos & Industry Tracks) (CEUR Workshop Proceedings, Vol. 1963). CEUR-WS.org

  7. [7]

    Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann

  8. [8]

    InISWC (2) (Lecture Notes in Computer Science, Vol

    LC-QuAD 2.0: A Large Dataset for Complex Question Answering over Wikidata and DBpedia. InISWC (2) (Lecture Notes in Computer Science, Vol. 11779). Springer, 69–78

  9. [9]

    Longquan Jiang and Ricardo Usbeck. 2022. Knowledge Graph Question Answer- ing Datasets and Their Generalizability: Are They Enough for Future Research?. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22) . Association for Computing Machinery, New York, NY, USA, 3209–3218. doi:10.1...

  10. [10]

    Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S

    Shicheng Liu, Sina J. Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, and Monica S. Lam. 2024. SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions. In EMNLP (Findings). Association for Computational Linguistics, 15977–16001

  11. [11]

    Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, and Adrian Bielefeldt. 2018. Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph. InProceedings of the 17th International Semantic Web Conference (ISWC’18) (LNCS, Vol. 11137), Denny Vrandečić, Kalina Bontcheva, Mari Carmen Suárez-Figueroa, Valentina...

  12. [12]

    Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

  13. [13]

    Alberto Moya Loustaunau and Aidan Hogan. 2025. QAWiki v1: Knowledge Graph Question Answering (KGQA) / SPARQL Query Generation Dataset for Wikidata. doi:10.5281/zenodo.16787599

  14. [14]

    Qwen Team. 2025. Qwen3 Technical Report

  15. [15]

    Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian Haarmann, Anastasia Krithara, Michael Röder, and Giulio Napolitano. 2017. 7th Open Challenge on Question Answering over Linked Data (QALD-7). In SemWebEval@ESWC (Communications in Computer and Information Science, Vol. 769). Springer, 59–69

  16. [16]

    Ricardo Usbeck, Xi Yan, Aleksandr Perevalov, Longquan Jiang, Julius Schulz, Angelie Kraft, Cedric Möller, Junbo Huang, Jan Reineke, Axel-Cyrille Ngonga Ngomo, Muhammad Saleem, and Andreas Both. 2024. QALD-10 – The 10th Challenge on Question Answering over Linked Data: Shifting from DBpedia to Wikidata as a KG for KGQA. Semantic Web 15, 6 (2024), 2193–2207...

  17. [17]

    Denny Vrandecic and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 57, 10 (2014), 78–85

  18. [18]

    Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs. In ISWC (1) (Lecture Notes in Computer Science, Vol. 16140). Springer, 271–289

  19. [19]

    Sebastian Walter and Hannah Bast. 2025. GRASP: Generic Reasoning And SPARQL Generation Across Knowledge Graphs - Demo System. In ISWC (Indus- try/Doctoral Consortium/Posters/Demos) (CEUR Workshop Proceedings, Vol. 4085) . CEUR-WS.org, 475–482

  20. [20]

    Sebastian Walter and Hannah Bast. 2025. Knowledge Graph Entity Linking via Interactive Reasoning and Exploration with GRASP. In Proceedings of the 20th International Workshop on Ontology Matching (OM 2025) (CEUR Workshop Proceedings, Vol. 4144). CEUR-WS.org

  21. [21]

    Sebastian Walter and Hannah Bast. 2026. GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons. Short paper submitted to SIGIR’26

  22. [22]

    Semnani, and Monica S

    Silei Xu, Shicheng Liu, Theo Culhane, Elizaveta Pertseva, Meng-Hsi Wu, Sina J. Semnani, and Monica S. Lam. 2023. Fine-Tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. In EMNLP. Association for Computational Linguistics, 5778–5791

  23. [23]

    Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In ACL (2). The Association for Computer Linguistics