Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP
Pith reviewed 2026-05-20 05:26 UTC · model grok-4.3
The pith
GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit.
Load-bearing premise
The approach assumes that execution feedback from running generated SPARQL queries, combined with structural constraints and answer-level rewards, supplies a sufficiently reliable and informative training signal to optimize the policy in the absence of gold query annotations.
read the original abstract
Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes applying Group-Relative Policy Optimization (GRPO) to fine-tune the Qwen3-1.7B model for zero-shot Text-to-SPARQL generation on DBLP-QuAD. Prompts incorporate natural language questions plus symbolic entity/relation hints; training uses execution feedback, structural constraints, and answer-level rewards (with an optional gold-query shaping variant). Results are compared against the unmodified zero-shot baseline and a supervised DoRA-finetuned model on answer accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. The central claim is that GRPO yields substantial gains over zero-shot, competitive generalization, and that execution-based rewards drive most of the improvement while additional shaping adds limited benefit.
Significance. If the empirical gains and ablation findings hold under fuller reporting, the work shows that outcome-based RL can train compact instruction-tuned models for KGQA without gold query annotations, reducing reliance on expensive supervision in scholarly domains. The ablation isolating execution rewards and the generalization tests supply concrete evidence on reward design for compositional query generation.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.
- [§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.
- [§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.
minor comments (2)
- [§2] Dataset statistics (number of questions, query-length distribution, entity coverage) are missing from the data section and would help readers interpret the scale of the reported gains.
- [Figures 2-4] Figure and table captions should explicitly state whether error bars represent standard deviation across seeds or across query categories.
Simulated Author's Rebuttal
We thank the referee for the insightful comments and constructive feedback on our manuscript. We address each of the major comments in detail below and have made revisions to improve the clarity and completeness of the paper.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.
Authors: We agree that providing these details is essential for verifying the source of the improvements. In the revised version, we will explicitly present the reward equation used for execution-based rewards, describe the normalization scheme, clarify that rewards are binary (1 for successful execution matching the expected answer, 0 otherwise) with no partial credit for minor errors, and include variance statistics from our experimental runs. This will demonstrate that the gains are attributable to the GRPO training with outcome-based feedback. revision: yes
-
Referee: [§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.
Authors: We acknowledge the need for more granular reporting. We will include per-template accuracy tables in the revised manuscript, report absolute performance deltas relative to the zero-shot and DoRA baselines, and add error bars to indicate the robustness of the results across evaluation settings. revision: yes
-
Referee: [§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.
Authors: This point highlights an important aspect of our approach. We will revise §3.2 to provide a more detailed description of the reward formulation, including how structural constraints are applied to reduce syntax errors and the GRPO variance reduction mechanisms. For partial credit, we maintain that binary execution rewards are appropriate for ensuring fully correct queries in a compositional setting, as partial rewards could encourage incorrect but partially executable queries. We will add discussion on handling multiple valid queries through execution equivalence and why this design provides a reliable signal without requiring gold annotations. revision: partial
Circularity Check
No circularity: empirical RL evaluation with external baselines and ablations
full rationale
The paper reports experimental results from applying GRPO to Qwen3-1.7B on DBLP-QuAD for Text-to-SPARQL, comparing against zero-shot and supervised DoRA baselines while ablating reward components. No derivation chain reduces a claimed result to a fitted parameter or self-defined quantity by construction; performance metrics arise from training runs and held-out evaluation rather than algebraic equivalence to inputs. Execution-based rewards and structural constraints are treated as external signals, not internally fitted quantities renamed as predictions. Self-citations, if present, are not load-bearing for the central empirical claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Execution feedback from SPARQL queries combined with structural constraints provides a reliable reward signal for training
Reference graph
Works this paper leans on
-
[1]
PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding
Hanna Abi Akl. PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding. InISWC 2023-International Semantic Web Conference,
work page 2023
-
[2]
doi: 10.1109/ICSC59802. 2024.00050. Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. Modern baselines for SPARQL semantic parsing. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2260–2265,
-
[3]
DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph
Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph. InBIR@ECIR, pages 37–51, 2023a. Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. Awalesushil/DBLP- QuAD, 2023b. Weiqin Chen, Nhan Pham, Michael Glass, Long Vu, Gaetano Rossiello, Shankar Subr...
-
[4]
A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation
Rose Hirigoyen, Amal Zouaq, and Samuel Reyd. A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang, editors,Findings of the Association for Computa- tional Linguistics: AACL-IJCNLP 2022, pages 226–236, Online only, November
work page 2022
-
[5]
doi: 10.18653/v1/2022.findings-aacl.22
Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-aacl.22. Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Pan. Less is more: Making smaller language models competent subgraph retrievers for multi-hop KGQA. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15787–15803,
-
[6]
doi: 10.1109/ICICT64420.2025.11005216. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adapta- tion. InForty-First International Conference on Machine Learning,
-
[7]
ISSN 2157-6904. doi: 10.1145/3757923. 12 Xuan-Bang Nguyen, Xuan-Hieu Phan, and Massimo Piccardi. Fine-tuning text-to-SQL models with reinforcement-learning training objectives.Natural Language Processing Journal, 10:100135,
-
[8]
doi: 10.1016/j.nlp.2025.100135
ISSN 2949-7191. doi: 10.1016/j.nlp.2025.100135. Aleksandr Perevalov, Dennis Diefenbach, Ricardo Usbeck, and Andreas Both. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 229–234. IEEE,
-
[9]
Spurious Rewards: Rethinking Training Signals in RLVR
ISSN 2076-3417. doi: 10.3390/app14041521. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Si- mon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.3390/app14041521 2076
-
[10]
Sparks of tabular reasoning via Text2SQL reinforcement learning
Josefa Lia Stoisser, Marc Boubnovski Martell, and Julien Fauqueur. Sparks of tabular reasoning via Text2SQL reinforcement learning. InThe 4th Table Representation Learning Workshop at ACL 2025,
work page 2025
-
[11]
Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction
Floris Vossebeld and Shenghui Wang. Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction. In Daniil Dobriy, Sanju Tiwari, Jennifer D’Souza, Nandana Mihindukulasooriya, and Francesco Osborne, editors,Proceedings of the Sec- ond International Workshop on Retrieval-Augmented Generation Enabled by Knowledge Graphs (RAGE-KG 2025...
work page 2025
-
[12]
CEUR. 13 Pfeifer Banerjee Usbeck Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incen- tivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.