Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

Debayan Banerjee; Jann Pfeifer; Ricardo Usbeck

arxiv: 2605.20066 · v1 · pith:SQ7R35T3new · submitted 2026-05-19 · 💻 cs.CL

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

Jann Pfeifer , Debayan Banerjee , Ricardo Usbeck This is my paper

Pith reviewed 2026-05-20 05:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords accuracyadditionalbaselinelanguagelearningmodelreinforcementrewards

0 comments

The pith

GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge graphs store facts as linked data, and SPARQL is the query language used to retrieve answers from them. Converting a plain-English question into a correct SPARQL query is difficult for language models without extensive labeled examples. This work tests whether reinforcement learning can help a small model learn the task using only outcome rewards: the model proposes a query, the system runs it, and the model receives a score based on whether the result matches the expected answer. They add symbolic hints about entities and relations to the prompts and apply Group-Relative Policy Optimization on the Qwen3-1.7B model. Training also includes structural constraints to keep queries valid. Experiments compare the RL-trained model against an unmodified zero-shot version and a supervised fine-tuned version on answer accuracy, execution success, and generalization to new question templates. Ablations show that the execution rewards drive most of the gains, while adding gold-query shaping gives only small extra benefit.

Core claim

GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit.

Load-bearing premise

The approach assumes that execution feedback from running generated SPARQL queries, combined with structural constraints and answer-level rewards, supplies a sufficiently reliable and informative training signal to optimize the policy in the absence of gold query annotations.

read the original abstract

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRPO with execution rewards gives a clear lift over zero-shot on DBLP-QuAD Text-to-SPARQL for the 1.7B model, but supervised DoRA still leads and the scope stays narrow.

read the letter

GRPO with execution rewards gives a clear lift over zero-shot on DBLP-QuAD Text-to-SPARQL for the 1.7B model, but supervised DoRA still leads and the scope stays narrow. The paper demonstrates a workable RL setup that avoids gold query annotations for token-level supervision. They run GRPO on Qwen3-1.7B, feed prompts that mix the question with symbolic entity and relation hints, and score generations by whether the SPARQL executes and returns the right answer set. The ablations are the most useful part: execution-based rewards explain most of the gain, while extra shaping from gold queries adds little. Generalization checks on held-out templates also look reasonable. This is a practical data point for anyone trying to train small models on structured generation when full supervision is missing. The comparison to both the unmodified zero-shot baseline and the supervised DoRA run on the same model size keeps the claims grounded. The central argument holds up on its own terms because the results are presented as empirical comparisons rather than derived from fitted parameters inside the paper. The main limitation is the narrow footprint: one scholarly dataset, one model scale, and no tests on other knowledge graphs or larger models. SPARQL is compositional, so execution success can be brittle and produce sparse signals; without the exact reward formula or how partial matches are handled, it is hard to tell how much of the reported improvement comes from the RL loop versus the symbolic hints already in the prompt. The work does not claim broad theoretical advances, which matches what it actually delivers. This paper is for people working on domain-specific KGQA or small-model RL for query generation. Readers who need concrete ablation numbers on reward components will get value from it. It deserves a serious referee to check the experimental protocol, variance, and reward details.

Referee Report

3 major / 2 minor

Summary. The paper proposes applying Group-Relative Policy Optimization (GRPO) to fine-tune the Qwen3-1.7B model for zero-shot Text-to-SPARQL generation on DBLP-QuAD. Prompts incorporate natural language questions plus symbolic entity/relation hints; training uses execution feedback, structural constraints, and answer-level rewards (with an optional gold-query shaping variant). Results are compared against the unmodified zero-shot baseline and a supervised DoRA-finetuned model on answer accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. The central claim is that GRPO yields substantial gains over zero-shot, competitive generalization, and that execution-based rewards drive most of the improvement while additional shaping adds limited benefit.

Significance. If the empirical gains and ablation findings hold under fuller reporting, the work shows that outcome-based RL can train compact instruction-tuned models for KGQA without gold query annotations, reducing reliance on expensive supervision in scholarly domains. The ablation isolating execution rewards and the generalization tests supply concrete evidence on reward design for compositional query generation.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.
[§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.
[§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.

minor comments (2)

[§2] Dataset statistics (number of questions, query-length distribution, entity coverage) are missing from the data section and would help readers interpret the scale of the reported gains.
[Figures 2-4] Figure and table captions should explicitly state whether error bars represent standard deviation across seeds or across query categories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments and constructive feedback on our manuscript. We address each of the major comments in detail below and have made revisions to improve the clarity and completeness of the paper.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.

Authors: We agree that providing these details is essential for verifying the source of the improvements. In the revised version, we will explicitly present the reward equation used for execution-based rewards, describe the normalization scheme, clarify that rewards are binary (1 for successful execution matching the expected answer, 0 otherwise) with no partial credit for minor errors, and include variance statistics from our experimental runs. This will demonstrate that the gains are attributable to the GRPO training with outcome-based feedback. revision: yes
Referee: [§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.

Authors: We acknowledge the need for more granular reporting. We will include per-template accuracy tables in the revised manuscript, report absolute performance deltas relative to the zero-shot and DoRA baselines, and add error bars to indicate the robustness of the results across evaluation settings. revision: yes
Referee: [§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.

Authors: This point highlights an important aspect of our approach. We will revise §3.2 to provide a more detailed description of the reward formulation, including how structural constraints are applied to reduce syntax errors and the GRPO variance reduction mechanisms. For partial credit, we maintain that binary execution rewards are appropriate for ensuring fully correct queries in a compositional setting, as partial rewards could encourage incorrect but partially executable queries. We will add discussion on handling multiple valid queries through execution equivalence and why this design provides a reliable signal without requiring gold annotations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL evaluation with external baselines and ablations

full rationale

The paper reports experimental results from applying GRPO to Qwen3-1.7B on DBLP-QuAD for Text-to-SPARQL, comparing against zero-shot and supervised DoRA baselines while ablating reward components. No derivation chain reduces a claimed result to a fitted parameter or self-defined quantity by construction; performance metrics arise from training runs and held-out evaluation rather than algebraic equivalence to inputs. Execution-based rewards and structural constraints are treated as external signals, not internally fitted quantities renamed as predictions. Self-citations, if present, are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that outcome rewards derived from query execution are adequate to drive policy improvement without token-level supervision, plus standard reinforcement learning assumptions about reward signals and policy optimization.

axioms (1)

domain assumption Execution feedback from SPARQL queries combined with structural constraints provides a reliable reward signal for training
The method relies on execution feedback, structural constraints, and answer-level rewards as the primary training mechanism.

pith-pipeline@v0.9.0 · 5759 in / 1336 out tokens · 54055 ms · 2026-05-20T05:26:54.953907+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding

Hanna Abi Akl. PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding. InISWC 2023-International Semantic Web Conference,

work page 2023
[2]

2024.00050

doi: 10.1109/ICSC59802. 2024.00050. Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. Modern baselines for SPARQL semantic parsing. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2260–2265,

work page doi:10.1109/icsc59802 2024
[3]

DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph

Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph. InBIR@ECIR, pages 37–51, 2023a. Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. Awalesushil/DBLP- QuAD, 2023b. Weiqin Chen, Nhan Pham, Michael Glass, Long Vu, Gaetano Rossiello, Shankar Subr...

work page doi:10.1016/j.websem.2025 2025
[4]

A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation

Rose Hirigoyen, Amal Zouaq, and Samuel Reyd. A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang, editors,Findings of the Association for Computa- tional Linguistics: AACL-IJCNLP 2022, pages 226–236, Online only, November

work page 2022
[5]

doi: 10.18653/v1/2022.findings-aacl.22

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-aacl.22. Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Pan. Less is more: Making smaller language models competent subgraph retrievers for multi-hop KGQA. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15787–15803,

work page doi:10.18653/v1/2022.findings-aacl.22 2022
[6]

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen

doi: 10.1109/ICICT64420.2025.11005216. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adapta- tion. InForty-First International Conference on Machine Learning,

work page doi:10.1109/icict64420.2025.11005216 2025
[7]

doi: 10.1145/3757923

ISSN 2157-6904. doi: 10.1145/3757923. 12 Xuan-Bang Nguyen, Xuan-Hieu Phan, and Massimo Piccardi. Fine-tuning text-to-SQL models with reinforcement-learning training objectives.Natural Language Processing Journal, 10:100135,

work page doi:10.1145/3757923
[8]

doi: 10.1016/j.nlp.2025.100135

ISSN 2949-7191. doi: 10.1016/j.nlp.2025.100135. Aleksandr Perevalov, Dennis Diefenbach, Ricardo Usbeck, and Andreas Both. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 229–234. IEEE,

work page doi:10.1016/j.nlp.2025.100135 2025
[9]

Spurious Rewards: Rethinking Training Signals in RLVR

ISSN 2076-3417. doi: 10.3390/app14041521. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Si- mon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3390/app14041521 2076
[10]

Sparks of tabular reasoning via Text2SQL reinforcement learning

Josefa Lia Stoisser, Marc Boubnovski Martell, and Julien Fauqueur. Sparks of tabular reasoning via Text2SQL reinforcement learning. InThe 4th Table Representation Learning Workshop at ACL 2025,

work page 2025
[11]

Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction

Floris Vossebeld and Shenghui Wang. Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction. In Daniil Dobriy, Sanju Tiwari, Jennifer D’Souza, Nandana Mihindukulasooriya, and Francesco Osborne, editors,Proceedings of the Sec- ond International Workshop on Retrieval-Augmented Generation Enabled by Knowledge Graphs (RAGE-KG 2025...

work page 2025
[12]

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei

CEUR. 13 Pfeifer Banerjee Usbeck Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incen- tivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,

work page arXiv

[1] [1]

PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding

Hanna Abi Akl. PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding. InISWC 2023-International Semantic Web Conference,

work page 2023

[2] [2]

2024.00050

doi: 10.1109/ICSC59802. 2024.00050. Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. Modern baselines for SPARQL semantic parsing. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2260–2265,

work page doi:10.1109/icsc59802 2024

[3] [3]

DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph

Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph. InBIR@ECIR, pages 37–51, 2023a. Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. Awalesushil/DBLP- QuAD, 2023b. Weiqin Chen, Nhan Pham, Michael Glass, Long Vu, Gaetano Rossiello, Shankar Subr...

work page doi:10.1016/j.websem.2025 2025

[4] [4]

A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation

Rose Hirigoyen, Amal Zouaq, and Samuel Reyd. A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang, editors,Findings of the Association for Computa- tional Linguistics: AACL-IJCNLP 2022, pages 226–236, Online only, November

work page 2022

[5] [5]

doi: 10.18653/v1/2022.findings-aacl.22

Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-aacl.22. Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Pan. Less is more: Making smaller language models competent subgraph retrievers for multi-hop KGQA. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15787–15803,

work page doi:10.18653/v1/2022.findings-aacl.22 2022

[6] [6]

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen

doi: 10.1109/ICICT64420.2025.11005216. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adapta- tion. InForty-First International Conference on Machine Learning,

work page doi:10.1109/icict64420.2025.11005216 2025

[7] [7]

doi: 10.1145/3757923

ISSN 2157-6904. doi: 10.1145/3757923. 12 Xuan-Bang Nguyen, Xuan-Hieu Phan, and Massimo Piccardi. Fine-tuning text-to-SQL models with reinforcement-learning training objectives.Natural Language Processing Journal, 10:100135,

work page doi:10.1145/3757923

[8] [8]

doi: 10.1016/j.nlp.2025.100135

ISSN 2949-7191. doi: 10.1016/j.nlp.2025.100135. Aleksandr Perevalov, Dennis Diefenbach, Ricardo Usbeck, and Andreas Both. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 229–234. IEEE,

work page doi:10.1016/j.nlp.2025.100135 2025

[9] [9]

Spurious Rewards: Rethinking Training Signals in RLVR

ISSN 2076-3417. doi: 10.3390/app14041521. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Si- mon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3390/app14041521 2076

[10] [10]

Sparks of tabular reasoning via Text2SQL reinforcement learning

Josefa Lia Stoisser, Marc Boubnovski Martell, and Julien Fauqueur. Sparks of tabular reasoning via Text2SQL reinforcement learning. InThe 4th Table Representation Learning Workshop at ACL 2025,

work page 2025

[11] [11]

Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction

Floris Vossebeld and Shenghui Wang. Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction. In Daniil Dobriy, Sanju Tiwari, Jennifer D’Souza, Nandana Mihindukulasooriya, and Francesco Osborne, editors,Proceedings of the Sec- ond International Workshop on Retrieval-Augmented Generation Enabled by Knowledge Graphs (RAGE-KG 2025...

work page 2025

[12] [12]

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei

CEUR. 13 Pfeifer Banerjee Usbeck Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incen- tivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,

work page arXiv