pith. sign in

arxiv: 2605.20066 · v1 · pith:SQ7R35T3new · submitted 2026-05-19 · 💻 cs.CL

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

Pith reviewed 2026-05-20 05:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords accuracyadditionalbaselinelanguagelearningmodelreinforcementrewards
0
0 comments X

The pith

GRPO-based RL with execution feedback improves zero-shot Text-to-SPARQL on DBLP-QuAD for a 1.7B model but trails supervised DoRA fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge graphs store facts as linked data, and SPARQL is the query language used to retrieve answers from them. Converting a plain-English question into a correct SPARQL query is difficult for language models without extensive labeled examples. This work tests whether reinforcement learning can help a small model learn the task using only outcome rewards: the model proposes a query, the system runs it, and the model receives a score based on whether the result matches the expected answer. They add symbolic hints about entities and relations to the prompts and apply Group-Relative Policy Optimization on the Qwen3-1.7B model. Training also includes structural constraints to keep queries valid. Experiments compare the RL-trained model against an unmodified zero-shot version and a supervised fine-tuned version on answer accuracy, execution success, and generalization to new question templates. Ablations show that the execution rewards drive most of the gains, while adding gold-query shaping gives only small extra benefit.

Core claim

GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit.

Load-bearing premise

The approach assumes that execution feedback from running generated SPARQL queries, combined with structural constraints and answer-level rewards, supplies a sufficiently reliable and informative training signal to optimize the policy in the absence of gold query annotations.

read the original abstract

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes applying Group-Relative Policy Optimization (GRPO) to fine-tune the Qwen3-1.7B model for zero-shot Text-to-SPARQL generation on DBLP-QuAD. Prompts incorporate natural language questions plus symbolic entity/relation hints; training uses execution feedback, structural constraints, and answer-level rewards (with an optional gold-query shaping variant). Results are compared against the unmodified zero-shot baseline and a supervised DoRA-finetuned model on answer accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. The central claim is that GRPO yields substantial gains over zero-shot, competitive generalization, and that execution-based rewards drive most of the improvement while additional shaping adds limited benefit.

Significance. If the empirical gains and ablation findings hold under fuller reporting, the work shows that outcome-based RL can train compact instruction-tuned models for KGQA without gold query annotations, reducing reliance on expensive supervision in scholarly domains. The ablation isolating execution rewards and the generalization tests supply concrete evidence on reward design for compositional query generation.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.
  2. [§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.
  3. [§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.
minor comments (2)
  1. [§2] Dataset statistics (number of questions, query-length distribution, entity coverage) are missing from the data section and would help readers interpret the scale of the reported gains.
  2. [Figures 2-4] Figure and table captions should explicitly state whether error bars represent standard deviation across seeds or across query categories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments and constructive feedback on our manuscript. We address each of the major comments in detail below and have made revisions to improve the clarity and completeness of the paper.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'execution-based rewards account for most gains' is load-bearing for the central argument yet is presented without the exact reward equation, normalization scheme, handling of partial matches, or variance statistics; without these details it is impossible to verify whether the reported improvements over zero-shot arise from a dense enough policy gradient or from prompt-level symbolic hints.

    Authors: We agree that providing these details is essential for verifying the source of the improvements. In the revised version, we will explicitly present the reward equation used for execution-based rewards, describe the normalization scheme, clarify that rewards are binary (1 for successful execution matching the expected answer, 0 otherwise) with no partial credit for minor errors, and include variance statistics from our experimental runs. This will demonstrate that the gains are attributable to the GRPO training with outcome-based feedback. revision: yes

  2. Referee: [§4.3] §4.3 (Generalization): the statement of 'competitive generalization' to held-out templates lacks per-template accuracy tables, absolute deltas, or error bars, so the magnitude and robustness of the generalization claim cannot be assessed against the zero-shot and DoRA baselines.

    Authors: We acknowledge the need for more granular reporting. We will include per-template accuracy tables in the revised manuscript, report absolute performance deltas relative to the zero-shot and DoRA baselines, and add error bars to indicate the robustness of the results across evaluation settings. revision: yes

  3. Referee: [§3.2] §3.2 (Reward formulation): because SPARQL queries are compositional, minor entity or syntax errors produce execution failures and zero rewards; the manuscript does not describe reward shaping for partial credit, multiple valid queries, or GRPO-specific variance reduction, leaving the weakest assumption (reliable training signal without gold annotations) untested.

    Authors: This point highlights an important aspect of our approach. We will revise §3.2 to provide a more detailed description of the reward formulation, including how structural constraints are applied to reduce syntax errors and the GRPO variance reduction mechanisms. For partial credit, we maintain that binary execution rewards are appropriate for ensuring fully correct queries in a compositional setting, as partial rewards could encourage incorrect but partially executable queries. We will add discussion on handling multiple valid queries through execution equivalence and why this design provides a reliable signal without requiring gold annotations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL evaluation with external baselines and ablations

full rationale

The paper reports experimental results from applying GRPO to Qwen3-1.7B on DBLP-QuAD for Text-to-SPARQL, comparing against zero-shot and supervised DoRA baselines while ablating reward components. No derivation chain reduces a claimed result to a fitted parameter or self-defined quantity by construction; performance metrics arise from training runs and held-out evaluation rather than algebraic equivalence to inputs. Execution-based rewards and structural constraints are treated as external signals, not internally fitted quantities renamed as predictions. Self-citations, if present, are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that outcome rewards derived from query execution are adequate to drive policy improvement without token-level supervision, plus standard reinforcement learning assumptions about reward signals and policy optimization.

axioms (1)
  • domain assumption Execution feedback from SPARQL queries combined with structural constraints provides a reliable reward signal for training
    The method relies on execution feedback, structural constraints, and answer-level rewards as the primary training mechanism.

pith-pipeline@v0.9.0 · 5759 in / 1336 out tokens · 54055 ms · 2026-05-20T05:26:54.953907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding

    Hanna Abi Akl. PSYCHIC: A neuro-symbolic framework for knowledge graph question- answering grounding. InISWC 2023-International Semantic Web Conference,

  2. [2]

    2024.00050

    doi: 10.1109/ICSC59802. 2024.00050. Debayan Banerjee, Pranav Ajit Nair, Jivat Neet Kaur, Ricardo Usbeck, and Chris Biemann. Modern baselines for SPARQL semantic parsing. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2260–2265,

  3. [3]

    DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph

    Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. DBLP-QuAD: A question answering dataset over the DBLP scholarly knowledge graph. InBIR@ECIR, pages 37–51, 2023a. Debayan Banerjee, Sushil Awale, Ricardo Usbeck, and Chris Biemann. Awalesushil/DBLP- QuAD, 2023b. Weiqin Chen, Nhan Pham, Michael Glass, Long Vu, Gaetano Rossiello, Shankar Subr...

  4. [4]

    A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation

    Rose Hirigoyen, Amal Zouaq, and Samuel Reyd. A copy mechanism for handling knowl- edge base elements in SPARQL neural machine translation. In Yulan He, Heng Ji, Sujian Li, Yang Liu, and Chua-Hui Chang, editors,Findings of the Association for Computa- tional Linguistics: AACL-IJCNLP 2022, pages 226–236, Online only, November

  5. [5]

    doi: 10.18653/v1/2022.findings-aacl.22

    Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-aacl.22. Wenyu Huang, Guancheng Zhou, Hongru Wang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Pan. Less is more: Making smaller language models competent subgraph retrievers for multi-hop KGQA. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15787–15803,

  6. [6]

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen

    doi: 10.1109/ICICT64420.2025.11005216. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adapta- tion. InForty-First International Conference on Machine Learning,

  7. [7]

    doi: 10.1145/3757923

    ISSN 2157-6904. doi: 10.1145/3757923. 12 Xuan-Bang Nguyen, Xuan-Hieu Phan, and Massimo Piccardi. Fine-tuning text-to-SQL models with reinforcement-learning training objectives.Natural Language Processing Journal, 10:100135,

  8. [8]

    doi: 10.1016/j.nlp.2025.100135

    ISSN 2949-7191. doi: 10.1016/j.nlp.2025.100135. Aleksandr Perevalov, Dennis Diefenbach, Ricardo Usbeck, and Andreas Both. Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers. In2022 IEEE 16th International Conference on Semantic Computing (ICSC), pages 229–234. IEEE,

  9. [9]

    Spurious Rewards: Rethinking Training Signals in RLVR

    ISSN 2076-3417. doi: 10.3390/app14041521. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Si- mon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

  10. [10]

    Sparks of tabular reasoning via Text2SQL reinforcement learning

    Josefa Lia Stoisser, Marc Boubnovski Martell, and Julien Fauqueur. Sparks of tabular reasoning via Text2SQL reinforcement learning. InThe 4th Table Representation Learning Workshop at ACL 2025,

  11. [11]

    Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction

    Floris Vossebeld and Shenghui Wang. Learning to Refine: An Agentic RL Approach for It- erative SPARQL Query Construction. In Daniil Dobriy, Sanju Tiwari, Jennifer D’Souza, Nandana Mihindukulasooriya, and Francesco Osborne, editors,Proceedings of the Sec- ond International Workshop on Retrieval-Augmented Generation Enabled by Knowledge Graphs (RAGE-KG 2025...

  12. [12]

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se- bastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei

    CEUR. 13 Pfeifer Banerjee Usbeck Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. Octothinker: Mid-training incen- tivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512,