GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

Bowen Song; Jianan Xie; Liang Wang; Qiang Liu; Shu Wu; Weiqiang Wang; Xin Sun; Zhongqi Chen; Zilei Wang

arxiv: 2605.29584 · v2 · pith:LJKOVZP6new · submitted 2026-05-28 · 💻 cs.CL

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

Xin Sun , Jianan Xie , Zhongqi Chen , Qiang Liu , Shu Wu , Bowen Song , Weiqiang Wang , Zilei Wang

show 1 more author

Liang Wang

This is my paper

Pith reviewed 2026-06-29 07:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords Gold-Action Policy DistillationKBQAReinforcement LearningPolicy DistillationAgentic Reinforcement LearningKnowledge Base Question AnsweringMid-Anchor Matching

0 comments

The pith

GAPD distills gold action sequences into RL policies for KBQA by matching intermediate entities as state anchors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training method called Gold-Action Policy Distillation to give reinforcement learning agents denser supervision when answering questions over knowledge bases. Current RL approaches rely only on whether the final answer is correct, which leaves many intermediate action mistakes under-supervised even when gold logical forms are available. GAPD converts those gold forms into executable action sequences and aligns them with a student's on-policy rollouts by treating the intermediate entities each path reaches as reliable state anchors. It then uses the aligned gold actions as a stop-gradient teacher to distill token-level distributions back into the student policy. If this works, agents can make better use of existing logical-form annotations during training instead of treating them only as initial data.

Core claim

GAPD converts gold logical forms into executable action sequences and applies mid-anchor matching on the intermediate entities reached during gold execution and student exploration; the resulting aligned gold actions serve as a stop-gradient teacher whose token distribution is distilled into the ordinary student policy over generated action-token spans, producing consistent gains over prior state-of-the-art methods on WebQSP, GrailQA, and GraphQ.

What carries the argument

Mid-anchor matching, which identifies intermediate entities reached in both gold and student paths as state anchors to align gold actions with on-policy rollouts for distillation.

If this is right

RL agents in KBQA receive token-level supervision in addition to sparse final-answer rewards.
Gold logical forms shift from warm-start data construction to active on-policy updates.
Policy distillation becomes possible even when student and gold trajectories diverge in length or order.
Performance improvements appear on standard logical-form-annotated KBQA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entity-anchor alignment technique could transfer to other agentic RL domains that possess partial gold trajectories.
Mid-anchor matching may reduce the amount of random exploration needed in sparse-reward settings by supplying early corrective signals.
If entity sets prove too coarse, future work could test finer-grained anchors such as relation sequences or partial answer sets.

Load-bearing premise

Gold logical forms can be turned into executable action sequences whose intermediate entities give reliable state anchors that match student rollouts without systematic misalignment or bias during distillation.

What would settle it

Running the full GAPD pipeline on WebQSP, GrailQA, or GraphQ and finding no accuracy gain over prior outcome-only RL baselines, or finding that the entity-based matching step produces systematically misaligned gold actions.

Figures

Figures reproduced from arXiv: 2605.29584 by Bowen Song, Jianan Xie, Liang Wang, Qiang Liu, Shu Wu, Weiqiang Wang, Xin Sun, Zhongqi Chen, Zilei Wang.

**Figure 1.** Figure 1: Overview of GAPD. GRPO supplies supervision mainly through the final answer reward. GAPD uses ENTITY-ANCHOR MATCHING to align an on-policy student state with a gold execution state, conditions the current policy on the aligned gold action to obtain a self-teacher distribution, assigns dense guidance to the selected student tokens, and fuses this guidance with the original GRPO advantage. cedural conditioni… view at source ↗

**Figure 2.** Figure 2: Training dynamics and interaction efficiency. Left: reward curves comparing w/o GAPD and w/ GAPD [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Case studies of GAPD. Each example shows the student action aligned to a gold action and highlights [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity to the GAPD coefficient λgapd. Each panel reports F1 after varying λgapd; the dashed gray horizontal line marks λgapd = 0, i.e., the KBQAR1 outcome-only RL baseline without GAPD. C Coefficient Sensitivity The coefficient sweep shows a useful range around λgapd = 0.1–1.0: very small weights underuse the guide, while the gains taper at λgapd = 1.0 compared with the stronger λgapd = 0.5 setting.… view at source ↗

**Figure 5.** Figure 5: provides an additional guide-strength analysis. It reports the share of supervised generated tokens with strong negative raw guide scores. Divergent actions consistently have heavier negative tails, indicating that the token-level guide mainly acts as a localized suppressive correction on generated spans that depart from the gold-actionconditioned self-teacher. .001 .01 .05 .1 .5 1 Negative-guide thresh… view at source ↗

**Figure 6.** Figure 6: Full trajectory view for the time-zone case. The figure preserves the complete student–environment [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Full trajectory view for the island comparison case. The figure preserves the complete student–environment [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Full trajectory view for the distilled-spirits case. The figure preserves the complete student–environment [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAPD introduces entity-set matching to distill gold actions into on-policy RL for KBQA, but the alignment step risks systematic bias when multiple paths share entities.

read the letter

The main contribution is a distillation setup that converts gold logical forms into action sequences and uses shared intermediate entities as anchors to pair student states with gold actions for token-level supervision. The teacher is the current policy conditioned on the matched gold action, with gradients stopped, and its token distribution is distilled back to the student over action spans. This is a direct attempt to add dense signals where standard outcome RL only has final-answer rewards.

It does a clean job of framing the problem and describing the matching procedure without overclaiming generality. The method stays within the existing agentic KBQA pipeline and avoids needing perfect trajectory matches.

The soft spot is the matching itself. When student and gold rollouts reach the same entity set through different relations or orders, the procedure picks one gold continuation as the target. Nothing in the description shows how they handle cases where that continuation is not the right one from the student's prefix, or whether they ablate the frequency of such mismatches. If those cases are common, the stop-gradient teacher ends up teaching the wrong token distribution, which would undermine the claimed gains. The abstract reports consistent SOTA on WebQSP, GrailQA, and GraphQ, but without ablations, error analysis, or even the actual numbers, it is impossible to tell whether the improvements survive that issue.

This is for people already working on RL agents for structured KBQA tasks. A reader in that niche could extract the matching trick and test it, but the paper does not yet give enough evidence to treat the results as settled. I would send it to peer review so the authors can supply the missing controls and the community can check the alignment assumption directly.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes GAPD, a Gold-Action Policy Distillation framework for agentic RL in KBQA. Gold logical forms are converted to executable action sequences; mid-anchor matching uses shared intermediate entities from student rollouts and gold executions as state anchors to align on-policy states with gold actions. A stop-gradient teacher policy conditioned on the aligned gold action provides token-level distillation to the student policy. The paper claims this yields consistent SOTA improvements over prior methods on WebQSP, GrailQA, and GraphQ.

Significance. If the alignment procedure supplies reliable supervision without introducing systematic bias, GAPD would offer a practical way to convert logical-form annotations into dense on-policy signals for RL, potentially raising sample efficiency in sparse-reward agentic settings. Demonstrated gains across three standard KBQA benchmarks would be a concrete empirical contribution to the intersection of RL and semantic parsing.

major comments (1)

[Mid-anchor matching procedure] Mid-anchor matching (described in the abstract and method): the procedure equates states via shared intermediate entity sets and selects the corresponding gold action for distillation. In KB graphs, however, distinct action sequences can reach identical entity sets through alternate relations or ordering; nothing in the description shows that entity overlap selects the continuation that would have been optimal from the student's actual prefix. If mismatches occur systematically, the stop-gradient teacher supplies incorrect token distributions, converting the claimed dense signal into a source of bias. This directly threatens the central claim of consistent SOTA gains over outcome-only RL and requires either a formal argument that such collisions are negligible or an empirical audit of matching fidelity.

minor comments (1)

[Abstract] The abstract states that GAPD 'consistently surpasses' prior work but supplies no quantitative deltas, standard deviations, or statistical tests; these details belong in the results section to allow readers to judge the practical magnitude of the reported improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful analysis of the mid-anchor matching procedure. The comment raises a substantive point about potential mismatches that we address directly below.

read point-by-point responses

Referee: [Mid-anchor matching procedure] Mid-anchor matching (described in the abstract and method): the procedure equates states via shared intermediate entity sets and selects the corresponding gold action for distillation. In KB graphs, however, distinct action sequences can reach identical entity sets through alternate relations or ordering; nothing in the description shows that entity overlap selects the continuation that would have been optimal from the student's actual prefix. If mismatches occur systematically, the stop-gradient teacher supplies incorrect token distributions, converting the claimed dense signal into a source of bias. This directly threatens the central claim of consistent SOTA gains over outcome-only RL and requires either a formal argument that such collisions are negligible or an empirical audit of matching fidelity.

Authors: We acknowledge that KB graphs admit multiple paths to the same intermediate entity set and that entity overlap alone does not guarantee the gold continuation is optimal from the student's exact prefix. Our design selects the gold action from the single gold logical-form execution that shares the observed entity set at the anchor point; this supplies a valid (if not necessarily shortest) continuation rather than an arbitrary one. While a general formal guarantee of negligible collisions would require strong assumptions on the KB structure that we do not claim, the consistent empirical gains over outcome-only RL baselines across WebQSP, GrailQA, and GraphQ suggest that any residual bias is not dominant. To strengthen the manuscript we will add (i) a brief discussion of the collision issue and (ii) a quantitative audit of matching fidelity on held-out rollouts. revision: partial

Circularity Check

0 steps flagged

No circularity: self-contained training procedure with independent empirical claims

full rationale

The paper presents GAPD as a novel training-time framework that converts gold logical forms to action sequences and applies mid-anchor matching on intermediate entities to enable token-level distillation from a stop-gradient teacher into an on-policy student. This is described as an additive dense signal on top of standard outcome-based RL, with no equations, fitted parameters, or self-citations shown to reduce the claimed SOTA gains to quantities defined by the inputs by construction. The central premise (entity-set anchors reliably align gold actions) is an empirical modeling choice whose correctness can be tested externally rather than a definitional or self-referential reduction. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable; the method description does not introduce new postulated quantities.

pith-pipeline@v0.9.1-grok · 5767 in / 1019 out tokens · 30239 ms · 2026-06-29T07:22:08.341072+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Preprint, arXiv:2603.00511

Multimodal adaptive retrieval augmented generation through internal representation learning. Preprint, arXiv:2603.00511. Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, and Liang Wang. 2026. Tool- weaver: Weaving collaborative semantics for scal- able tool use in large language models.Preprin...

work page arXiv 2026
[2]

Kimi K2.5: Visual Agentic Intelligence

Logical form generation via multi-task learn- ing for complex question answering over knowledge bases. InProceedings of the 29th International Con- ference on Computational Linguistics, pages 1687– 1696, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

arXiv preprint arXiv:2501.18922 , year=

Curran Associates, Inc. Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James R. Glass, and Helen M. Meng. 2025a. De- coding on graphs: Faithful and sound reasoning on knowledge graphs through generation of well-formed chains. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 24349–243...

work page arXiv 2023
[4]

PullNet: Open domain question answering with iterative retrieval on knowledge bases and text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 2380– 2390, Hong Kong, China. Association for Computa- tional Linguistics. H...

2019
[5]

InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium

Open domain question answering using early fusion of knowledge bases and text. InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium. Association for Computational Linguistics. Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung- Yeung Shum...

2018
[6]

Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

Predict the retrieval! test time adapta- tion for retrieval augmented generation.Preprint, arXiv:2601.11443. Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, and Liang Wang. 2025a. Kbqa-r1: Reinforcing large language models for knowledge base question an- swering.arXiv preprint arXiv:2512.10999. Xin Sun, Qiang L...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

time.time_zone.day_dst_ends

exposes KB interaction APIs to the LLM and decomposes complex questions through multi-turn tool use.ToG(Sun et al., 2024) andPoG(Chen et al., 2024) are prompting-based graph reason- ing agents that perform test-time graph exploration or planning; their reported Hits@1 results are not mixed into our F1/EM main tables. KBQA-R1 Harness with commercial LLMs. ...

2024

[1] [1]

Preprint, arXiv:2603.00511

Multimodal adaptive retrieval augmented generation through internal representation learning. Preprint, arXiv:2603.00511. Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, and Liang Wang. 2026. Tool- weaver: Weaving collaborative semantics for scal- able tool use in large language models.Preprin...

work page arXiv 2026

[2] [2]

Kimi K2.5: Visual Agentic Intelligence

Logical form generation via multi-task learn- ing for complex question answering over knowledge bases. InProceedings of the 29th International Con- ference on Computational Linguistics, pages 1687– 1696, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

arXiv preprint arXiv:2501.18922 , year=

Curran Associates, Inc. Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James R. Glass, and Helen M. Meng. 2025a. De- coding on graphs: Faithful and sound reasoning on knowledge graphs through generation of well-formed chains. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 24349–243...

work page arXiv 2023

[4] [4]

PullNet: Open domain question answering with iterative retrieval on knowledge bases and text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 2380– 2390, Hong Kong, China. Association for Computa- tional Linguistics. H...

2019

[5] [5]

InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium

Open domain question answering using early fusion of knowledge bases and text. InProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium. Association for Computational Linguistics. Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung- Yeung Shum...

2018

[6] [6]

Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

Predict the retrieval! test time adapta- tion for retrieval augmented generation.Preprint, arXiv:2601.11443. Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, and Liang Wang. 2025a. Kbqa-r1: Reinforcing large language models for knowledge base question an- swering.arXiv preprint arXiv:2512.10999. Xin Sun, Qiang L...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

time.time_zone.day_dst_ends

exposes KB interaction APIs to the LLM and decomposes complex questions through multi-turn tool use.ToG(Sun et al., 2024) andPoG(Chen et al., 2024) are prompting-based graph reason- ing agents that perform test-time graph exploration or planning; their reported Hits@1 results are not mixed into our F1/EM main tables. KBQA-R1 Harness with commercial LLMs. ...

2024