Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Jiaxin Ma; Mai Nishimura; Rikuto Kotoge

arxiv: 2508.20324 · v4 · submitted 2025-08-27 · 💻 cs.CL

Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge , Mai Nishimura , Jiaxin Ma This is my paper

Pith reviewed 2026-05-18 20:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords Distillation-Guided Policy OptimizationAgentic RAGCompact language modelsReinforcement learningSearch and planningPolicy optimizationRetrieval-augmented generation

0 comments

The pith

Distillation-Guided Policy Optimization lets compact models develop agentic search and planning in RAG tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Distillation-Guided Policy Optimization to train small language models on agentic retrieval-augmented generation behaviors such as search and planning. Compact models start with poor performance that leads to sparse rewards and unstable reinforcement learning, so the method begins with demonstrations from a larger teacher and maintains continuous teacher guidance throughout optimization. It also defines the Agentic RAG Capabilities metric to measure reasoning, search coordination, and response synthesis separately. Experiments indicate that the resulting compact models reach sophisticated behaviors and can outperform the teacher model on some tasks, opening agentic RAG to settings with limited compute.

Core claim

Distillation-Guided Policy Optimization combines cold-start initialization from teacher demonstrations with continuous teacher guidance during policy optimization, enabling compact models to achieve sophisticated agentic search behaviors in RAG tasks and in some cases outperform the larger teacher model.

What carries the argument

Distillation-Guided Policy Optimization (DGPO), which stabilizes reinforcement learning for small models by seeding the policy from teacher demonstrations and supplying ongoing teacher signals to maintain search and planning skills.

If this is right

Compact models become viable for agentic RAG workloads that previously required much larger models.
The ARC metric supplies a practical way to diagnose failures in reasoning, coordination, and synthesis.
Agentic capabilities become accessible in resource-constrained environments.
Policy optimization for small models can be made reliable by combining imitation and guided RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same initialization-plus-guidance pattern may help other reinforcement-learning post-training tasks where base models start weak.
DGPO could be tested on models below 0.5B parameters or on non-RAG agentic domains to check generality.
If the teacher itself is imperfect, the method might still transmit useful search patterns but could also propagate the teacher's limitations.

Load-bearing premise

Cold-start initialization from teacher demonstrations together with continuous guidance will be enough to overcome sparse rewards and training instability in compact models that begin with weak performance.

What would settle it

A controlled run in which a compact model trained with DGPO shows no gain in search coordination or exhibits the same instability as standard RL without teacher guidance.

read the original abstract

Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5--1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DGPO adds teacher guidance on top of distillation to stabilize RL for tiny agentic RAG models, but the abstract leaves the independence of the final policy and the actual numbers unclear.

read the letter

The core move here is using cold-start teacher demonstrations plus ongoing guidance during policy optimization to get 0.5-1B models to do search and planning in RAG settings. They also define an ARC metric that breaks down reasoning, coordination, and synthesis steps. That combination targets a real pain point: standard RL collapses on small models because early performance is so bad that rewards stay sparse and training never gets traction. The practical goal of making agentic behavior runnable on constrained hardware is worth attention if it holds up.

Referee Report

2 major / 2 minor

Summary. The paper proposes Distillation-Guided Policy Optimization (DGPO) to train compact language models (0.5–1B parameters) for agentic RAG behaviors such as search, planning, and synthesis. DGPO combines cold-start initialization from teacher demonstrations with continuous teacher guidance during policy optimization to mitigate sparse rewards and training instability in RL. The authors introduce the Agentic RAG Capabilities (ARC) metric to evaluate reasoning, search coordination, and response synthesis at a fine-grained level. Comprehensive experiments are reported to show that DGPO enables compact models to exhibit sophisticated agentic behaviors, sometimes outperforming the larger teacher model, thereby making agentic RAG feasible under compute constraints.

Significance. If the central empirical claims hold under autonomous inference conditions, the work would be significant for enabling advanced sequential decision-making capabilities in resource-constrained environments. It directly targets a practical barrier in applying RL to small models and provides a new fine-grained evaluation framework (ARC) that could support more precise analysis of agentic behaviors.

major comments (2)

[Method / Experiments] Method and Experiments sections: The description states that DGPO employs 'continuous teacher guidance during policy optimization,' yet the manuscript does not explicitly state whether all teacher signals (logits, action suggestions, or reward shaping) are removed at inference time when computing ARC scores. This separation is load-bearing for the claim that compact models acquire independent 'sophisticated agentic search behaviors' rather than performing guided imitation; without it, outperformance over the teacher on 0.5–1B models could be an artifact of ongoing distillation.
[Abstract] Abstract and Experiments: The abstract asserts 'comprehensive experiments' with positive outcomes and occasional outperformance, but reports no concrete metrics, baselines, statistical significance tests, variance across runs, or exact computation details for ARC scores. This absence prevents verification of the central claim that DGPO overcomes sparse rewards and instability in compact models.

minor comments (2)

[§4] Notation for ARC sub-components (reasoning, search coordination, response synthesis) should be defined with explicit formulas or scoring rubrics in the main text rather than deferred to an appendix.
[Evaluation protocol] The paper should include a clear statement on whether the teacher model remains accessible at test time or if all reported results use fully autonomous compact-model rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have revised the manuscript to improve clarity and verifiability of our claims.

read point-by-point responses

Referee: [Method / Experiments] Method and Experiments sections: The description states that DGPO employs 'continuous teacher guidance during policy optimization,' yet the manuscript does not explicitly state whether all teacher signals (logits, action suggestions, or reward shaping) are removed at inference time when computing ARC scores. This separation is load-bearing for the claim that compact models acquire independent 'sophisticated agentic search behaviors' rather than performing guided imitation; without it, outperformance over the teacher on 0.5–1B models could be an artifact of ongoing distillation.

Authors: We agree this distinction is critical. In DGPO, teacher guidance (logits, action suggestions, and reward shaping) is applied exclusively during the RL training phase to address sparse rewards and instability. All ARC evaluations, including those showing outperformance, are performed with the compact model operating fully autonomously at inference time, with no teacher signals present. We will add an explicit statement in the Method section clarifying this separation and confirming that ARC scores reflect independent agentic behavior. revision: yes
Referee: [Abstract] Abstract and Experiments: The abstract asserts 'comprehensive experiments' with positive outcomes and occasional outperformance, but reports no concrete metrics, baselines, statistical significance tests, variance across runs, or exact computation details for ARC scores. This absence prevents verification of the central claim that DGPO overcomes sparse rewards and instability in compact models.

Authors: The abstract is intentionally concise and summarizes high-level findings, with all quantitative details, baselines, variance, significance tests, and ARC computation provided in the Experiments section and appendices. To improve accessibility, we will revise the abstract to include a small number of representative ARC scores and key comparisons while preserving brevity. Full experimental details remain unchanged in the main body. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training procedure with independent experimental validation

full rationale

The paper proposes DGPO as a practical RL training recipe (cold-start from teacher demonstrations plus continuous guidance) and evaluates it via experiments on ARC metrics. No derivation chain, equations, or first-principles claims are present that reduce to fitted parameters or self-citations by construction. Results are reported as empirical outcomes rather than analytical predictions forced by the method definition itself. Any self-citations would be incidental and non-load-bearing for the central claim, which rests on observed performance differences rather than tautological equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard RL assumptions plus the untested premise that teacher guidance transfers effectively to small models without introducing new biases or capability ceilings.

axioms (1)

domain assumption Teacher demonstrations provide a sufficiently rich cold-start distribution for compact models to escape sparse-reward regimes.
Invoked to justify the initialization step in DGPO.

pith-pipeline@v0.9.0 · 5697 in / 1055 out tokens · 21563 ms · 2026-05-18T20:18:55.131100+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DGPO employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization... ARC... thinking, query rewriting, and source referencing
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Distillation-Guided Policy Optimization (DGPO)... selective KL penalty... rϕ(x, y) = 1 if correct else -β DKL[πθ∥πg]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.