Composing Policy Gradients and Prompt Optimization for Language Model Programs

Chen Qian; Christopher Potts; Dan Klein; Dilara Soylu; Isaac Miller; Kaiqiang Song; Karel D'Oosterlinck; Lakshya A Agrawal; Liheng Lai; Matei Zaharia

arxiv: 2508.04660 · v2 · submitted 2025-08-06 · 💻 cs.CL

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems , Dilara Soylu , Lakshya A Agrawal , Isaac Miller , Liheng Lai , Chen Qian , Kaiqiang Song , Meng Jiang

show 5 more authors

Dan Klein Matei Zaharia Karel D'Oosterlinck Christopher Potts Omar Khattab

This is my paper

Pith reviewed 2026-05-19 00:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords policy optimizationprompt optimizationlanguage model programsgroup relative policy optimizationmodular AI systemsreinforcement learning for LMsDSPy

0 comments

The pith

GRPO and prompt optimization together improve accuracy by 11% on average for modular language model programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates extending Group Relative Policy Optimization from single language models to modular programs that combine multiple LM calls with distinct prompts. It shows that this extension, particularly the multi-module variant, works well alongside automatic prompt optimization. The combination delivers measurable gains on tasks involving classification, multi-step search, and private delegation. A reader would care because many practical AI systems are now built as such programs rather than isolated models, and this offers a way to optimize them using established RL techniques.

Core claim

Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own.

What carries the argument

multi-module GRPO, which groups training signals at the level of individual module invocations within the larger program

If this is right

GRPO can be instantiated for arbitrary multi-prompt programs using the same abstractions as prompt optimization.
The multi-module approach remains stable without task-specific redesign of grouping or rewards.
Combining GRPO with prompt optimization yields higher accuracy than using either method separately.
The method applies across classification, search, and delegation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar RL methods could be adapted to optimize entire modular AI systems rather than single components.
Open-sourcing this in DSPy may encourage broader experimentation with program-level optimization.
Future work might test whether trajectory-level grouping performs differently on programs with longer or more variable structures.

Load-bearing premise

The module-level grouping used in multi-module GRPO will remain effective and stable when applied to any multi-prompt program without needing custom adjustments to the grouping or reward design.

What would settle it

Running multi-module GRPO on a new multi-prompt program with different module structure and measuring if the accuracy improvements fail to appear or if the optimization becomes unstable.

read the original abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows GRPO can be adapted to multi-module LM programs and stacks with prompt optimization for modest gains on three tasks, but the evidence is still preliminary with thin experimental details.

read the letter

The main thing to know is that the authors adapt GRPO to work on modular LM programs by grouping at the module level and demonstrate that it combines with automatic prompt optimization. They report an 11% average accuracy lift over the base post-trained model and a 5% lift over prompt optimization alone across classification, many-hop search, and privacy-preserving delegation tasks. The code is released in DSPy, which makes the approach immediately usable for people already working in that framework.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates whether Group Relative Policy Optimization (GRPO) and a proposed multi-module variant can be instantiated for arbitrary multi-prompt LM programs and composed with automatic prompt optimization. The main multi-module GRPO variant constructs groups from module-level invocations (with trajectory-level grouping as an alternative). The central empirical claim is that GRPO composes well with prompt optimization, yielding 11% average accuracy gains over the post-trained LM and 5% gains over prompt optimization alone across classification, many-hop search, and privacy-preserving delegation tasks. The implementation is open-sourced in DSPy.

Significance. If the results hold under more rigorous evaluation, the work would have moderate significance for the field of optimizing modular LM systems. It provides the first reported evidence that GRPO can be applied off-the-shelf to multi-prompt programs using the same abstractions as prompt optimization, with the open-sourcing in DSPy as a clear strength for reproducibility. The empirical focus on composition rather than isolated RL or prompt tuning is a useful practical contribution, though the absence of parameter-free derivations or falsifiable predictions limits its theoretical reach.

major comments (2)

[Abstract] Abstract: the central claim of 11% average accuracy gains (and 5% over prompt optimization) is presented without variance, standard deviations, confidence intervals, or statistical significance tests across the three tasks. This is load-bearing because the soundness of the composition result rests on these gains being reliable rather than task-specific noise.
[Abstract] Abstract (paragraph on main variant of multi-module GRPO): the assumption that module-level grouping remains effective and stable for arbitrary programs without task-specific redesign of grouping or reward structure is not validated. In many-hop search, where inter-module interactions are likely to dominate, per-module reward signals may produce biased or noisy relative advantage estimates; an ablation or credit-assignment analysis is needed to support the off-the-shelf claim.

minor comments (2)

The abstract refers to 'post-trained LM' without naming the specific base model or version used in the experiments; this should be stated explicitly for reproducibility.
Minor notation: clarify whether 'GRPO group size' is a fixed hyperparameter or tuned per task, as this affects the interpretation of the 'off-the-shelf' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment below and outline the revisions we will make to improve rigor and transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 11% average accuracy gains (and 5% over prompt optimization) is presented without variance, standard deviations, confidence intervals, or statistical significance tests across the three tasks. This is load-bearing because the soundness of the composition result rests on these gains being reliable rather than task-specific noise.

Authors: We agree that the abstract would benefit from explicit reporting of variability to better support the reliability of the composition results. The full manuscript presents per-task results from multiple independent runs, which demonstrate consistent gains. In the revised version, we will update the abstract to include standard deviations for the reported averages and reference the statistical significance of the improvements where appropriate. revision: yes
Referee: [Abstract] Abstract (paragraph on main variant of multi-module GRPO): the assumption that module-level grouping remains effective and stable for arbitrary programs without task-specific redesign of grouping or reward structure is not validated. In many-hop search, where inter-module interactions are likely to dominate, per-module reward signals may produce biased or noisy relative advantage estimates; an ablation or credit-assignment analysis is needed to support the off-the-shelf claim.

Authors: We recognize the value of additional validation for the grouping strategy in complex settings like many-hop search. Our current experiments show that module-level GRPO yields stable gains across all tasks, including many-hop search, without per-task redesign of grouping or rewards. To directly address potential credit-assignment concerns, we will add an ablation comparing module-level versus trajectory-level grouping and include a brief analysis of per-module reward signals in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical GRPO composition results

full rationale

The paper is an empirical study reporting measured accuracy gains from applying GRPO and multi-module variants to LM programs on classification, many-hop search, and delegation tasks. No mathematical derivation chain exists that reduces the reported 11% or 5% gains to quantities defined by fitted parameters, self-citations, or ansatzes inside the paper. The central claims rest on experimental outcomes against baselines rather than any self-referential construction or uniqueness theorem. Minor references to the DSPy library for open-sourcing the implementation are not load-bearing for the accuracy results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions about reward signals and on the empirical behavior of GRPO when applied to modular programs; no new theoretical entities are introduced.

free parameters (1)

GRPO group size and learning-rate schedule
Hyperparameters chosen for the multi-module experiments; their specific values are not stated in the abstract.

axioms (1)

domain assumption GRPO is an effective post-training method for language models
Invoked in the first sentence of the abstract as established prior work.

pith-pipeline@v0.9.0 · 5777 in / 1245 out tokens · 77641 ms · 2026-05-19T00:04:12.102305+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MMGRPO starts by sampling full program trajectories... constructs GRPO groups at the module level... JmmGRPO(θM) with module-level ωt and Âi
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BetterTogether(PO, MMGRPO) staging of MIPROv2 then GRPO

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.