P^2O: Joint Policy and Prompt Optimization

Boxi Cao; Hongyu Lin; Jinglin Yang; Kaiqi Zhang; Le Sun; Min He; Xianpei Han; Xinyu Lu; Yaojie Lu

arxiv: 2603.21877 · v3 · submitted 2026-03-23 · 💻 cs.LG · cs.AI

P²O: Joint Policy and Prompt Optimization

Xinyu Lu , Kaiqi Zhang , Jinglin Yang , Boxi Cao , Yaojie Lu , Hongyu Lin , Min He , Xianpei Han

show 1 more author

Le Sun

This is my paper

Pith reviewed 2026-05-15 00:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords joint policy and prompt optimizationreinforcement learning with verifiable rewardsadvantage collapsecontext distillationGEPALLM reasoningprompt evolution

0 comments

The pith

P²O restores learning signals on hard reasoning samples by alternating policy updates with prompt evolution and distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard reinforcement learning with verifiable rewards loses all advantage signals on difficult problems where every rollout fails, and simply running more rollouts brings little benefit. P²O counters this by switching between continuous policy training and discrete prompt evolution that finds better reasoning instructions for those stuck cases. The improved behavior is then distilled directly into the model weights so no special prompts are needed later. The result is stronger performance than baseline methods, better results than doubling the rollout budget, and improved generalization to new problems.

Core claim

P²O mitigates advantage collapse in RLVR by alternating continuous policy updates with discrete prompt evolution. For intractable samples, the GEPA algorithm discovers successful reasoning prompts, and context distillation internalizes these gains into model parameters. This restores critical advantage signals, significantly outperforming standard GRPO, surpassing baselines with doubled rollout budgets, and yielding strong out-of-distribution generalization with up to 9.5% performance improvement.

What carries the argument

Alternating policy gradient steps with evolutionary prompt discovery via GEPA, followed by context distillation to embed the gains in parameters.

Load-bearing premise

The GEPA algorithm can reliably discover successful reasoning prompts for intractable instances and context distillation can internalize these gains into model parameters without loss of effectiveness.

What would settle it

A controlled experiment on a set of hard samples where GEPA fails to produce any prompts that improve success rates and P²O shows no gain over standard GRPO would falsify the claim that the joint process restores useful signals.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals. For these intractable samples, simply scaling up rollout budgets offers limited gains. We introduce Joint Policy and Prompt Optimization (P$^2$O) to mitigate this collapse by alternating continuous policy updates with discrete prompt evolution. P$^2$O leverages the GEPA algorithm to discover successful reasoning prompts for intractable instances. Via context distillation, the model internalizes these prompt-induced gains directly into its parameters, removing the need for inference-time prompting. Empirically, P$^2$O restores critical advantage signals, significantly outperforming standard GRPO and surpassing baselines with doubled rollout budgets, ultimately yielding strong out-of-distribution generalization and an up to $9.5\%$ performance improvement. Our findings expose the limits of standard exploration in sparse-reward environments, illuminating the potential of unifying evolutionary algorithms with reinforcement learning. This integration of discrete semantic search and continuous parameter updates establishes a self-reinforcing paradigm for autonomous LLM alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P²O alternates GRPO policy updates with GEPA prompt evolution on hard samples then distills the gains back into parameters, but the abstract supplies zero experimental details to back the reported gains.

read the letter

The paper's central move is to treat prompt search as a discrete evolutionary step that runs in alternation with continuous policy optimization. When GRPO rollouts all fail on a sample, GEPA tries to find a reasoning prompt that succeeds, then context distillation folds that success into the model weights so the prompt is no longer needed at test time. The stated payoff is restored advantage signals, better performance than standard GRPO, and gains even against baselines that double the rollout budget, plus some out-of-distribution improvement up to 9.5 %.

Referee Report

2 major / 0 minor

Summary. The paper claims that P²O mitigates advantage collapse in RLVR for LLMs on hard samples by alternating policy updates with prompt evolution via GEPA and using context distillation to internalize gains, leading to restored advantage signals, outperformance over GRPO and doubled-rollout baselines, up to 9.5% improvement, and strong OOD generalization.

Significance. If validated, this would be significant for introducing a self-reinforcing paradigm that unifies discrete semantic search with continuous parameter updates in LLM alignment, exposing limits of standard exploration in sparse-reward settings.

major comments (2)

Abstract: The abstract reports empirical gains and out-of-distribution benefits but supplies no experimental details, baseline definitions, statistical tests, or ablation results, leaving the central performance claims unsupported by visible evidence.
P²O algorithm description: The load-bearing assumption that the GEPA algorithm can reliably discover successful reasoning prompts for intractable instances (where all GRPO rollouts fail) and that context distillation internalizes these gains without loss of effectiveness is stated but not verified with success-rate statistics or ablations isolating the distillation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: Abstract: The abstract reports empirical gains and out-of-distribution benefits but supplies no experimental details, baseline definitions, statistical tests, or ablation results, leaving the central performance claims unsupported by visible evidence.

Authors: We agree that the abstract would benefit from greater specificity to better support the central claims. In the revised version, we will expand the abstract to briefly define the primary baselines (GRPO and the doubled-rollout variant), note the use of statistical significance testing for the reported gains (including the 9.5% improvement), and reference the key ablation findings on advantage restoration and OOD generalization. Full experimental protocols, dataset details, and complete ablation tables will continue to reside in Sections 3 and 4, consistent with typical abstract length constraints. revision: yes
Referee: P²O algorithm description: The load-bearing assumption that the GEPA algorithm can reliably discover successful reasoning prompts for intractable instances (where all GRPO rollouts fail) and that context distillation internalizes these gains without loss of effectiveness is stated but not verified with success-rate statistics or ablations isolating the distillation step.

Authors: The manuscript reports overall performance improvements that depend on these mechanisms, with supporting evidence appearing in the empirical results. To directly address the request for explicit verification, we will add success-rate statistics quantifying how often GEPA recovers valid reasoning trajectories on samples where all GRPO rollouts fail, and we will include a new ablation that isolates the context-distillation step to measure any performance change after internalization. These additions will be placed in Section 4.2 and the experimental appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential predictions

full rationale

The paper presents P²O as an empirical algorithm that alternates GRPO-style policy updates with discrete prompt evolution via GEPA followed by context distillation. No equations, first-principles derivations, or quantitative predictions are claimed anywhere in the provided text. All performance claims (outperformance over GRPO, doubled-rollout baselines, 9.5% gains, OOD generalization) rest on experimental comparisons rather than any fitted parameter renamed as a prediction or any result forced by self-citation. The abstract's reference to a 'self-reinforcing paradigm' is rhetorical, not a mathematical reduction. Because the work contains no derivation chain at all, no step can be shown to equal its inputs by construction. This is the normal case for an empirical RL+LLM paper and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract relies on the domain assumption that advantage collapse occurs on hard samples and that prompt evolution plus distillation can mitigate it; no free parameters or invented entities are explicitly quantified.

axioms (1)

domain assumption RLVR suffers from advantage collapse on hard samples where all rollouts fail
Stated directly in the opening of the abstract as the core problem motivating the work.

invented entities (1)

P^2O algorithm no independent evidence
purpose: Joint policy and prompt optimization to restore learning signals
Newly introduced method combining continuous updates with discrete prompt evolution

pith-pipeline@v0.9.0 · 5516 in / 1365 out tokens · 26963 ms · 2026-05-15T00:45:41.545068+00:00 · methodology

P²O: Joint Policy and Prompt Optimization

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)