Think Outside the Policy: In-Context Steered Policy Optimization
Pith reviewed 2026-05-18 02:25 UTC · model grok-4.3
The pith
In-Context Steered Policy Optimization lets large reasoning models guide their own RLVR training using in-context examples from existing datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ICPO expands the policy coverage by mixing the current policy with implicit expert forcing via in-context learning, filters unreliable trajectories with expert region reject sampling, and balances guidance with annealed expert-bonus reward shaping, leading to enhanced RLVR for LRMs.
What carries the argument
Mixed-policy GRPO with implicit expert forcing that leverages in-context learning to provide expert guidance without stronger models.
Load-bearing premise
In-context learning in current LRMs can reliably supply effective expert guidance from existing datasets.
What would settle it
Running ICPO on a mathematical reasoning task where the existing dataset provides poor in-context examples and observing no improvement or decreased stability compared to standard GRPO.
read the original abstract
Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts which are confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advanced models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces mixed-policy GRPO with implicit expert forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates expert region reject sampling to filter unreliable off-policy trajectories and annealed expert-bonus reward shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs. Our code is available at https://github.com/Celine-hxy/ICPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes In-Context Steered Policy Optimization (ICPO), a framework for improving Reinforcement Learning from Verifiable Rewards (RLVR) in Large Reasoning Models (LRMs). It builds on Group Relative Policy Optimization (GRPO) by introducing mixed-policy GRPO with implicit expert forcing to expand exploration using the base model's in-context learning on existing datasets, without trajectories from stronger expert models. Additional components include expert region reject sampling to filter unreliable off-policy trajectories and annealed expert-bonus reward shaping to balance guidance and autonomous improvement. The abstract claims that ICPO yields consistent gains in performance and training stability on mathematical reasoning benchmarks.
Significance. If the empirical results and mechanisms hold under detailed scrutiny, the work could offer a meaningful advance by providing a more accessible and scalable RLVR approach that avoids reliance on advanced external models, potentially broadening the applicability of policy optimization for reasoning tasks in LRMs.
major comments (2)
- [Abstract] Abstract: The central claim that 'ICPO consistently enhances RLVR performance and training stability' is presented without any quantitative results, tables, ablation studies, or experimental details, rendering it impossible to evaluate whether the reported improvements stem from the proposed mechanisms or other factors.
- [Abstract] Abstract: The description of 'mixed-policy GRPO with implicit expert forcing' and 'expert region reject sampling' provides no specifics on example selection from existing datasets, prompting formats for implicit forcing, or the precise definition and implementation of the 'expert region,' which are load-bearing for the claim that in-context learning alone can reliably substitute for stronger expert trajectories.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address the major comments point by point below, clarifying the role of the abstract and the availability of details in the full paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'ICPO consistently enhances RLVR performance and training stability' is presented without any quantitative results, tables, ablation studies, or experimental details, rendering it impossible to evaluate whether the reported improvements stem from the proposed mechanisms or other factors.
Authors: We acknowledge that the abstract presents the central claim at a high level without quantitative results or tables. This is standard practice to ensure the abstract remains concise and accessible. The full manuscript provides the supporting experimental details, including performance tables on mathematical reasoning benchmarks, ablation studies isolating the contributions of mixed-policy GRPO, reject sampling, and reward shaping, as well as metrics demonstrating improved training stability. These results indicate that the gains arise from the proposed mechanisms rather than extraneous factors. revision: no
-
Referee: [Abstract] Abstract: The description of 'mixed-policy GRPO with implicit expert forcing' and 'expert region reject sampling' provides no specifics on example selection from existing datasets, prompting formats for implicit forcing, or the precise definition and implementation of the 'expert region,' which are load-bearing for the claim that in-context learning alone can reliably substitute for stronger expert trajectories.
Authors: The abstract introduces the core ideas at a summary level without implementation specifics to preserve brevity. The full manuscript details the example selection process from existing datasets, the prompting formats that leverage the base model's in-context learning for implicit expert forcing, and the precise definition of the expert region together with its use in reject sampling to filter unreliable trajectories. These elements are elaborated in the methodology section and support the claim that in-context guidance from existing data can effectively substitute for trajectories from stronger models. revision: no
Circularity Check
No circularity in abstract; method described as extension of prior RLVR without equations or self-referential reductions
full rationale
The provided abstract introduces ICPO as a framework that builds on existing RLVR methods such as GRPO by adding mixed-policy GRPO with implicit expert forcing, expert region reject sampling, and annealed expert-bonus reward shaping. These are presented as new mechanisms leveraging the base LRM's in-context learning on existing datasets, with claimed performance gains on benchmarks. No equations, derivations, fitted parameters, or self-citations appear in the text that would reduce the central claims to inputs by construction. The derivation chain, to the extent visible in the abstract, remains self-contained and independent of the reported results.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing (IEF), where expert-conditioned rollouts are generated through few-shot ICL guidance... Expert Region Reject Sampling (ERRS)... annealed expert bonus into the Reward Shaping (RS) design
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results demonstrate that ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.