Co-Evolution of Policy and Internal Reward for Language Agents

Bang Liu; Chenglin Wu; Fanqi Kong; Hanwei Wu; Jiayi Zhang; Jingwei Song; Shuyuan Zhang; Tung Sum Thomas Kwok; Xiao-Wen Chang; Xinyu Wang

arxiv: 2604.03098 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI· cs.CL

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang , Hanwei Wu , Jingwei Song , Shuyuan Zhang , Jiayi Zhang , Fanqi Kong , Tung Sum Thomas Kwok , Xiao-Wen Chang

show 3 more authors

Yuyu Luo Chenglin Wu Bang Liu

This is my paper

Pith reviewed 2026-05-13 19:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords language agentsinternal rewardself-guidanceco-evolutionpolicy optimizationGRPOsparse rewardsLLM agents

0 comments

The pith

Language agents co-evolve their policy with a self-generated internal reward to handle sparse signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Self-Guide, a signal the agent generates itself to steer its next action during inference and then converts into step-level internal rewards for denser training supervision. This creates a loop where an improved policy produces better guidance signals and those signals in turn refine the policy further. On three agent benchmarks, inference-time use of the self-guidance already improves results, and training the policy jointly with the internal rewards via GRPO adds roughly 8 percent gains over baselines that use only the environment reward. The work shows agents can advance not just by gathering more experience but by learning to create and refine their own internal feedback during both acting and learning.

Core claim

Language agents can generate a short self-guidance signal that serves simultaneously as inference-time steering for the next action and as step-level internal reward for policy optimization; this produces a co-evolving loop in which better policies yield better guidance and better guidance further improves the policy, delivering measurable gains on agent benchmarks.

What carries the argument

Self-Guide, the self-generated internal reward signal that steers actions at inference time and supplies step-level supervision for training.

If this is right

Inference-time self-guidance improves agent performance across the tested benchmarks.
Joint evolution of policy and internal reward with GRPO adds about 8 percent improvement over baselines that use only environment rewards.
The policy and the internal reward generator improve each other over time through repeated interaction.
Agents can reduce dependence on external reward models by creating and using their own step-level signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the internal signals remain accurate across tasks, agents could train effectively in environments that provide only very delayed or absent rewards.
The co-evolution mechanism might allow agents to develop task-specific reward structures without human redesign.
The same self-guidance loop could be tested in non-language settings such as robotic control where internal signals could substitute for sparse external feedback.

Load-bearing premise

The self-generated guidance signal can be reliably converted into accurate step-level internal rewards that avoid systematic bias or reward hacking.

What would settle it

A direct comparison on the same three benchmarks in which training with the internal rewards produces no gain or worse performance than training with environment rewards alone, or in which inference-time guidance leads to systematically worse action choices.

Figures

Figures reproduced from arXiv: 2604.03098 by Bang Liu, Chenglin Wu, Fanqi Kong, Hanwei Wu, Jiayi Zhang, Jingwei Song, Shuyuan Zhang, Tung Sum Thomas Kwok, Xiao-Wen Chang, Xinyu Wang, Yuyu Luo.

**Figure 2.** Figure 2: Comparison between baseline GRPO and GRPO with Self-Guide. Baseline GRPO optimizes a policy using sparse trajectory-level environment rewards. Our method augments each step with a verbal self-guidance signal zt : the model first generates zt to assess the current trajectory, and then produces action at conditioned on zt . The same selfguidance signals are mapped to step-level internal rewards, aggregated … view at source ↗

**Figure 3.** Figure 3: Self-guidance without training already improves performance in structured [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves on the three environments with Qwen3-1.7B as base model. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation on stage-wise guidance-reward scheduling. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Offline self-guidance distillation versus online co-evolution. Although the student improves under offline distillation (a), the distilled model does not transfer reliably to downstream RL: using it as guidance reward (b) or for inference-time guidance only (c) yields unstable or unsustained gains, supporting the need for online co-evolution. method creates a self-reinforcing loop that requires no external… view at source ↗

**Figure 7.** Figure 7: Self-guidance prompt template, applied identically across ALFWorld, SciWorld, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Actor prompt template for ALFWorld. Admissible actions and the self-guidance [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Actor prompt template for SciWorld. Because the number of admissible actions [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Actor prompt template for WebShop, adapted from GiGPO ( [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Error distribution among failed episodes across three benchmarks and three base [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Training curves of DAPO and DAPO w/ SG & GR. Our method achieves a final [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template used for offline self-guidance model distillation. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-Guide creates a co-evolution loop where agents turn their own inference-time guidance signals into step-level training rewards, yielding an 8% gain over environment-reward baselines on three benchmarks, but the conversion step lacks reported validation.

read the letter

The core idea is straightforward: the agent produces a short self-guidance signal to steer its next action at inference time, then re-uses that same signal as a denser internal reward for GRPO training. This sets up a loop where improvements in policy feed back into better guidance and vice versa. The abstract shows inference-only guidance already helps, and adding the joint evolution pushes performance another 8% above baselines that rely only on sparse environment rewards across three agent benchmarks. That dual-use mechanism is the main novelty relative to standard credit-assignment or external reward-model approaches. It directly targets the long-horizon sparse-reward bottleneck without requiring a separate learned critic. The reported gains are concrete enough to notice, and the setup stays grounded in environment interactions rather than pure fitting. The soft spot is the missing detail on how the guidance signal is turned into step-level rewards and whether that conversion stays aligned with true outcomes. No correlation numbers, no ablations isolating the internal-reward term, and no checks against reward hacking appear in the abstract, so the 8% lift could partly reflect optimistic bias rather than genuine policy improvement. If the full paper supplies those controls and shows the internal rewards track environment signals reasonably, the result strengthens; otherwise the central claim stays under-supported. This is worth sending to peer review. Anyone working on LLM agents or internal reward design would get usable ideas and benchmark numbers from it, even if they want tighter validation on the reward quality. A referee could push for exactly those missing checks without discarding the approach.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Self-Guide, a self-generated internal reward for LLM agents. The same short self-guidance signal is used at inference time to steer actions and converted into step-level internal rewards for denser supervision during GRPO-based policy optimization. This produces a co-evolution loop in which improved policies generate better guidance and the guidance in turn improves the policy. Experiments on three agent benchmarks show gains from inference-time self-guidance alone and an additional 8% improvement when the internal reward is jointly evolved versus baselines trained only on environment reward.

Significance. If the conversion from self-guidance to internal reward proves robust and free of exploitable bias, the approach offers a scalable route to densify rewards in long-horizon language-agent tasks without external reward models. The co-evolution framing is conceptually attractive and could reduce reliance on post-hoc credit assignment.

major comments (3)

[§4.1] §4.1 (experimental setup): no details are given on how the self-guidance signal is converted into numeric step-level rewards, nor on any correlation or validation against the sparse environment signal; without this, the 8% gain cannot be distinguished from reward hacking.
[§4.3] §4.3 (ablation table): the reported improvement is not isolated from the inference-time guidance component; an ablation that trains with environment reward plus guidance but without the internal-reward conversion is missing, leaving the central co-evolution claim unsupported.
[§5.2] §5.2 (GRPO objective): the internal reward is added to the environment reward without stated weighting, clipping, or normalization; this risks the internal signal dominating and producing optimistic self-reinforcement rather than genuine policy improvement.

minor comments (2)

[Introduction] The acronym GRPO is used without expansion on first appearance in the introduction.
[Figure 1] Figure 1 caption does not clarify whether the depicted loop is the training-time or inference-time flow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on Self-Guide. We address each major comment below with clarifications from the manuscript and planned revisions to strengthen the presentation of the conversion process, ablations, and objective details.

read point-by-point responses

Referee: [§4.1] §4.1 (experimental setup): no details are given on how the self-guidance signal is converted into numeric step-level rewards, nor on any correlation or validation against the sparse environment signal; without this, the 8% gain cannot be distinguished from reward hacking.

Authors: The self-guidance signal is a short textual assessment generated by the agent (e.g., a 1-2 sentence evaluation of the current state and proposed action). This is converted to a numeric step-level reward by prompting the same LLM to output a scalar score in [0,1] that reflects alignment with task progress; the score is then used directly as the internal reward. We will expand §4.1 with the exact prompt template, pseudocode for the conversion, and a new table reporting Pearson correlation (typically >0.7 on successful trajectories) between internal and environment rewards to address potential hacking concerns. revision: yes
Referee: [§4.3] §4.3 (ablation table): the reported improvement is not isolated from the inference-time guidance component; an ablation that trains with environment reward plus guidance but without the internal-reward conversion is missing, leaving the central co-evolution claim unsupported.

Authors: The current ablations isolate inference-time guidance from full training with internal rewards, but we agree an explicit row is needed for training under environment reward plus guidance signal without the numeric conversion step. We will add this ablation to Table 3 in the revision, running the GRPO baseline with the textual guidance appended to the prompt but no internal reward term; preliminary runs indicate the full co-evolution still yields the reported 8% gain. revision: yes
Referee: [§5.2] §5.2 (GRPO objective): the internal reward is added to the environment reward without stated weighting, clipping, or normalization; this risks the internal signal dominating and producing optimistic self-reinforcement rather than genuine policy improvement.

Authors: The GRPO objective adds the internal reward after min-max normalization to [0,1] to match the environment reward scale, with equal weighting (λ=1) and no clipping beyond the standard PPO-style ratio clipping already present in GRPO. We will insert the precise combined reward formula r_total = r_env + r_internal (post-normalization) into §5.2 and add a short discussion of why optimistic reinforcement is limited by the requirement that guidance must be generated from the current policy and validated against environment outcomes in the co-evolution loop. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method grounded in environment interactions with empirical claims

full rationale

The paper describes Self-Guide as a self-generated signal converted into internal rewards for both inference steering and GRPO-based training, forming a co-evolution loop. However, no equations, derivations, or self-citations are presented that reduce the claimed 8% gains or the internal-reward conversion to a fitted parameter or input by construction. The improvements are positioned as outcomes of interaction with environment rewards and benchmarks, not as tautological redefinitions. The central premise remains empirically testable against external baselines without load-bearing self-referential definitions or ansatzes smuggled via prior work. This is the common case of a self-contained empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; the method implicitly assumes the internal signal can be produced without external supervision and converted to valid step rewards.

axioms (1)

domain assumption Self-generated guidance signals can be converted into accurate step-level rewards without introducing bias
Central to turning inference guidance into training supervision

pith-pipeline@v0.9.0 · 5532 in / 1134 out tokens · 33714 ms · 2026-05-13T19:48:47.803807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Jansen and Marc

Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.775. URLhttps://aclanthology.org/2022.emnlp-main.775/. Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, and Wei Wang. Arlarena: A unified framework for stable ag...

work page doi:10.18653/v1/2022.emnlp-main.775 2022
[2]

Decide whether the student’sLABEL is correct in hindsight (use the full trajectory; events after steptare allowed)

work page
[3]

Produce a scalar reward in[−1, 1]: • exact label match→about+0.8 • adjacent mismatch (positive↔neutral or neutral↔negative) → about +0.2 • opposite mismatch (positive↔negative)→about−0.8 • invalid format/label→ −1.0 • if the Reason contradicts facts, subtract about 0.2; if it clearly supports, add about 0.2; generic→0 • clip to[−1, 1]

work page
[4]

Returnonlythe reward number (float)

Donotreward verbosity; if the case is ambiguous, prefer neutral and keep reward near 0. Returnonlythe reward number (float). No JSON. No explanations. FULL trajectory (steps1 . . .T): {full traj} We are judging stept={t}. Student output (two lines): {student two lines} Now output the reward number only. Figure 13: Prompt template used for offline self-gui...

work page

[1] [1]

Jansen and Marc

Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.775. URLhttps://aclanthology.org/2022.emnlp-main.775/. Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, and Wei Wang. Arlarena: A unified framework for stable ag...

work page doi:10.18653/v1/2022.emnlp-main.775 2022

[2] [2]

Decide whether the student’sLABEL is correct in hindsight (use the full trajectory; events after steptare allowed)

work page

[3] [3]

Produce a scalar reward in[−1, 1]: • exact label match→about+0.8 • adjacent mismatch (positive↔neutral or neutral↔negative) → about +0.2 • opposite mismatch (positive↔negative)→about−0.8 • invalid format/label→ −1.0 • if the Reason contradicts facts, subtract about 0.2; if it clearly supports, add about 0.2; generic→0 • clip to[−1, 1]

work page

[4] [4]

Returnonlythe reward number (float)

Donotreward verbosity; if the case is ambiguous, prefer neutral and keep reward near 0. Returnonlythe reward number (float). No JSON. No explanations. FULL trajectory (steps1 . . .T): {full traj} We are judging stept={t}. Student output (two lines): {student two lines} Now output the reward number only. Figure 13: Prompt template used for offline self-gui...

work page