Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Chenfu Bao; Du Su; En Wang; Jinchang Hou; Wenbin Liu; Xingyu Lin; Yilin Wen; Zhonghou Lv

arxiv: 2604.12736 · v1 · submitted 2026-04-14 · 💻 cs.CL

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Xingyu Lin , Yilin Wen , Du Su , Jinchang Hou , En Wang , Wenbin Liu , Chenfu Bao , Zhonghou Lv This is my paper

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords token-level policy optimizationgroup relative policy optimizationmathematical reasoningchain-of-thoughtentropy collapseKL divergence masksparse rewardsLLM training stability

0 comments

The pith

TEPO links group rewards to individual tokens via sequence likelihood and a selective KL mask to stabilize sparse-reward training in chain-of-thought reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard group-level methods like GRPO fail on token-level sparse rewards because uniform entropy regularization collapses model behavior during long reasoning chains. TEPO solves this by first aggregating rewards to tokens through sequence-level likelihood and then applying a targeted KL constraint only to tokens that carry positive advantage and are losing entropy. Experiments confirm this produces state-of-the-art scores on math benchmarks while cutting convergence time by half. A reader should care because reliable training of reasoning models removes a major practical barrier to scaling advanced capabilities.

Core claim

TEPO uses sequence-level likelihood to distribute group-level rewards across tokens via token-level aggregation and adds a KL-Divergence mask that restricts updates to tokens with positive advantages and decreasing entropy, thereby preventing abrupt policy shifts and entropy collapse under sparse token rewards.

What carries the argument

Sequence-level likelihood aggregation that distributes group rewards to tokens, paired with a token-level KL-Divergence mask applied selectively to positive-advantage, decreasing-entropy tokens.

If this is right

Mathematical reasoning benchmarks reach state-of-the-art performance.
Convergence time drops by 50 percent compared with GRPO and DAPO.
Training stability improves because entropy collapse is avoided on critical tokens.
Abrupt policy updates are limited without requiring broad entropy regularization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation-plus-mask pattern could be tested on other sparse token-reward tasks such as code generation or multi-step planning.
Lower reliance on global entropy terms may simplify hyperparameter search across reinforcement learning setups for language models.
Credit assignment accuracy could be measured directly on long reasoning traces to verify the likelihood linkage preserves signal over many steps.

Load-bearing premise

That sequence likelihood gives an accurate, unbiased distribution of group rewards down to tokens and that the selective KL mask will reliably block entropy collapse without creating new biases or needing extra tuning.

What would settle it

Apply TEPO to a held-out mathematical reasoning benchmark and check whether entropy still collapses or whether stability and accuracy gains disappear relative to GRPO.

Figures

Figures reproduced from arXiv: 2604.12736 by Chenfu Bao, Du Su, En Wang, Jinchang Hou, Wenbin Liu, Xingyu Lin, Yilin Wen, Zhonghou Lv.

**Figure 1.** Figure 1: Overview of the TEPO Framework TEPO (1) replaces baselines’ noisy, sparse token-level credit assignment with sequence-level likelihood, using soft aggregation to broadcast group rewards to tokens and stabilize training. (2) A selective KL mask curbs abrupt updates exclusively for tokens with positive advantage and decreasing entropy, balancing entropy reduction and stability. on undifferentiated token-leve… view at source ↗

**Figure 2.** Figure 2: Lower Gradient Bias and Faster Reasoning Efficiency with Markov Likelihood: The left panel [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Markov Likelihood enhanced performance (We transform the raw data into percentage-based [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TEPO's use of sequence likelihood for token aggregation keeps the reward signal uniform within each completion, so it probably does not fix the token-level sparsity the authors blame on GRPO.

read the letter

The paper introduces TEPO to address entropy collapse in GRPO-style training for LLM reasoning. It links group rewards to tokens through sequence-level likelihood aggregation and adds a KL mask that only activates on tokens with positive advantage and falling entropy. The reported outcome is SOTA math-reasoning scores plus a 50% cut in convergence time versus GRPO and DAPO baselines. That stability number is the main practical claim worth checking. The mask itself is the clearest new piece; it is a targeted constraint rather than blanket entropy regularization, and it makes sense as a way to avoid abrupt updates on already-stable tokens. The authors lay out the CoT sparsity problem clearly and show how their changes aim to keep policy updates controlled. The math stays within standard policy-gradient territory with the added mask, and the citations track the recent GRPO line of work without obvious gaps. The soft spot sits in the aggregation step. Because P(y|x) is a single scalar per full sequence, every token inside that sequence receives the same scaled reward. This does not create differential credit assignment across tokens, which undercuts the claim that the method solves the token-level sparsity issue. The stability gain could therefore come mostly from the mask or from hyper-parameter choices on the math benchmarks rather than from the linking mechanism. Without seeing the full derivations and ablations it is hard to tell whether the gradients remain unbiased or whether the mask introduces its own mode-seeking bias. The experiments are presented as decisive, but the abstract gives no error bars, variance numbers, or detailed baseline comparisons, so the 50% figure needs verification. This work is aimed at people already running GRPO or DAPO variants on reasoning tasks. A reader who wants concrete tweaks to training stability would find the mask idea useful to test. The paper is coherent enough on its own terms to deserve a serious referee who can examine the equations, the exact mask rule, and whether the gains hold on broader tasks.

Referee Report

3 major / 2 minor

Summary. The paper introduces TEPO (Token-Level Policy Optimization) to improve upon Group Relative Policy Optimization (GRPO) for LLM mathematical reasoning. It proposes linking group-level rewards to tokens via aggregation weighted by sequence-level likelihood P(y|x) and a token-level KL-divergence mask applied selectively to tokens with positive advantages and decreasing entropy. The authors claim this resolves token-level sparse rewards in chain-of-thought reasoning, yielding state-of-the-art benchmark performance and a 50% reduction in convergence time versus GRPO/DAPO.

Significance. If the aggregation and selective mask deliver genuine token-level credit assignment and stable gradients without new biases, TEPO could meaningfully advance group-based RL methods for LLM reasoning by addressing entropy collapse under sparse rewards. The reported efficiency gains would have practical value for large-scale training, and the attempt to move beyond uniform entropy regularization is a constructive direction.

major comments (3)

[Method (reward aggregation)] Aggregation mechanism: weighting group rewards by the scalar sequence-level likelihood P(y|x) applies the same factor to every token in a completion, leaving the per-token signal effectively uniform rather than differentiated. This directly undercuts the central claim of solving token-level sparsity in CoT, as the construction reduces to a re-scaled sequence-level signal.
[KL-Divergence mask definition] Token-level KL mask: the selective mask on tokens with positive advantage and decreasing entropy lacks any derivation showing preservation of unbiased gradients or avoidance of mode-seeking. This is load-bearing for the stability and 50% convergence claims, which may be artifacts of the math-reasoning regime rather than a general property.
[Experiments] Experimental support: the abstract states SOTA results and 50% faster convergence, yet no baselines, datasets, run counts, error bars, or ablations isolating the aggregation versus mask components are referenced, leaving the performance claims without visible grounding.

minor comments (2)

[Abstract] Abstract contains typographical spacing artifacts (e.g., 'mathem atical', 'chal lenge', 'undifferen tiated') that should be corrected.
[Notation] Notation for likelihood P(y|x) and advantage should be defined once and used consistently across equations and text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas for clarification and improvement in the manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Method (reward aggregation)] Aggregation mechanism: weighting group rewards by the scalar sequence-level likelihood P(y|x) applies the same factor to every token in a completion, leaving the per-token signal effectively uniform rather than differentiated. This directly undercuts the central claim of solving token-level sparsity in CoT, as the construction reduces to a re-scaled sequence-level signal.

Authors: We appreciate the referee highlighting this aspect of the aggregation. The sequence-level likelihood P(y|x) serves as a weighting factor for the group reward assigned to the completion, which is then incorporated into the token-level policy gradient updates. This construction links the group-level signal to individual tokens via the per-token log-probabilities in the objective, rather than treating the entire sequence uniformly in the optimization. While the weight is constant within a sequence, the token-level aggregation arises because the gradient is computed and applied at the token granularity, enabling adjustments to specific tokens in the CoT chain based on the aggregated reward. We acknowledge that this does not introduce intra-sequence differentiation beyond the likelihood weighting itself. In the revised manuscript, we will expand the method section with a clearer mathematical formulation and an ablation comparing this approach to purely sequence-level baselines to better substantiate the claim. revision: partial
Referee: [KL-Divergence mask definition] Token-level KL mask: the selective mask on tokens with positive advantage and decreasing entropy lacks any derivation showing preservation of unbiased gradients or avoidance of mode-seeking. This is load-bearing for the stability and 50% convergence claims, which may be artifacts of the math-reasoning regime rather than a general property.

Authors: We agree that a formal derivation would strengthen the presentation of the KL mask. The mask is applied only to tokens exhibiting positive advantage and decreasing entropy to constrain policy shifts on high-reward tokens while preserving exploration on others, thereby mitigating entropy collapse under sparse rewards. In the revision, we will add a derivation in the appendix demonstrating that the selective masking preserves the unbiasedness of the group-relative policy gradient estimator (by showing equivalence to the unmasked case under the expectation over the group). We will also include analysis addressing potential mode-seeking behavior and empirical results from our training curves showing improved stability. We will qualify the convergence claims as observed in the mathematical reasoning setting and discuss broader applicability as a direction for future work. revision: yes
Referee: [Experiments] Experimental support: the abstract states SOTA results and 50% faster convergence, yet no baselines, datasets, run counts, error bars, or ablations isolating the aggregation versus mask components are referenced, leaving the performance claims without visible grounding.

Authors: The full manuscript details the experimental setup in Section 4, including baselines (GRPO, DAPO, and standard PPO variants), datasets (GSM8K, MATH, and others), multiple runs with error bars, and ablations isolating the aggregation and KL mask components. The abstract is intentionally concise and does not enumerate these elements. In the revised version, we will update the abstract to reference the key baselines, datasets, and performance metrics. We will also ensure the ablations are more prominently highlighted in the main text and figures to directly address the isolation of contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new mechanisms introduced without reduction to inputs by construction

full rationale

The paper introduces TEPO as a novel token-level framework that aggregates group-level rewards to tokens using sequence-level likelihood and applies a targeted token-level KL mask on positive-advantage tokens with decreasing entropy. No equations, derivations, or self-citations are exhibited in the abstract or summary that would reduce the claimed stability gains or performance improvements to fitted parameters or prior results by construction. The central proposal consists of explicit new linking and masking rules whose validity is asserted via experiments rather than tautological re-expression of inputs. This matches the default expectation for non-circular papers where the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; all technical details of the likelihood aggregation and KL mask are omitted, preventing enumeration.

pith-pipeline@v0.9.0 · 5494 in / 1157 out tokens · 37325 ms · 2026-05-10T15:18:36.956145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

One-shot entropy minimization.arXiv preprint arXiv:2505.20282,

One-shot entropy minimization.arXiv preprint arXiv:2505.20282. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others

work page arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learn- ing with deep energy-based policies. InInter- national conference on machine learning, pages 1352–1361. PMLR. Tuomas Haarnoja, Aurick Zhou, Pieter A...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

(12) vanishes

Using ∇θπθ(a|s) = πθ(a|s)∇θlogπθ(a|s), the first term in Eq. (12) vanishes. Thus: ∇θH(πθ) =− ∑ a πθ(a|s)∇θlogπθ(a|s)(13) Log-probability derivative for Softmax: ∇θlogπθ(a|s) =∇θϕθ(a|s)−Ea′∼πθ[∇θϕθ(a′|s)](14) Substitute into Eq. (13): ∇θH(πθ) =−Eπθ [ ∇θϕθ(a|s)−Ea′∼πθ[∇θϕθ(a′|s)] ] (15) C.4 Step 3: Entropy Change with NPG Update C.4.1 NPG Update Rule NPG up...

work page 2001
[4]

Cov[logπθ,A ] > 0:∆ H < 0(entropy↓, policy more deterministic)

work page
[5]

Cov[logπθ,A ] < 0:∆ H > 0(entropy↑, policy more exploratory)

work page
[6]

This derivation underpins entropy regular- ization for balanced E-E trade-off in RL

Cov[logπθ,A ] = 0:∆ H = 0(entropy un- changed). This derivation underpins entropy regular- ization for balanced E-E trade-off in RL. Algorithm 1Token-Level Policy Gradient Computation forTEPO Require:πθ: Current policy network (LLM); 1:πθold: Pre-update (reference) policy; 2:{(xi,yi)}G i=1: Batch of prompt-response pairs (xi = prompt,yi = LLM-generated re...

work page 2025

[1] [1]

One-shot entropy minimization.arXiv preprint arXiv:2505.20282,

One-shot entropy minimization.arXiv preprint arXiv:2505.20282. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others

work page arXiv

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.arXiv preprint arXiv:2501.12948. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learn- ing with deep energy-based policies. InInter- national conference on machine learning, pages 1352–1361. PMLR. Tuomas Haarnoja, Aurick Zhou, Pieter A...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

(12) vanishes

Using ∇θπθ(a|s) = πθ(a|s)∇θlogπθ(a|s), the first term in Eq. (12) vanishes. Thus: ∇θH(πθ) =− ∑ a πθ(a|s)∇θlogπθ(a|s)(13) Log-probability derivative for Softmax: ∇θlogπθ(a|s) =∇θϕθ(a|s)−Ea′∼πθ[∇θϕθ(a′|s)](14) Substitute into Eq. (13): ∇θH(πθ) =−Eπθ [ ∇θϕθ(a|s)−Ea′∼πθ[∇θϕθ(a′|s)] ] (15) C.4 Step 3: Entropy Change with NPG Update C.4.1 NPG Update Rule NPG up...

work page 2001

[4] [4]

Cov[logπθ,A ] > 0:∆ H < 0(entropy↓, policy more deterministic)

work page

[5] [5]

Cov[logπθ,A ] < 0:∆ H > 0(entropy↑, policy more exploratory)

work page

[6] [6]

This derivation underpins entropy regular- ization for balanced E-E trade-off in RL

Cov[logπθ,A ] = 0:∆ H = 0(entropy un- changed). This derivation underpins entropy regular- ization for balanced E-E trade-off in RL. Algorithm 1Token-Level Policy Gradient Computation forTEPO Require:πθ: Current policy network (LLM); 1:πθold: Pre-update (reference) policy; 2:{(xi,yi)}G i=1: Batch of prompt-response pairs (xi = prompt,yi = LLM-generated re...

work page 2025