Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349

GTPO, GRPO-S: Token · 2025 · arXiv 2508.04349

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 unclear 1

representative citing papers

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

cs.LG · 2026-05-14 · conditional · novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

cs.LG · 2026-02-06 · unverdicted · novelty 6.0

Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.

EGM: Efficient Visual Grounding Language Models

cs.CV · 2026-01-20 · unverdicted · novelty 6.0

EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

MRPO is a step-aware RL method that penalizes early reasoning errors exponentially more when the final answer is incorrect, reducing early-stage failures from 64% to 13% and outperforming baselines including larger models on medical VQA tasks.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

cs.LG · 2025-10-11 · unverdicted · novelty 5.0

Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy cs.LG · 2026-05-14 · conditional · none · ref 26
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 36
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 16
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning cs.LG · 2026-02-06 · unverdicted · none · ref 13
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
EGM: Efficient Visual Grounding Language Models cs.CV · 2026-01-20 · unverdicted · none · ref 32
EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.
Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning cs.CV · 2026-06-30 · unverdicted · none · ref 3
MRPO is a step-aware RL method that penalizes early reasoning errors exponentially more when the final answer is incorrect, reducing early-stage failures from 64% to 13% and outperforming baselines including larger models on medical VQA tasks.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 134
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors cs.AI · 2026-05-09 · unverdicted · none · ref 34
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer