ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.
citing papers explorer
-
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
-
Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning
Group Causal Counterfactual Policy Optimization trains LLMs on generalizable reasoning by defining episodic rewards for counterfactual robustness and transferability then optimizing the policy with token-level advantages.
-
EGM: Efficient Visual Grounding Language Models
EGM enables 8B VLMs to reach 91.4 IoU on RefCOCO at 737 ms latency, outperforming a 235B model at 4320 ms, by substituting volume of mid-quality tokens for model scale.
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.