STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.