Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.