Step-GRPO internalizes dynamic early exit into reasoning models via step-structured optimization, Dynamic Truncated Rollout, and Step-Aware Relative Reward, delivering 32% token reduction on Qwen3-8B with no accuracy loss.
Wait, let me check
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning
Step-GRPO internalizes dynamic early exit into reasoning models via step-structured optimization, Dynamic Truncated Rollout, and Step-Aware Relative Reward, delivering 32% token reduction on Qwen3-8B with no accuracy loss.