BPPO selects shortest correct and incorrect completions for GRPO updates with prefix-focused optimization to deliver up to 6.08x speedup and 30-50% shorter responses on math reasoning tasks.
Leash: Adaptive length penalty and reward shaping for efficient large reasoning model
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
citing papers explorer
-
BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses
BPPO selects shortest correct and incorrect completions for GRPO updates with prefix-focused optimization to deliver up to 6.08x speedup and 30-50% shorter responses on math reasoning tasks.