Leash: Adaptive length penalty and reward shaping for efficient large reasoning model

Solving quantitative reasoning problems with language models · 2024 · arXiv 2512.21540

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

BPPO selects shortest correct and incorrect completions for GRPO updates with prefix-focused optimization to deliver up to 6.08x speedup and 30-50% shorter responses on math reasoning tasks.

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.

citing papers explorer

Showing 2 of 2 citing papers.

BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses cs.LG · 2026-05-27 · unverdicted · none · ref 3
BPPO selects shortest correct and incorrect completions for GRPO updates with prefix-focused optimization to deliver up to 6.08x speedup and 30-50% shorter responses on math reasoning tasks.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling cs.CL · 2026-04-29 · unverdicted · none · ref 11
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.

Leash: Adaptive length penalty and reward shaping for efficient large reasoning model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer