REC-RL applies Group Relative Policy Optimization with combined range and Gaussian accuracy rewards plus a format reward to improve referring expression counting.
Let’s verify step by step
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
citing papers explorer
-
REC-RL: Referring expression counting via Gaussian and range-based reward optimization
REC-RL applies Group Relative Policy Optimization with combined range and Gaussian accuracy rewards plus a format reward to improve referring expression counting.
-
LLM Reasoning with Process Rewards for Outcome-Guided Steps
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
-
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.