Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
citing papers explorer
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
-
When Less is Enough: Efficient Inference via Collaborative Reasoning
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
-
Emergent Slow Thinking in LLMs as Inverse Tree Freezing
RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.
-
Learning to Reason under Off-Policy Guidance
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
- Cost-Aware Learning