Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185

Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, Kartik Talamadupula · 2025 · arXiv 2504.05185

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

When Less is Enough: Efficient Inference via Collaborative Reasoning

cs.LG · 2026-05-01 · conditional · novelty 6.0

A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

cs.AI · 2025-09-28 · unverdicted · novelty 6.0

RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.

Learning to Reason under Off-Policy Guidance

cs.LG · 2025-04-21 · unverdicted · novelty 6.0

LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Cost-Aware Learning

cs.LG · 2026-04-30

citing papers explorer

Showing 9 of 9 citing papers.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 229
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 10
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 56
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 9
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG · 2026-05-01 · conditional · none · ref 7
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
Emergent Slow Thinking in LLMs as Inverse Tree Freezing cs.AI · 2025-09-28 · unverdicted · none · ref 12
RLVR drives a concept network in LLMs through nucleation and freezing into inverse trees that support slow thinking, and intervening with brief SFT at peak frustration outperforms standard RLVR while post-freeze SFT causes forgetting.
Learning to Reason under Off-Policy Guidance cs.LG · 2025-04-21 · unverdicted · none · ref 44
LUFFY mixes off-policy reasoning traces into RLVR training via Mixed-Policy GRPO and regularized importance sampling, delivering over 6-point gains on math benchmarks and enabling training of weak models where on-policy RLVR fails.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models cs.CL · 2025-03-20 · accept · none · ref 47
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Cost-Aware Learning cs.LG · 2026-04-30 · unreviewed · ref 7

Concise reasoning via reinforcement learning.arXiv preprint arXiv:2504.05185

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer