hub

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726

· 2025 · arXiv 2508.09726

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 2 baseline 1 unclear 1

representative citing papers

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

CERO uses Beta posteriors and Fenchel-dual online optimization to adaptively allocate a fixed rollout budget across prompts and epochs in LLM RL, outperforming fixed-allocation GRPO on math reasoning benchmarks.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.

MEMENTO: Teaching LLMs to Manage Their Own Context

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

cs.CL · 2026-01-08 · unverdicted · novelty 6.0

GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

Entropy After </Think> for reasoning model early exiting

cs.LG · 2025-09-30 · unverdicted · novelty 6.0

Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

cs.CL · 2026-05-24 · unverdicted · novelty 4.0

A textbook-derived neuroscience knowledge graph supplies synthetic multi-hop QA supervision and RL rewards to fine-tune a small LM claimed to exceed larger general models on expert reasoning.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 14
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models cs.CL · 2025-03-20 · accept · none · ref 157
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer