DART is a training-free router that accepts direct answers on draft agreement and allocates thinking budgets via draft entropy on disagreement, reporting accuracy gains and token reductions on math and code benchmarks across model scales.
hub Mixed citations
O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Mixed citation behavior. Most common role is background (38%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
Evo-L2S uses multi-objective evolutionary model merging to produce reasoning models that cut generated chain-of-thought length by over 50% while preserving or improving accuracy on math benchmarks.
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
RA-RFT trains a retriever to rank contexts by expected reasoning benefit and uses the retrieved analogies inside reinforcement fine-tuning, yielding 7.1 and 2.8 point gains on AIME 2025 over GRPO for two Qwen3 models.
CoRP consolidates reward-weighted perturbations into a single model via low-rank structure, improving base LLMs by 8.1 points on average while using one-tenth the budget of prior ensembles and one forward pass.
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
EvoTD applies crossover for skill composition and parametric mutation for complexity scaling, filtered by a Zone of Proximal Development, to generate tasks that improve LLM reasoning generalization across models.
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
ExpThink reduces average CoT response length by up to 77% while improving accuracy on math benchmarks via experience-guided reward shaping and difficulty-adaptive advantage in RL.
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non-mathematical reasoning benchmarks.
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
citing papers explorer
-
DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models
DART is a training-free router that accepts direct answers on draft agreement and allocates thinking budgets via draft entropy on disagreement, reporting accuracy gains and token reductions on math and code benchmarks across model scales.
-
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
CrowdMath is a new dataset of annotated collaborative math proof discussions where frontier LLMs achieve 83-88% on next-post prediction but only 0.42 macro-F1 on identifying contribution roles.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
PopuLoRA shows that co-evolving populations of LoRA adapters through cross-evaluated self-play can outperform compute-matched single-agent baselines on multiple code and math reasoning benchmarks.
-
ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Efficient Test-Time Scaling via Temporal Reasoning Aggregation
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.