GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Learning how hard to think: Input-adaptive allocation of lm computation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
method 1polarities
use method 1representative citing papers
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
GlimpRouter uses the entropy of the first token in each reasoning step to decide whether to invoke a large model, yielding 10.7% higher accuracy and 25.9% lower latency than a standalone large model on AIME25.
DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.
citing papers explorer
-
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Self-Supervised On-Policy Distillation for Reasoning Language Models
SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.
-
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
GlimpRouter uses the entropy of the first token in each reasoning step to decide whether to invoke a large model, yielding 10.7% higher accuracy and 25.9% lower latency than a standalone large model on AIME25.
-
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models
DMoA is a differentiable multi-agent LLM framework with recurrent context-aware routing and predictive entropy self-supervision that claims SOTA results on 9 benchmarks through elastic agent collaboration.
-
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.