Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, Xiangyang Ji · 2025 · arXiv 2507.04632

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

cs.LG · 2026-02-16 · unverdicted · novelty 5.0

A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.

citing papers explorer

Showing 5 of 5 citing papers.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 24
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training cs.LG · 2026-05-26 · unverdicted · none · ref 10
Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation cs.LG · 2026-05-09 · unverdicted · none · ref 39
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 43 · 2 links
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning cs.LG · 2026-02-16 · unverdicted · none · ref 20
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.

Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models?arXiv preprint arXiv:2507.04632, 2025a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer