POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
arXiv preprint arXiv:2411.19943 , year =
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.
AttentionPO weights tokens in DPO using LLM attention as a pairwise judge, yielding better results on AlpacaEval, MT-Bench, and ArenaHard than prior preference optimization methods.
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
AtManRL learns an additive attention mask on CoT traces to produce a saliency reward that, when combined with outcome rewards in GRPO, trains LLMs to generate reasoning that genuinely influences final predictions.
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token use than standard CoT on GSM8K and MATH500.
Select to Think reframes LLM help as ranking among SLM top-K candidates and distills the ranking ability back into the SLM for improved single-pass reasoning.
citing papers explorer
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.