Self-hinting language models enhance reinforcement learning

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian · 2026 · arXiv 2602.03143

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

representative citing papers

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

cs.LG · 2026-05-06 · unverdicted · novelty 6.0 · 3 refs

Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates near 50% in binary-reward RL, delivering wall-clock speedups and modest performance gains on SWE-bench Verified and AIME tasks.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

cs.CL · 2026-04-09 · accept · novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

citing papers explorer

Showing 7 of 7 citing papers.

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning stat.ML · 2026-05-06 · unverdicted · none · ref 12
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRPO and prior tree variants on nine benchmarks.
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR cs.AI · 2026-05-15 · unverdicted · none · ref 8
NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 29
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 17 · 2 links
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime cs.LG · 2026-05-06 · unverdicted · none · ref 10 · 3 links
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates near 50% in binary-reward RL, delivering wall-clock speedups and modest performance gains on SWE-bench Verified and AIME tasks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 25 · 4 links
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning cs.CL · 2026-04-09 · accept · none · ref 88
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Self-hinting language models enhance reinforcement learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer