TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
hub
Vcrl: Variance-based curriculum reinforcement learning for large language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13representative citing papers
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
citing papers explorer
-
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier
Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.
-
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
-
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
-
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.
-
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
-
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.