hub

Vcrl: Variance-based curriculum reinforcement learning for large language models

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang · 2025 · arXiv 2509.19803

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier

cs.AI · 2026-06-09 · unverdicted · novelty 6.0

Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

cs.LG · 2026-06-03 · conditional · novelty 6.0

Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

cs.LG · 2026-05-13 · unverdicted · novelty 6.0 · 3 refs

ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

cs.AI · 2026-05-27 · unverdicted · novelty 5.0

Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

cs.LG · 2026-02-13 · unverdicted · novelty 5.0

VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

citing papers explorer

Showing 13 of 13 citing papers.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR cs.AI · 2026-06-23 · unverdicted · none · ref 22
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 26
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 51
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier cs.AI · 2026-06-09 · unverdicted · none · ref 39
Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO cs.LG · 2026-06-03 · conditional · none · ref 28
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 109
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training cs.LG · 2026-05-26 · unverdicted · none · ref 5
Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA cs.CL · 2026-05-21 · unverdicted · none · ref 18
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unverdicted · none · ref 1 · 3 links
ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning cs.LG · 2026-05-11 · unverdicted · none · ref 8
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 14
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning cs.CL · 2026-05-25 · unverdicted · none · ref 8
DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction cs.LG · 2026-02-13 · unverdicted · none · ref 10
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

Vcrl: Variance-based curriculum reinforcement learning for large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer