{"total":13,"items":[{"citing_arxiv_id":"2606.25178","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR","primary_cat":"cs.AI","submitted_at":"2026-06-23T21:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20662","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier","primary_cat":"cs.AI","submitted_at":"2026-06-09T23:13:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04560","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rollout-Level Advantage-Prioritized Experience Replay for GRPO","primary_cat":"cs.LG","submitted_at":"2026-06-03T07:47:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03087","ref_index":109,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR","primary_cat":"cs.LG","submitted_at":"2026-06-02T03:17:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28388","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-27T12:25:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26606","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-26T06:41:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pilot-Commit estimates per-prompt informativeness via a pilot stage and skips low-variance prompts, matching baseline accuracy with up to 4.0x fewer cumulative rollouts than DAPO on math reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25604","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-05-25T08:55:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DVAO dynamically weights multi-objective advantages by rollout-group reward variance to bound magnitudes, add cross-objective regularization, and outperform static baselines on math and tool-use tasks with Qwen models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23067","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA","primary_cat":"cs.CL","submitted_at":"2026-05-21T21:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12969","ref_index":1,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective","primary_cat":"cs.LG","submitted_at":"2026-05-13T04:02:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConSPO is a new contrastive sequence-level policy optimization method that addresses GRPO limitations via length-normalized log-probability scores and InfoNCE-style objectives, outperforming baselines on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12004","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Agentic Policy from Action Guidance","primary_cat":"cs.CL","submitted_at":"2026-05-12T11:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Since effective agentic RL strongly depends on the base model to explore valid training signals, existing methods often rely on a cold- start before RL or on alternating SFT and RL to dynamically align the model capabilities with the target tasks [4, 6, 11, 13, 44]. Some works instead adopt dynamic task scheduling [ 20, 71, 76] or curriculum learning [26, 30] to ensure that the difficulty of training tasks is well matched to the evolving capabilities of the model. A line of work most closely related to ours constructs curriculum learning examples from existing SFT data [ 59, 72], or directly uses this data as hints to guide the model toward obtaining meaningful learning signals on hard tasks [ 22, 42, 64]."},{"citing_arxiv_id":"2605.11235","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:50:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"insurmountable prompts may become informative, while once-useful prompts eventually become saturated. Existing curricula address this through various forms of adaptation (see Fig. 1): some rely on predefined coarse prompt groups based on domains or data sources [ 11, 12], others apply hand-coded heuristics over rollout statistics or offline difficulty labels [8, 13-16], and the rest train a separate model to score such informativeness [9, 10]. Crucially, across all these paradigms, curriculum judgment remainsexternalto the policy. But because the competence frontier moves with the policy as it learns, accurate judgment requires tracking the policy's current state; thus, we argue that the policy itself is the entity most directly aware of that state."},{"citing_arxiv_id":"2605.02913","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12579","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction","primary_cat":"cs.LG","submitted_at":"2026-02-13T03:40:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}