Title resolution pending

Le, T · 2025 · arXiv 2509.21880

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

Selective Off-Policy Reference Tuning with Plan Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

cs.AI · 2026-04-18 · unverdicted · novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

cs.LG · 2026-05-20

citing papers explorer

Showing 4 of 4 citing papers.

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG cs.AI · 2026-05-12 · unverdicted · none · ref 5 · 2 links
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 15 · 2 links
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 28
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation cs.LG · 2026-05-20 · unreviewed · ref 21

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer