AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
Your group-relative advantage is biased
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while improving accuracy by 4-6 points.
citing papers explorer
-
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
-
Policy Improvement Reinforcement Learning
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
-
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
The paper shows that advantage collapse in GRPO causes training stagnation on math reasoning benchmarks and proposes AVSPO, which uses real-time monitoring to inject virtual reward samples and reduces collapse while improving accuracy by 4-6 points.