arxiv: 2605.11611 · v2 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

Jianghan Shen , Siqi Luo , Xinyu Cheng , Jing Xiong , Yue Li , Jiyao Liu , Jiashi Lin , Yirong Chen

show 1 more author

Junjun He

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic RAGcurriculum rollout samplingsearch depthRLVRretrieval supervisionGRPOZeroSearch

0 comments

The pith

Search depth serves as an annotation-free proxy for supervision density, enabling curriculum rollout sampling that improves RLVR training for agentic RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In RLVR-based agentic RAG training, rollout trajectories vary in search depth and are not equally informative. Deeper trajectories contain more retrieval decision points, supplying denser supervision signals to the retrieval sub-policy. Uniform sampling overlooks this variation even as average depth rises during training. CuSearch reallocates a fixed update budget within each batch toward the deepest trajectories using Search-Depth Greedy Allocation, creating an implicit or explicit curriculum aligned with policy improvement.

Core claim

The central claim is that per-trajectory search depth functions as a reliable proxy for retrieval supervision density because deeper trajectories embed more retrieval decision points. CuSearch implements this via SDGA, a batch operator that greedily assigns updates to the deepest available rollouts. The auto variant always favors current deepest trajectories while the phase variant raises the depth threshold once sufficiently many deep trajectories appear. This produces consistent gains, reaching 11.8 exact-match points over standard GRPO on ZeroSearch across model types and retrieval frameworks.

What carries the argument

Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories to create a training-aligned curriculum.

Load-bearing premise

Trajectories differ substantially in search depth, with deeper ones containing more retrieval decision points and therefore providing denser direct supervision for the retrieval sub-policy.

What would settle it

An ablation in which deeper-search trajectories are deliberately undersampled while overall performance stays the same or improves would falsify the claim that search depth reliably indicates higher supervision density.

Figures

Figures reproduced from arXiv: 2605.11611 by Jianghan Shen, Jiashi Lin, Jing Xiong, Jiyao Liu, Junjun He, Siqi Luo, Xinyu Cheng, Yirong Chen, Yue Li.

**Figure 2.** Figure 2: Overview of CuSearch. Top: A query generates N · G rollouts. SDGA then selects K by search depth for gradient updates. Bottom left: SDGA greedily allocates budget K across depth buckets by priority and capacity. Bottom right: SDGA-Auto always targets the deepest available bucket, with selections advancing deeper as the depth distribution shifts upward during training. SDGA-Phase explicitly advances the tar… view at source ↗

**Figure 3.** Figure 3: Training dynamics on ZeroSearch with Qwen2.5-3B. Panels (a–c) show EM on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Training template. The question is appended to the end during training and [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Reasoning trace example for CuSearch. G Why Search Depth Maximizes Retrieval Gradient Coverage From a gradient optimization perspective, the central question is: given a fixed update budget K, how should we select T ⊆ T to maximize the amount of direct gradient information devoted to retrieval behavior in each policy update? We formalize this question by decomposing the policy gradient and defining a retr… view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CuSearch reallocates RLVR rollouts toward deeper-search trajectories in agentic RAG and reports solid gains, but the claim that depth is a clean proxy for retrieval supervision density rests on an untested assumption.

read the letter

The paper's main move is straightforward: instead of sampling rollouts uniformly in RLVR for agentic RAG, it uses Search-Depth Greedy Allocation to steer the fixed budget toward trajectories that already went deeper in search. This creates an implicit curriculum as the batch depth distribution shifts upward during training. SDGA-Auto just grabs the deepest ones each time; SDGA-Phase raises the threshold once enough deep trajectories appear. The reported result is up to 11.8 exact-match points over plain GRPO on ZeroSearch, and the gains hold across model sizes and retrieval setups. That is the concrete, new piece: a batch-level operator that turns observed search depth into a free curriculum signal without extra labels or learned parameters.

Referee Report

2 major / 2 minor

Summary. The paper claims that in RLVR training for agentic RAG, trajectories vary in search depth and deeper ones supply denser retrieval supervision; it introduces CuSearch with Search-Depth Greedy Allocation (SDGA) to reallocate a fixed update budget toward deeper-search rollouts, either via SDGA-Auto (always deepest) or SDGA-Phase (curriculum threshold). Experiments report consistent gains, reaching +11.8 exact-match points over GRPO on ZeroSearch across model types and retrieval frameworks, positioning per-trajectory search depth as an annotation-free proxy for supervision density.

Significance. If the empirical gains are robust, the work supplies a lightweight, training-aligned curriculum operator that exploits an observable trajectory property without extra annotations or fitted parameters. This could improve sample efficiency in outcome-supervised agentic RL settings where exploration depth naturally increases during training.

major comments (2)

[Experiments] Experiments section: the abstract states 'consistent gains across model types and frameworks' and 'up to 11.8 exact-match points,' yet no statistical significance tests, number of random seeds, variance across runs, or exact rollout-sampling protocol (temperature, max depth, batch size) are described, leaving the central performance claim only partially supported.
[Method] Method / SDGA definition: the allocation rule is defined directly from observed search depth, but the manuscript provides no correlation analysis, ablation, or gradient-variance study separating depth from total trajectory length. If depth is collinear with length, SDGA primarily up-weights longer sequences rather than enriching retrieval-specific supervision, undermining the proxy claim.

minor comments (2)

[Method] Notation for SDGA-Auto and SDGA-Phase should be introduced with explicit pseudocode or equations rather than prose descriptions only.
[Abstract] The abstract and introduction both state the same performance numbers; consolidate to avoid repetition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, agreeing where revisions are needed to strengthen the empirical support and methodological clarity. We will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the abstract states 'consistent gains across model types and frameworks' and 'up to 11.8 exact-match points,' yet no statistical significance tests, number of random seeds, variance across runs, or exact rollout-sampling protocol (temperature, max depth, batch size) are described, leaving the central performance claim only partially supported.

Authors: We agree that the current manuscript lacks sufficient details on reproducibility and statistical rigor, which weakens the central performance claims. In the revised version, we will add the following: results averaged over three independent random seeds with reported standard deviations; paired t-tests or Wilcoxon tests for statistical significance between CuSearch variants and GRPO baselines; and a complete description of the rollout-sampling protocol, including temperature (0.7), maximum search depth (5), batch size (32), and other hyperparameters used across all experiments and frameworks. These changes will make the reported gains (including the +11.8 exact-match improvement) fully verifiable. revision: yes
Referee: [Method] Method / SDGA definition: the allocation rule is defined directly from observed search depth, but the manuscript provides no correlation analysis, ablation, or gradient-variance study separating depth from total trajectory length. If depth is collinear with length, SDGA primarily up-weights longer sequences rather than enriching retrieval-specific supervision, undermining the proxy claim.

Authors: We acknowledge this is a valid concern: without explicit analysis, it is possible that search depth correlates with trajectory length, which could mean SDGA is partly rewarding longer sequences rather than retrieval decision density. The manuscript currently relies on the conceptual argument that deeper trajectories contain more retrieval decision points, but does not quantify independence from length. We will therefore add (i) a Pearson correlation analysis between per-trajectory search depth and total length across training batches, and (ii) a controlled ablation that samples equal-length trajectories while varying depth. These additions will either confirm the proxy or clarify its limits; we do not claim the current evidence already separates the two factors. revision: yes

Circularity Check

0 steps flagged

No circularity: search depth used as direct observed proxy without reduction to inputs

full rationale

The paper's core operator SDGA reallocates sampling budget explicitly from per-trajectory search depth observed in the current batch. The premise that deeper trajectories contain more retrieval decision points is stated as an empirical observation rather than derived from any equation, fitted parameter, or self-citation chain. No load-bearing step reduces by construction to its own outputs; the curriculum is an explicit design choice grounded in the stated heterogeneity of trajectories. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about the relationship between search depth and supervision density; no free parameters or new entities are introduced.

axioms (1)

domain assumption Deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy
This premise directly justifies reallocating the update budget away from shallow trajectories.

pith-pipeline@v0.9.0 · 5564 in / 1149 out tokens · 35988 ms · 2026-05-15T06:05:13.458153+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

trajectories differ substantially in search depth... deeper-search trajectories contain more retrieval decision points and provide denser direct supervision
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SDGA-Auto always targets the deepest available trajectories... implicit training-aligned curriculum

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 11 internal anchors

[1]

Learning to reason with search for llms via reinforcement learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, et al. Learning to reason with search for llms via reinforcement learning. 2025a. Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, and Yong Jiang. Agentfrontier:...

work page arXiv
[2]

Actor-curator: Co-adaptive curriculum learning via policy-improvement bandits for rl post-training.arXiv preprint arXiv:2602.20532,

Zhengyao Gu, Jonathan Light, Raul Astudillo, Ziyu Ye, Langzhou He, Henry Peng Zou, Wei Cheng, Santiago Paternain, Philip S Yu, and Yisong Yue. Actor-curator: Co-adaptive curriculum learning via policy-improvement bandits for rl post-training.arXiv preprint arXiv:2602.20532,

work page arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy- guided advantage shaping.arXiv preprint arXiv:2509.21880,

Thanh-Long V Le, Myeongho Jeon, Kim Vu, Viet Lai, and Eunho Yang. No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy- guided advantage shaping.arXiv preprint arXiv:2509.21880,

work page arXiv
[6]

Explore data left behind in reinforcement learning for reasoning language models.arXiv preprint arXiv:2511.04800,

Chenxi Liu, Junjie Liang, Yuqi Jia, Bochuan Cao, Yang Bai, Heng Huang, and Xun Chen. Explore data left behind in reinforcement learning for reasoning language models.arXiv preprint arXiv:2511.04800,

work page arXiv
[7]

The Llama 3 Herd of Models

Llama Team. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser- assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

work page 2023
[10]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval- augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Zerosearch: Incentivize the search capability of llms without searching, 2025

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588,

work page arXiv
[16]

Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701,

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701,

work page arXiv
[17]

Heapa: Difficulty-aware heap sampling and on-policy query augmentation for llm reinforcement learning.arXiv preprint arXiv:2601.22448,

Weiqi Wang, Xin Liu, Binxuan Huang, Hejie Cui, Rongzhi Zhang, Changlong Yu, Shuowei Jin, Jingfeng Yang, Qingyu Yin, Zhengyang Wang, et al. Heapa: Difficulty-aware heap sampling and on-policy query augmentation for llm reinforcement learning.arXiv preprint arXiv:2601.22448,

work page arXiv
[18]

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Hanhui Li, Yiwei Wang, Xiaodan Liang, and Jing Tang. Depth-breadth synergy in rlvr: Unlocking llm reasoning gains with adaptive exploration.arXiv preprint arXiv:2508.13755,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

work page 2018
[20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Process vs

Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, et al. Process vs. outcome reward: Which is better for agentic rag reinforcement learning.arXiv preprint arXiv:2505.14069, 2025a. Yuheng Zhang, Wenlin Yao, Changlong Yu, Yao Liu, Qingyu Yin, Bing Yin, Hyokun Yun, and Lihong L...

work page arXiv 2025
[22]

In contrast, GRPO-Full averages informative and weakly informative trajectories uniformly at every step, diluting the retrieval-specific supervision signal

This reflects a reinforcing selection dynamic: as training progresses and the model naturally produces more deep-search trajectories, CuSearch continues to concentrate gradient updates on those trajectories, progressively strengthening the policy’s retrieval tendency. In contrast, GRPO-Full averages informative and weakly informative trajectories uniforml...

work page 2022
[23]

Al Janub Stadium, January </information> <think> Now, I find the answer to the question

John Divine. Al Janub Stadium, January </information> <think> Now, I find the answer to the question. </think> <search> what league includes the operating group of Al Janoub Stadium?</search> <information> Doc 1: Al Janoub Stadium, Al Wakrah, Qatar. A 40,000-capacity sta- dium. Built in 2022, for the 2022 FIFA World Cup. The first match held was the match...

work page 2022
[24]

Al Shaikh Stadium

Doc 4: The Al Janoub Stadium is part of the 8 stadiums being built for the 2022 FIFA World Cup, for which Qatar is hosting the event from November 20- Doc 5: Al Shaikh Stadium — Mohammed Bin Khalifa Stadium — Al Bayt Stadium — Al Janoub Stadium. Al Shaikh Stadium. 17,000. Mohammed Bin Khalifa Stadium. 22 42,000. Al Bayt Stadium. 40,000. Al Janoub Stadium....

work page 2022