HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Kaiyu He; Kun Wan; Mian Zhang; Peilin Wu; Wentian Zhao; Xinya Du; Zhiyu Chen

arxiv: 2510.07794 · v2 · submitted 2025-10-09 · 💻 cs.CL · cs.AI· cs.LG

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Peilin Wu , Mian Zhang , Kun Wan , Wentian Zhao , Kaiyu He , Xinya Du , Zhiyu Chen This is my paper

Pith reviewed 2026-05-18 09:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords agentic RAGhierarchical process rewardsreinforcement learningsearch efficiencyquestion answeringLLM agentsover-search reductionretrieval augmented generation

0 comments

The pith

Hierarchical process rewards train agentic RAG models to evaluate each search decision on the fly and reduce over-search while raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix widespread suboptimal search in agentic RAG systems, where models retrieve information they already know or skip needed searches, causing extra cost and shaky answers. It replaces coarse outcome-based rewards in reinforcement learning with a hierarchical process reward that breaks the reasoning path into clear steps and scores how often each step makes the right search-or-not choice. If the method works, agents would search only when necessary, producing more reliable answers at lower overhead. The authors test this on Qwen2.5 and Llama-3.2 models across seven QA benchmarks and report gains in both accuracy and search efficiency.

Core claim

HiPRAG adds a knowledge-grounded hierarchical process reward to RL training for agentic RAG. The reward first decomposes the agent's reasoning trajectory into discrete parsable steps, then adds a bonus proportional to the fraction of optimal search and non-search steps on top of standard outcome and format rewards. This fine-grained signal guides the model to make better search decisions during generation. Experiments show the approach yields 65.4 percent average accuracy on 3B models and 67.2 percent on 7B models while cutting the over-search rate to 2.3 percent and also lowering under-search.

What carries the argument

Hierarchical process reward function that decomposes reasoning trajectories into steps and bonuses the proportion of optimal search decisions.

If this is right

Average accuracy reaches 65.4 percent for 3B models and 67.2 percent for 7B models on seven diverse QA benchmarks.
Over-search rate falls to 2.3 percent while under-search rate also declines.
Efficiency gains appear alongside accuracy gains because unnecessary retrievals are avoided.
The same improvements hold when the method is applied to different RL algorithms, model families, sizes, and types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same step-wise reward structure could guide other sequential decision agents such as tool-use or multi-hop planners.
Lower search overhead may translate directly into reduced inference cost when these agents are deployed at scale.
Process-level bonuses might complement outcome rewards in any domain where intermediate actions have measurable optimality.

Load-bearing premise

The necessity of each search decision can be reliably evaluated on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps and applying a knowledge-grounded process reward to judge optimality.

What would settle it

If models trained with HiPRAG still show over-search rates above 10 percent or no accuracy improvement over plain outcome-based RL on the same seven benchmarks, the claim that the hierarchical reward improves search optimality would be falsified.

Figures

Figures reproduced from arXiv: 2510.07794 by Kaiyu He, Kun Wan, Mian Zhang, Peilin Wu, Wentian Zhao, Xinya Du, Zhiyu Chen.

**Figure 2.** Figure 2: Reward curves for different RL algorithm and curves of the ratio of searches among all [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of reasoning trajectory formats for the same multi-hop question. Each logical [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Input prompt for generating HiPRAG’s parsable output format with the new XML tagging [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for Over-search Detection Prompt for Under-search Detection You are an expert Fact-Checker and Logic Verifier. Your task is to evaluate a single, isolated reasoning step from an AI agent. This step was generated without using a search tool. Your goal is to determine if the agent made a mistake by not searching, based only on the information within this single step and your own general knowledge. Ana… view at source ↗

**Figure 6.** Figure 6: Prompt for Under-search Detection 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Case study: Baseline reasoning trajectory. The model has five unnecessary search steps [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Case study: HiPRAG-trained reasoning trajectory. The model correctly identifies the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiPRAG layers a hierarchical process reward onto RL for agentic RAG and reports lower over-search with steady accuracy, but the reliability of its automatic step decomposition and optimality labels is not shown.

read the letter

The main thing to know is that HiPRAG adds a process-level bonus during RL training for search agents. It decomposes reasoning trajectories into steps, labels each search or non-search decision as optimal using a knowledge-grounded check, and rewards the overall proportion of good decisions on top of standard outcome and format rewards. Experiments on Qwen2.5 and Llama-3.2 models across seven QA benchmarks show average accuracies of 65.4% at 3B and 67.2% at 7B, with over-search dropping to 2.3% and under-search also reduced. They further claim the gains hold across RL algorithms and model families. This is a practical attempt to move beyond pure outcome rewards for tool-use efficiency. The empirical scope is decent: multiple model sizes, diverse benchmarks, and some cross-algorithm checks give a reader concrete numbers to look at. The core idea of scoring individual decisions rather than just final answers is a clear extension of prior outcome-only RL setups for RAG agents. The soft spot is exactly where the stress-test note points. The whole method rests on two unvalidated pieces: whether free-form trajectories can be parsed into discrete steps reliably, and whether the knowledge-grounded optimality labels match what a human would call correct. No parsing accuracy, inter-annotator numbers, or correlation with human judgments appear in the abstract, and the provided details do not fill that gap. If those labels are noisy, the extra bonus term may not deliver a meaningfully better training signal than outcome rewards already do. The reported accuracy and efficiency numbers are still useful to see, but they do not yet prove the hierarchical reward is the cause. This paper is for people working on training efficient retrieval agents or tool-calling systems. A reader who cares about practical RL fixes for over-search would find the benchmarks and generalizability claims worth examining. It has enough concrete results and a focused problem to deserve a serious referee, even with the labeling validation gap. I would send it to peer review so the authors can supply the missing checks on the decomposition and labeling steps.

Referee Report

1 major / 0 minor

Summary. The paper introduces HiPRAG, a method for training agentic RAG systems via RL that augments standard outcome and format rewards with a hierarchical process reward. The process reward is obtained by automatically decomposing the agent's free-form reasoning trajectory into discrete parsable steps and labeling each search or non-search decision as optimal according to a knowledge-grounded criterion; an additional bonus is then awarded proportional to the fraction of optimal decisions. Experiments on Qwen2.5 and Llama-3.2 models across seven QA benchmarks report average accuracies of 65.4% (3B) and 67.2% (7B) together with a reduction of the over-search rate to 2.3% and a concurrent drop in under-search rate. The authors further claim good generalizability across RL algorithms, model families, and sizes.

Significance. If the decomposition and optimality labeling can be shown to be reliable, the work would demonstrate that fine-grained process-level supervision can measurably improve search efficiency in retrieval-augmented agents beyond what outcome rewards alone achieve. The reported gains on multiple model scales and the claim of broad applicability across RL algorithms would be of interest to the agentic-RAG and RL-for-reasoning communities.

major comments (1)

Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We address the major comment below and commit to revisions that strengthen the presentation of the hierarchical process reward.

read point-by-point responses

Referee: Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.

Authors: We agree that explicit validation of the decomposition reliability and optimality labeling is necessary to substantiate that the hierarchical bonus supplies a finer-grained signal. The original manuscript describes the automatic decomposition and knowledge-grounded criterion in the Method section but does not report quantitative metrics such as parsing accuracy or human correlation. In the revised version we will add a dedicated analysis subsection that evaluates these components on sampled trajectories across benchmarks and models, including agreement with manual annotations, to directly address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of internal reward definitions

full rationale

The paper presents HiPRAG as an RL training method that augments outcome rewards with a hierarchical process reward derived from on-the-fly decomposition of reasoning trajectories into steps, each labeled optimal or not via a knowledge-grounded criterion. Reported results consist of measured accuracies (65.4% for 3B, 67.2% for 7B) and search rates (over-search 2.3%) on seven external QA benchmarks across Qwen2.5 and Llama-3.2 models. No equations, fitted parameters, or self-citations are shown that would render these performance figures equivalent to quantities defined inside the same training loop or prior author work. The central claims rest on external experimental outcomes rather than any self-referential reduction, rendering the reported derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability to define and compute optimal search decisions from decomposed reasoning steps without additional fitted parameters beyond standard RL reward weights.

axioms (1)

domain assumption Reasoning trajectories can be decomposed into discrete, parsable steps whose search necessity can be judged knowledge-groundedly.
Invoked to enable on-the-fly process reward evaluation as stated in the abstract.

pith-pipeline@v0.9.0 · 5860 in / 1339 out tokens · 54267 ms · 2026-05-18T09:23:57.684497+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposing the agent's reasoning trajectory into discrete, parsable steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
cs.IR 2026-05 unverdicted novelty 5.0

SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/ 2503.19470. Kaustubh D. Dhole. To retrieve or not to retrieve? uncertainty detection for dynamic retrieval augmented generation,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

URLhttps://arxiv.org/abs/2501.09292. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aure...

work page arXiv
[3]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Deeprag: Thinking to retrieval step by step for large language models,

URLhttps://arxiv.org/abs/2502.01142. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.),Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, Barcelona, Spain (Onli...

work page arXiv
[5]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

International Com- mittee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580. URLhttps: //aclanthology.org/2020.coling-main.580/. Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning, 2025a. URLhttps://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.coling-main.580 2020
[6]

ISBN 979-8-89176- 256-5

Association for Computational Linguistics. ISBN 979-8-89176- 256-5. doi: 10.18653/v1/2025.findings-acl.652. URLhttps://aclanthology.org/2025. findings-acl.652/. Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents, 2025a. URLhttps: //arxiv.org/a...

work page doi:10.18653/v1/2025.findings-acl.652 2025
[7]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/ P17-1147/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...

work page doi:10.18653/v1/p17-1147 2020
[8]

Dense Passage Retrieval for Open-Domain Question Answering

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[10]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Jian Li, Xiaoxi Li, Yan Zheng, Yizhang Jin, Shuo Wang, Jiafu Wu, Yabiao Wang, Chengjie Wang, and Xiaotong Yuan. A survey on ai search with large language models.Preprints, July 2025a. doi: 10.20944/preprints202507.2024.v1. URLhttps://doi.org/10.20944/ preprints202507.2024.v1. Xiaoxi Li, Guanting Dong, Jiajie Jin,...

work page doi:10.20944/preprints202507.2024.v1 2024
[11]

URLhttps://aclanthology.org/2023.acl-long.546/

18653/v1/2023.acl-long.546. URLhttps://aclanthology.org/2023.acl-long.546/. OpenAI. Introducing gpt-4.1 in the api. OpenAI Blog, April 2025a. URLhttps://openai.com/ index/gpt-4-1/. OpenAI. Introducing gpt-5. OpenAI Blog, August 2025b. URLhttps://openai.com/index/ introducing-gpt-5/. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike L...

work page 2023
[12]

ToolRL: Reward is All Tool Learning Needs

Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.378. URLhttps://aclanthology.org/2023. findings-emnlp.378/. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025a. URLhttps://arxiv.org/ abs/2504.13958. Cheng Qian, Emre Ca...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.378 2023
[13]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proximal Policy Optimization Algorithms

URLhttps://arxiv.org/abs/1707.06347. Zeyang Sha, Shiwen Cui, and Weiqiang Wang. Sem: Reinforcement learning for search-efficient large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

URLhttps://arxiv.org/abs/2505.07903. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,

work page arXiv
[16]

URLhttps://arxiv.org/abs/2402. 03300. Yuanhao Shen, Xiaodan Zhu, and Lei Chen. SMARTCAL: An approach to self-aware tool- use evaluation and calibration. In Franck Dernoncourt, Daniel Preot ¸iuc-Pietro, and Anasta- sia Shimorina (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 774–789, Miami...

work page 2024
[17]

doi: 10.18653/v1/2024.emnlp-industry.59

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.59. URL https://aclanthology.org/2024.emnlp-industry.59/. Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag,

work page doi:10.18653/v1/2024.emnlp-industry.59 2024
[18]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

URLhttps://arxiv.org/abs/2501.09136. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2503.05592. Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yi...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

In: Zong, C., Xia, F., Li, W., Navigli, R

Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.702. URLhttps://aclanthology.org/2024.acl-long.702/. Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Yang Song, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. InProceedings of the 48th International ACM SI...

work page doi:10.18653/v1/ 2024
[20]

ISBN 9798400715921

Association for Computing Machinery. ISBN 9798400715921. doi: 10.1145/3726302.3730102. URLhttps: //doi.org/10.1145/3726302.3730102. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Mul- tihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554,

work page doi:10.1145/3726302.3730102
[21]

Lost in the Middle: How Language Models Use Long Contexts

doi: 10.1162/tacl a 00475. URLhttps: //aclanthology.org/2022.tacl-1.31/. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Proceedings of the 61st Annual Meet- ing of...

work page internal anchor Pith review doi:10.1162/tacl 2022
[22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Association for Computational Linguistics. doi: 10.18653/v1/2023. acl-long.557. URLhttps://aclanthology.org/2023.acl-long.557/. Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025a. URLhttps://arxiv.org/abs/250...

work page doi:10.18653/v1/2023 2023
[23]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

URLhttps://arxiv.org/abs/2212.03533. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation, 2025b. URLhttps://arxiv.org/abs/2501.14342. Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, and Zhiyu Zoey Chen. Search wisely: Mitigating sub-optimal agentic searches by reducing uncertainty,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D

URLhttps://arxiv.org/abs/ 2505.17281. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceed- ings of the 2018 Conference on Empirical...

work page arXiv 2018
[25]

Cohen and Ruslan Salakhutdinov and Christopher D

Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URLhttps://aclanthology.org/D18-1259/. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR),

work page doi:10.18653/v1/d18-1259
[26]

ISBN 979-8-89176-251-0

Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1312. URL https://aclanthology.org/2025.acl-long.1312/. Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, and Anurag Beniwal. Beyond correctness: Harmonizing process and outcome rewards through rl training,

work page doi:10.18653/v1/2025.acl-long.1312 2025
[27]

14 Preprint

URLhttps://arxiv.org/abs/2509.03403. 14 Preprint. Under review. Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promot- ing efficient reasoning with verifiable stepwise reward,

work page internal anchor Pith review arXiv
[28]

URLhttps://arxiv.org/abs/ 2508.10293. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The l...

work page arXiv
[29]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

URLhttps://arxiv.org/abs/2509.02547. Hanna Zubkova, Ji-Hoon Park, and Seong-Whan Lee. Sugar: Leveraging contextual confidence for smarter retrieval,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

15 Preprint

URLhttps://arxiv.org/abs/2501.04899. 15 Preprint. Under review. APPENDIX A EXAMPLE OFSEARCH-R1 FORMAT VS. HIPRAG FORMAT In this section we present an example of transforming from the original Search-R1 output format to our HiPRAG output format shown in Figure

work page arXiv
[31]

</information> <think> Great, the CEO of AMD is Dr. Lisa Su. The final piece of information I need is the closing stock price from the last trading day. Since today is Saturday, July 19, 2025, I need the price from Friday, July 18,

work page 2025
[32]

Lacy J. Dalton

</think> <search>AMD stock price closing July 18 2025</search> <information> On Friday, July 18, 2025, the stock for Advanced Micro Devices, Inc. (ticker: AMD) closed at a price of $175.40 per share. </information> <think> I have all the necessary information. The company is AMD, the CEO is Dr. Lisa Su, and the closing stock price was $175.40. I can now p...

work page 2025

[1] [1]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/ 2503.19470. Kaustubh D. Dhole. To retrieve or not to retrieve? uncertainty detection for dynamic retrieval augmented generation,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

URLhttps://arxiv.org/abs/2501.09292. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aure...

work page arXiv

[3] [3]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Deeprag: Thinking to retrieval step by step for large language models,

URLhttps://arxiv.org/abs/2502.01142. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.),Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, Barcelona, Spain (Onli...

work page arXiv

[5] [5]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

International Com- mittee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580. URLhttps: //aclanthology.org/2020.coling-main.580/. Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning, 2025a. URLhttps://arxiv.org/a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.coling-main.580 2020

[6] [6]

ISBN 979-8-89176- 256-5

Association for Computational Linguistics. ISBN 979-8-89176- 256-5. doi: 10.18653/v1/2025.findings-acl.652. URLhttps://aclanthology.org/2025. findings-acl.652/. Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents, 2025a. URLhttps: //arxiv.org/a...

work page doi:10.18653/v1/2025.findings-acl.652 2025

[7] [7]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/ P17-1147/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...

work page doi:10.18653/v1/p17-1147 2020

[8] [8]

Dense Passage Retrieval for Open-Domain Question Answering

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[9] [10]

ISBN 9781713829546

Curran Associates Inc. ISBN 9781713829546. Jian Li, Xiaoxi Li, Yan Zheng, Yizhang Jin, Shuo Wang, Jiafu Wu, Yabiao Wang, Chengjie Wang, and Xiaotong Yuan. A survey on ai search with large language models.Preprints, July 2025a. doi: 10.20944/preprints202507.2024.v1. URLhttps://doi.org/10.20944/ preprints202507.2024.v1. Xiaoxi Li, Guanting Dong, Jiajie Jin,...

work page doi:10.20944/preprints202507.2024.v1 2024

[10] [11]

URLhttps://aclanthology.org/2023.acl-long.546/

18653/v1/2023.acl-long.546. URLhttps://aclanthology.org/2023.acl-long.546/. OpenAI. Introducing gpt-4.1 in the api. OpenAI Blog, April 2025a. URLhttps://openai.com/ index/gpt-4-1/. OpenAI. Introducing gpt-5. OpenAI Blog, August 2025b. URLhttps://openai.com/index/ introducing-gpt-5/. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike L...

work page 2023

[11] [12]

ToolRL: Reward is All Tool Learning Needs

Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.378. URLhttps://aclanthology.org/2023. findings-emnlp.378/. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025a. URLhttps://arxiv.org/ abs/2504.13958. Cheng Qian, Emre Ca...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.378 2023

[12] [13]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Proximal Policy Optimization Algorithms

URLhttps://arxiv.org/abs/1707.06347. Zeyang Sha, Shiwen Cui, and Weiqiang Wang. Sem: Reinforcement learning for search-efficient large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

URLhttps://arxiv.org/abs/2505.07903. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,

work page arXiv

[15] [16]

URLhttps://arxiv.org/abs/2402. 03300. Yuanhao Shen, Xiaodan Zhu, and Lei Chen. SMARTCAL: An approach to self-aware tool- use evaluation and calibration. In Franck Dernoncourt, Daniel Preot ¸iuc-Pietro, and Anasta- sia Shimorina (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 774–789, Miami...

work page 2024

[16] [17]

doi: 10.18653/v1/2024.emnlp-industry.59

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.59. URL https://aclanthology.org/2024.emnlp-industry.59/. Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag,

work page doi:10.18653/v1/2024.emnlp-industry.59 2024

[17] [18]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

URLhttps://arxiv.org/abs/2501.09136. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2503.05592. Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yi...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

In: Zong, C., Xia, F., Li, W., Navigli, R

Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.702. URLhttps://aclanthology.org/2024.acl-long.702/. Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Yang Song, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. InProceedings of the 48th International ACM SI...

work page doi:10.18653/v1/ 2024

[19] [20]

ISBN 9798400715921

Association for Computing Machinery. ISBN 9798400715921. doi: 10.1145/3726302.3730102. URLhttps: //doi.org/10.1145/3726302.3730102. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Mul- tihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554,

work page doi:10.1145/3726302.3730102

[20] [21]

Lost in the Middle: How Language Models Use Long Contexts

doi: 10.1162/tacl a 00475. URLhttps: //aclanthology.org/2022.tacl-1.31/. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Proceedings of the 61st Annual Meet- ing of...

work page internal anchor Pith review doi:10.1162/tacl 2022

[21] [22]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Association for Computational Linguistics. doi: 10.18653/v1/2023. acl-long.557. URLhttps://aclanthology.org/2023.acl-long.557/. Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025a. URLhttps://arxiv.org/abs/250...

work page doi:10.18653/v1/2023 2023

[22] [23]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

URLhttps://arxiv.org/abs/2212.03533. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation, 2025b. URLhttps://arxiv.org/abs/2501.14342. Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, and Zhiyu Zoey Chen. Search wisely: Mitigating sub-optimal agentic searches by reducing uncertainty,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [24]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D

URLhttps://arxiv.org/abs/ 2505.17281. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceed- ings of the 2018 Conference on Empirical...

work page arXiv 2018

[24] [25]

Cohen and Ruslan Salakhutdinov and Christopher D

Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URLhttps://aclanthology.org/D18-1259/. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR),

work page doi:10.18653/v1/d18-1259

[25] [26]

ISBN 979-8-89176-251-0

Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1312. URL https://aclanthology.org/2025.acl-long.1312/. Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, and Anurag Beniwal. Beyond correctness: Harmonizing process and outcome rewards through rl training,

work page doi:10.18653/v1/2025.acl-long.1312 2025

[26] [27]

14 Preprint

URLhttps://arxiv.org/abs/2509.03403. 14 Preprint. Under review. Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promot- ing efficient reasoning with verifiable stepwise reward,

work page internal anchor Pith review arXiv

[27] [28]

URLhttps://arxiv.org/abs/ 2508.10293. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The l...

work page arXiv

[28] [29]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

URLhttps://arxiv.org/abs/2509.02547. Hanna Zubkova, Ji-Hoon Park, and Seong-Whan Lee. Sugar: Leveraging contextual confidence for smarter retrieval,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

15 Preprint

URLhttps://arxiv.org/abs/2501.04899. 15 Preprint. Under review. APPENDIX A EXAMPLE OFSEARCH-R1 FORMAT VS. HIPRAG FORMAT In this section we present an example of transforming from the original Search-R1 output format to our HiPRAG output format shown in Figure

work page arXiv

[30] [31]

</information> <think> Great, the CEO of AMD is Dr. Lisa Su. The final piece of information I need is the closing stock price from the last trading day. Since today is Saturday, July 19, 2025, I need the price from Friday, July 18,

work page 2025

[31] [32]

Lacy J. Dalton

</think> <search>AMD stock price closing July 18 2025</search> <information> On Friday, July 18, 2025, the stock for Advanced Micro Devices, Inc. (ticker: AMD) closed at a price of $175.40 per share. </information> <think> I have all the necessary information. The company is AMD, the CEO is Dr. Lisa Su, and the closing stock price was $175.40. I can now p...

work page 2025