HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation
Pith reviewed 2026-05-18 09:23 UTC · model grok-4.3
The pith
Hierarchical process rewards train agentic RAG models to evaluate each search decision on the fly and reduce over-search while raising accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HiPRAG adds a knowledge-grounded hierarchical process reward to RL training for agentic RAG. The reward first decomposes the agent's reasoning trajectory into discrete parsable steps, then adds a bonus proportional to the fraction of optimal search and non-search steps on top of standard outcome and format rewards. This fine-grained signal guides the model to make better search decisions during generation. Experiments show the approach yields 65.4 percent average accuracy on 3B models and 67.2 percent on 7B models while cutting the over-search rate to 2.3 percent and also lowering under-search.
What carries the argument
Hierarchical process reward function that decomposes reasoning trajectories into steps and bonuses the proportion of optimal search decisions.
If this is right
- Average accuracy reaches 65.4 percent for 3B models and 67.2 percent for 7B models on seven diverse QA benchmarks.
- Over-search rate falls to 2.3 percent while under-search rate also declines.
- Efficiency gains appear alongside accuracy gains because unnecessary retrievals are avoided.
- The same improvements hold when the method is applied to different RL algorithms, model families, sizes, and types.
Where Pith is reading between the lines
- The same step-wise reward structure could guide other sequential decision agents such as tool-use or multi-hop planners.
- Lower search overhead may translate directly into reduced inference cost when these agents are deployed at scale.
- Process-level bonuses might complement outcome rewards in any domain where intermediate actions have measurable optimality.
Load-bearing premise
The necessity of each search decision can be reliably evaluated on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps and applying a knowledge-grounded process reward to judge optimality.
What would settle it
If models trained with HiPRAG still show over-search rates above 10 percent or no accuracy improvement over plain outcome-based RL on the same seven benchmarks, the claim that the hierarchical reward improves search optimality would be falsified.
Figures
read the original abstract
Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HiPRAG, a method for training agentic RAG systems via RL that augments standard outcome and format rewards with a hierarchical process reward. The process reward is obtained by automatically decomposing the agent's free-form reasoning trajectory into discrete parsable steps and labeling each search or non-search decision as optimal according to a knowledge-grounded criterion; an additional bonus is then awarded proportional to the fraction of optimal decisions. Experiments on Qwen2.5 and Llama-3.2 models across seven QA benchmarks report average accuracies of 65.4% (3B) and 67.2% (7B) together with a reduction of the over-search rate to 2.3% and a concurrent drop in under-search rate. The authors further claim good generalizability across RL algorithms, model families, and sizes.
Significance. If the decomposition and optimality labeling can be shown to be reliable, the work would demonstrate that fine-grained process-level supervision can measurably improve search efficiency in retrieval-augmented agents beyond what outcome rewards alone achieve. The reported gains on multiple model scales and the claim of broad applicability across RL algorithms would be of interest to the agentic-RAG and RL-for-reasoning communities.
major comments (1)
- Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our submission. We address the major comment below and commit to revisions that strengthen the presentation of the hierarchical process reward.
read point-by-point responses
-
Referee: Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.
Authors: We agree that explicit validation of the decomposition reliability and optimality labeling is necessary to substantiate that the hierarchical bonus supplies a finer-grained signal. The original manuscript describes the automatic decomposition and knowledge-grounded criterion in the Method section but does not report quantitative metrics such as parsing accuracy or human correlation. In the revised version we will add a dedicated analysis subsection that evaluates these components on sampled trajectories across benchmarks and models, including agreement with manual annotations, to directly address this concern. revision: yes
Circularity Check
No circularity: empirical benchmark results independent of internal reward definitions
full rationale
The paper presents HiPRAG as an RL training method that augments outcome rewards with a hierarchical process reward derived from on-the-fly decomposition of reasoning trajectories into steps, each labeled optimal or not via a knowledge-grounded criterion. Reported results consist of measured accuracies (65.4% for 3B, 67.2% for 7B) and search rates (over-search 2.3%) on seven external QA benchmarks across Qwen2.5 and Llama-3.2 models. No equations, fitted parameters, or self-citations are shown that would render these performance figures equivalent to quantities defined inside the same training loop or prior author work. The central claims rest on external experimental outcomes rather than any self-referential reduction, rendering the reported derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning trajectories can be decomposed into discrete, parsable steps whose search necessity can be judged knowledge-groundedly.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
decomposing the agent's reasoning trajectory into discrete, parsable steps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.
Reference graph
Works this paper leans on
-
[1]
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
URLhttps://arxiv.org/abs/ 2503.19470. Kaustubh D. Dhole. To retrieve or not to retrieve? uncertainty detection for dynamic retrieval augmented generation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URLhttps://arxiv.org/abs/2501.09292. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aure...
-
[3]
URL https://arxiv.org/abs/2407.21783. Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Deeprag: Thinking to retrieval step by step for large language models,
URLhttps://arxiv.org/abs/2502.01142. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.),Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, Barcelona, Spain (Onli...
-
[5]
RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning
International Com- mittee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580. URLhttps: //aclanthology.org/2020.coling-main.580/. Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning, 2025a. URLhttps://arxiv.org/a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.coling-main.580 2020
-
[6]
Association for Computational Linguistics. ISBN 979-8-89176- 256-5. doi: 10.18653/v1/2025.findings-acl.652. URLhttps://aclanthology.org/2025. findings-acl.652/. Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents, 2025a. URLhttps: //arxiv.org/a...
-
[7]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/ P17-1147/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...
-
[8]
Dense Passage Retrieval for Open-Domain Question Answering
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang...
-
[10]
Curran Associates Inc. ISBN 9781713829546. Jian Li, Xiaoxi Li, Yan Zheng, Yizhang Jin, Shuo Wang, Jiafu Wu, Yabiao Wang, Chengjie Wang, and Xiaotong Yuan. A survey on ai search with large language models.Preprints, July 2025a. doi: 10.20944/preprints202507.2024.v1. URLhttps://doi.org/10.20944/ preprints202507.2024.v1. Xiaoxi Li, Guanting Dong, Jiajie Jin,...
-
[11]
URLhttps://aclanthology.org/2023.acl-long.546/
18653/v1/2023.acl-long.546. URLhttps://aclanthology.org/2023.acl-long.546/. OpenAI. Introducing gpt-4.1 in the api. OpenAI Blog, April 2025a. URLhttps://openai.com/ index/gpt-4-1/. OpenAI. Introducing gpt-5. OpenAI Blog, August 2025b. URLhttps://openai.com/index/ introducing-gpt-5/. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike L...
work page 2023
-
[12]
ToolRL: Reward is All Tool Learning Needs
Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.378. URLhttps://aclanthology.org/2023. findings-emnlp.378/. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025a. URLhttps://arxiv.org/ abs/2504.13958. Cheng Qian, Emre Ca...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.378 2023
-
[13]
URLhttps://arxiv.org/abs/2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Proximal Policy Optimization Algorithms
URLhttps://arxiv.org/abs/1707.06347. Zeyang Sha, Shiwen Cui, and Weiqiang Wang. Sem: Reinforcement learning for search-efficient large language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
URLhttps://arxiv.org/abs/2505.07903. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,
-
[16]
URLhttps://arxiv.org/abs/2402. 03300. Yuanhao Shen, Xiaodan Zhu, and Lei Chen. SMARTCAL: An approach to self-aware tool- use evaluation and calibration. In Franck Dernoncourt, Daniel Preot ¸iuc-Pietro, and Anasta- sia Shimorina (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 774–789, Miami...
work page 2024
-
[17]
doi: 10.18653/v1/2024.emnlp-industry.59
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.59. URL https://aclanthology.org/2024.emnlp-industry.59/. Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag,
-
[18]
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
URLhttps://arxiv.org/abs/2501.09136. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2503.05592. Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
In: Zong, C., Xia, F., Li, W., Navigli, R
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.702. URLhttps://aclanthology.org/2024.acl-long.702/. Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Yang Song, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. InProceedings of the 48th International ACM SI...
-
[20]
Association for Computing Machinery. ISBN 9798400715921. doi: 10.1145/3726302.3730102. URLhttps: //doi.org/10.1145/3726302.3730102. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Mul- tihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554,
-
[21]
Lost in the Middle: How Language Models Use Long Contexts
doi: 10.1162/tacl a 00475. URLhttps: //aclanthology.org/2022.tacl-1.31/. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Proceedings of the 61st Annual Meet- ing of...
work page internal anchor Pith review doi:10.1162/tacl 2022
-
[22]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen
Association for Computational Linguistics. doi: 10.18653/v1/2023. acl-long.557. URLhttps://aclanthology.org/2023.acl-long.557/. Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025a. URLhttps://arxiv.org/abs/250...
-
[23]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
URLhttps://arxiv.org/abs/2212.03533. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation, 2025b. URLhttps://arxiv.org/abs/2501.14342. Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, and Zhiyu Zoey Chen. Search wisely: Mitigating sub-optimal agentic searches by reducing uncertainty,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URLhttps://arxiv.org/abs/ 2505.17281. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceed- ings of the 2018 Conference on Empirical...
-
[25]
Cohen and Ruslan Salakhutdinov and Christopher D
Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URLhttps://aclanthology.org/D18-1259/. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR),
-
[26]
Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1312. URL https://aclanthology.org/2025.acl-long.1312/. Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, and Anurag Beniwal. Beyond correctness: Harmonizing process and outcome rewards through rl training,
-
[27]
URLhttps://arxiv.org/abs/2509.03403. 14 Preprint. Under review. Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promot- ing efficient reasoning with verifiable stepwise reward,
work page internal anchor Pith review arXiv
-
[28]
URLhttps://arxiv.org/abs/ 2508.10293. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The l...
-
[29]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
URLhttps://arxiv.org/abs/2509.02547. Hanna Zubkova, Ji-Hoon Park, and Seong-Whan Lee. Sugar: Leveraging contextual confidence for smarter retrieval,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
URLhttps://arxiv.org/abs/2501.04899. 15 Preprint. Under review. APPENDIX A EXAMPLE OFSEARCH-R1 FORMAT VS. HIPRAG FORMAT In this section we present an example of transforming from the original Search-R1 output format to our HiPRAG output format shown in Figure
-
[31]
</information> <think> Great, the CEO of AMD is Dr. Lisa Su. The final piece of information I need is the closing stock price from the last trading day. Since today is Saturday, July 19, 2025, I need the price from Friday, July 18,
work page 2025
-
[32]
</think> <search>AMD stock price closing July 18 2025</search> <information> On Friday, July 18, 2025, the stock for Advanced Micro Devices, Inc. (ticker: AMD) closed at a price of $175.40 per share. </information> <think> I have all the necessary information. The company is AMD, the CEO is Dr. Lisa Su, and the closing stock price was $175.40. I can now p...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.