pith. sign in

arxiv: 2510.07794 · v2 · submitted 2025-10-09 · 💻 cs.CL · cs.AI· cs.LG

HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Pith reviewed 2026-05-18 09:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords agentic RAGhierarchical process rewardsreinforcement learningsearch efficiencyquestion answeringLLM agentsover-search reductionretrieval augmented generation
0
0 comments X

The pith

Hierarchical process rewards train agentic RAG models to evaluate each search decision on the fly and reduce over-search while raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix widespread suboptimal search in agentic RAG systems, where models retrieve information they already know or skip needed searches, causing extra cost and shaky answers. It replaces coarse outcome-based rewards in reinforcement learning with a hierarchical process reward that breaks the reasoning path into clear steps and scores how often each step makes the right search-or-not choice. If the method works, agents would search only when necessary, producing more reliable answers at lower overhead. The authors test this on Qwen2.5 and Llama-3.2 models across seven QA benchmarks and report gains in both accuracy and search efficiency.

Core claim

HiPRAG adds a knowledge-grounded hierarchical process reward to RL training for agentic RAG. The reward first decomposes the agent's reasoning trajectory into discrete parsable steps, then adds a bonus proportional to the fraction of optimal search and non-search steps on top of standard outcome and format rewards. This fine-grained signal guides the model to make better search decisions during generation. Experiments show the approach yields 65.4 percent average accuracy on 3B models and 67.2 percent on 7B models while cutting the over-search rate to 2.3 percent and also lowering under-search.

What carries the argument

Hierarchical process reward function that decomposes reasoning trajectories into steps and bonuses the proportion of optimal search decisions.

If this is right

  • Average accuracy reaches 65.4 percent for 3B models and 67.2 percent for 7B models on seven diverse QA benchmarks.
  • Over-search rate falls to 2.3 percent while under-search rate also declines.
  • Efficiency gains appear alongside accuracy gains because unnecessary retrievals are avoided.
  • The same improvements hold when the method is applied to different RL algorithms, model families, sizes, and types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same step-wise reward structure could guide other sequential decision agents such as tool-use or multi-hop planners.
  • Lower search overhead may translate directly into reduced inference cost when these agents are deployed at scale.
  • Process-level bonuses might complement outcome rewards in any domain where intermediate actions have measurable optimality.

Load-bearing premise

The necessity of each search decision can be reliably evaluated on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps and applying a knowledge-grounded process reward to judge optimality.

What would settle it

If models trained with HiPRAG still show over-search rates above 10 percent or no accuracy improvement over plain outcome-based RL on the same seven benchmarks, the claim that the hierarchical reward improves search optimality would be falsified.

Figures

Figures reproduced from arXiv: 2510.07794 by Kaiyu He, Kun Wan, Mian Zhang, Peilin Wu, Wentian Zhao, Xinya Du, Zhiyu Chen.

Figure 1
Figure 1. Figure 1: A general overview of the HiPRAG training workflow. The policy model generates a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reward curves for different RL algorithm and curves of the ratio of searches among all [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of reasoning trajectory formats for the same multi-hop question. Each logical [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Input prompt for generating HiPRAG’s parsable output format with the new XML tagging [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for Over-search Detection Prompt for Under-search Detection You are an expert Fact-Checker and Logic Verifier. Your task is to evaluate a single, isolated reasoning step from an AI agent. This step was generated without using a search tool. Your goal is to determine if the agent made a mistake by not searching, based only on the information within this single step and your own general knowledge. Ana… view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for Under-search Detection 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study: Baseline reasoning trajectory. The model has five unnecessary search steps [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study: HiPRAG-trained reasoning trajectory. The model correctly identifies the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces HiPRAG, a method for training agentic RAG systems via RL that augments standard outcome and format rewards with a hierarchical process reward. The process reward is obtained by automatically decomposing the agent's free-form reasoning trajectory into discrete parsable steps and labeling each search or non-search decision as optimal according to a knowledge-grounded criterion; an additional bonus is then awarded proportional to the fraction of optimal decisions. Experiments on Qwen2.5 and Llama-3.2 models across seven QA benchmarks report average accuracies of 65.4% (3B) and 67.2% (7B) together with a reduction of the over-search rate to 2.3% and a concurrent drop in under-search rate. The authors further claim good generalizability across RL algorithms, model families, and sizes.

Significance. If the decomposition and optimality labeling can be shown to be reliable, the work would demonstrate that fine-grained process-level supervision can measurably improve search efficiency in retrieval-augmented agents beyond what outcome rewards alone achieve. The reported gains on multiple model scales and the claim of broad applicability across RL algorithms would be of interest to the agentic-RAG and RL-for-reasoning communities.

major comments (1)
  1. Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We address the major comment below and commit to revisions that strengthen the presentation of the hierarchical process reward.

read point-by-point responses
  1. Referee: Abstract and Method description: the central claim that HiPRAG reduces over-search to 2.3% while raising accuracy rests on the reliability of the on-the-fly decomposition into parsable steps and the knowledge-grounded optimality labels used to compute the hierarchical bonus. No quantitative validation of either step (parsing accuracy, inter-annotator agreement, or correlation with human optimality judgments) is reported across the seven benchmarks or two model families. Without such evidence the additional reward term cannot be shown to supply a training signal that is meaningfully finer-grained than the outcome reward already in use.

    Authors: We agree that explicit validation of the decomposition reliability and optimality labeling is necessary to substantiate that the hierarchical bonus supplies a finer-grained signal. The original manuscript describes the automatic decomposition and knowledge-grounded criterion in the Method section but does not report quantitative metrics such as parsing accuracy or human correlation. In the revised version we will add a dedicated analysis subsection that evaluates these components on sampled trajectories across benchmarks and models, including agreement with manual annotations, to directly address this concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results independent of internal reward definitions

full rationale

The paper presents HiPRAG as an RL training method that augments outcome rewards with a hierarchical process reward derived from on-the-fly decomposition of reasoning trajectories into steps, each labeled optimal or not via a knowledge-grounded criterion. Reported results consist of measured accuracies (65.4% for 3B, 67.2% for 7B) and search rates (over-search 2.3%) on seven external QA benchmarks across Qwen2.5 and Llama-3.2 models. No equations, fitted parameters, or self-citations are shown that would render these performance figures equivalent to quantities defined inside the same training loop or prior author work. The central claims rest on external experimental outcomes rather than any self-referential reduction, rendering the reported derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability to define and compute optimal search decisions from decomposed reasoning steps without additional fitted parameters beyond standard RL reward weights.

axioms (1)
  • domain assumption Reasoning trajectories can be decomposed into discrete, parsable steps whose search necessity can be judged knowledge-groundedly.
    Invoked to enable on-the-fly process reward evaluation as stated in the abstract.

pith-pipeline@v0.9.0 · 5860 in / 1339 out tokens · 54267 ms · 2026-05-18T09:23:57.684497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

    cs.IR 2026-05 unverdicted novelty 5.0

    SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/ 2503.19470. Kaustubh D. Dhole. To retrieve or not to retrieve? uncertainty detection for dynamic retrieval augmented generation,

  2. [2]

    URLhttps://arxiv.org/abs/2501.09292. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Ko- renev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aure...

  3. [3]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models,

  4. [4]

    Deeprag: Thinking to retrieval step by step for large language models,

    URLhttps://arxiv.org/abs/2502.01142. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.),Proceedings of the 28th International Conference on Computational Linguistics, pp. 6609–6625, Barcelona, Spain (Onli...

  5. [5]

    RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

    International Com- mittee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.580. URLhttps: //aclanthology.org/2020.coling-main.580/. Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning, 2025a. URLhttps://arxiv.org/a...

  6. [6]

    ISBN 979-8-89176- 256-5

    Association for Computational Linguistics. ISBN 979-8-89176- 256-5. doi: 10.18653/v1/2025.findings-acl.652. URLhttps://aclanthology.org/2025. findings-acl.652/. Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, and Jiawei Han. An empirical study on reinforcement learning for reasoning-search interleaved llm agents, 2025a. URLhttps: //arxiv.org/a...

  7. [7]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URLhttps://aclanthology.org/ P17-1147/. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...

  8. [8]

    Dense Passage Retrieval for Open-Domain Question Answering

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URLhttps://aclanthology.org/2020.emnlp-main.550/. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang...

  9. [10]

    ISBN 9781713829546

    Curran Associates Inc. ISBN 9781713829546. Jian Li, Xiaoxi Li, Yan Zheng, Yizhang Jin, Shuo Wang, Jiafu Wu, Yabiao Wang, Chengjie Wang, and Xiaotong Yuan. A survey on ai search with large language models.Preprints, July 2025a. doi: 10.20944/preprints202507.2024.v1. URLhttps://doi.org/10.20944/ preprints202507.2024.v1. Xiaoxi Li, Guanting Dong, Jiajie Jin,...

  10. [11]

    URLhttps://aclanthology.org/2023.acl-long.546/

    18653/v1/2023.acl-long.546. URLhttps://aclanthology.org/2023.acl-long.546/. OpenAI. Introducing gpt-4.1 in the api. OpenAI Blog, April 2025a. URLhttps://openai.com/ index/gpt-4-1/. OpenAI. Introducing gpt-5. OpenAI Blog, August 2025b. URLhttps://openai.com/index/ introducing-gpt-5/. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike L...

  11. [12]

    ToolRL: Reward is All Tool Learning Needs

    Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.378. URLhttps://aclanthology.org/2023. findings-emnlp.378/. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-T ¨ur, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025a. URLhttps://arxiv.org/ abs/2504.13958. Cheng Qian, Emre Ca...

  12. [13]

    Qwen2.5 Technical Report

    URLhttps://arxiv.org/abs/2412.15115. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms,

  13. [14]

    Proximal Policy Optimization Algorithms

    URLhttps://arxiv.org/abs/1707.06347. Zeyang Sha, Shiwen Cui, and Weiqiang Wang. Sem: Reinforcement learning for search-efficient large language models,

  14. [15]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

    URLhttps://arxiv.org/abs/2505.07903. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,

  15. [16]

    URLhttps://arxiv.org/abs/2402. 03300. Yuanhao Shen, Xiaodan Zhu, and Lei Chen. SMARTCAL: An approach to self-aware tool- use evaluation and calibration. In Franck Dernoncourt, Daniel Preot ¸iuc-Pietro, and Anasta- sia Shimorina (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 774–789, Miami...

  16. [17]

    doi: 10.18653/v1/2024.emnlp-industry.59

    Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.59. URL https://aclanthology.org/2024.emnlp-industry.59/. Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag,

  17. [18]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    URLhttps://arxiv.org/abs/2501.09136. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025a. URLhttps://arxiv.org/abs/2503.05592. Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yi...

  18. [19]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.702. URLhttps://aclanthology.org/2024.acl-long.702/. Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Yang Song, and Han Li. Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. InProceedings of the 48th International ACM SI...

  19. [20]

    ISBN 9798400715921

    Association for Computing Machinery. ISBN 9798400715921. doi: 10.1145/3726302.3730102. URLhttps: //doi.org/10.1145/3726302.3730102. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Mul- tihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554,

  20. [21]

    Lost in the Middle: How Language Models Use Long Contexts

    doi: 10.1162/tacl a 00475. URLhttps: //aclanthology.org/2022.tacl-1.31/. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re- trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.),Proceedings of the 61st Annual Meet- ing of...

  21. [22]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    Association for Computational Linguistics. doi: 10.18653/v1/2023. acl-long.557. URLhttps://aclanthology.org/2023.acl-long.557/. Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025a. URLhttps://arxiv.org/abs/250...

  22. [23]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    URLhttps://arxiv.org/abs/2212.03533. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation, 2025b. URLhttps://arxiv.org/abs/2501.14342. Peilin Wu, Mian Zhang, Xinlu Zhang, Xinya Du, and Zhiyu Zoey Chen. Search wisely: Mitigating sub-optimal agentic searches by reducing uncertainty,

  23. [24]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D

    URLhttps://arxiv.org/abs/ 2505.17281. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceed- ings of the 2018 Conference on Empirical...

  24. [25]

    Cohen and Ruslan Salakhutdinov and Christopher D

    Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URLhttps://aclanthology.org/D18-1259/. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR),

  25. [26]

    ISBN 979-8-89176-251-0

    Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1312. URL https://aclanthology.org/2025.acl-long.1312/. Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, and Anurag Beniwal. Beyond correctness: Harmonizing process and outcome rewards through rl training,

  26. [27]

    14 Preprint

    URLhttps://arxiv.org/abs/2509.03403. 14 Preprint. Under review. Chuhuai Yue, Chengqi Dong, Yinan Gao, Hang He, Jiajun Chai, Guojun Yin, and Wei Lin. Promot- ing efficient reasoning with verifiable stepwise reward,

  27. [28]

    URLhttps://arxiv.org/abs/ 2508.10293. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The l...

  28. [29]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    URLhttps://arxiv.org/abs/2509.02547. Hanna Zubkova, Ji-Hoon Park, and Seong-Whan Lee. Sugar: Leveraging contextual confidence for smarter retrieval,

  29. [30]

    15 Preprint

    URLhttps://arxiv.org/abs/2501.04899. 15 Preprint. Under review. APPENDIX A EXAMPLE OFSEARCH-R1 FORMAT VS. HIPRAG FORMAT In this section we present an example of transforming from the original Search-R1 output format to our HiPRAG output format shown in Figure

  30. [31]

    </information> <think> Great, the CEO of AMD is Dr. Lisa Su. The final piece of information I need is the closing stock price from the last trading day. Since today is Saturday, July 19, 2025, I need the price from Friday, July 18,

  31. [32]

    Lacy J. Dalton

    </think> <search>AMD stock price closing July 18 2025</search> <information> On Friday, July 18, 2025, the stock for Advanced Micro Devices, Inc. (ticker: AMD) closed at a price of $175.40 per share. </information> <think> I have all the necessary information. The company is AMD, the CEO is Dr. Lisa Su, and the closing stock price was $175.40. I can now p...