arxiv: 2605.05538 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.IR

Recognition: unknown

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

Susheel Suresh , Hazel Mak , Shangpo Chou , Fred Kroon , Sahil Bhatnagar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:06 UTC · model grok-4.3

classification 💻 cs.AI cs.IR

keywords agentic RAGretrieval augmented generationenterprise knowledge basesLLM tool useiterative retrievalfactuality evaluationbenchmark performance

0 comments

The pith

Equipping language models with search, navigation, and summarization tools improves retrieval recall and answer correctness over standard fixed-retrieval RAG on enterprise benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgenticRAG as a lightweight harness layered on existing enterprise search systems, giving a reasoning model the ability to issue searches, open documents, locate sections inside them, and summarize evidence in repeated steps. This design moves away from the usual practice of fixing the entire context set in one early retrieval pass. On three benchmarks the method produces large lifts, including nearly 50 percent top recall on BRIGHT and 92 percent answer correctness on FinanceBench that sits within two points of an oracle with perfect evidence. Ablation results attribute most of the improvement to the shift from single-shot retrieval to ongoing tool use. The harness is presented as compatible with production enterprise deployments because it leaves the underlying search infrastructure unchanged.

Core claim

By supplying a reasoning LLM with four tools—search, find, open, and summarize—the harness lets the model iteratively retrieve information, navigate within documents, and analyze evidence on its own, reducing overdependence on the initial search stack and delivering higher recall, factuality, and correctness than conventional RAG pipelines.

What carries the argument

The lightweight agentic harness that equips the reasoning LLM with search, find, open, and summarize tools for iterative retrieval and evidence analysis.

Load-bearing premise

The three open benchmarks sufficiently represent the query distributions and document quality found in real enterprise knowledge bases.

What would settle it

A controlled test on enterprise data where the base search engine returns mostly irrelevant documents would show whether the iterative tool-use gains persist or disappear compared with single-shot retrieval.

Figures

Figures reproduced from arXiv: 2605.05538 by Fred Kroon, Hazel Mak, Sahil Bhatnagar, Shangpo Chou, Susheel Suresh.

**Figure 1.** Figure 1: Agentic loop 15). When maximum iterations are reached without a final answer, the agent issues a forced completion request, requiring the model to respond using available information. If the token budget is exceeded during execution, the agent triggers context management (Sec. 3.4) to free space and continues the loop. For detailed algorithm, see Appendix A.1. 3.3 Retrieval Tools The system provides thre… view at source ↗

**Figure 2.** Figure 2: Example conversation history with context view at source ↗

**Figure 3.** Figure 3: Factuality performance on the WixQA Expert view at source ↗

**Figure 4.** Figure 4: Factuality performance on the WixQA Sim view at source ↗

**Figure 5.** Figure 5: shows an example conversation from the FinanceBench view at source ↗

read the original abstract

We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously. On three open benchmarks we observe substantial gains: $49.6\%$ recall@1 on BRIGHT (+21.8 pp over the best embedding baseline), 0.96 factuality on WixQA ($+13\%$ relative improvement), and $92\%$ answer correctness on FinanceBench--within 2 pp of oracle access to true evidence. Ablation studies show that the most significant factor is the shift from single-shot retrieval to agentic tool use ($5.9\times$ improvement), while multi-query search and in-document navigation contribute to both quality and efficiency. We present various design choices in our agentic harness that were informed by pre-production deployments. Our results demonstrate its suitability for real-world enterprise production environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents AgenticRAG, a lightweight agentic harness layered atop existing enterprise search infrastructure. An LLM is equipped with search, find, open, and summarize tools to perform iterative retrieval, intra-document navigation, and evidence analysis. On BRIGHT, WixQA, and FinanceBench the system reports 49.6% recall@1 (+21.8 pp), 0.96 factuality (+13% relative), and 92% answer correctness (within 2 pp of oracle), respectively. Ablations attribute the largest gains (5.9×) to the shift from single-shot to agentic tool use, with additional contributions from multi-query search and navigation; design choices are stated to be informed by pre-production deployments, supporting the claim of suitability for real-world enterprise knowledge bases.

Significance. If the reported gains prove robust and the agentic patterns transfer beyond the chosen benchmarks, the work could meaningfully advance practical RAG systems by loosening the dependence on a single early retrieval step and enabling autonomous navigation and summarization. The concrete numeric improvements, the ablation isolating tool-use effects, and the production-informed design choices are positive features that could guide deployment of agentic retrieval in constrained enterprise settings.

major comments (2)

Abstract and Evaluation: the claim that AgenticRAG is suitable for real-world enterprise production environments is load-bearing for the paper's contribution, yet the evaluation is confined to three open benchmarks whose query distributions and document quality may not reflect noisier enterprise retrieval (lower precision@10, duplicates, domain-specific noise). No quantitative results or ablations under degraded retrieval conditions are provided, leaving open the possibility that iterative tool calls compound rather than mitigate errors.
Ablation studies: the 5.9× improvement attributed to agentic tool use versus single-shot retrieval is a central empirical result. Without explicit details on baseline implementation (identical search engine, query formulation, and candidate-set construction), it is difficult to confirm that the delta isolates the effect of agentic iteration rather than differences in retrieval configuration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and proposed revisions to improve the manuscript.

read point-by-point responses

Referee: Abstract and Evaluation: the claim that AgenticRAG is suitable for real-world enterprise production environments is load-bearing for the paper's contribution, yet the evaluation is confined to three open benchmarks whose query distributions and document quality may not reflect noisier enterprise retrieval (lower precision@10, duplicates, domain-specific noise). No quantitative results or ablations under degraded retrieval conditions are provided, leaving open the possibility that iterative tool calls compound rather than mitigate errors.

Authors: We acknowledge that the three benchmarks cannot capture every characteristic of enterprise data, including variable noise levels or duplicates. That said, BRIGHT, WixQA, and FinanceBench were selected precisely because they feature complex, multi-hop queries and domain-specific content that mirror challenges observed in our pre-production enterprise deployments. The agentic harness design (tool set, iteration budget, and summarization step) was shaped by those deployments, where single-shot retrieval frequently failed on precisely the kinds of noisy or fragmented documents the referee mentions. The high factuality and correctness scores already reflect the harness's ability to mitigate compounding errors through targeted navigation and evidence condensation. In the revised manuscript we will (1) add an explicit limitations paragraph discussing benchmark-to-enterprise gaps and (2) expand the failure-mode analysis to illustrate how the summarize and find tools reduce rather than amplify retrieval noise. We do not have new quantitative results on artificially degraded retrieval at this time. revision: partial
Referee: Ablation studies: the 5.9× improvement attributed to agentic tool use versus single-shot retrieval is a central empirical result. Without explicit details on baseline implementation (identical search engine, query formulation, and candidate-set construction), it is difficult to confirm that the delta isolates the effect of agentic iteration rather than differences in retrieval configuration.

Authors: We agree that the ablation requires fuller specification to isolate the agentic contribution. The single-shot baseline employs exactly the same search engine, embedding model, and index as the agentic system. It issues one query derived directly from the user input, retrieves a fixed top-k candidate set, and stops; no further tool calls or iteration occur. In the revised manuscript we will add a dedicated paragraph (and accompanying table) that states the precise query formulation, k value, ranking parameters, and candidate-set construction used for the baseline, thereby confirming that the 5.9× gain is attributable to the introduction of iterative tool use rather than any change in the underlying retriever. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmarks

full rationale

The paper introduces an agentic harness for RAG and reports direct empirical measurements (recall@1, factuality, answer correctness) on three fixed external benchmarks plus ablations attributing gains to tool-use patterns. No mathematical derivation, equations, fitted parameters renamed as predictions, or self-citation chains are present; design choices are stated as informed by deployments but not used to derive the benchmark numbers. All claims reduce to observable performance deltas against independent test sets, satisfying the self-contained criterion with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems contribution with no explicit mathematical axioms or invented physical entities. The only implicit assumptions are standard ones about benchmark validity and the transferability of tool-use behavior from the evaluated datasets to enterprise data.

pith-pipeline@v0.9.0 · 5527 in / 1464 out tokens · 41775 ms · 2026-05-08T12:06:11.473772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 21 canonical work pages · 8 internal anchors

[1]

, note =

AzureAISearch , title =. , note =
[2]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Precise zero-shot dense retrieval without relevance labels , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[3]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[4]

Query2doc: Query expansion with large language models

Query2doc: Query expansion with large language models , author=. arXiv preprint arXiv:2303.07678 , year=

work page arXiv
[5]

Foundations and Trends

Learning to rank for information retrieval , author=. Foundations and Trends. 2009 , publisher=

2009
[6]

Passage Re-ranking with BERT

Passage Re-ranking with BERT , author=. arXiv preprint arXiv:1901.04085 , year=

work page internal anchor Pith review arXiv 1901
[7]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models , author=. arXiv preprint arXiv:2104.08663 , year=

work page internal anchor Pith review arXiv
[8]

LATTICE : LLM -guided hierarchical retrieval

LLM-guided Hierarchical Retrieval , author=. arXiv preprint arXiv:2510.13217 , year=

work page arXiv
[9]

Bright: A realistic and challenging bench- mark for reasoning-intensive retrieval.arXiv preprint arXiv:2407.12883,

Bright: A realistic and challenging benchmark for reasoning-intensive retrieval , author=. arXiv preprint arXiv:2407.12883 , year=

work page arXiv
[10]

arXiv preprint arXiv:2505.08643 , year=

Wixqa: A multi-dataset benchmark for enterprise retrieval-augmented generation , author=. arXiv preprint arXiv:2505.08643 , year=

work page arXiv
[11]

Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

Financebench: A new benchmark for financial question answering , author=. arXiv preprint arXiv:2311.11944 , year=

work page arXiv
[12]

2026 , url =

Shreyas Subramanian and Wale Akinfaderin and Yanyan Zhang and Ishan Singh and Chris Pecora and Mani Khanuja and Sandeep Singh and Maira Ladeira Tanke , title =. 2026 , url =

2026
[13]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG , author=. arXiv preprint arXiv:2501.09136 , year=

work page internal anchor Pith review arXiv
[14]

arXiv preprint arXiv:2409.15730 (2024)

A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions , author=. arXiv preprint arXiv:2507.18910 , year=

work page arXiv
[15]

arXiv preprint arXiv:2511.05385 , year=

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework , author=. arXiv preprint arXiv:2511.05385 , year=

work page arXiv
[16]

International Conference on Learning Representations (ICLR) , year=

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval , author=. International Conference on Learning Representations (ICLR) , year=
[17]

arXiv preprint arXiv:2602.05014 , year=

DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search , author=. arXiv preprint arXiv:2602.05014 , year=

work page arXiv
[18]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: Agentic Search-Enhanced Large Reasoning Models , author=. arXiv preprint arXiv:2501.05366 , year=

work page internal anchor Pith review arXiv
[19]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review arXiv
[20]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[21]

International Conference on Machine Learning (ICML) , pages=

REALM: Retrieval-Augmented Language Model Pre-Training , author=. International Conference on Machine Learning (ICML) , pages=
[22]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=
[23]

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , author=. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[24]

International Conference on Machine Learning (ICML) , pages=

Improving Language Models by Retrieving from Trillions of Tokens , author=. International Conference on Machine Learning (ICML) , pages=
[25]

In-context retrieval- augmented language models,

In-Context Retrieval-Augmented Language Models , author=. arXiv preprint arXiv:2302.00083 , year=

work page arXiv
[26]

Replug: Retrieval-augmented black-box language models

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. arXiv preprint arXiv:2301.12652 , year=

work page arXiv
[27]

Active retrieval augmented genera- tion,

Active Retrieval Augmented Generation , author=. arXiv preprint arXiv:2305.06983 , year=

work page arXiv
[28]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[29]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.10997 , year=

work page internal anchor Pith review arXiv
[30]

International Conference on Learning Representations (ICLR) , year=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. International Conference on Learning Representations (ICLR) , year=
[31]

Corrective Retrieval Augmented Generation

Corrective Retrieval Augmented Generation , author=. arXiv preprint arXiv:2401.15884 , year=

work page internal anchor Pith review arXiv
[32]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

2024
[34]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

2024
[35]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

From Local to Global: A Graph RAG Approach to Query-Focused Summarization , author=. arXiv preprint arXiv:2404.16130 , year=

work page internal anchor Pith review arXiv
[36]

Transactions on Graph Data and Knowledge , year=

GraphRAG on Technical Documents - Impact of Knowledge Graph Schema , author=. Transactions on Graph Data and Knowledge , year=
[37]

IEEE Transactions on Knowledge and Data Engineering , year=

Unifying Large Language Models and Knowledge Graphs: A Roadmap , author=. IEEE Transactions on Knowledge and Data Engineering , year=
[38]

arXiv preprint arXiv:2402.12345 , year=

Knowledge Graph-Enhanced Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2402.12345 , year=

work page arXiv
[39]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=
[40]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[41]

arXiv preprint arXiv:2210.03350

Measuring and Narrowing the Compositionality Gap in Language Models , author=. arXiv preprint arXiv:2210.03350 , year=

work page arXiv