Recognition: unknown
EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge
Pith reviewed 2026-05-08 17:11 UTC · model grok-4.3
The pith
A new benchmark supplies 500,000 synthetic enterprise documents and 500 questions to evaluate retrieval-augmented generation on company-internal knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information.
What carries the argument
The EnterpriseRAG-Bench dataset together with its generation framework, which produces a coherent synthetic enterprise corpus containing controlled noise to measure how well RAG systems retrieve and reason over internal company records.
If this is right
- RAG developers can measure performance on tasks such as constrained retrieval and conflict resolution that arise in company settings.
- The accompanying framework lets organizations generate custom versions matched to their own document mix and scale.
- Standardized evaluation on the leaderboard enables direct comparison of retrieval methods for internal knowledge tasks.
- The question categories reveal where current systems fail when information is absent or spread across multiple sources.
- Teams gain a reusable testbed for improving AI agents that operate over proprietary data.
Where Pith is reading between the lines
- Wider use of this benchmark could encourage RAG research to prioritize handling of noisy, interconnected internal records over clean public sources.
- The approach of injecting realistic enterprise noise might be extended to other domains such as legal or medical document collections.
- If the synthetic data proves predictive, it could reduce the need for costly access to real proprietary data during early model development.
- Future versions might add temporal dynamics or access-control constraints to simulate live company environments more closely.
Load-bearing premise
The synthetic corpus with cross-document coherence and added noise such as misfiled documents and conflicting information realistically reflects real company-internal knowledge.
What would settle it
A study that shows RAG models achieve different accuracy rankings when tested on actual proprietary enterprise collections versus this synthetic set would show the benchmark does not capture real conditions.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EnterpriseRAG-Bench, a synthetic dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) together with 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence grounded in shared projects, people, and initiatives and is augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information; the authors also release the generation framework, evaluation harness, and a public leaderboard.
Significance. If the realism claim holds, the benchmark would fill a clear gap in RAG evaluation by supplying a proxy for proprietary enterprise data, which is increasingly relevant for internal AI agents. The open release of the generation framework, code, and leaderboard is a concrete strength that supports reproducibility and customization to different industries or source mixes.
major comments (1)
- [Abstract] Abstract: the central claim that the synthetic corpus 'realistically reflects the nature of company-internal knowledge' because it incorporates cross-document coherence and noise (misfiled documents, conflicts) is presented without any quantitative validation. No comparisons are reported of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (even anonymized) enterprise corpora from the listed sources. This validation gap is load-bearing for the benchmark's stated purpose.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address the major comment below and have revised the manuscript to acknowledge the validation limitations while preserving the benchmark's value as a synthetic proxy.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the synthetic corpus 'realistically reflects the nature of company-internal knowledge' because it incorporates cross-document coherence and noise (misfiled documents, conflicts) is presented without any quantitative validation. No comparisons are reported of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (even anonymized) enterprise corpora from the listed sources. This validation gap is load-bearing for the benchmark's stated purpose.
Authors: We agree that a direct quantitative comparison to real (even anonymized) enterprise corpora would strengthen the realism claim. However, such data is unavailable to us or the community due to privacy, legal, and competitive sensitivities—the exact reason a synthetic benchmark is needed. The generation framework was designed with input from enterprise practitioners to reflect observed patterns in cross-document coherence, noise, and source distributions. In the revised manuscript we have added a 'Limitations' section that explicitly discusses the absence of empirical validation against real corpora, details the rationale and sources for our coherence/noise parameters, and softens the abstract language from 'realistically reflects' to 'aims to approximate'. We also emphasize that the open framework and code allow users to calibrate and validate against their own proprietary data. These changes address the load-bearing concern without overstating the benchmark's fidelity. revision: yes
- Quantitative comparisons of document-type frequencies, cross-reference density, conflict rates, or retrieval-failure modes against real (anonymized) enterprise corpora from the listed sources, which cannot be performed due to data access restrictions.
Circularity Check
No circularity: new benchmark artifact release with no derivations or self-referential reductions
full rationale
The paper presents EnterpriseRAG-Bench as a released synthetic corpus (~500k documents across nine enterprise sources), generation framework, 500 questions in ten categories, and leaderboard. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims consist of describing the dataset construction (cross-document coherence, added noise) and its intended use for RAG evaluation. These are direct artifact contributions rather than any chain that reduces a result to its own inputs by definition, renaming, or self-citation. The realism of the synthetic data is asserted but not derived from prior results within the paper, leaving no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic documents from shared projects and people can exhibit realistic cross-document coherence and noise patterns of actual company data.
Reference graph
Works this paper leans on
-
[1]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2018
work page internal anchor Pith review arXiv 2018
-
[2]
FinQA : A dataset of numerical reasoning over financial data
Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, and Michael Bendersky. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021
2021
-
[3]
Zhiyu Chen et al. BrowseComp-Plus : A controlled evaluation framework for browsing agents. arXiv preprint arXiv:2508.06600, 2025
-
[4]
Meet KARL : A faster agent for enterprise knowledge, powered by custom RL
Databricks . Meet KARL : A faster agent for enterprise knowledge, powered by custom RL . Technical report, Databricks, 2025
2025
-
[5]
PubMedQA : A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA : A dataset for biomedical research question answering. In Proceedings of EMNLP, 2019
2019
-
[6]
Natural questions: A benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019
2019
-
[7]
Stuart P. Lloyd. Least squares quantization in PCM . IEEE Transactions on Information Theory, 28 0 (2): 0 129--137, 1982
1982
-
[8]
Malkov and D
Yu A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 0 (4): 0 824--836, 2020
2020
-
[9]
MTEB : Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Lo \"i c Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Proceedings of EACL, 2023
2023
-
[10]
New embedding models and API updates
OpenAI . New embedding models and API updates. OpenAI Blog, 2024
2024
-
[11]
OpenAI . GPT-5.4 . OpenAI, 2026
2026
-
[12]
KILT : A benchmark for knowledge intensive language tasks
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Yacine Yaber, et al. KILT : A benchmark for knowledge intensive language tasks. In Proceedings of NAACL, 2021
2021
-
[13]
The probabilistic relevance framework: BM25 and beyond
Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3 0 (4): 0 333--389, 2009
2009
-
[14]
Sathya Subramanian et al. Keyword search is all you need. arXiv preprint arXiv:2602.23368, 2025
-
[15]
BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Nandan Thakur, Nils Reimers, Andreas R \"u ckl \'e , Abhishek Srivastava, and Iryna Gurevych. BEIR : A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of NeurIPS, 2021
2021
-
[16]
MuSiQue : Multihop questions via single hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue : Multihop questions via single hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022
2022
-
[17]
Visualizing data using t-SNE
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE . Journal of Machine Learning Research, 9: 0 2579--2605, 2008
2008
-
[18]
HotpotQA : A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.