BEAVER: An Enterprise Benchmark for Text-to-SQL
Pith reviewed 2026-05-23 20:45 UTC · model grok-4.3
The pith
Current text-to-SQL systems reach only 10.8 percent accuracy on complex enterprise queries from private data warehouses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEAVER is built by synthesizing high-fidelity expert-verified queries that isolate or combine domain knowledge and query complexity, plus human annotations for five subtasks that enable fine-grained analysis. SOTA agentic frameworks score 10.8 percent accuracy, which rises to 30.1 percent when all subtask annotations are provided as oracle hints. This confirms subtask resolution as the primary bottleneck and supplies a taxonomy of residual errors that persist with hints.
What carries the argument
The five subtask annotations paired with synthesized query sets focused on domain knowledge, query complexity, or both, which support isolation of failure modes beyond all-or-nothing accuracy.
If this is right
- Current systems require targeted gains in handling domain knowledge and complex structures separately rather than end-to-end generation.
- Fine-grained subtask metrics can diagnose errors more precisely than standard accuracy alone.
- Advanced SQL functions remain difficult even when subtask hints are supplied.
- Synthesis methods can expand benchmarks despite privacy limits on original logs.
- Residual error patterns point to specific needs such as better support for sophisticated query functions.
Where Pith is reading between the lines
- Modular systems that explicitly plan and solve subtasks in sequence might close more of the gap between 10.8 and 30.1 percent.
- The synthesis approach could extend to creating similar private-data benchmarks for related tasks like text-to-visualization.
- The error taxonomy suggests value in augmenting models with domain-specific function retrieval or libraries.
- Performance on BEAVER could serve as a practical filter for selecting models before enterprise deployment.
Load-bearing premise
The synthesized high-fidelity expert-verified queries accurately represent the compounded challenges present in scarce real-world enterprise query logs.
What would settle it
If model performance on a sample of unaltered real enterprise query logs differs substantially from results on the synthesized BEAVER set, the benchmark's ability to isolate representative challenges would be undermined.
Figures
read the original abstract
Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BEAVER, a text-to-SQL benchmark with 9128 question-SQL pairs from private enterprise query logs across 812 tables in 19 domains. It addresses scarcity via synthesis of high-fidelity expert-verified queries that isolate or combine domain knowledge and query complexity challenges, and supplies human annotations plus metrics for five subtasks to enable fine-grained diagnosis. Evaluations show SOTA agentic frameworks with GPT-5.2 reach only 10.8% accuracy, rising to 30.1% with oracle subtask hints, and include a residual error taxonomy focused on advanced functions.
Significance. If the synthesized queries faithfully capture real enterprise distributions, the benchmark and subtask annotations would provide a valuable resource for diagnosing LLM limitations in complex private settings beyond public benchmarks, with the oracle-hint experiment offering direct evidence on bottleneck location and the error taxonomy aiding targeted improvements.
major comments (2)
- [Dataset-level synthesis] Dataset construction (synthesis paragraph): the claim that synthesized queries 'isolate individual challenges or combine them' and represent compounded enterprise difficulties lacks any quantitative validation (e.g., Kolmogorov-Smirnov tests or distribution comparisons on schema depth, function usage, or domain-term frequency) against the original scarce logs; without this, the 10.8% vs. 30.1% gap and subtask-bottleneck conclusion rest on an unverified assumption.
- [Evaluation results] Evaluation section: the headline accuracies (10.8% SOTA, 30.1% with hints) are reported without specifying the exact agentic frameworks, number of independent runs, or confidence intervals, making it impossible to assess whether the 'significant performance gap' is robust or sensitive to prompting variance.
minor comments (2)
- [Abstract] The abstract refers to 'GPT-5.2' without a citation or model card; if this is a hypothetical or internal model, the manuscript should clarify its capabilities relative to publicly available models.
- [Error taxonomy] Table or figure captions for the error taxonomy should explicitly state the sample size on which the taxonomy percentages are computed.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and outline planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Dataset-level synthesis] Dataset construction (synthesis paragraph): the claim that synthesized queries 'isolate individual challenges or combine them' and represent compounded enterprise difficulties lacks any quantitative validation (e.g., Kolmogorov-Smirnov tests or distribution comparisons on schema depth, function usage, or domain-term frequency) against the original scarce logs; without this, the 10.8% vs. 30.1% gap and subtask-bottleneck conclusion rest on an unverified assumption.
Authors: We agree that quantitative validation against the original logs would provide stronger support for the fidelity of the synthesized queries. Privacy constraints prevent release of the original enterprise logs, precluding public statistical tests such as Kolmogorov-Smirnov comparisons. In the revised version we will expand the synthesis section with additional internal validation details (e.g., expert agreement rates, schema-depth histograms, and function-usage frequencies computed on the source logs) that can be reported without violating confidentiality. These additions will clarify the basis for the isolation/combination claims while preserving the expert-verification process already described. revision: partial
-
Referee: [Evaluation results] Evaluation section: the headline accuracies (10.8% SOTA, 30.1% with hints) are reported without specifying the exact agentic frameworks, number of independent runs, or confidence intervals, making it impossible to assess whether the 'significant performance gap' is robust or sensitive to prompting variance.
Authors: We accept that the current presentation omits these experimental details. The revised manuscript will name the specific agentic frameworks evaluated, state the number of independent runs, and report confidence intervals (or standard deviations) for the headline accuracy figures. This will allow readers to evaluate robustness directly. revision: yes
Circularity Check
No circularity: empirical benchmark construction with direct evaluation
full rationale
The paper is an empirical benchmark paper that sources queries from private logs, synthesizes additional high-fidelity examples to address scarcity and isolate challenges, provides subtask annotations, and reports model accuracies via direct execution-based evaluation. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. Performance numbers (10.8% and 30.1%) are measured outcomes on the constructed dataset rather than predictions that reduce to the inputs by construction. The synthesis step and error taxonomy are methodological choices whose validity can be assessed externally; they do not create a self-referential loop. This is the normal case of a self-contained evaluation study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Enterprise query logs can be synthesized into high-fidelity, expert-verified queries that isolate or combine challenges such as domain knowledge and query complexity.
Forward citations
Cited by 8 Pith papers
-
Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis
The authors define a taxonomy for LLM-enhanced relational operators categorized into Select, Match, Impute, Cluster and Order, and release LROBench to evaluate single and multi-operator queries on semantic database pr...
-
EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
-
An Alternate Agentic AI Architecture (It's About the Data)
RUBICON replaces opaque LLM-based tool orchestration in agentic AI with an explicit query algebra (AQL: Find, From, Where) executed via wrappers to deliver traceable, deterministic access to heterogeneous enterprise d...
-
A Demonstration of SQLyzr: A Platform for Fine-Grained Text-to-SQL Evaluation and Analysis
SQLyzr is a new evaluation platform that adds diverse metrics, realistic settings, query classification, and analysis features to overcome the single-score limitations of existing text-to-SQL benchmarks.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, an...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Peter Baile Chen, Yi Zhang, and Dan Roth. 2024. https://aclanthology.org/2024.acl-long.148 Is table retrieval a solved problem? exploring join-aware multi-table retrieval . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2687--2699, Bangkok, Thailand. Association for Computational L...
work page 2024
-
[3]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. https://doi.org/10.14778/3641204.3641221 Text-to-sql empowered by large language models: A benchmark evaluation . Proc. VLDB Endow., 17(5):1132–1145
-
[4]
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696--15707. PMLR
work page 2023
-
[5]
Wuwei Lan, Zhiguo Wang, Anuj Chauhan, Henghui Zhu, Alexander Li, Jiang Guo, Sheng Zhang, Chung-Wei Hang, Joseph Lilien, Yiqun Hu, Lin Pan, Mingwen Dong, Jun Wang, Jiarong Jiang, Stephen Ash, Vittorio Castelli, Patrick Ng, and Bing Xiang. 2023. https://arxiv.org/abs/2305.16265 Unite: A unified benchmark for text-to-sql evaluation . Preprint, arXiv:2305.16265
-
[6]
Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. 2021. https://doi.org/10.18653/v1/2021.acl-long.176 K aggle DBQA : Realistic evaluation of text-to- SQL parsers . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pa...
-
[7]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459--9474
work page 2020
-
[8]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36
work page 2024
- [9]
-
[10]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173
work page 2024
-
[12]
Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019. Natural language querying of complex business intelligence queries. In Proceedings of the 2019 International Conference on Management of Data, pages 1997--2000
work page 2019
-
[13]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[16]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[17]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.