pith. sign in

arxiv: 2409.02038 · v3 · pith:IJYMPJBOnew · submitted 2024-09-03 · 💻 cs.CL · cs.AI· cs.DB

BEAVER: An Enterprise Benchmark for Text-to-SQL

Pith reviewed 2026-05-23 20:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DB
keywords text-to-SQLbenchmarkenterprise dataLLM evaluationquery complexitydomain knowledgesubtask annotationsagentic frameworks
0
0 comments X

The pith

Current text-to-SQL systems reach only 10.8 percent accuracy on complex enterprise queries from private data warehouses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEAVER, a benchmark of 9128 question-SQL pairs drawn from real enterprise query logs across 19 domains and 812 tables. It demonstrates that state-of-the-art agentic frameworks using advanced models achieve just 10.8 percent accuracy on these queries. Supplying oracle hints for five critical subtasks raises accuracy to 30.1 percent, showing that resolving subtasks involving domain knowledge and query complexity forms a major bottleneck. Existing public benchmarks rely on simple schemas where models perform well, so this new resource isolates compounded challenges that matter for practical deployment. The work also supplies a taxonomy of errors that remain even with hints, such as difficulties with advanced functions.

Core claim

BEAVER is built by synthesizing high-fidelity expert-verified queries that isolate or combine domain knowledge and query complexity, plus human annotations for five subtasks that enable fine-grained analysis. SOTA agentic frameworks score 10.8 percent accuracy, which rises to 30.1 percent when all subtask annotations are provided as oracle hints. This confirms subtask resolution as the primary bottleneck and supplies a taxonomy of residual errors that persist with hints.

What carries the argument

The five subtask annotations paired with synthesized query sets focused on domain knowledge, query complexity, or both, which support isolation of failure modes beyond all-or-nothing accuracy.

If this is right

  • Current systems require targeted gains in handling domain knowledge and complex structures separately rather than end-to-end generation.
  • Fine-grained subtask metrics can diagnose errors more precisely than standard accuracy alone.
  • Advanced SQL functions remain difficult even when subtask hints are supplied.
  • Synthesis methods can expand benchmarks despite privacy limits on original logs.
  • Residual error patterns point to specific needs such as better support for sophisticated query functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Modular systems that explicitly plan and solve subtasks in sequence might close more of the gap between 10.8 and 30.1 percent.
  • The synthesis approach could extend to creating similar private-data benchmarks for related tasks like text-to-visualization.
  • The error taxonomy suggests value in augmenting models with domain-specific function retrieval or libraries.
  • Performance on BEAVER could serve as a practical filter for selecting models before enterprise deployment.

Load-bearing premise

The synthesized high-fidelity expert-verified queries accurately represent the compounded challenges present in scarce real-world enterprise query logs.

What would settle it

If model performance on a sample of unaltered real enterprise query logs differs substantially from results on the synthesized BEAVER set, the benchmark's ability to isolate representative challenges would be undermined.

Figures

Figures reproduced from arXiv: 2409.02038 by \c{C}a\u{g}atay Demiralp, Devin Yang, Fabian Wenz, Michael Cafarella, Michael Stonebraker, Nesime Tatbul, Peter Baile Chen, Weiyue Li, Yi Zhang.

Figure 1
Figure 1. Figure 1: The mean values for the number of joins, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schema of tables to illustrate retriever models did not retrieve sufficient information. A green tick [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Schema of tables to illustrate retriever model did not retrieve connecting tables. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Interface for annotating natural language questions. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BEAVER, a text-to-SQL benchmark with 9128 question-SQL pairs from private enterprise query logs across 812 tables in 19 domains. It addresses scarcity via synthesis of high-fidelity expert-verified queries that isolate or combine domain knowledge and query complexity challenges, and supplies human annotations plus metrics for five subtasks to enable fine-grained diagnosis. Evaluations show SOTA agentic frameworks with GPT-5.2 reach only 10.8% accuracy, rising to 30.1% with oracle subtask hints, and include a residual error taxonomy focused on advanced functions.

Significance. If the synthesized queries faithfully capture real enterprise distributions, the benchmark and subtask annotations would provide a valuable resource for diagnosing LLM limitations in complex private settings beyond public benchmarks, with the oracle-hint experiment offering direct evidence on bottleneck location and the error taxonomy aiding targeted improvements.

major comments (2)
  1. [Dataset-level synthesis] Dataset construction (synthesis paragraph): the claim that synthesized queries 'isolate individual challenges or combine them' and represent compounded enterprise difficulties lacks any quantitative validation (e.g., Kolmogorov-Smirnov tests or distribution comparisons on schema depth, function usage, or domain-term frequency) against the original scarce logs; without this, the 10.8% vs. 30.1% gap and subtask-bottleneck conclusion rest on an unverified assumption.
  2. [Evaluation results] Evaluation section: the headline accuracies (10.8% SOTA, 30.1% with hints) are reported without specifying the exact agentic frameworks, number of independent runs, or confidence intervals, making it impossible to assess whether the 'significant performance gap' is robust or sensitive to prompting variance.
minor comments (2)
  1. [Abstract] The abstract refers to 'GPT-5.2' without a citation or model card; if this is a hypothetical or internal model, the manuscript should clarify its capabilities relative to publicly available models.
  2. [Error taxonomy] Table or figure captions for the error taxonomy should explicitly state the sample size on which the taxonomy percentages are computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and outline planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Dataset-level synthesis] Dataset construction (synthesis paragraph): the claim that synthesized queries 'isolate individual challenges or combine them' and represent compounded enterprise difficulties lacks any quantitative validation (e.g., Kolmogorov-Smirnov tests or distribution comparisons on schema depth, function usage, or domain-term frequency) against the original scarce logs; without this, the 10.8% vs. 30.1% gap and subtask-bottleneck conclusion rest on an unverified assumption.

    Authors: We agree that quantitative validation against the original logs would provide stronger support for the fidelity of the synthesized queries. Privacy constraints prevent release of the original enterprise logs, precluding public statistical tests such as Kolmogorov-Smirnov comparisons. In the revised version we will expand the synthesis section with additional internal validation details (e.g., expert agreement rates, schema-depth histograms, and function-usage frequencies computed on the source logs) that can be reported without violating confidentiality. These additions will clarify the basis for the isolation/combination claims while preserving the expert-verification process already described. revision: partial

  2. Referee: [Evaluation results] Evaluation section: the headline accuracies (10.8% SOTA, 30.1% with hints) are reported without specifying the exact agentic frameworks, number of independent runs, or confidence intervals, making it impossible to assess whether the 'significant performance gap' is robust or sensitive to prompting variance.

    Authors: We accept that the current presentation omits these experimental details. The revised manuscript will name the specific agentic frameworks evaluated, state the number of independent runs, and report confidence intervals (or standard deviations) for the headline accuracy figures. This will allow readers to evaluate robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with direct evaluation

full rationale

The paper is an empirical benchmark paper that sources queries from private logs, synthesizes additional high-fidelity examples to address scarcity and isolate challenges, provides subtask annotations, and reports model accuracies via direct execution-based evaluation. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. Performance numbers (10.8% and 30.1%) are measured outcomes on the constructed dataset rather than predictions that reduce to the inputs by construction. The synthesis step and error taxonomy are methodological choices whose validity can be assessed externally; they do not create a self-referential loop. This is the normal case of a self-contained evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the validity of query synthesis from scarce private logs and expert verification to isolate challenges; no free parameters, new entities, or formal axioms are introduced.

axioms (1)
  • domain assumption Enterprise query logs can be synthesized into high-fidelity, expert-verified queries that isolate or combine challenges such as domain knowledge and query complexity.
    Invoked at the dataset level to address scarcity of enterprise logs and enable fine-grained evaluation.

pith-pipeline@v0.9.0 · 5874 in / 1292 out tokens · 33868 ms · 2026-05-23T20:45:32.056688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Model-Enhanced Relational Operators: Taxonomy, Benchmark, and Analysis

    cs.DB 2026-03 unverdicted novelty 7.0

    The authors define a taxonomy for LLM-enhanced relational operators categorized into Select, Match, Impute, Cluster and Order, and release LROBench to evaluate single and multi-operator queries on semantic database pr...

  2. EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

    cs.DB 2026-05 unverdicted novelty 6.0

    EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.

  3. SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

  4. An Alternate Agentic AI Architecture (It's About the Data)

    cs.DB 2026-04 unverdicted novelty 5.0

    RUBICON replaces opaque LLM-based tool orchestration in agentic AI with an explicit query algebra (AQL: Find, From, Where) executed via wrappers to deliver traceable, deterministic access to heterogeneous enterprise d...

  5. A Demonstration of SQLyzr: A Platform for Fine-Grained Text-to-SQL Evaluation and Analysis

    cs.DB 2026-04 unverdicted novelty 5.0

    SQLyzr is a new evaluation platform that adds diverse metrics, realistic settings, query classification, and analysis features to overcome the single-score limitations of existing text-to-SQL benchmarks.

  6. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  7. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  8. Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

    cs.IR 2026-04 unverdicted novelty 4.0

    An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, an...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 8 Pith papers · 4 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Peter Baile Chen, Yi Zhang, and Dan Roth. 2024. https://aclanthology.org/2024.acl-long.148 Is table retrieval a solved problem? exploring join-aware multi-table retrieval . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2687--2699, Bangkok, Thailand. Association for Computational L...

  3. [3]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. https://doi.org/10.14778/3641204.3641221 Text-to-sql empowered by large language models: A benchmark evaluation . Proc. VLDB Endow., 17(5):1132–1145

  4. [4]

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696--15707. PMLR

  5. [5]

    Wuwei Lan, Zhiguo Wang, Anuj Chauhan, Henghui Zhu, Alexander Li, Jiang Guo, Sheng Zhang, Chung-Wei Hang, Joseph Lilien, Yiqun Hu, Lin Pan, Mingwen Dong, Jun Wang, Jiarong Jiang, Stephen Ash, Vittorio Castelli, Patrick Ng, and Bing Xiang. 2023. https://arxiv.org/abs/2305.16265 Unite: A unified benchmark for text-to-sql evaluation . Preprint, arXiv:2305.16265

  6. [6]

    Chia-Hsuan Lee, Oleksandr Polozov, and Matthew Richardson. 2021. https://doi.org/10.18653/v1/2021.acl-long.176 K aggle DBQA : Realistic evaluation of text-to- SQL parsers . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pa...

  7. [7]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459--9474

  8. [8]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36

  9. [9]

    Xianming Li and Jing Li. 2023. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871

  10. [10]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281

  11. [11]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173

  12. [12]

    Jaydeep Sen, Fatma Ozcan, Abdul Quamar, Greg Stager, Ashish Mittal, Manasa Jammi, Chuan Lei, Diptikalyan Saha, and Karthik Sankaranarayanan. 2019. Natural language querying of complex business intelligence queries. In Proceedings of the 2019 International Conference on Management of Data, pages 1997--2000

  13. [13]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  14. [14]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887

  15. [15]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  16. [16]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  17. [17]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...