Recognition: unknown
Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems
Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3
The pith
STEF scores Text-to-SQL accuracy in production using only the user question, an enriched reformulation, and the generated SQL without any database schema or reference queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STEF is a schema-agnostic evaluation system that works solely with the user question, an enriched reformulation, and the generated SQL. It extracts semantic specifications from the natural language and SQL sides, performs normalized feature alignment, and computes a composite 0-100 accuracy score that accounts for filter alignment, semantic verdict, and evaluator confidence. The system adds enriched question quality validation, configurable rule injection through prompt templating, and production-tolerant normalization for GROUP BY, ORDER BY, and LIMIT variations.
What carries the argument
Semantic specification extraction from natural language and SQL followed by normalized feature alignment to produce a composite accuracy score.
If this is right
- Production Text-to-SQL agents can receive continuous quality monitoring instead of remaining unevaluated after deployment.
- Feedback loops for agent improvement become feasible without requiring schema access or reference queries.
- Structured query evaluation scales to real-world deployments where traditional benchmarks cannot be applied.
- Enriched question reformulations serve as a first-class signal for detecting input quality issues that affect downstream SQL accuracy.
Where Pith is reading between the lines
- The same semantic-alignment approach could extend to evaluating other structured generation tasks such as API call construction or data transformation scripts where schemas are unavailable at evaluation time.
- Over repeated deployments the framework might accumulate enough scored examples to train lighter-weight evaluators that reduce reliance on the initial prompt-based rules.
- Teams could combine STEF outputs with user click-through or correction data to create closed-loop retraining signals that were previously blocked by the lack of production labels.
Load-bearing premise
Semantic specifications extracted from natural language and SQL can be accurately aligned and scored without any database schema or reference queries, relying instead on enriched reformulations and prompt-based rules.
What would settle it
A manual audit of production queries where STEF scores are compared against independent human judgments of whether each generated SQL actually answers the original question and produces the expected result.
read the original abstract
Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes STEF, a Schema-agnostic Text-to-SQL Evaluation Framework for production environments. It extracts semantic specifications from the user question, an enriched reformulation, and the generated SQL, performs normalized feature alignment, and computes an interpretable 0-100 accuracy score using a composite metric based on filter alignment, semantic verdict, and evaluator confidence. The framework claims to operate without database schema or reference queries, incorporating enriched question quality validation, configurable rule injection via prompt templating, and production-robust normalization for GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. It asserts that empirical results demonstrate STEF enables continuous production monitoring and agent improvement feedback loops, making structured query evaluation viable at scale for the first time.
Significance. If the framework's reliability were demonstrated, STEF would address a genuine gap in production Text-to-SQL evaluation where ground-truth queries and schemas are unavailable. The emphasis on enriched reformulations, configurable rule injection, and normalization heuristics represents a practical direction for agent-agnostic monitoring. However, the manuscript provides no empirical data, validation methodology, or correlation analysis to support the central claims, so the significance remains potential rather than realized. The approach could enable feedback loops if the composite metric correlates with actual correctness, but this is not shown.
major comments (3)
- [Abstract] Abstract: The claim that 'Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops' is unsupported by any data, tables, figures, datasets, human judgments, execution outcomes, or validation methodology in the manuscript. This is load-bearing for the central claim of viability at scale.
- [Abstract] Abstract / Framework Description: The 0-100 accuracy score is produced by a composite metric (filter alignment + semantic verdict + confidence) that is itself defined via LLM prompts and internal rules without reference to external ground truth, execution results, or known-correct cases. This creates a self-referential loop where misalignment in prompt-based extraction (e.g., filters, GROUP BY tolerance) would propagate undetected, contradicting the production-monitoring goal.
- [Abstract] Abstract: The core assumption that semantic specifications extracted from natural language and SQL can be accurately aligned and scored without schema, reference queries, or execution results is stated but not tested; no details are provided on how the enriched reformulation is generated, how alignment is validated, or how the metric was calibrated against any external anchor.
minor comments (2)
- [Abstract] The abstract is overly dense; separating the problem statement, method contributions, and claimed results into distinct sentences would improve readability.
- [Abstract] The manuscript would benefit from explicit definitions or pseudocode for the 'normalized feature alignment' and 'composite metric' steps, even at a high level.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments accurately note that the current manuscript presents the STEF framework primarily through description and design rationale without accompanying empirical data or validation studies. We address each major comment below and will undertake a major revision to strengthen the evidential basis for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops' is unsupported by any data, tables, figures, datasets, human judgments, execution outcomes, or validation methodology in the manuscript. This is load-bearing for the central claim of viability at scale.
Authors: We acknowledge that the manuscript does not contain empirical data, tables, figures, or validation methodology to support the stated claim. The abstract phrasing overstated the current content, which focuses on the framework's architecture and production-oriented features. In the revised manuscript we will remove the unsupported claim from the abstract and replace it with a statement that STEF is designed to enable such monitoring. We will add a new 'Preliminary Validation' section that applies STEF to a sample of queries drawn from public Text-to-SQL benchmarks, reports the resulting score distributions, and discusses how the scores could be used to drive agent improvement loops. This section will also note the absence of direct human judgment correlation in the present work and outline how such correlation could be obtained in future studies. revision: yes
-
Referee: [Abstract] Abstract / Framework Description: The 0-100 accuracy score is produced by a composite metric (filter alignment + semantic verdict + confidence) that is itself defined via LLM prompts and internal rules without reference to external ground truth, execution results, or known-correct cases. This creates a self-referential loop where misalignment in prompt-based extraction (e.g., filters, GROUP BY tolerance) would propagate undetected, contradicting the production-monitoring goal.
Authors: The referee correctly identifies the self-referential character of the metric. This design choice was made precisely because production deployments frequently lack ground-truth queries and execution access. We agree that the approach carries the risk of undetected propagation of extraction errors. In the revision we will add an explicit 'Limitations' subsection that describes this risk, explains the role of the configurable rule-injection mechanism in constraining common failure modes (GROUP BY tolerance, ORDER BY defaults, LIMIT heuristics), and proposes a practical mitigation: periodic offline calibration against any available execution results or human labels. The section will also clarify that the composite metric is intended as a monitoring signal rather than a definitive correctness oracle. revision: yes
-
Referee: [Abstract] Abstract: The core assumption that semantic specifications extracted from natural language and SQL can be accurately aligned and scored without schema, reference queries, or execution results is stated but not tested; no details are provided on how the enriched reformulation is generated, how alignment is validated, or how the metric was calibrated against any external anchor.
Authors: We agree that the manuscript provides insufficient implementation detail on the enriched reformulation step, the alignment procedure, and any calibration process. The enriched reformulation is produced by a separate LLM prompt that expands the original question with inferred semantic context; feature alignment then compares normalized representations of filters, projections, and aggregations. In the revised version we will expand the 'Framework' section with the exact prompt templates used for reformulation and extraction, concrete examples of feature alignment (including edge cases), and a description of a lightweight calibration procedure that uses a small set of manually annotated query pairs to set component weights in the composite metric. These additions will make the underlying assumptions explicit and testable. revision: yes
Circularity Check
No circularity: STEF is an operational definition of an evaluation metric
full rationale
The paper defines STEF explicitly as a composite procedure (semantic specification extraction from NL/SQL, normalized feature alignment, filter alignment, semantic verdict, and confidence scoring) that produces the 0-100 accuracy value by construction. This is a system specification rather than a derivation in which an independent quantity is shown to equal a fitted or self-referential input. No equations, parameter-fitting steps, or load-bearing self-citations are described that would reduce the claimed accuracy or monitoring capability to the inputs by definition. The framework is therefore self-contained as a proposed production tool; any external validation of its correlation to execution correctness lies outside the circularity analysis.
Axiom & Free-Parameter Ledger
free parameters (2)
- weights in composite metric
- rule injection parameters
axioms (2)
- domain assumption Enriched reformulation of the user question provides sufficient semantic information for evaluation
- domain assumption Normalized feature alignment between NL and SQL accurately reflects query correctness
invented entities (1)
-
STEF composite metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
Yu T, et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InEMNLP, 2018. Spider
2018
-
[2]
WikiSQL Datset WikiSQL
-
[3]
A Big Bench for Large-Scale Database Grounded Text-to-SQLs BIRD
BIRD-SQL. A Big Bench for Large-Scale Database Grounded Text-to-SQLs BIRD
-
[4]
FeTaQA: Free-form Table Question Answering.TACL, 2022
Nan L, et al. FeTaQA: Free-form Table Question Answering.TACL, 2022. FeTaQA: Free-form Table Question Answering
2022
- [5]
-
[6]
Cortex Analyst: Production Text-to-SQL Evaluation
Snowflake. Cortex Analyst: Production Text-to-SQL Evaluation. Technical Report, 2024. Snowflake Blog
2024
-
[7]
Production SQL Evaluation Framework
IBM Research. Production SQL Evaluation Framework. Technical Report, 2025. IBM Developer
2025
-
[8]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chat- bot arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023.https://arxiv.org/abs/2306.05685
work page internal anchor Pith review arXiv 2023
-
[9]
Cortex Analyst: Evaluating text-to-SQL accuracy for real- world BI
Snowflake Engineering Blog. Cortex Analyst: Evaluating text-to-SQL accuracy for real- world BI. Technical Report, Snowflake Inc., 2024.https://www.snowflake.com/en/ engineering-blog/cortex-analyst-text-to-sql-accuracy-bi/ 17
2024
- [10]
-
[11]
J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J. Lou, T. Liu, and D. Zhang. Towards complex text- to-SQL in cross-domain database with intermediate representation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4448–4459, 2019.https://arxiv.org/abs/1905.08205
-
[12]
B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7567–7578, 2020.https://arxiv. org/abs/1911.04942
-
[13]
T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li, H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor, S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and D. Radev. SParC: Cross-domain semantic parsing in context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4511–4523, 2019.https...
-
[14]
T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. Lin, Y. C. Tan, T. Shi, Z. Shim, C. Xiong, R. Socher, and D. Radev. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), pages 1962–1979, 2019...
-
[15]
C.-H. Chiang and H.-Y. Lee. Can large language models be an alternative to human eval- uations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 15607–15631, 2023.https://arxiv.org/abs/2305.01937
-
[16]
Elgohary, D
S. Elgohary, D. Peskov, and J. Boyd-Graber. Can you unpack that? learning to rewrite questions-in-context. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP-IJCNLP), pages 5920–5926, 2019.https://aclanthology. org/D19-1605/ 18
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.