arxiv: 2604.28049 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems

Taslim Jamal Arif , Kuldeep Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords Text-to-SQLevaluation frameworkschema-agnosticproduction monitoringSQL accuracysemantic alignmentagent evaluation

0 comments

The pith

STEF scores Text-to-SQL accuracy in production using only the user question, an enriched reformulation, and the generated SQL without any database schema or reference queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the gap in evaluating Text-to-SQL systems once they move into live production environments, where developers lack access to ground-truth queries or full database schemas that current benchmarks require. It introduces STEF as a framework that extracts semantic specifications from the natural language inputs and the generated SQL, aligns their features after normalization, and outputs an interpretable 0-100 accuracy score based on filter alignment, semantic verdict, and evaluator confidence. This setup incorporates enriched question reformulations as a quality signal and allows prompt-based rule injection for application-specific needs. The approach claims to support ongoing monitoring and agent improvement without the constraints that currently make production evaluation impractical. If correct, it would turn silent quality drift into measurable feedback that can drive iterative fixes at scale.

Core claim

STEF is a schema-agnostic evaluation system that works solely with the user question, an enriched reformulation, and the generated SQL. It extracts semantic specifications from the natural language and SQL sides, performs normalized feature alignment, and computes a composite 0-100 accuracy score that accounts for filter alignment, semantic verdict, and evaluator confidence. The system adds enriched question quality validation, configurable rule injection through prompt templating, and production-tolerant normalization for GROUP BY, ORDER BY, and LIMIT variations.

What carries the argument

Semantic specification extraction from natural language and SQL followed by normalized feature alignment to produce a composite accuracy score.

If this is right

Production Text-to-SQL agents can receive continuous quality monitoring instead of remaining unevaluated after deployment.
Feedback loops for agent improvement become feasible without requiring schema access or reference queries.
Structured query evaluation scales to real-world deployments where traditional benchmarks cannot be applied.
Enriched question reformulations serve as a first-class signal for detecting input quality issues that affect downstream SQL accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-alignment approach could extend to evaluating other structured generation tasks such as API call construction or data transformation scripts where schemas are unavailable at evaluation time.
Over repeated deployments the framework might accumulate enough scored examples to train lighter-weight evaluators that reduce reliance on the initial prompt-based rules.
Teams could combine STEF outputs with user click-through or correction data to create closed-loop retraining signals that were previously blocked by the lack of production labels.

Load-bearing premise

Semantic specifications extracted from natural language and SQL can be accurately aligned and scored without any database schema or reference queries, relying instead on enriched reformulations and prompt-based rules.

What would settle it

A manual audit of production queries where STEF scores are compared against independent human judgments of whether each generated SQL actually answers the original question and produces the expected result.

read the original abstract

Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEF sketches a schema-free scoring system for production text-to-SQL using prompt-driven semantic alignment, but the 0-100 metric has no shown link to actual query correctness.

read the letter

The main point is that this paper targets a genuine production pain: text-to-SQL systems often run without access to the database schema or any reference queries, so standard benchmarks cannot be used. STEF extracts semantic specifications from the user question, an enriched reformulation, and the generated SQL, aligns normalized features such as filters, and combines them with a semantic verdict and evaluator confidence into a single 0-100 score. It also adds prompt-templated rule injection and specific normalization rules for GROUP BY tolerance, ORDER BY, and LIMIT. That combination is the actual novelty; prior work has not packaged these elements into a production monitoring loop that claims to run without schema or ground truth. The paper states the problem clearly and shows how the pieces fit together for continuous feedback. The soft spot is the missing anchor. The abstract asserts empirical results for monitoring and improvement loops, yet supplies no comparison of the 0-100 scores against execution success, human labels, or known-correct SQL. Because the extraction, alignment, and verdict all rely on the same LLM prompts, the metric is self-referential; any systematic misreading of filters or join conditions will produce high scores for wrong queries and then feed those errors back into the agent. The composite weights and rule parameters are also free choices with no sensitivity checks reported. This paper is for engineers who already run text-to-SQL agents in production and need ideas for lightweight monitoring when full database access is restricted. A reader could take the framework description and test it on their own logs, but they would have to supply the validation the paper omits. It deserves peer review because the problem is practical and the proposed structure is concrete enough for referees to demand the necessary correlation experiments.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes STEF, a Schema-agnostic Text-to-SQL Evaluation Framework for production environments. It extracts semantic specifications from the user question, an enriched reformulation, and the generated SQL, performs normalized feature alignment, and computes an interpretable 0-100 accuracy score using a composite metric based on filter alignment, semantic verdict, and evaluator confidence. The framework claims to operate without database schema or reference queries, incorporating enriched question quality validation, configurable rule injection via prompt templating, and production-robust normalization for GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. It asserts that empirical results demonstrate STEF enables continuous production monitoring and agent improvement feedback loops, making structured query evaluation viable at scale for the first time.

Significance. If the framework's reliability were demonstrated, STEF would address a genuine gap in production Text-to-SQL evaluation where ground-truth queries and schemas are unavailable. The emphasis on enriched reformulations, configurable rule injection, and normalization heuristics represents a practical direction for agent-agnostic monitoring. However, the manuscript provides no empirical data, validation methodology, or correlation analysis to support the central claims, so the significance remains potential rather than realized. The approach could enable feedback loops if the composite metric correlates with actual correctness, but this is not shown.

major comments (3)

[Abstract] Abstract: The claim that 'Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops' is unsupported by any data, tables, figures, datasets, human judgments, execution outcomes, or validation methodology in the manuscript. This is load-bearing for the central claim of viability at scale.
[Abstract] Abstract / Framework Description: The 0-100 accuracy score is produced by a composite metric (filter alignment + semantic verdict + confidence) that is itself defined via LLM prompts and internal rules without reference to external ground truth, execution results, or known-correct cases. This creates a self-referential loop where misalignment in prompt-based extraction (e.g., filters, GROUP BY tolerance) would propagate undetected, contradicting the production-monitoring goal.
[Abstract] Abstract: The core assumption that semantic specifications extracted from natural language and SQL can be accurately aligned and scored without schema, reference queries, or execution results is stated but not tested; no details are provided on how the enriched reformulation is generated, how alignment is validated, or how the metric was calibrated against any external anchor.

minor comments (2)

[Abstract] The abstract is overly dense; separating the problem statement, method contributions, and claimed results into distinct sentences would improve readability.
[Abstract] The manuscript would benefit from explicit definitions or pseudocode for the 'normalized feature alignment' and 'composite metric' steps, even at a high level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments accurately note that the current manuscript presents the STEF framework primarily through description and design rationale without accompanying empirical data or validation studies. We address each major comment below and will undertake a major revision to strengthen the evidential basis for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops' is unsupported by any data, tables, figures, datasets, human judgments, execution outcomes, or validation methodology in the manuscript. This is load-bearing for the central claim of viability at scale.

Authors: We acknowledge that the manuscript does not contain empirical data, tables, figures, or validation methodology to support the stated claim. The abstract phrasing overstated the current content, which focuses on the framework's architecture and production-oriented features. In the revised manuscript we will remove the unsupported claim from the abstract and replace it with a statement that STEF is designed to enable such monitoring. We will add a new 'Preliminary Validation' section that applies STEF to a sample of queries drawn from public Text-to-SQL benchmarks, reports the resulting score distributions, and discusses how the scores could be used to drive agent improvement loops. This section will also note the absence of direct human judgment correlation in the present work and outline how such correlation could be obtained in future studies. revision: yes
Referee: [Abstract] Abstract / Framework Description: The 0-100 accuracy score is produced by a composite metric (filter alignment + semantic verdict + confidence) that is itself defined via LLM prompts and internal rules without reference to external ground truth, execution results, or known-correct cases. This creates a self-referential loop where misalignment in prompt-based extraction (e.g., filters, GROUP BY tolerance) would propagate undetected, contradicting the production-monitoring goal.

Authors: The referee correctly identifies the self-referential character of the metric. This design choice was made precisely because production deployments frequently lack ground-truth queries and execution access. We agree that the approach carries the risk of undetected propagation of extraction errors. In the revision we will add an explicit 'Limitations' subsection that describes this risk, explains the role of the configurable rule-injection mechanism in constraining common failure modes (GROUP BY tolerance, ORDER BY defaults, LIMIT heuristics), and proposes a practical mitigation: periodic offline calibration against any available execution results or human labels. The section will also clarify that the composite metric is intended as a monitoring signal rather than a definitive correctness oracle. revision: yes
Referee: [Abstract] Abstract: The core assumption that semantic specifications extracted from natural language and SQL can be accurately aligned and scored without schema, reference queries, or execution results is stated but not tested; no details are provided on how the enriched reformulation is generated, how alignment is validated, or how the metric was calibrated against any external anchor.

Authors: We agree that the manuscript provides insufficient implementation detail on the enriched reformulation step, the alignment procedure, and any calibration process. The enriched reformulation is produced by a separate LLM prompt that expands the original question with inferred semantic context; feature alignment then compares normalized representations of filters, projections, and aggregations. In the revised version we will expand the 'Framework' section with the exact prompt templates used for reformulation and extraction, concrete examples of feature alignment (including edge cases), and a description of a lightweight calibration procedure that uses a small set of manually annotated query pairs to set component weights in the composite metric. These additions will make the underlying assumptions explicit and testable. revision: yes

Circularity Check

0 steps flagged

No circularity: STEF is an operational definition of an evaluation metric

full rationale

The paper defines STEF explicitly as a composite procedure (semantic specification extraction from NL/SQL, normalized feature alignment, filter alignment, semantic verdict, and confidence scoring) that produces the 0-100 accuracy value by construction. This is a system specification rather than a derivation in which an independent quantity is shown to equal a fitted or self-referential input. No equations, parameter-fitting steps, or load-bearing self-citations are described that would reduce the claimed accuracy or monitoring capability to the inputs by definition. The framework is therefore self-contained as a proposed production tool; any external validation of its correlation to execution correctness lies outside the circularity analysis.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

Abstract provides high-level description only; specific free parameters, axioms, and entities would be detailed in full text which is unavailable here.

free parameters (2)

weights in composite metric
The composite metric encompasses filter alignment, semantic verdict, and confidence, implying tunable weights or thresholds not specified.
rule injection parameters
Configurable application-specific rules via prompt templating likely involve custom parameters.

axioms (2)

domain assumption Enriched reformulation of the user question provides sufficient semantic information for evaluation
The framework uses an enriched reformulation as input, assuming this captures necessary details without schema.
domain assumption Normalized feature alignment between NL and SQL accurately reflects query correctness
Central to producing the accuracy score.

invented entities (1)

STEF composite metric no independent evidence
purpose: To produce interpretable 0-100 accuracy score
New metric introduced without external validation mentioned.

pith-pipeline@v0.9.0 · 5525 in / 1571 out tokens · 49056 ms · 2026-05-07T07:00:18.486061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

Yu T, et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InEMNLP, 2018. Spider

2018
[2]

WikiSQL Datset WikiSQL
[3]

A Big Bench for Large-Scale Database Grounded Text-to-SQLs BIRD

BIRD-SQL. A Big Bench for Large-Scale Database Grounded Text-to-SQLs BIRD
[4]

FeTaQA: Free-form Table Question Answering.TACL, 2022

Nan L, et al. FeTaQA: Free-form Table Question Answering.TACL, 2022. FeTaQA: Free-form Table Question Answering

2022
[5]

R. J. Dong and C. Rudin. Semantic evaluation for text-to-SQL with distilled test suites. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 396–411, 2020.https://arxiv.org/abs/2010.02840

work page arXiv 2020
[6]

Cortex Analyst: Production Text-to-SQL Evaluation

Snowflake. Cortex Analyst: Production Text-to-SQL Evaluation. Technical Report, 2024. Snowflake Blog

2024
[7]

Production SQL Evaluation Framework

IBM Research. Production SQL Evaluation Framework. Technical Report, 2025. IBM Developer

2025
[8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-judge with MT-bench and chat- bot arena. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023.https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[9]

Cortex Analyst: Evaluating text-to-SQL accuracy for real- world BI

Snowflake Engineering Blog. Cortex Analyst: Evaluating text-to-SQL accuracy for real- world BI. Technical Report, Snowflake Inc., 2024.https://www.snowflake.com/en/ engineering-blog/cortex-analyst-text-to-sql-accuracy-bi/ 17

2024
[10]

H. Kim, J. Lee, S. Kim, T. Kim, and H. Yoo. FLEX: Expert-level false-less execution metric for reliable text-to-SQL benchmark.arXiv preprint arXiv:2409.19014, 2024.https://arxiv. org/abs/2409.19014

work page arXiv 2024
[11]

J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J. Lou, T. Liu, and D. Zhang. Towards complex text- to-SQL in cross-domain database with intermediate representation. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4448–4459, 2019.https://arxiv.org/abs/1905.08205

work page arXiv 2019
[12]

B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 7567–7578, 2020.https://arxiv. org/abs/1911.04942

work page arXiv 2020
[13]

T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li, H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor, S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and D. Radev. SParC: Cross-domain semantic parsing in context. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4511–4523, 2019.https...

work page arXiv 2019
[14]

T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. Lin, Y. C. Tan, T. Shi, Z. Shim, C. Xiong, R. Socher, and D. Radev. CoSQL: A conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP), pages 1962–1979, 2019...

work page arXiv 2019
[15]

Chiang and H.-Y

C.-H. Chiang and H.-Y. Lee. Can large language models be an alternative to human eval- uations? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 15607–15631, 2023.https://arxiv.org/abs/2305.01937

work page arXiv 2023
[16]

Elgohary, D

S. Elgohary, D. Peskov, and J. Boyd-Graber. Can you unpack that? learning to rewrite questions-in-context. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP-IJCNLP), pages 5920–5926, 2019.https://aclanthology. org/D19-1605/ 18

2019