PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism
Pith reviewed 2026-05-11 02:42 UTC · model grok-4.3
The pith
PolySQL evaluates text-to-SQL models across SQL dialects by comparing normalized execution results from different backends instead of translating queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By executing generated SQL statements on dialect-specific backends and aligning their result sets through normalization, semantic correctness can be assessed directly. This approach reveals a 10.1 percent average accuracy drop from SQLite to other dialects, attributes 61 percent of errors to logical rather than syntactic causes, and identifies a stable difficulty hierarchy among dialects while attaining full query coverage and superior evaluation fidelity compared with transpilation.
What carries the argument
Dual-execution with result normalization, which runs each candidate query on multiple SQL engines and treats matching normalized outputs as evidence of semantic equivalence.
If this is right
- Model accuracy declines 10.1 percent on average when moving from SQLite to other dialects.
- Logical errors account for 61 percent of failures while syntactic errors account for only 8 percent.
- A consistent difficulty ordering exists among SQL dialects.
- Three new datasets enable the first large-scale cross-dialect text-to-SQL studies.
- Evaluation fidelity exceeds that of query transpilation at 100 percent coverage.
Where Pith is reading between the lines
- Training pipelines could incorporate multi-dialect data to reduce the observed performance gap.
- Real deployments that span several database systems would gain from this style of validation.
- The normalization technique might generalize to other query languages that share core semantics but differ in surface syntax.
- Leaderboard rankings based solely on SQLite may systematically overestimate practical robustness.
Load-bearing premise
Normalizing execution results across dialects reliably detects semantic equivalence and any observed differences arise from the model rather than from normalization artifacts or backend quirks.
What would settle it
A manual review of queries flagged as inequivalent by PolySQL but confirmed as semantically identical by human inspection of the raw outputs.
Figures
read the original abstract
SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PolySQL, a dual-execution framework for cross-dialect Text-to-SQL evaluation that compares normalized execution results across SQL backends to determine semantic equivalence without query transpilation. It constructs three new datasets, reports a 10.1% average accuracy drop from SQLite to other dialects, identifies a dialect difficulty hierarchy, and attributes the drop primarily to logical errors (61%) rather than syntactic ones (8%), while claiming 100% query coverage and higher fidelity than transpilation baselines. The work also releases code and a leaderboard.
Significance. If the normalization of execution results can be shown to reliably distinguish semantic equivalence from backend artifacts, the approach would enable scalable, dialect-robust benchmarking that current SQLite-centric evaluations lack. The release of code, datasets, and leaderboard constitutes a concrete contribution that could support reproducible follow-up work.
major comments (2)
- The description of the normalization procedure for comparing execution results across dialects is insufficiently detailed to support the central claims. The reported 10.1% accuracy drop, the 61% vs. 8% logical-vs-syntactic error split, and the fidelity advantage over transpilation all rest on the assumption that normalized result comparison correctly identifies semantic equivalence; without explicit handling of type coercion, NULL semantics, floating-point precision, collation, and unordered result sets, these quantities cannot be verified as model-induced rather than normalization-induced.
- Dataset construction and error classification criteria are not described at a level that allows reproduction or independent validation of the error breakdown and dialect hierarchy. The abstract states concrete percentages and the existence of three datasets, yet provides no information on how queries were selected, how logical versus syntactic errors were distinguished, or how the 100% coverage was achieved.
minor comments (1)
- The abstract contains an incomplete parenthetical reference to Cohen's kappa (written as 'Cohen's )' with no value supplied), which should be completed or removed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide greater detail and reproducibility.
read point-by-point responses
-
Referee: The description of the normalization procedure for comparing execution results across dialects is insufficiently detailed to support the central claims. The reported 10.1% accuracy drop, the 61% vs. 8% logical-vs-syntactic error split, and the fidelity advantage over transpilation all rest on the assumption that normalized result comparison correctly identifies semantic equivalence; without explicit handling of type coercion, NULL semantics, floating-point precision, collation, and unordered result sets, these quantities cannot be verified as model-induced rather than normalization-induced.
Authors: We agree that the current description of the normalization procedure lacks sufficient explicit detail on the listed aspects. In the revised manuscript, we will expand the relevant section (currently 3.2) with a step-by-step account of result normalization, including dedicated handling for type coercion rules, NULL value semantics, floating-point precision tolerances, collation and string comparison, and canonical ordering of result sets. We will add pseudocode, concrete examples from each backend, and a discussion of how these steps isolate model-induced semantic differences from backend artifacts. This revision will directly support verification of the accuracy drop, error breakdown, and fidelity claims. revision: yes
-
Referee: Dataset construction and error classification criteria are not described at a level that allows reproduction or independent validation of the error breakdown and dialect hierarchy. The abstract states concrete percentages and the existence of three datasets, yet provides no information on how queries were selected, how logical versus syntactic errors were distinguished, or how the 100% coverage was achieved.
Authors: We acknowledge that the manuscript does not currently provide enough detail on dataset construction and error classification for full reproducibility. We will add a new subsection detailing the sources and selection criteria for the three datasets, the exact protocol used to distinguish logical from syntactic errors (including annotation guidelines and inter-annotator agreement), and the mechanisms that enable 100% coverage without query transpilation. These additions will allow independent validation of the reported percentages and dialect hierarchy. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces PolySQL as a dual-execution method that normalizes and compares execution results across SQL backends to evaluate Text-to-SQL models without transpilation. All reported quantities (accuracy drops, error type breakdowns, dialect hierarchies, fidelity comparisons) are presented as direct empirical outputs from running the method on newly constructed datasets. No equations, fitted parameters, or self-citations appear in the provided text as load-bearing steps that reduce the central claims to inputs by construction. The approach is self-contained as an empirical measurement framework rather than a derived prediction from prior fitted values.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Normalized execution results from different SQL dialects can be directly compared to determine semantic equivalence of queries.
Reference graph
Works this paper leans on
-
[1]
Sql-gen: Bridging the dialect gap for text-to- sql via synthetic data and model merging.ArXiv, abs/2408.12733. Jie Shi, Xi Cao, Bo Xu, Jiaqing Liang, Yanghua Xiao, Jia Chen, Peng Wang, and Wei Wang. 2025. Dialect- SQL: An adaptive framework for bridging the dialect gap in text-to-SQL. InProceedings of the 2025 Con- ference on Empirical Methods in Natural ...
-
[2]
The model's query SUCCEEDED in SQLite (returned correct results)
-
[3]
The exact same query (conceptually) FAILED in the Target Dialect (e. g., PostgreSQL)
-
[4]
Your job is to explain WHY the transition caused a failure. Key fields you will receive: - `gen_type`: The TARGET SQL dialect (e.g., postgres, bigquery, snowflake). - `schema`: The database schema in the TARGET dialect. - `predicted_sql`: The model's generated SQL. - `gold_sql`: A reference query in SQLite (provided ONLY for intent understanding). - `ques...
-
[5]
If `results_equal` is True -> output `null` (no error to classify)
-
[6]
Otherwise, determine the PRIMARY root cause
-
[7]
Choose the MOST SPECIFIC category that applies
-
[8]
Follow the DECISION PROCEDURE strictly (it prioritizes Logic errors over Syntax errors). ================================================================= ERROR CATEGORIES (use these exact names) ================================================================= ------------------------------------------------------------------------------- CATEGORY 1: sch...
work page 2023
-
[9]
CHECK FOR SCHEMA REFERENCE ERRORS - Did it invent a table or column? -> If yes: `schema_linking_error`
-
[10]
CHECK FOR ROW SELECTION/LOGIC ERRORS - Did it filter on the wrong logic? - Did it try to compare a String to an Int (logic flaw)? - Did it mess up the JOINs? -> If yes: `filtering_error`
-
[11]
CHECK FOR AGGREGATION/OUTPUT ERRORS - Did it group by the wrong thing? - Did it fail a "Strict Group By" check? -> If yes: `aggregation_error`
-
[12]
CHECK FOR DIALECT SYNTAX ERRORS - Everything else makes sense, but it used `strftime` instead of ` EXTRACT`? - Everything else makes sense, but it used `"`instead of`` ` ``? -> If yes:`dialect _error`
-
[13]
CHECK FOR EVALUATION/PROCESS ISSUES (Last Resort) - The SQLisflawless, but execution failed? (Timeout/Crash) - The SQLisflawless, but`results _equal`isFalse? (Float/Sort mismatch) - The Gold Queryiswrong? -> If yes:`invalid _evaluation` ================================================================= OUTPUT FORMAT ========================================...
-
[14]
List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets
Structural Alignment (Bag vs. List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets. Let Qgold be the ground truth query and Rpred, Rgold be the execution result sets (rows). • Ordered Context:If Qgold contains an ex- plicit ORDER BY clause, we enforce strict list equality. The comparator validates that Rpred[i...
-
[15]
Schema Homogenization.To account for dialect-specific column naming conventions (e.g., case sensitivity), we normalize column headers to lowercase and structurally align the dataframes. If the model predicts a superset of the required columns (e.g., returning an extra ID column), we align Rpred to the schema of Rgold, discarding ex- traneous columns while...
-
[16]
Type-Aware Value Comparison.Finally, we perform cell-wise comparison with type tolerance: • Numeric Tolerance:Direct equality checks fail across dialects due to floating-point preci- sion differences (e.g., 64-bit vs. 32-bit floats). We employ a tolerance threshold (ϵ= 10 −5) using numpy.allclose, allowing for minor driver-level variations while rejecting...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.