PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism

Andrea Giovannini; Corentin Royer; Elad Venezian; Francesco Fusco; Yotam Perlitz

arxiv: 2605.07796 · v1 · submitted 2026-05-08 · 💻 cs.CL

PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism

Yotam Perlitz , Elad Venezian , Corentin Royer , Francesco Fusco , Andrea Giovannini This is my paper

Pith reviewed 2026-05-11 02:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords text-to-sqlsql dialectscross-dialect evaluationquery normalizationsemantic equivalencelogical errorsevaluation benchmarks

0 comments

The pith

PolySQL evaluates text-to-SQL models across SQL dialects by comparing normalized execution results from different backends instead of translating queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current text-to-SQL benchmarks rely almost exclusively on SQLite, yet model outputs show weak agreement when the same queries run on other dialects such as PostgreSQL or MySQL. PolySQL closes this gap with a dual-execution technique that runs each generated query on multiple engines and checks whether their normalized results match, thereby determining semantic equivalence without any query rewriting. The method delivers 100 percent coverage and higher fidelity than manual or automated transpilation. Large-scale tests using three new datasets show an average 10.1 percent accuracy decline outside SQLite, driven mainly by logical mistakes rather than syntax problems, and expose a consistent ordering of dialect difficulty.

Core claim

By executing generated SQL statements on dialect-specific backends and aligning their result sets through normalization, semantic correctness can be assessed directly. This approach reveals a 10.1 percent average accuracy drop from SQLite to other dialects, attributes 61 percent of errors to logical rather than syntactic causes, and identifies a stable difficulty hierarchy among dialects while attaining full query coverage and superior evaluation fidelity compared with transpilation.

What carries the argument

Dual-execution with result normalization, which runs each candidate query on multiple SQL engines and treats matching normalized outputs as evidence of semantic equivalence.

If this is right

Model accuracy declines 10.1 percent on average when moving from SQLite to other dialects.
Logical errors account for 61 percent of failures while syntactic errors account for only 8 percent.
A consistent difficulty ordering exists among SQL dialects.
Three new datasets enable the first large-scale cross-dialect text-to-SQL studies.
Evaluation fidelity exceeds that of query transpilation at 100 percent coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines could incorporate multi-dialect data to reduce the observed performance gap.
Real deployments that span several database systems would gain from this style of validation.
The normalization technique might generalize to other query languages that share core semantics but differ in surface syntax.
Leaderboard rankings based solely on SQLite may systematically overestimate practical robustness.

Load-bearing premise

Normalizing execution results across dialects reliably detects semantic equivalence and any observed differences arise from the model rather than from normalization artifacts or backend quirks.

What would settle it

A manual review of queries flagged as inequivalent by PolySQL but confirmed as semantically identical by human inspection of the raw outputs.

Figures

Figures reproduced from arXiv: 2605.07796 by Andrea Giovannini, Corentin Royer, Elad Venezian, Francesco Fusco, Yotam Perlitz.

**Figure 1.** Figure 1: Dual Execution vs. Query Transpilation. (a) Single-dialect evaluation. The model maps an NL question q to SQL, the SQL is executed on the source DB to obtain Rˆsrc, that is compared against a gold query result Rsrc . (b) Transpilation for cross-dialect evaluation. The model generates an SQL query yˆ tgt in the target dialect, which is executed on a migrated (offline) DB . To obtain the reference result Rtg… view at source ↗

**Figure 2.** Figure 2: Dialect Difficulty Hierarchy via McNemar’s Test. Heatmap of pairwise McNemar’s test p-values comparing model performance across dialects. Dark green indicates significant differences (p < 0.05), while light colors indicate statistical indistinguishability. PostgreSQL and MySQL form a statistically indistinguishable intermediate tier (p ≈ 0.40), while BigQuery and Snowflake constitute the most challenging… view at source ↗

**Figure 3.** Figure 3: Ranking Instability Across Dialects. The y-axis reports model rank averaged over three datasets, as measured by PolySQLon each SQL dialect (x-axis). Rankings on SQLite (leftmost) differ markedly from those on enterprise dialects, revealing substantial rank reordering and limiting the predictive value of SQLitebased leaderboards (detailed analysis in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolySQL offers a dual-execution normalization method to evaluate text-to-SQL across dialects without transpilation, plus new datasets and a 10% accuracy drop finding, but the reliability of that normalization is the part that needs verification.

read the letter

The paper's main contribution is a dual-execution method that evaluates text-to-SQL queries across SQL dialects by running them on different backends and normalizing the results for comparison. This avoids the usual transpilation headaches and achieves full coverage. They introduce three new datasets and run a large-scale study. The results show models drop 10.1% in accuracy when moving from SQLite to other dialects. They also map out a difficulty hierarchy among dialects and break down the errors, finding logical mistakes account for 61% of the degradation while syntactic ones are only 8%. The release of the framework code and leaderboard is a practical plus. It lets the community test models on multiple dialects without starting from scratch. The soft spot sits in the normalization step. The claims about superior fidelity and the error split rest on normalized outputs correctly flagging semantic equivalence. Differences in NULL semantics, type coercion, floating-point handling, or unordered sets could create mismatches that aren't really model faults. If the paper spells out the normalization rules with enough detail and tests for these edge cases, that concern shrinks. Otherwise the central numbers need more scrutiny. This paper targets researchers working on natural language to database interfaces who care about production dialects rather than just benchmarks. Anyone running evaluations or building models for real SQL engines will find the datasets and method useful. It deserves peer review. The idea is straightforward, the gap it fills is clear, and the open resources make it worth the time. Referees can check the normalization procedure and error classification to confirm the findings hold.

Referee Report

2 major / 1 minor

Summary. The paper introduces PolySQL, a dual-execution framework for cross-dialect Text-to-SQL evaluation that compares normalized execution results across SQL backends to determine semantic equivalence without query transpilation. It constructs three new datasets, reports a 10.1% average accuracy drop from SQLite to other dialects, identifies a dialect difficulty hierarchy, and attributes the drop primarily to logical errors (61%) rather than syntactic ones (8%), while claiming 100% query coverage and higher fidelity than transpilation baselines. The work also releases code and a leaderboard.

Significance. If the normalization of execution results can be shown to reliably distinguish semantic equivalence from backend artifacts, the approach would enable scalable, dialect-robust benchmarking that current SQLite-centric evaluations lack. The release of code, datasets, and leaderboard constitutes a concrete contribution that could support reproducible follow-up work.

major comments (2)

The description of the normalization procedure for comparing execution results across dialects is insufficiently detailed to support the central claims. The reported 10.1% accuracy drop, the 61% vs. 8% logical-vs-syntactic error split, and the fidelity advantage over transpilation all rest on the assumption that normalized result comparison correctly identifies semantic equivalence; without explicit handling of type coercion, NULL semantics, floating-point precision, collation, and unordered result sets, these quantities cannot be verified as model-induced rather than normalization-induced.
Dataset construction and error classification criteria are not described at a level that allows reproduction or independent validation of the error breakdown and dialect hierarchy. The abstract states concrete percentages and the existence of three datasets, yet provides no information on how queries were selected, how logical versus syntactic errors were distinguished, or how the 100% coverage was achieved.

minor comments (1)

The abstract contains an incomplete parenthetical reference to Cohen's kappa (written as 'Cohen's )' with no value supplied), which should be completed or removed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide greater detail and reproducibility.

read point-by-point responses

Referee: The description of the normalization procedure for comparing execution results across dialects is insufficiently detailed to support the central claims. The reported 10.1% accuracy drop, the 61% vs. 8% logical-vs-syntactic error split, and the fidelity advantage over transpilation all rest on the assumption that normalized result comparison correctly identifies semantic equivalence; without explicit handling of type coercion, NULL semantics, floating-point precision, collation, and unordered result sets, these quantities cannot be verified as model-induced rather than normalization-induced.

Authors: We agree that the current description of the normalization procedure lacks sufficient explicit detail on the listed aspects. In the revised manuscript, we will expand the relevant section (currently 3.2) with a step-by-step account of result normalization, including dedicated handling for type coercion rules, NULL value semantics, floating-point precision tolerances, collation and string comparison, and canonical ordering of result sets. We will add pseudocode, concrete examples from each backend, and a discussion of how these steps isolate model-induced semantic differences from backend artifacts. This revision will directly support verification of the accuracy drop, error breakdown, and fidelity claims. revision: yes
Referee: Dataset construction and error classification criteria are not described at a level that allows reproduction or independent validation of the error breakdown and dialect hierarchy. The abstract states concrete percentages and the existence of three datasets, yet provides no information on how queries were selected, how logical versus syntactic errors were distinguished, or how the 100% coverage was achieved.

Authors: We acknowledge that the manuscript does not currently provide enough detail on dataset construction and error classification for full reproducibility. We will add a new subsection detailing the sources and selection criteria for the three datasets, the exact protocol used to distinguish logical from syntactic errors (including annotation guidelines and inter-annotator agreement), and the mechanisms that enable 100% coverage without query transpilation. These additions will allow independent validation of the reported percentages and dialect hierarchy. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PolySQL as a dual-execution method that normalizes and compares execution results across SQL backends to evaluate Text-to-SQL models without transpilation. All reported quantities (accuracy drops, error type breakdowns, dialect hierarchies, fidelity comparisons) are presented as direct empirical outputs from running the method on newly constructed datasets. No equations, fitted parameters, or self-citations appear in the provided text as load-bearing steps that reduce the central claims to inputs by construction. The approach is self-contained as an empirical measurement framework rather than a derived prediction from prior fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central method rests on the assumption that normalized result sets from different SQL engines can be compared to determine query equivalence; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Normalized execution results from different SQL dialects can be directly compared to determine semantic equivalence of queries.
This assumption underpins the entire dual-execution approach described in the abstract.

pith-pipeline@v0.9.0 · 5499 in / 1290 out tokens · 47460 ms · 2026-05-11T02:42:52.652335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, and 1 others

Sql-gen: Bridging the dialect gap for text-to- sql via synthetic data and model merging.ArXiv, abs/2408.12733. Jie Shi, Xi Cao, Bo Xu, Jiaqing Liang, Yanghua Xiao, Jia Chen, Peng Wang, and Wei Wang. 2025. Dialect- SQL: An adaptive framework for bridging the dialect gap in text-to-SQL. InProceedings of the 2025 Con- ference on Empirical Methods in Natural ...

work page arXiv 2025
[2]

The model's query SUCCEEDED in SQLite (returned correct results)

work page
[3]

g., PostgreSQL)

The exact same query (conceptually) FAILED in the Target Dialect (e. g., PostgreSQL)

work page
[4]

Key fields you will receive: - `gen_type`: The TARGET SQL dialect (e.g., postgres, bigquery, snowflake)

Your job is to explain WHY the transition caused a failure. Key fields you will receive: - `gen_type`: The TARGET SQL dialect (e.g., postgres, bigquery, snowflake). - `schema`: The database schema in the TARGET dialect. - `predicted_sql`: The model's generated SQL. - `gold_sql`: A reference query in SQLite (provided ONLY for intent understanding). - `ques...

work page
[5]

If `results_equal` is True -> output `null` (no error to classify)

work page
[6]

Otherwise, determine the PRIMARY root cause

work page
[7]

Choose the MOST SPECIFIC category that applies

work page
[8]

Follow the DECISION PROCEDURE strictly (it prioritizes Logic errors over Syntax errors). ================================================================= ERROR CATEGORIES (use these exact names) ================================================================= ------------------------------------------------------------------------------- CATEGORY 1: sch...

work page 2023
[9]

CHECK FOR SCHEMA REFERENCE ERRORS - Did it invent a table or column? -> If yes: `schema_linking_error`

work page
[10]

CHECK FOR ROW SELECTION/LOGIC ERRORS - Did it filter on the wrong logic? - Did it try to compare a String to an Int (logic flaw)? - Did it mess up the JOINs? -> If yes: `filtering_error`

work page
[11]

Strict Group By

CHECK FOR AGGREGATION/OUTPUT ERRORS - Did it group by the wrong thing? - Did it fail a "Strict Group By" check? -> If yes: `aggregation_error`

work page
[12]

CHECK FOR DIALECT SYNTAX ERRORS - Everything else makes sense, but it used `strftime` instead of ` EXTRACT`? - Everything else makes sense, but it used `"`instead of`` ` ``? -> If yes:`dialect _error`

work page
[13]

question_id

CHECK FOR EVALUATION/PROCESS ISSUES (Last Resort) - The SQLisflawless, but execution failed? (Timeout/Crash) - The SQLisflawless, but`results _equal`isFalse? (Float/Sort mismatch) - The Gold Queryiswrong? -> If yes:`invalid _evaluation` ================================================================= OUTPUT FORMAT ========================================...

work page
[14]

List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets

Structural Alignment (Bag vs. List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets. Let Qgold be the ground truth query and Rpred, Rgold be the execution result sets (rows). • Ordered Context:If Qgold contains an ex- plicit ORDER BY clause, we enforce strict list equality. The comparator validates that Rpred[i...

work page
[15]

Schema Homogenization.To account for dialect-specific column naming conventions (e.g., case sensitivity), we normalize column headers to lowercase and structurally align the dataframes. If the model predicts a superset of the required columns (e.g., returning an extra ID column), we align Rpred to the schema of Rgold, discarding ex- traneous columns while...

work page
[16]

32-bit floats)

Type-Aware Value Comparison.Finally, we perform cell-wise comparison with type tolerance: • Numeric Tolerance:Direct equality checks fail across dialects due to floating-point preci- sion differences (e.g., 64-bit vs. 32-bit floats). We employ a tolerance threshold (ϵ= 10 −5) using numpy.allclose, allowing for minor driver-level variations while rejecting...

work page

[1] [1]

Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, and 1 others

Sql-gen: Bridging the dialect gap for text-to- sql via synthetic data and model merging.ArXiv, abs/2408.12733. Jie Shi, Xi Cao, Bo Xu, Jiaqing Liang, Yanghua Xiao, Jia Chen, Peng Wang, and Wei Wang. 2025. Dialect- SQL: An adaptive framework for bridging the dialect gap in text-to-SQL. InProceedings of the 2025 Con- ference on Empirical Methods in Natural ...

work page arXiv 2025

[2] [2]

The model's query SUCCEEDED in SQLite (returned correct results)

work page

[3] [3]

g., PostgreSQL)

The exact same query (conceptually) FAILED in the Target Dialect (e. g., PostgreSQL)

work page

[4] [4]

Key fields you will receive: - `gen_type`: The TARGET SQL dialect (e.g., postgres, bigquery, snowflake)

Your job is to explain WHY the transition caused a failure. Key fields you will receive: - `gen_type`: The TARGET SQL dialect (e.g., postgres, bigquery, snowflake). - `schema`: The database schema in the TARGET dialect. - `predicted_sql`: The model's generated SQL. - `gold_sql`: A reference query in SQLite (provided ONLY for intent understanding). - `ques...

work page

[5] [5]

If `results_equal` is True -> output `null` (no error to classify)

work page

[6] [6]

Otherwise, determine the PRIMARY root cause

work page

[7] [7]

Choose the MOST SPECIFIC category that applies

work page

[8] [8]

Follow the DECISION PROCEDURE strictly (it prioritizes Logic errors over Syntax errors). ================================================================= ERROR CATEGORIES (use these exact names) ================================================================= ------------------------------------------------------------------------------- CATEGORY 1: sch...

work page 2023

[9] [9]

CHECK FOR SCHEMA REFERENCE ERRORS - Did it invent a table or column? -> If yes: `schema_linking_error`

work page

[10] [10]

CHECK FOR ROW SELECTION/LOGIC ERRORS - Did it filter on the wrong logic? - Did it try to compare a String to an Int (logic flaw)? - Did it mess up the JOINs? -> If yes: `filtering_error`

work page

[11] [11]

Strict Group By

CHECK FOR AGGREGATION/OUTPUT ERRORS - Did it group by the wrong thing? - Did it fail a "Strict Group By" check? -> If yes: `aggregation_error`

work page

[12] [12]

CHECK FOR DIALECT SYNTAX ERRORS - Everything else makes sense, but it used `strftime` instead of ` EXTRACT`? - Everything else makes sense, but it used `"`instead of`` ` ``? -> If yes:`dialect _error`

work page

[13] [13]

question_id

CHECK FOR EVALUATION/PROCESS ISSUES (Last Resort) - The SQLisflawless, but execution failed? (Timeout/Crash) - The SQLisflawless, but`results _equal`isFalse? (Float/Sort mismatch) - The Gold Queryiswrong? -> If yes:`invalid _evaluation` ================================================================= OUTPUT FORMAT ========================================...

work page

[14] [14]

List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets

Structural Alignment (Bag vs. List Seman- tics).We explicitly handle the distinction be- tween ordered and unordered result sets. Let Qgold be the ground truth query and Rpred, Rgold be the execution result sets (rows). • Ordered Context:If Qgold contains an ex- plicit ORDER BY clause, we enforce strict list equality. The comparator validates that Rpred[i...

work page

[15] [15]

Schema Homogenization.To account for dialect-specific column naming conventions (e.g., case sensitivity), we normalize column headers to lowercase and structurally align the dataframes. If the model predicts a superset of the required columns (e.g., returning an extra ID column), we align Rpred to the schema of Rgold, discarding ex- traneous columns while...

work page

[16] [16]

32-bit floats)

Type-Aware Value Comparison.Finally, we perform cell-wise comparison with type tolerance: • Numeric Tolerance:Direct equality checks fail across dialects due to floating-point preci- sion differences (e.g., 64-bit vs. 32-bit floats). We employ a tolerance threshold (ϵ= 10 −5) using numpy.allclose, allowing for minor driver-level variations while rejecting...

work page