pith. sign in

arxiv: 2604.06736 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.DB

SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Pith reviewed 2026-05-10 18:42 UTC · model grok-4.3

classification 💻 cs.CL cs.DB
keywords text-to-sqlllm evaluationsql generationabstract syntax treesstructural consistencyprogram synthesisspider benchmarkquery diversity
0
0 comments X

The pith

LLM Text-to-SQL systems generate structurally diverse queries even for correct executions, and a structured compile pipeline can improve both consistency and accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that despite high execution accuracy on benchmarks like Spider, LLM-generated SQL queries frequently differ in their underlying structure for the same task. These structural differences are often triggered by small changes in the input phrasing or how the database schema is presented. SQLStructEval provides a way to measure this by converting queries to canonical abstract syntax trees. The authors demonstrate that forcing generation into a structured space through a compile-style process not only increases structural consistency but also raises execution accuracy. This work argues that structural reliability deserves attention as a separate evaluation dimension for LLM-based code generation.

Core claim

Modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. Generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency.

What carries the argument

SQLStructEval framework, which analyzes SQL program structures through canonical abstract syntax tree (AST) representations to quantify structural diversity and reliability.

If this is right

  • Structural diversity in LLM SQL outputs can be systematically measured using canonical ASTs.
  • Minor input variations like paraphrases lead to inconsistent query structures.
  • A compile-style pipeline for structured generation enhances both accuracy and consistency.
  • Structural reliability is an overlooked but important aspect of evaluating LLM program generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending this AST-based evaluation to other programming languages could reveal similar structural issues in general code generation tasks.
  • If structural consistency reduces debugging effort in deployed systems, then prioritizing it during model training could produce more maintainable outputs.
  • Future benchmarks might benefit from adding structural variance metrics alongside execution accuracy to better reflect real-world reliability.

Load-bearing premise

That canonical AST representations capture structural differences that have practical implications for reliability in real-world Text-to-SQL applications.

What would settle it

Running the compile-style pipeline on a new set of Spider queries and finding no gain in execution accuracy or no reduction in AST variance would challenge the reported improvements.

Figures

Figures reproduced from arXiv: 2604.06736 by Fan Zhang, Haipeng Zhang, Preslav Nakov, Yixi Zhou, Yu Chen, Zhiqiao Guo, Zhuohan Xie.

Figure 1
Figure 1. Figure 1: Comparison between traditional LLM￾based Text-to-SQL generation and our proposed SQL￾STRUCTEVAL framework. Traditional methods directly generate SQL queries from text prompts, which often leads to structurally inconsistent outputs across repeated generations. In contrast, SQLSTRUCTEVAL introduces an AST-based representation that enables explicit struc￾tural analysis and improves the stability and compara￾b… view at source ↗
Figure 2
Figure 2. Figure 2: AST structures of two execution-equivalent SQL queries in Example 1 of Appendix [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
read the original abstract

Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SQLStructEval, a framework that evaluates the structural properties of LLM-generated SQL queries via canonical AST representations. On the Spider benchmark, it reports that modern LLMs frequently produce structurally diverse queries for identical inputs even when execution results are correct, with this variance often triggered by surface-level input perturbations such as paraphrases or schema presentation order. The authors further propose a compile-style pipeline that generates queries within a constrained structural space and demonstrate gains in both execution accuracy and structural consistency, arguing that structural reliability constitutes an overlooked evaluation dimension for LLM-based program generation.

Significance. If the AST-based structural metric proves predictive of downstream reliability (e.g., under schema evolution or maintenance), the work usefully expands Text-to-SQL evaluation beyond execution match. The public release of code is a clear strength that supports reproducibility and follow-up experiments. The empirical observation of input-triggered structural variance is noteworthy and could inform prompt engineering and decoding strategies.

major comments (3)
  1. [Pipeline description and results (likely §4)] The central claim that the compile-style pipeline improves structural consistency (and that this is not merely richer prompting) requires an explicit control experiment in which the baseline receives an equivalently detailed prompt without the compile constraint; without it, attribution of gains to structural enforcement remains ambiguous.
  2. [Discussion and conclusions] No evidence is presented that AST distance or canonical-form consistency correlates with practical failure modes such as query fragility under schema drift or evolution; the Spider execution-match results alone do not establish that the reported structural variance is load-bearing for real-world reliability.
  3. [Experimental setup] Details on variance measurement (e.g., exact AST canonicalization rules, distance metric, and statistical significance testing across multiple runs or seeds) are insufficient to allow replication or assessment of whether the reported diversity exceeds what would be expected from sampling noise.
minor comments (2)
  1. [Methods] Clarify the precise definition of 'canonical AST' early in the methods section, including how aliases, join order, and subquery placement are normalized.
  2. [Figures] Figure captions should explicitly state the number of LLMs, prompts, and queries underlying each bar or distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Pipeline description and results (likely §4)] The central claim that the compile-style pipeline improves structural consistency (and that this is not merely richer prompting) requires an explicit control experiment in which the baseline receives an equivalently detailed prompt without the compile constraint; without it, attribution of gains to structural enforcement remains ambiguous.

    Authors: We agree that the current experiments leave open the possibility that gains arise from prompt detail rather than the compile constraint itself. In the revised manuscript, we will add an explicit control baseline that receives a comparably detailed prompt describing desired structural properties but without the actual compilation and enforcement steps. Results from this control will be reported alongside the existing conditions to better isolate the contribution of structural enforcement. revision: yes

  2. Referee: [Discussion and conclusions] No evidence is presented that AST distance or canonical-form consistency correlates with practical failure modes such as query fragility under schema drift or evolution; the Spider execution-match results alone do not establish that the reported structural variance is load-bearing for real-world reliability.

    Authors: We acknowledge this limitation. Our evaluation is confined to the Spider benchmark and does not directly test correlations with downstream issues such as schema evolution. While we maintain that structural consistency is a valuable and previously overlooked dimension, we will revise the discussion and conclusions to explicitly note the absence of such evidence as a limitation and to outline suggested future experiments that could establish links to real-world reliability metrics. revision: partial

  3. Referee: [Experimental setup] Details on variance measurement (e.g., exact AST canonicalization rules, distance metric, and statistical significance testing across multiple runs or seeds) are insufficient to allow replication or assessment of whether the reported diversity exceeds what would be expected from sampling noise.

    Authors: We thank the referee for highlighting this gap in reproducibility details. The revised manuscript will provide precise specifications of the AST canonicalization rules, the distance metric (normalized tree-edit distance on canonical forms), and the statistical procedures, including results aggregated over multiple independent runs with varied random seeds. We will also update the public code repository with explicit documentation and replication scripts for these measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study with no derivations or self-referential fitting

full rationale

The paper presents an empirical framework (SQLStructEval) for analyzing LLM-generated SQL via canonical ASTs on the Spider benchmark. All claims rest on direct experimental observations of structural variance and pipeline improvements, with no equations, fitted parameters renamed as predictions, or load-bearing self-citations. The derivation chain is absent; results are falsifiable via replication on the public benchmark and do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes standard SQL AST parsing can identify equivalent structures; no free parameters, invented entities, or ad-hoc axioms are evident from the abstract.

axioms (1)
  • domain assumption Canonical AST representations accurately reflect structural equivalence of SQL queries independent of surface syntax.
    Central to analyzing structural diversity and consistency.

pith-pipeline@v0.9.0 · 5458 in / 1137 out tokens · 42517 ms · 2026-05-10T18:42:49.219739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Evaluating the

    Instance-level randomization: Toward more stable LLM evaluations. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2025, pages 3411–3425, Suzhou, China. Association for Computational Linguistics. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- ming Zhang. 2023. Is your code generated by Chat- GPT really correct? rigorous evaluati...

  2. [2]

    Parse the SQL query into an AST using sqlglot; 13

  3. [3]

    Apply AST-level canonicalization (alias nor- malization and logical operator normaliza- tion)

  4. [4]

    Render the canonical AST back into SQL

  5. [5]

    tables": [ {

    Apply text-level normalization. The resulting canonical SQL string uniquely cor- responds to a canonical AST structure and is used as the basis for all structural metrics in our analysis. For each question with k generated SQL queries, we compute the frequency distribution over canoni- cal structures. Structural statistics are then derived from this distr...