pith. machine review for the scientific record. sign in

arxiv: 2605.09295 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords Text-to-SQLSQL skeleton predictiontree searchlarge language modelsBIRD benchmarkcoarse-to-fine explorationadaptive refinement
0
0 comments X

The pith

Reframing SQL skeleton prediction as a coarse-to-fine tree search enables more accurate generation of complex database queries from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LEAF-SQL to address limitations in existing Text-to-SQL methods that rely on single structural hypotheses for SQL skeletons. By using a three-level hierarchy to guide progressive exploration, it generates diverse candidate skeletons and prunes them efficiently with dedicated agents. This approach leads to better performance on challenging benchmarks like BIRD. A sympathetic reader would care because accurate Text-to-SQL systems allow non-experts to interact with databases without writing code, especially for intricate queries involving nesting and multiple conditions. If the central claim holds, it supplies a more reliable way to build query structures before final generation.

Core claim

LEAF-SQL reframes skeleton prediction as a coarse-to-fine tree search process. It employs a three-level skeleton hierarchy to guide the search, a Skeleton Formulation Agent to generate diverse candidates, and a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation.

What carries the argument

The three-level skeleton hierarchy together with the Skeleton Formulation Agent and Skeleton Evaluation Agent, which together turn skeleton prediction into a level-wise tree search with adaptive refinement.

If this is right

  • LEAF-SQL consistently improves performance when used with various LLM backbones for Text-to-SQL tasks.
  • On the BIRD benchmark hidden test set the method reaches 71.6 execution accuracy and exceeds leading search-based and skeleton-based approaches.
  • Complex queries that contain deeply nested logic or multiple clauses are handled more effectively than methods limited to one structural hypothesis.
  • The combination of progressive refinement and pruning balances structural diversity with computational efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same level-wise exploration pattern could be tested on other structured-generation problems such as program synthesis or formal proof construction where multiple valid structures exist.
  • Replacing the fixed evaluation rules with a learned reward model might allow the pruning step to improve automatically from execution feedback.
  • The results indicate that explicit multi-level decomposition helps large language models more than flat, single-pass generation when the output must satisfy strict syntactic constraints.

Load-bearing premise

The Skeleton Evaluation Agent can reliably prune away bad structural hypotheses while keeping all correct ones and that the three-level hierarchy plus adaptive fine-graining covers the needed query variety without excessive cost.

What would settle it

Compare full LEAF-SQL accuracy against versions that disable the Skeleton Evaluation Agent or the adaptive fine-graining step on the official BIRD hidden test set; absence of a clear accuracy drop would undermine the necessity of those components.

Figures

Figures reproduced from arXiv: 2605.09295 by Changxuan Wan, Dexi Liu, Qing Shu, Qizhi Wan, Xiping Liu, Zhao Tan.

Figure 2
Figure 2. Figure 2: The proposed three-level skeleton hierarchy (Base, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Comparison of LEAF-SQL with prior works. LEAF [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prevailing multi-stage pipeline for Text-to-SQL in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of LEAF-SQL. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration of the Level-wise Skeleton Search. The search has three phases (Base, Expanded, Detailed), each with [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An illustration of how a three-level skeleton hierarchy [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall architecture of Skeleton Evaluation Agent [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example of the prompt used for SQL Generation. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison between using Oracle Skele [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Granularity distribution of skeletons generated by [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average consumption of time and tokens on BIRD. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case studies from the BIRD benchmark illustrating the outputs of LEAF-SQL. For each question of varying difficulty [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons--intermediate representations of query logic--to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LEAF-SQL, a framework that reframes Text-to-SQL skeleton prediction as a coarse-to-fine tree search process. It employs a three-level skeleton hierarchy to guide exploration, a Skeleton Formulation Agent to generate diverse structural candidates, and a Skeleton Evaluation Agent to prune the search space via LLM prompting. The approach is evaluated on standard benchmarks with various LLM backbones, claiming consistent improvements and a new state-of-the-art of 71.6% execution accuracy on the official hidden test set of the BIRD benchmark, outperforming prior search-based and skeleton-based methods.

Significance. If the empirical gains hold under scrutiny, the work offers a structured way to increase diversity in skeleton hypotheses for complex, nested queries, which could strengthen LLM-based Text-to-SQL systems on schema-diverse datasets like BIRD. The integration of level-wise search with adaptive agents represents a practical advance over single-hypothesis skeleton methods, provided the pruning and coverage mechanisms are shown to be reliable.

major comments (2)
  1. [§3] §3 (Method), Skeleton Evaluation Agent description: No quantitative evaluation of the agent's pruning reliability (e.g., precision, recall, or false-negative rate on gold skeletons) is reported. This is load-bearing for the 71.6% BIRD claim, as unverified false negatives on valid skeletons would directly undermine the outperformance over baselines.
  2. [§4] §4 (Experiments), BIRD results and hierarchy discussion: The manuscript provides no coverage statistics or ablation on how the three-level hierarchy plus adaptive fine-graining enumerates structural variants for BIRD's complex queries (e.g., nested clauses across diverse schemas). Without this, the assumption that the search sufficiently covers query diversity remains unverified and central to the headline result.
minor comments (2)
  1. [Abstract] Abstract and §1: The claim of 'consistent improvements' across 'various LLM backbones' would benefit from explicit listing of the backbones and exact baseline comparisons in the abstract for immediate clarity.
  2. [Tables] Figure captions and tables: Ensure all tables reporting execution accuracy include standard deviations or multiple runs to support the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and commit to incorporating the suggested analyses in the revised version.

read point-by-point responses
  1. Referee: [§3] §3 (Method), Skeleton Evaluation Agent description: No quantitative evaluation of the agent's pruning reliability (e.g., precision, recall, or false-negative rate on gold skeletons) is reported. This is load-bearing for the 71.6% BIRD claim, as unverified false negatives on valid skeletons would directly undermine the outperformance over baselines.

    Authors: We thank the referee for pointing this out. The Skeleton Evaluation Agent is intended to prune invalid or low-quality skeletons to focus the search. While our end-to-end results on BIRD support the overall approach, we agree that reporting quantitative metrics on the pruning step, such as precision, recall, and false-negative rates relative to gold skeletons, would provide important validation. In the revision, we will add a dedicated subsection with these metrics computed on a sample of BIRD queries to demonstrate the agent's reliability and address concerns about potential false negatives. revision: yes

  2. Referee: [§4] §4 (Experiments), BIRD results and hierarchy discussion: The manuscript provides no coverage statistics or ablation on how the three-level hierarchy plus adaptive fine-graining enumerates structural variants for BIRD's complex queries (e.g., nested clauses across diverse schemas). Without this, the assumption that the search sufficiently covers query diversity remains unverified and central to the headline result.

    Authors: We appreciate this feedback. The level-wise exploration with adaptive fine-graining is designed to handle the diversity in BIRD's complex queries by starting with coarse skeletons and refining them. However, we acknowledge the absence of explicit coverage statistics and ablations in the current manuscript. We will include additional experiments in the revised paper, such as statistics on the number of skeletons explored at each level and an ablation showing performance with and without the hierarchy, to verify sufficient coverage of structural variants for nested and schema-diverse queries. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper proposes LEAF-SQL as a practical coarse-to-fine skeleton search framework using LLM agents and a three-level hierarchy. Performance claims rest on execution accuracy measured on the external BIRD hidden test set, not on any internal equations, fitted parameters, or self-referential definitions. No derivation chain reduces a result to its own inputs by construction, and the method is presented as an engineering proposal rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method builds on standard LLM prompting and benchmark evaluation practices; the abstract introduces no new free parameters, mathematical axioms, or invented entities beyond the proposed search framework itself.

pith-pipeline@v0.9.0 · 5551 in / 1077 out tokens · 75908 ms · 2026-05-12T04:27:14.737646+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 7 internal anchors

  1. [1]

    Spider: A large- scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,

    T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A large- scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Comput...

  2. [2]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,

    J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. C. Chang, F. Huang, R. Cheng, and Y . Li, “Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,” inProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)...

  3. [3]

    Sql-o1: A self-reward heuristic dynamic search method for text- to-sql,

    S. Lyu, H. Luo, R. Li, Z. Ou, J. Sun, Y . Qin, X. Shang, M. Song, and Y . Zhu, “Sql-o1: A self-reward heuristic dynamic search method for text-to-sql,”arXiv preprint arXiv:2502.11741, 2025

  4. [4]

    Learnat: Learning nl2sql with ast-guided task decomposition for large language models,

    W. Liao, X. Gao, T. Jia, R. Qiu, Y . Zhu, Y . Lin, X. Chu, J. Zhao, and Y . Wang, “Learnat: Learning nl2sql with ast-guided task decomposition for large language models,”arXiv preprint arXiv:2504.02327, 2025

  5. [5]

    Few- shot text-to-sql translation using structure and content prompt learning,

    Z. Gu, J. Fan, N. Tang, L. Cao, B. Jia, S. Madden, and X. Du, “Few- shot text-to-sql translation using structure and content prompt learning,” Proc. ACM Manag. Data, vol. 1, no. 2, pp. 138–166, 2023

  6. [6]

    Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql,

    H. Li, J. Zhang, C. Li, and H. Chen, “Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 067– 13 075

  7. [7]

    Combining small language models and large language models for zero-shot nl2sql,

    J. Fan, Z. Gu, S. Zhang, Y . Zhang, Z. Chen, L. Cao, G. Li, S. Madden, X. Du, and N. Tang, “Combining small language models and large language models for zero-shot nl2sql,”Proc. VLDB Endow., vol. 17, no. 11, p. 2750–2763, 2024

  8. [8]

    Dac: Decomposed automation correction for text-to-sql,

    D. Wang, L. Dou, X. Zhang, Q. Zhu, and W. Che, “Dac: Decomposed automation correction for text-to-sql,”arXiv preprint arXiv:2408.08779, 2024

  9. [9]

    Ucs-sql: Uniting content and structure for enhanced semantic bridging in text-to-sql,

    Z. Wu, Z. Li, J. JieZhangChinaTele, Z. He, J. Yang, Y . Zhao, R. Fang, B. Wang, H. Xie, S. Song, and Z. Li, “Ucs-sql: Uniting content and structure for enhanced semantic bridging in text-to-sql,” inFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025, pp. 8156– 8168

  10. [10]

    Purple: Making a large language model a better sql writer,

    T. Ren, Y . Fan, Z. He, R. Huang, J. Dai, C. Huang, Y . Jing, K. Zhang, Y . Yang, and X. S. Wang, “Purple: Making a large language model a better sql writer,”arXiv preprint arXiv:2403.20014, 2024

  11. [11]

    Chain of thought prompting elicits knowledge augmentation,

    D. Wu, J. Zhang, and X. Huang, “Chain of thought prompting elicits knowledge augmentation,” inFindings of the Association for Compu- tational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 6519–6534

  12. [12]

    Tree of thoughts: deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: deliberate problem solving with large language models,” inProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023, pp. 8812–8825

  13. [13]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    V . Zhong, C. Xiong, and R. Socher, “Seq2sql: Generating structured queries from natural language using reinforcement learning,”arXiv preprint arXiv:1709.00103, 2017

  14. [14]

    arXiv preprint arXiv:2411.07763 (2024)

    F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, V . Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu, “Spider 2.0: Evaluating language models on real-world enterprise text- to-sql workflows,”arXiv preprint arXiv:2411.07763, 2025

  15. [15]

    A survey of text-to-sql in the era of llms: Where are we, and where are we going?

    X. Liu, S. Shen, B. Li, P. Ma, R. Jiang, Y . Zhang, J. Fan, G. Li, N. Tang, and Y . Luo, “A survey of text-to-sql in the era of llms: Where are we, and where are we going?”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 10, pp. 5735–5754, 2025

  16. [16]

    The dawn of natural language to sql: Are we fully ready?

    B. Li, Y . Luo, C. Chai, G. Li, and N. Tang, “The dawn of natural language to sql: Are we fully ready?”Proc. VLDB Endow., vol. 17, no. 11, p. 3318–3331, 2024

  17. [17]

    arXiv preprint arXiv:2411.00073 (2024)

    Z. Cao, Y . Zheng, Z. Fan, X. Zhang, W. Chen, and X. Bai, “Rsl- sql: Robust schema linking in text-to-sql generation,”arXiv preprint arXiv:2411.00073, 2024

  18. [18]

    Enhancing text-to-sql parsing through question rewriting and execution- guided refinement,

    W. Mao, R. Wang, J. Guo, J. Zeng, C. Gao, P. Han, and C. Liu, “Enhancing text-to-sql parsing through question rewriting and execution- guided refinement,” inFindings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computa- tional Linguistics, 2024, pp. 2009–2024

  19. [19]

    Share: An slm-based hierarchical action correction assistant for text-to-sql,

    G. Qu, J. Li, B. Qin, X. Li, N. Huo, C. Ma, and R. Cheng, “Share: An slm-based hierarchical action correction assistant for text-to-sql,” in Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics. Vienna, Austria: Association for Computational Linguistics, 2025, pp. 11 268–11 292

  20. [20]

    Teaching Large Language Models to Self-Debug

    X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,”arXiv preprint arXiv:2304.05128, 2023

  21. [21]

    arXiv preprint arXiv:2502.17248 (2025)

    B. Li, J. Zhang, J. Fan, Y . Xu, C. Chen, N. Tang, and Y . Luo, “Alpha- sql: Zero-shot text-to-sql using monte carlo tree search,”arXiv preprint arXiv:2502.17248, 2025

  22. [22]

    Large language model instruction following: A survey of progresses and challenges,

    R. Lou, K. Zhang, and W. Yin, “Large language model instruction following: A survey of progresses and challenges,”Computational Linguistics, vol. 50, no. 3, pp. 1053–1095, 2024

  23. [23]

    Exploring chain of thought style prompting for text-to-sql,

    C.-Y . Tai, Z. Chen, T. Zhang, X. Deng, and H. Sun, “Exploring chain of thought style prompting for text-to-sql,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023, pp. 5376– 5393

  24. [24]

    Text- to-sql empowered by large language models: A benchmark evaluation,

    D. Gao, H. Wang, Y . Li, X. Sun, Y . Qian, B. Ding, and J. Zhou, “Text- to-sql empowered by large language models: A benchmark evaluation,” Proc. VLDB Endow., vol. 17, no. 5, p. 1132–1145, 2024

  25. [25]

    Mac-sql: A multi-agent collaborative framework for text-to-sql,

    B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li, “Mac-sql: A multi-agent collaborative framework for text-to-sql,” inProceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 540–557

  26. [26]

    arXiv preprint arXiv:2505.13271 (2025) Verification-based Text-to-SQL Evaluation with Database Constraints 9

    L. Sheng and S.-S. Xu, “Csc-sql: Corrective self-consistency in text-to- sql via reinforcement learning,”arXiv preprint arXiv:2505.13271, 2025

  27. [27]

    Mcts-sql: Light-weight llms can master the text-to-sql through monte carlo tree search,

    S. Yuan, L. Chen, M. Yuan, and J. Zhao, “Mcts-sql: Light-weight llms can master the text-to-sql through monte carlo tree search,”arXiv preprint arXiv:2501.16607, 2025

  28. [28]

    XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

    Y . Liu, Y . Zhu, Y . Gao, Z. Luo, X. Li, X. Shi, Y . Hong, J. Gao, Y . Li, B. Ding, and J. Zhou, “Xiyan-sql: A novel multi-generator framework for text-to-sql,”arXiv preprint arXiv:2507.04701, 2025

  29. [29]

    Dcg-sql: Enhancing in-context learning for text-to-sql with deep contextual schema link graph,

    J. Lee, J.-S. Lee, J. Lee, Y . Choi, and J.-H. Lee, “Dcg-sql: Enhancing in-context learning for text-to-sql with deep contextual schema link graph,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 2025, pp. 15 397–15 412

  30. [30]

    Parsql: Enhancing text-to-sql through sql parsing and reasoning,

    Y . Dai, H. Yang, M. Hao, and P. Chao, “Parsql: Enhancing text-to-sql through sql parsing and reasoning,” inFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 2025, pp. 661– 681

  31. [31]

    Synthesizing text-to-sql data from weak and strong llms,

    J. Yang, B. Hui, M. Yang, J. Yang, J. Lin, and C. Zhou, “Synthesizing text-to-sql data from weak and strong llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 7864–7875

  32. [32]

    Towards robustness of text-to-sql models against synonym substitution,

    Y . Gan, X. Chen, Q. Huang, M. Purver, J. R. Woodward, J. Xie, and P. Huang, “Towards robustness of text-to-sql models against synonym substitution,” inProceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 2021, pp. 2505– 2515

  33. [33]

    Structure-grounded pretraining for text-to-sql,

    X. Deng, A. H. Awadallah, C. Meek, O. Polozov, H. Sun, and M. Richardson, “Structure-grounded pretraining for text-to-sql,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 2021, pp. 1337–1350

  34. [34]

    Exploring underexplored limitations of cross-domain text-to-sql generalization,

    Y . Gan, X. Chen, and M. Purver, “Exploring underexplored limitations of cross-domain text-to-sql generalization,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 2021, pp. 8926–8931

  35. [35]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhanget al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    GPT-4 Technical Report

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmadet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2024

  37. [37]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,” arXiv preprint arXiv:2507.06261, 2025

  38. [38]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zhenget al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024