pith. sign in

arxiv: 2507.04701 · v2 · submitted 2025-07-07 · 💻 cs.CL

XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Pith reviewed 2026-05-19 06:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords Text-to-SQLMulti-generator ensembleLLM fine-tuningCandidate selectionSchema filteringBIRD benchmarkSpider dataset
0
0 comments X

The pith

XiYan-SQL achieves state-of-the-art text-to-SQL results by generating multiple diverse SQL candidates and selecting the best one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes XiYan-SQL, a framework designed to improve text-to-SQL translation using large language models. It filters relevant schemas, generates multiple SQL queries with an ensemble of models fine-tuned on different formats, and selects the optimal query with a dedicated model. This multi-pronged approach seeks to overcome the limitations of single-generation methods by increasing the variety and quality of candidate queries. The framework demonstrates its value through superior performance on challenging benchmarks. A sympathetic reader would see this as a practical way to make LLM-based SQL generation more reliable.

Core claim

XiYan-SQL is a novel framework consisting of a schema filter, a multi-generator ensemble, and a selection model that together produce and identify the optimal SQL query from text input, resulting in 75.63% accuracy on the BIRD benchmark and 89.65% on the Spider test set.

What carries the argument

The multi-generator ensemble approach that employs a multi-task fine-tuning strategy to build multiple generation models with distinct generation styles by fine-tuning across different SQL formats.

If this is right

  • The use of distinct SQL formats for fine-tuning creates greater diversity in generated SQL queries.
  • The selection model with candidate reorganization can reliably choose the correct query from the pool of candidates.
  • Schema filtering reduces noise by focusing on relevant database schemas for generation.
  • Combining these components leads to consistent outperformance on standard text-to-SQL benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-generator strategies could improve performance in other structured prediction tasks like code generation or query optimization.
  • Automating the selection of SQL formats for fine-tuning might further enhance the framework without additional manual effort.
  • Testing the approach on real-world database schemas with high complexity could reveal its practical utility beyond benchmarks.

Load-bearing premise

The assumption that fine-tuning separate generators on distinct SQL formats will reliably produce sufficiently diverse and high-quality candidates that the downstream selection model can consistently identify the correct query.

What would settle it

An experiment showing that on a new text-to-SQL dataset, the accuracy of the selected query falls significantly below the reported SOTA levels despite generating multiple candidates.

Figures

Figures reproduced from arXiv: 2507.04701 by Bolin Ding, Jingren Zhou, Jinyang Gao, Xiaorong Shi, Xiaoxia Li, Yifu Liu, Yingqi Gao, Yin Zhu, Yu Li, Yuntao Hong, Zhiling Luo.

Figure 1
Figure 1. Figure 1: Overview of the proposed XiYan-SQL framework, including three steps: Schema Filter, Multiple SQL Generation, and SQL Selection. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The illustration of the process for multiple related tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of multiple SQL queries with different format corresponding [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Comparison of EX among different multiple candidate methods [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes XiYan-SQL, a framework for Text-to-SQL consisting of a Schema Filter module to obtain relevant schemas, a multi-generator ensemble that applies multi-task fine-tuning and trains separate generators on distinct SQL formats to produce diverse candidates, and a selection model using candidate reorganization to identify the best query. It reports new state-of-the-art results of 75.63% on the BIRD benchmark and 89.65% on the Spider test set.

Significance. If the reported gains hold after controlling for model scale and training compute, and if the multi-generator approach demonstrably increases the coverage of correct candidates beyond a single fine-tuned model, the work would offer a practical advance in LLM-based Text-to-SQL by addressing candidate diversity and selection. The framework is benchmark-driven and does not rely on parameter-free derivations, but the emphasis on format-induced stylistic variation is a concrete, testable hypothesis that could influence future ensemble designs if supported by quantitative evidence.

major comments (2)
  1. [§3.2] §3.2 (Multi-Generator Ensemble): The central claim that fine-tuning across different SQL formats produces 'distinct generation styles' sufficient for the selection model to recover the gold query is load-bearing for the SOTA results, yet no quantitative diversity metric (e.g., parse-tree edit distance, execution-result variance, or error-type coverage across the candidate pool) is reported. Without such evidence, it remains possible that format variation induces only superficial differences while core semantic errors remain correlated, undermining the advantage over a single generator.
  2. [§4] §4 (Experiments): The performance claims of 75.63% on BIRD and 89.65% on Spider are presented without details on statistical significance testing, ablation isolating the contribution of the multi-generator plus selection versus a single strong baseline of comparable size and training compute, or error analysis showing where the reorganization strategy corrects failures of individual generators.
minor comments (2)
  1. [Abstract and §4] The abstract and §4 would benefit from explicit listing of all baselines with their model sizes and training regimes to allow direct comparison.
  2. [§3.3] Notation for the candidate reorganization strategy in §3.3 could be clarified with a small illustrative example of how candidates are re-ranked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Multi-Generator Ensemble): The central claim that fine-tuning across different SQL formats produces 'distinct generation styles' sufficient for the selection model to recover the gold query is load-bearing for the SOTA results, yet no quantitative diversity metric (e.g., parse-tree edit distance, execution-result variance, or error-type coverage across the candidate pool) is reported. Without such evidence, it remains possible that format variation induces only superficial differences while core semantic errors remain correlated, undermining the advantage over a single generator.

    Authors: We agree that explicit quantitative metrics would provide stronger support for the claim that format-induced variation yields meaningfully distinct candidates. The current manuscript relies on the design rationale and overall performance gains but does not report diversity statistics. In the revised version we will add a dedicated analysis subsection that measures syntactic diversity via parse-tree edit distance, variance in execution results, and coverage of different error categories across the candidate pool produced by the separate generators. This will allow readers to assess whether the multi-generator approach increases coverage beyond what a single fine-tuned model achieves. revision: yes

  2. Referee: [§4] §4 (Experiments): The performance claims of 75.63% on BIRD and 89.65% on Spider are presented without details on statistical significance testing, ablation isolating the contribution of the multi-generator plus selection versus a single strong baseline of comparable size and training compute, or error analysis showing where the reorganization strategy corrects failures of individual generators.

    Authors: We acknowledge that the experimental section would benefit from greater rigor. In the revision we will add (1) statistical significance testing (e.g., bootstrap resampling or McNemar’s test) for the reported improvements over prior SOTA methods, (2) an ablation that directly compares the full framework against a single-generator baseline trained with matched model size and compute budget, and (3) a qualitative error analysis that illustrates concrete cases in which the candidate reorganization and selection model recovers the correct query when individual generators fail. These additions will clarify the incremental contribution of the multi-generator plus selection components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The XiYan-SQL paper describes an empirical multi-generator framework consisting of schema filtering, multi-task fine-tuned generators on distinct SQL formats, and a downstream selection model. All performance claims (75.63% on BIRD, 89.65% on Spider) are obtained by direct evaluation against fixed, externally defined benchmark datasets and gold queries. No equations, uniqueness theorems, or predictions are presented that reduce by construction to fitted parameters or self-referential definitions. The central assumption about candidate diversity is tested via reported results rather than assumed via internal renaming or self-citation chains. The derivation chain is therefore self-contained against independent external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM fine-tuning for structured output and the value of candidate diversity; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Fine-tuning LLMs on different SQL output formats produces meaningfully diverse yet correct query candidates.
    Invoked in the description of the multi-generator ensemble component.

pith-pipeline@v0.9.0 · 5740 in / 1256 out tokens · 34308 ms · 2026-05-19T06:49:14.537277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

    cs.CL 2026-05 unverdicted novelty 7.0

    LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...

  2. DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

    cs.DB 2025-10 unverdicted novelty 7.0

    DeepEye-SQL applies SDLC-inspired orchestration to Text-to-SQL, achieving 73.5% on BIRD-Dev, 75.07% on BIRD-Test, and 89.8% on Spider-Test with ~30B MoE models.

  3. Data-aware candidate selection in NL2SQL translation via small separating instances

    cs.DB 2026-05 unverdicted novelty 6.0

    A selection technique based on separating instances and provenance outperforms baselines for choosing among 2-3 NL2SQL candidates on a BIRD-DEV subset without consistency scores.

  4. Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

    cs.CL 2026-04 unverdicted novelty 6.0

    TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

  5. SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis

    cs.AI 2026-04 unverdicted novelty 6.0

    SemanticAgent introduces a three-stage semantic analysis, synthesis, and verification process that produces higher-quality text-to-SQL training data than prior execution-only methods.

  6. AgentNLQ: A General-Purpose Agent for Natural Language to SQL

    cs.AI 2026-05 unverdicted novelty 5.0

    A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 6 Pith papers · 6 internal anchors

  1. [1]

    The dawn of natural language to sql: Are we fully ready?

    B. Li, Y . Luo, C. Chai, G. Li, and N. Tang, “The dawn of natural language to sql: Are we fully ready?” arXiv preprint arXiv:2406.01265 , 2024

  2. [2]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems , ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2024

  3. [3]

    Mac- sql: A multi-agent collaborative framework for text-to-sql,

    B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun et al. , “Mac-sql: A multi-agent collaborative framework for text-to-sql,” arXiv preprint arXiv:2312.11242 , 2024

  4. [4]

    arXiv preprint arXiv:2405.07467 , year=

    D. Lee, C. Park, J. Kim, and H. Park, “Mcs-sql: Leveraging multiple prompts and multiple-choice selection for text-to-sql generation,” arXiv preprint arXiv:2405.07467, 2024

  5. [5]

    Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

    M. Pourreza, H. Li, R. Sun, Y . Chung, S. Talaei, G. T. Kakkar, Y . Gan, A. Saberi, F. Ozcan, and S. O. Arik, “Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql,” arXiv preprint arXiv:2410.01943, 2024

  6. [6]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,

    J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo et al. , “Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,” Advances in Neural Information Processing Systems , vol. 36, 2024

  7. [7]

    Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

    T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman et al., “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,” arXiv preprint arXiv:1809.08887 , 2018

  8. [8]

    Learning to parse database queries using inductive logic programming,

    J. M. Zelle and R. J. Mooney, “Learning to parse database queries using inductive logic programming,” in Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2 , ser. AAAI’96. AAAI Press, 1996, p. 1050–1055. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12

  9. [9]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 6000–6010

  10. [10]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , 2019, pp. 4171–4186

  11. [11]

    TaBERT: Pretraining for joint understanding of textual and tabu- lar data

    P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data,” arXiv e-prints , p. arXiv:2005.08314, May 2020

  12. [12]

    Grappa: Grammar-augmented pre-training for table semantic parsing,

    T. Yu, C.-S. Wu, X. V . Lin, B. Wang, Y . C. Tan, X. Yang, D. Radev, R. Socher, and C. Xiong, “Grappa: Grammar-augmented pre-training for table semantic parsing,” in International Conference on Learning Representations , 2021. [Online]. Available: https: //arxiv.org/abs/2009.13845

  13. [13]

    Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

    D. Gao, H. Wang, Y . Li, X. Sun, Y . Qian, B. Ding, and J. Zhou, “Text- to-sql empowered by large language models: A benchmark evaluation,” arXiv preprint arXiv:2308.15363 , 2023

  14. [14]

    Din-sql: Decomposed in-context learning of text-to-sql with self-correction,

    M. Pourreza and D. Rafiei, “Din-sql: Decomposed in-context learning of text-to-sql with self-correction,” Advances in Neural Information Processing Systems, vol. 36, 2024

  15. [15]

    Dts-sql: Decomposed text- to-sql with small large language models,

    ——, “Dts-sql: Decomposed text-to-sql with small large language models,” arXiv preprint arXiv:2402.01117 , 2024

  16. [16]

    Codes: Towards building open-source language models for text-to-sql,

    H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu, R. Wei, H. Pan, C. Li, and H. Chen, “Codes: Towards building open-source language models for text-to-sql,” Proceedings of the ACM on Management of Data , vol. 2, no. 3, pp. 1–28, 2024

  17. [17]

    Synthesizing text-to-sql data from weak and strong llms,

    J. Yang, B. Hui, M. Yang, J. Yang, J. Lin, and C. Zhou, “Synthesizing text-to-sql data from weak and strong llms,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 7864–7875

  18. [18]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

    X. Dong, C. Zhang, Y . Ge, Y . Mao, Y . Gao, J. Lin, D. Lou et al., “C3: Zero-shot text-to-sql with chatgpt,” arXiv preprint arXiv:2307.07306 , 2023

  19. [20]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

  20. [21]

    CHESS: Contextual Harnessing for Efficient SQL Synthesis

    S. Talaei, M. Pourreza, Y .-C. Chang, A. Mirhoseini, and A. Saberi, “Chess: Contextual harnessing for efficient sql synthesis,” arXiv preprint arXiv:2405.16755, 2024

  21. [22]

    The death of schema linking? text-to-sql in the age of well-reasoned language models,

    K. Maamari, F. Abubaker, D. Jaroslawicz, and A. Mhedhbi, “The death of schema linking? text-to-sql in the age of well-reasoned language models,” arXiv preprint arXiv:2408.07702 , 2024

  22. [23]

    Decomposition for enhancing attention: Improving llm-based text-to-sql through workflow paradigm,

    Y . Xie, X. Jin, T. Xie, M. Lin, L. Chen, C. Yu, L. Cheng, C. Zhuo, B. Hu, and Z. Li, “Decomposition for enhancing attention: Improving llm-based text-to-sql through workflow paradigm,”arXiv preprint arXiv:2402.10671, 2024

  23. [24]

    Momq: Mixture- of-experts enhances multi-dialect query generation across relational and non-relational databases,

    Z. Lin, Y . Liu, Z. Luo, J. Gao, and Y . Li, “Momq: Mixture- of-experts enhances multi-dialect query generation across relational and non-relational databases,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.18406

  24. [25]

    Loftune: A low-overhead and flexible approach for spark sql configuration tuning,

    J. Li, J. Ye, Y . Mao, Y . Gao, and L. Chen, “Loftune: A low-overhead and flexible approach for spark sql configuration tuning,” IEEE Transactions on Knowledge and Data Engineering , pp. 1–14, 2025

  25. [26]

    Direct preference optimization: your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: your language model is secretly a reward model,” in Proceedings of the 37th International Conference on Neural Information Processing Systems , ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

  26. [27]

    Locality-sensitive hashing scheme based on p-stable distributions,

    M. Datar, N. Immorlica, P. Indyk, and V . S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry , 2004, pp. 253– 262

  27. [28]

    Lima: Less is more for alignment,

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for alignment,” Advances in Neural Information Processing Systems , vol. 36, 2024

  28. [29]

    Zephyr: Direct distillation of LM alignment,

    L. Tunstall, E. E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. V . Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf, “Zephyr: Direct distillation of LM alignment,” in First Conference on Language Modeling , 2024. [Online]. Available: https://openreview.net/forum?id=aKkAwZB6JV

  29. [30]

    A survey of NL2SQL with large language models – where are we, and where are we going?arXiv preprint arXiv:2408.05109v1,

    X. Liu, S. Shen, B. Li, P. Ma, R. Jiang, Y . Luo, Y . Zhang, J. Fan, G. Li, and N. Tang, “A survey of nl2sql with large language models: Where are we, and where are we going?” arXiv preprint arXiv:2408.05109 , 2024

  30. [31]

    A survey on multi-task learning,

    Y . Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering , vol. 34, no. 12, pp. 5586–5609, 2022

  31. [32]

    Which tasks should be learned together in multi-task learning?

    T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, and S. Savarese, “Which tasks should be learned together in multi-task learning?” in Proceedings of the 37th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 9120–9132. [Online]. Available: htt...

  32. [33]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. V . Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations , 2023. [Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw

  33. [34]

    Large language models are not robust multiple choice selectors,

    C. Zheng, H. Zhou, F. Meng, J. Zhou, and M. Huang, “Large language models are not robust multiple choice selectors,” in The Twelfth International Conference on Learning Representations , 2024. [Online]. Available: https://openreview.net/forum?id=shr9PXz7T0

  34. [35]

    Xiyan-sql: A multi- generator ensemble framework for text-to-sql,

    Y . Gao, Y . Liu, X. Li, X. Shi, Y . Zhu, Y . Wang, S. Li, W. Li, Y . Hong, Z. Luo, J. Gao, L. Mou, and Y . Li, “A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql,” 2024. [Online]. Available: https://arxiv.org/abs/2411.08599

  35. [36]

    GPT-4 Technical Report

    OpenAI, “Gpt-4 technical report,” CoRR, vol. abs/2303.08774, 2023. [Online]. Available: https://arxiv.org/abs/2303.08774

  36. [37]

    Qwen2.5 Technical Report

    Q. Team, “Qwen2.5 technical report,” arXiv preprint arXiv:2412.15115 , 2024

  37. [38]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang et al., “Qwen2.5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024

  38. [39]

    Before generation, align it! a novel and effective strategy for mitigating hallucinations in text-to-sql generation,

    G. Qu, J. Li, B. Li, B. Qin, N. Huo, C. Ma, and R. Cheng, “Before generation, align it! a novel and effective strategy for mitigating hallucinations in text-to-sql generation,” arXiv preprint arXiv:2405.15307, 2024

  39. [40]

    arXiv preprint arXiv:2403.09732 , year=

    Z. Li, X. Wang, J. Zhao, S. Yang, G. Du, X. Hu, B. Zhang, Y . Ye, Z. Li, R. Zhao et al., “Pet-sql: A prompt-enhanced two-stage text-to-sql framework with cross-consistency,” arXiv preprint arXiv:2403.09732 , 2024

  40. [41]

    Tool-assisted agent on sql inspection and refinement in real-world scenarios,

    Z. Wang, R. Zhang, Z. Nie, and J. Kim, “Tool-assisted agent on sql inspection and refinement in real-world scenarios,” arXiv preprint arXiv:2408.16991, 2024

  41. [42]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics , vol. 12, pp. 157–173, 2024