LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting

Hung-Fu Chang; Po-An Chen; Yu-Jie Yang

arxiv: 2601.17942 · v1 · submitted 2026-01-25 · 💻 cs.AI · cs.DB

LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting

Yu-Jie Yang , Hung-Fu Chang , Po-An Chen This is my paper

Pith reviewed 2026-05-16 10:56 UTC · model grok-4.3

classification 💻 cs.AI cs.DB

keywords text-to-sqlself-refinementweighted majority votingmulti-agent systemssql generationnatural language to databaseensemble methods

0 comments

The pith

A self-refinement and voting pipeline on an existing text-to-SQL model reaches over 85 percent execution accuracy on Spider benchmarks without ground-truth data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the SSEV pipeline that adds iterative self-correction and weighted majority voting to a base text-to-SQL generator. Experiments show this combination produces execution accuracies of 85.5 percent on Spider development data, 86.4 percent on the test set, and 66.3 percent on the BIRD benchmark. Building from these results, the authors present ReCAPAgent-SQL, a multi-agent system with separate planning, critique, and validation steps that achieves 31 percent accuracy on the first 100 queries of a newer enterprise-oriented benchmark. If the mechanisms work as described, natural-language database querying becomes more reliable across ambiguous queries and complex schemas while removing the need for labeled correction examples.

Core claim

The paper claims that layering single-agent self-refinement with weighted majority voting and its randomized variant onto a PET-SQL base produces the reported benchmark accuracies without requiring ground-truth labels, and that extending the idea into a collaborative ReCAPAgent-SQL framework with specialized agents for planning, knowledge retrieval, critique, action, refinement, schema linking, and validation improves handling of real-world enterprise database tasks as measured on Spider 2.0-Lite.

What carries the argument

The Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline, which generates multiple SQL candidates, applies iterative self-correction to each, then selects the final output through weighted majority voting.

Load-bearing premise

The accuracy gains come mainly from adding self-refinement and voting steps rather than from the underlying base model or the particular large language model used.

What would settle it

Running the identical base generator on the Spider 1.0 development set both with and without the self-refinement and voting layers and checking whether the accuracy difference remains large.

Figures

Figures reproduced from arXiv: 2601.17942 by Hung-Fu Chang, Po-An Chen, Yu-Jie Yang.

**Figure 2.** Figure 2: Full Database DDL & Optimization Rule Prompt Example. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cell Value References Prompt Example. – Few-Shot Demonstrations: Based on embedding similarity, the top-k question–SQL pairs most similar to the input query are retrieved. These examples guide the model toward generating SQL that is stylistically and structurally consistent with the target dataset [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Few-Shot Demonstrations Prompt Example. After assembling the complete prompt with optimization rule, full schema DDL, sampled cell value references, and k few-shot examples, a prompt is submitted to an LLM to generate the first-round SQL, referred to as PreSQL. • PreSQL-Based Schema Linking: Schema linking is considered a long-standing and fundamentally crucial component for Text-to-SQL since it bridges na… view at source ↗

**Figure 5.** Figure 5: Architecture of the two-stage Text-to-SQL system: PreSQL (blue) and PostSQL (yellow) pipelines. Both [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of PreSQL and PostSQL prompt structure. The left side illustrates the PreSQL prompt, which [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Refinement prompt example 3.3 ReCAPAgent-SQL Although ReCAPAgent-SQL builds upon insights from our SSEV pipeline, it represents a substantially more comprehensive and structurally distinct framework. ReCAPAgent-SQL is not only a multi-agent system, but also systematically incorporates the core ideas of PreSQL generation, schema linking, PostSQL refinement, execution-guided self-refinement, and expert vot… view at source ↗

**Figure 8.** Figure 8: Architecture of ReCAPAgent-SQL with core agent modules. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Generate reference plan prompt – Task Instruction: The agent explicitly instructs the LLM to produce only a step-by-step plan in strict JSON format. – Expected CSV Format: The agent also asks the LLM to define the expected output format (column names and types). • Response Parsing The LLM’s response is parsed into a plan (a list of steps) and an expected_csv_format. To ensure robustness, the agent is enfor… view at source ↗

**Figure 10.** Figure 10: Reference plan example • Syntax RetrieverAgent for BigQuery SQL dialect–specific syntax (e.g., BigQuery) often includes specialized function parameters or advanced usage patterns that may not be fully represented in an LLM’s training corpus. The Syntax RetrieverAgent module addresses this limitation by retrieving relevant syntax documentation from a dedicated syntax repository. Whether this module is invo… view at source ↗

**Figure 11.** Figure 11: Syntax RetrieverAgent Prompt Example. 3.3.3 Critique Agent: Plan/SQL Evaluation Strategy The Critique Agent provides automated evaluation and refinement for SQL generation, improving robustness via iterative critique. • Plan Critique Given a step-by-step reasoning plan produced by the PlannerAgent, the Critique Agent conducts a structured evaluation of the plan by prompting an LLM to assess (1) whether th… view at source ↗

**Figure 12.** Figure 12: Predict Action Prompt: Given the user question and environment, the LLM is instructed to generate an [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Self-Refinement Prompt: Guides the LLM in analyzing the SQL failure reason and refining the query using [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Validate Result Prompt: Checks whether the query result is correct, empty, or contains redundant columns. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Voting strategy model weights (Stage 6) on Spider 1.0 Dev. WMA shows rapid convergence to stronger [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Voting error bound analysis (Stage 6) on Spider 1.0 Dev. WMA achieves the lowest final error rate and [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Voting strategy model weights (Stage 6) on Spider 1.0 Test. WMA rapidly converges to stronger models. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Voting error bound analysis (Stage 6) on Spider 1.0 Test. WMA achieves the lowest final error rate and [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Voting strategy model weights (Stage 6) on BIRD Dev. WMA converges quickly to stronger models. [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

**Figure 20.** Figure 20: Voting error bound analysis (Stage 6) on BIRD Dev. WMA achieves the lowest final error rate and regret. [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Performance comparison of baseline and ReCAPAgent-SQL on 100 Spider 2.0-lite queries. GPT-4.1 shows [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗

**Figure 22.** Figure 22: Ensemble Voting strategy analysis: WMA vs RWMA on Spider 2.0-lite. WMA achieves slightly higher [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗

read the original abstract

Text-to-SQL has emerged as a prominent research area, particularly with the rapid advancement of large language models (LLMs). By enabling users to query databases through natural language rather than SQL, this technology significantly lowers the barrier to data analysis. However, generating accurate SQL from natural language remains challenging due to ambiguity in user queries, the complexity of schema linking, limited generalization across SQL dialects, and the need for domain-specific understanding. In this study, we propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline built on PET-SQL that operates without ground-truth data, integrating self-refinement with Weighted Majority Voting (WMV) and its randomized variant (RWMA). Experimental results show that the SSEV achieves competitive performance across multiple benchmarks, attaining execution accuracies of 85.5% on Spider 1.0-Dev, 86.4% on Spider 1.0-Test, and 66.3% on BIRD-Dev. Building on insights from the SSEV pipeline, we further propose ReCAPAgent-SQL (Refinement-Critique-Act-Plan agent-based SQL framework) to address the growing complexity of enterprise databases and real-world Text-to-SQL tasks. The framework integrates multiple specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation, enabling iterative refinement of SQL predictions through agent collaboration. ReCAPAgent-SQL's WMA results achieve 31% execution accuracy on the first 100 queries of Spider 2.0-Lite, demonstrating significant improvements in handling real-world enterprise scenarios. Overall, our work facilitates the deployment of scalable Text-to-SQL systems in practical settings, supporting better data-driven decision-making at lower cost and with greater efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends PET-SQL with self-refinement and voting for competitive Text-to-SQL results but misses key ablations to prove the additions matter.

read the letter

The one or two things to know are that this paper layers self-refinement and weighted majority voting on PET-SQL to reach 85.5% execution accuracy on Spider dev, 86.4% on test, and 66.3% on BIRD, then scales the idea into a multi-agent framework called ReCAPAgent-SQL that adds planning, critique, and validation agents for enterprise queries, hitting 31% on a Spider 2.0 slice. The SSEV pipeline and the agent roles are the concrete new pieces, built without ground-truth data. The work does well at targeting real pain points like schema ambiguity and iterative fixing through collaboration, and the benchmark numbers are at least competitive on public sets. That gives practitioners something to try. The soft spots are the missing direct comparison to plain PET-SQL under the same LLM and prompting budget, so the performance lift cannot be cleanly attributed to the self-refinement or voting steps rather than the base framework or model choice. Experimental details on protocol, variance, or statistical checks are light, which leaves the generalization claim thinner than it could be. The stress-test concern holds up. This is for applied Text-to-SQL researchers and engineers who want engineering patterns for LLM database interfaces. A reader focused on multi-agent refinements would pick up usable ideas. It deserves peer review because the benchmarks are standard and the pipeline is described enough to replicate and test further.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline built on PET-SQL that combines self-refinement with Weighted Majority Voting (WMV) and its randomized variant (RWMA) for Text-to-SQL generation without ground-truth data. It reports execution accuracies of 85.5% on Spider 1.0-Dev, 86.4% on Spider 1.0-Test, and 66.3% on BIRD-Dev. It further introduces ReCAPAgent-SQL, a multi-agent framework with specialized agents for planning, critique, action generation, self-refinement, schema linking, and validation, achieving 31% execution accuracy on the first 100 queries of Spider 2.0-Lite for enterprise scenarios.

Significance. If the performance claims hold after proper validation, the work could advance practical LLM-based Text-to-SQL systems by demonstrating how self-refinement and adaptive voting improve accuracy on established benchmarks and complex enterprise queries. The multi-agent ReCAPAgent-SQL framework addresses real-world challenges such as schema complexity and domain knowledge, potentially supporting more scalable and cost-effective data analysis tools.

major comments (3)

[Abstract] Abstract: The headline execution accuracies (85.5% Spider-Dev, 86.4% Test, 66.3% BIRD) are presented as evidence that SSEV improves over prior work, yet the manuscript supplies no direct comparison of the base PET-SQL model (same LLM, same prompting budget, same schema linking) versus the full SSEV stack; without this delta the contribution of self-refinement and WMV/RWMA cannot be isolated.
[Abstract] Abstract: No experimental protocol is described, including the specific LLM, number of self-refinement iterations, voting hyperparameters, number of runs, or statistical significance tests; this absence prevents verification that the reported numbers support the central performance claims.
[ReCAPAgent-SQL description] ReCAPAgent-SQL section: The 31% WMA result on the first 100 queries of Spider 2.0-Lite is claimed to demonstrate significant improvements for enterprise scenarios, but the text provides neither comparisons to prior methods nor ablations of the individual agent components (planning, critique, validation), leaving the load-bearing role of the framework unassessed.

minor comments (2)

Define all acronyms (SSEV, WMV, RWMA, ReCAPAgent-SQL) at first use and ensure consistent terminology throughout.
Add an algorithm box or pseudocode for the adaptive weighted majority voting procedure to clarify implementation details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the current manuscript lacks sufficient detail on experimental controls, protocols, and ablations, which limits the ability to assess the contributions of SSEV and ReCAPAgent-SQL. We will revise the manuscript to address these points directly, adding the requested comparisons, protocol descriptions, and component analyses while preserving the core claims supported by our existing results.

read point-by-point responses

Referee: [Abstract] Abstract: The headline execution accuracies (85.5% Spider-Dev, 86.4% Test, 66.3% BIRD) are presented as evidence that SSEV improves over prior work, yet the manuscript supplies no direct comparison of the base PET-SQL model (same LLM, same prompting budget, same schema linking) versus the full SSEV stack; without this delta the contribution of self-refinement and WMV/RWMA cannot be isolated.

Authors: We agree that isolating the contribution of self-refinement and WMV/RWMA requires a direct head-to-head comparison against the unmodified PET-SQL baseline under matched conditions. In the revised version we will add a dedicated ablation table (and corresponding text in the abstract and experimental sections) reporting execution accuracy for base PET-SQL using the identical LLM, prompting budget, and schema linking procedure. This will allow readers to compute the precise delta attributable to the SSEV pipeline. revision: yes
Referee: [Abstract] Abstract: No experimental protocol is described, including the specific LLM, number of self-refinement iterations, voting hyperparameters, number of runs, or statistical significance tests; this absence prevents verification that the reported numbers support the central performance claims.

Authors: We acknowledge the omission. The revised manuscript will include a new “Experimental Setup” subsection that specifies the LLM (including version and temperature), exact number of self-refinement iterations, WMV/RWMA hyperparameters (weights, randomization seed, ensemble size), number of independent runs, and the statistical tests (e.g., McNemar or bootstrap confidence intervals) used to assess significance. These details will also be summarized concisely in the abstract. revision: yes
Referee: [ReCAPAgent-SQL description] ReCAPAgent-SQL section: The 31% WMA result on the first 100 queries of Spider 2.0-Lite is claimed to demonstrate significant improvements for enterprise scenarios, but the text provides neither comparisons to prior methods nor ablations of the individual agent components (planning, critique, validation), leaving the load-bearing role of the framework unassessed.

Authors: We agree that the current presentation does not sufficiently demonstrate the contribution of the multi-agent design. In revision we will (1) add baseline results from prior Text-to-SQL methods evaluated on the same first 100 Spider 2.0-Lite queries and (2) include an ablation study that systematically disables or replaces each agent (planning, critique, validation, etc.) while keeping the rest of the pipeline fixed. These additions will be placed in the ReCAPAgent-SQL section and summarized in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper reports execution accuracies measured directly on public external benchmarks (Spider 1.0-Dev/Test, BIRD-Dev, Spider 2.0-Lite) without any equations, fitted parameters, or self-referential definitions that reduce the claimed performance to quantities defined inside the same pipeline. The SSEV and ReCAPAgent-SQL pipelines are described as built on PET-SQL, but the results are presented as measured outcomes rather than predictions derived by construction from internal fits or self-citations. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the provided text. This is a standard empirical evaluation setup with no detectable circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The claims rest on the domain assumption that LLMs can iteratively correct their own SQL errors and that agent collaboration improves handling of complex schemas; no numerical free parameters are stated in the abstract, and the two named frameworks are the primary invented contributions.

axioms (1)

domain assumption LLMs can effectively self-refine SQL queries through iterative prompting without external ground truth
This assumption directly supports the Single-Agent Self-Refinement component of the SSEV pipeline.

invented entities (2)

SSEV pipeline no independent evidence
purpose: Combine self-refinement with weighted majority voting for Text-to-SQL without ground-truth supervision
New pipeline built on PET-SQL and presented as the core contribution.
ReCAPAgent-SQL framework no independent evidence
purpose: Multi-agent system integrating planning, critique, schema linking, and validation for enterprise-scale Text-to-SQL
Introduced to address growing complexity of real-world databases.

pith-pipeline@v0.9.0 · 5626 in / 1608 out tokens · 73554 ms · 2026-05-16T10:56:24.301105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Revolutionize business intelligence with genbi, 2024

Wren AI. Revolutionize business intelligence with genbi, 2024. Accessed: 2025-05-21

work page 2024
[2]

How we built a text-to-sql ai agent to get instant answers from our data

Salesforce. How we built a text-to-sql ai agent to get instant answers from our data. https://www.salesforce. com/blog/text-to-sql-agent/, 2025. Accessed: 2025-06

work page 2025
[3]

Snowflake introduces cortex aisql and snowconvert ai: Analytics re- built for the ai era

Snowflake. Snowflake introduces cortex aisql and snowconvert ai: Analytics re- built for the ai era. https://www.snowflake.com/en/news/press-releases/ snowflake-introduces-cortex-aisql-and-snowconvert-ai-analytics-rebuilt-for-the-ai-era/ ,

work page
[4]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. InProceedings of the 12th International Conference on Learning R...

work page 2025
[5]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Quan Yao, and Dragomir Radev Roman. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Association ...

work page 2018
[6]

Chang, Fei Huang, Reynold Cheng, and Yongbin Li

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.arXiv preprint arXiv:2305.03111, 2023

work page arXiv 2023
[7]

doi:10.48550/arXiv.2502.00675 , abstract =

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration.arXiv preprint arXiv:2502.00675, 2025

work page arXiv 2025
[8]

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik

Davood Rafiei Mohammadreza Pourreza. Din-sql: Decomposed in-context learning of text-to-sql with self- correction.arXiv preprint arXiv:2304.11015, 2023

work page arXiv 2023
[9]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023
[10]

arXiv preprint arXiv:2403.09732 , year=

Zhishuai Li, Xiang Wang, Jingjing Zhao, Sun Yang, Guoqing Du, Xiaoru Hu, Bin Zhang, Yuxiao Ye, Ziyue Li, Rui Zhao, and Hangyu Mao. PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency.arXiv preprint arXiv:2403.09732, 2024

work page arXiv 2024
[11]

Next- generation database interfaces: A survey of llm-based text-to-sql.arXiv, 2024

Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. Next- generation database interfaces: A survey of llm-based text-to-sql.arXiv, 2024. 25 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting

work page 2024
[12]

Db-gpt: Empowering database interactions with private large language models.arXiv, 2023

Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen, Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong Liu, and Faqiang Chen. Db-gpt: Empowering database interactions with private large language models.arXiv, 2023

work page 2023
[13]

Lecture 15: The weighted majority algorithm

Shai Shalev-Shwartz. Lecture 15: The weighted majority algorithm. https://www.cs.cmu.edu/~15850/ notes/lec15.pdf, 2022. Accessed: 2025-05-22

work page 2022
[15]

Available athttps://arxiv.org/abs/1909.05207

work page arXiv 1909
[16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Accessed: 2025-06

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[18]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. Accessed: 2025-06

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Gpt-4o: Fast, intelligent, flexible gpt model

OpenAI. Gpt-4o: Fast, intelligent, flexible gpt model. https://platform.openai.com/docs/models/ gpt-4o, 2024. Accessed: 2025-06

work page 2024
[20]

Gpt-4.1 - flagship gpt model for complex tasks

OpenAI. Gpt-4.1 - flagship gpt model for complex tasks. https://platform.openai.com/docs/models/ gpt-4.1, 2025. Accessed: 2025-04-23

work page 2025
[21]

o3-mini - a small model alternative to o3.https://platform.openai.com/docs/models/o3-mini,

OpenAI. o3-mini - a small model alternative to o3.https://platform.openai.com/docs/models/o3-mini,

work page
[22]

Gemini 2.5: Our most intelligent ai model

Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://deepmind.google/technologies/ gemini, 2025. Accessed: 2025-06

work page 2025
[23]

Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025. Accessed: 2025-06. 26 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting Appendix A. Self-Refinement Prompt -- Target SQL Dialect: BIGQUERY [Project Information] Current Project ID: bigquery-public-data Important: Use these IDs in your ...

work page 2025
[24]

First, check the database structure: - Use ’ls’ to see all files and directories - Look for files ending in .sql, .ddl, or .schema - Check for any README.md or documentation files ... [Original SQL] WITH EngagementData AS ( SELECT user_pseudo_id, event_timestamp, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = ’engagement_time_msec’) AS enga...

work page
[25]

Try simplifying query structure

work page
[26]

Focus on filtering conditions

work page
[27]

Try SELECT with minimal columns first

work page
[28]

Explain your strategy choice and apply it to refine the SQL

Double-check all referenced schema components. Explain your strategy choice and apply it to refine the SQL. 27 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting [Expected Output Format] CSV Format: distinct_pseudo_users_count (integer) Ensure the output matches this format exactly. [Next Steps]

work page
[29]

First verify the database structure and table names

work page
[30]

If directory access fails, try alternative paths

work page
[31]

Document any access issues encountered

work page
[32]

Then refine the SQL query to resolve the issue

work page
[33]

", is_save=..., save_path=

Ensure the output format matches the requirements [Required Action Format] You must output exactly one of the following actions (no other text): - Action: BIGQUERY_EXEC_SQL(sql_query="...", is_save=..., save_path=".../result.csv") - Action: Terminate(output=".../result.csv") Important: When dealing with numerical results, DO NOT round the numbers. Keep th...

work page 2021
[34]

First, check the database structure: - Use ’ls’ to see all files and directories - Look for files ending in ’.sql’, ’.ddl’, or ’.schema’ - Check for any README.md or documentation files

work page
[35]

Verify table names and schema: - Open any ‘.sql‘ or ‘.ddl‘ files to see table definitions - Look for ‘CREATE TABLE‘ statements - Note the exact database and schema names - Check for any table aliases or views - If schema file not found, try other directories ...... [Current Plan] [’Identify the relevant table in the ‘ga4‘ dataset within the bigquery-publi...

work page
[36]

description

**TIMESTAMP Functions and Timezone Handling**: I need documentation or examples on how to handle timestamps and timezones in BigQuery "description": "GoogleSQL for BigQuery supports the following timestamp functions.\n\n IMPORTANT: Before working with these functions", ......, ‘‘‘ ## specifically for filtering data within defined date ranges (e.g. DATE | ...

work page 2017

[1] [1]

Revolutionize business intelligence with genbi, 2024

Wren AI. Revolutionize business intelligence with genbi, 2024. Accessed: 2025-05-21

work page 2024

[2] [2]

How we built a text-to-sql ai agent to get instant answers from our data

Salesforce. How we built a text-to-sql ai agent to get instant answers from our data. https://www.salesforce. com/blog/text-to-sql-agent/, 2025. Accessed: 2025-06

work page 2025

[3] [3]

Snowflake introduces cortex aisql and snowconvert ai: Analytics re- built for the ai era

Snowflake. Snowflake introduces cortex aisql and snowconvert ai: Analytics re- built for the ai era. https://www.snowflake.com/en/news/press-releases/ snowflake-introduces-cortex-aisql-and-snowconvert-ai-analytics-rebuilt-for-the-ai-era/ ,

work page

[4] [4]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows. InProceedings of the 12th International Conference on Learning R...

work page 2025

[5] [5]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Quan Yao, and Dragomir Radev Roman. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Association ...

work page 2018

[6] [6]

Chang, Fei Huang, Reynold Cheng, and Yongbin Li

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin C.C. Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.arXiv preprint arXiv:2305.03111, 2023

work page arXiv 2023

[7] [7]

doi:10.48550/arXiv.2502.00675 , abstract =

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration.arXiv preprint arXiv:2502.00675, 2025

work page arXiv 2025

[8] [8]

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik

Davood Rafiei Mohammadreza Pourreza. Din-sql: Decomposed in-context learning of text-to-sql with self- correction.arXiv preprint arXiv:2304.11015, 2023

work page arXiv 2023

[9] [9]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023

[10] [10]

arXiv preprint arXiv:2403.09732 , year=

Zhishuai Li, Xiang Wang, Jingjing Zhao, Sun Yang, Guoqing Du, Xiaoru Hu, Bin Zhang, Yuxiao Ye, Ziyue Li, Rui Zhao, and Hangyu Mao. PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency.arXiv preprint arXiv:2403.09732, 2024

work page arXiv 2024

[11] [11]

Next- generation database interfaces: A survey of llm-based text-to-sql.arXiv, 2024

Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, and Xiao Huang. Next- generation database interfaces: A survey of llm-based text-to-sql.arXiv, 2024. 25 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting

work page 2024

[12] [12]

Db-gpt: Empowering database interactions with private large language models.arXiv, 2023

Siqiao Xue, Caigao Jiang, Wenhui Shi, Fangyin Cheng, Keting Chen, Hongjun Yang, Zhiping Zhang, Jianshan He, Hongyang Zhang, Ganglin Wei, Wang Zhao, Fan Zhou, Danrui Qi, Hong Yi, Shaodong Liu, and Faqiang Chen. Db-gpt: Empowering database interactions with private large language models.arXiv, 2023

work page 2023

[13] [13]

Lecture 15: The weighted majority algorithm

Shai Shalev-Shwartz. Lecture 15: The weighted majority algorithm. https://www.cs.cmu.edu/~15850/ notes/lec15.pdf, 2022. Accessed: 2025-05-22

work page 2022

[14] [15]

Available athttps://arxiv.org/abs/1909.05207

work page arXiv 1909

[15] [16]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Accessed: 2025-06

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [17]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024

[17] [18]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. Accessed: 2025-06

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [19]

Gpt-4o: Fast, intelligent, flexible gpt model

OpenAI. Gpt-4o: Fast, intelligent, flexible gpt model. https://platform.openai.com/docs/models/ gpt-4o, 2024. Accessed: 2025-06

work page 2024

[19] [20]

Gpt-4.1 - flagship gpt model for complex tasks

OpenAI. Gpt-4.1 - flagship gpt model for complex tasks. https://platform.openai.com/docs/models/ gpt-4.1, 2025. Accessed: 2025-04-23

work page 2025

[20] [21]

o3-mini - a small model alternative to o3.https://platform.openai.com/docs/models/o3-mini,

OpenAI. o3-mini - a small model alternative to o3.https://platform.openai.com/docs/models/o3-mini,

work page

[21] [22]

Gemini 2.5: Our most intelligent ai model

Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://deepmind.google/technologies/ gemini, 2025. Accessed: 2025-06

work page 2025

[22] [23]

Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025

xAI. Grok 3 beta — the age of reasoning agents.https://x.ai/news/grok-3, 2025. Accessed: 2025-06. 26 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting Appendix A. Self-Refinement Prompt -- Target SQL Dialect: BIGQUERY [Project Information] Current Project ID: bigquery-public-data Important: Use these IDs in your ...

work page 2025

[23] [24]

First, check the database structure: - Use ’ls’ to see all files and directories - Look for files ending in .sql, .ddl, or .schema - Check for any README.md or documentation files ... [Original SQL] WITH EngagementData AS ( SELECT user_pseudo_id, event_timestamp, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = ’engagement_time_msec’) AS enga...

work page

[24] [25]

Try simplifying query structure

work page

[25] [26]

Focus on filtering conditions

work page

[26] [27]

Try SELECT with minimal columns first

work page

[27] [28]

Explain your strategy choice and apply it to refine the SQL

Double-check all referenced schema components. Explain your strategy choice and apply it to refine the SQL. 27 LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority V oting [Expected Output Format] CSV Format: distinct_pseudo_users_count (integer) Ensure the output matches this format exactly. [Next Steps]

work page

[28] [29]

First verify the database structure and table names

work page

[29] [30]

If directory access fails, try alternative paths

work page

[30] [31]

Document any access issues encountered

work page

[31] [32]

Then refine the SQL query to resolve the issue

work page

[32] [33]

", is_save=..., save_path=

Ensure the output format matches the requirements [Required Action Format] You must output exactly one of the following actions (no other text): - Action: BIGQUERY_EXEC_SQL(sql_query="...", is_save=..., save_path=".../result.csv") - Action: Terminate(output=".../result.csv") Important: When dealing with numerical results, DO NOT round the numbers. Keep th...

work page 2021

[33] [34]

First, check the database structure: - Use ’ls’ to see all files and directories - Look for files ending in ’.sql’, ’.ddl’, or ’.schema’ - Check for any README.md or documentation files

work page

[34] [35]

Verify table names and schema: - Open any ‘.sql‘ or ‘.ddl‘ files to see table definitions - Look for ‘CREATE TABLE‘ statements - Note the exact database and schema names - Check for any table aliases or views - If schema file not found, try other directories ...... [Current Plan] [’Identify the relevant table in the ‘ga4‘ dataset within the bigquery-publi...

work page

[35] [36]

description

**TIMESTAMP Functions and Timezone Handling**: I need documentation or examples on how to handle timestamps and timezones in BigQuery "description": "GoogleSQL for BigQuery supports the following timestamp functions.\n\n IMPORTANT: Before working with these functions", ......, ‘‘‘ ## specifically for filtering data within defined date ranges (e.g. DATE | ...

work page 2017