ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL
Pith reviewed 2026-06-27 19:42 UTC · model grok-4.3
The pith
Distilling recurring LLM failure patterns into a compact set of generation rules enables a fully zero-shot Text-to-SQL system to reach 88.6% execution accuracy on Spider test.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM failures in zero-shot Text-to-SQL follow systematic patterns that can be distilled into a small set of core generation rules; when these rules are paired with knowledge-augmented schema representation and rule-driven structured reasoning plus execution-guided early stopping, the framework produces executable SQL at 87.2% accuracy on Spider development and 88.6% on test without any in-context examples.
What carries the argument
The Map-Reduce-based rule distillation pipeline that scans failure cases, maps individual error modes, and reduces them to a compact rule set that then constrains generation in the three downstream modules.
If this is right
- The same distilled rules allow an 81.3% result on the domain-specific UrbanPlan dataset.
- A 4B-parameter model equipped with the rules surpasses zero-shot baselines of larger closed-source models.
- Rule-driven structured reasoning reduces structural deviations that otherwise appear in free-form generation.
- Execution-guided early stopping supplies low-cost self-correction without extra model calls.
Where Pith is reading between the lines
- The distillation pipeline could be reused on other structured output tasks where errors also cluster, such as semantic parsing or code generation.
- Because the rules replace in-context examples, the method may scale better when schemas become very large and context windows are limited.
- Iterative application of the same rules across multiple self-correction rounds might further close the remaining gap to few-shot performance.
Load-bearing premise
The assumption that LLM failures in zero-shot Text-to-SQL are systematic and recurring enough to be distilled into a small set of rules that generalize across domains and model sizes.
What would settle it
Extract the rules on Spider, then test them on a fresh cross-domain benchmark where the rule-augmented zero-shot accuracy falls back to the level of an unaugmented zero-shot baseline.
Figures
read the original abstract
Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ZAS-SQL, a fully zero-shot Text-to-SQL framework that identifies systematic failure patterns in LLM-generated SQL, distills a small set of core generation rules via a Map-Reduce pipeline, and augments generation with knowledge-augmented schema representation, rule-driven structured reasoning, and Execution-Guided Early Stopping. It reports execution accuracies of 87.2% (Spider Dev) and 88.6% (Spider Test), surpassing several few-shot and fine-tuned GPT-4/4o baselines, with additional results on UrbanPlan (81.3%) and a 4B-parameter model.
Significance. If the zero-shot guarantee can be verified without label leakage, the result would be significant: it would demonstrate that a compact set of distilled rules can substitute for in-context demonstrations while improving cross-domain generalization and model efficiency. The reported outperformance of closed-source zero-shot baselines by an open 4B model would also be noteworthy if reproducible.
major comments (1)
- [Abstract and Methods (Map-Reduce rule distillation)] Abstract and rule-distillation pipeline description: the central claim of a 'fully zero-shot' framework that sets new SOTA while surpassing few-shot GPT-4 methods rests on the distillation step. Identifying failures for rule extraction requires executing generated SQL against ground-truth labels and comparing results; the manuscript provides no explicit statement that this step was performed exclusively on a held-out synthetic corpus rather than the Spider dev/test splits whose accuracies are later reported. If evaluation labels were used, the zero-shot property and the comparison to few-shot baselines are invalidated.
minor comments (1)
- [Abstract] The abstract mentions results on 'UrbanPlan' but supplies no dataset statistics, domain description, or baseline numbers for that corpus.
Simulated Author's Rebuttal
We thank the referee for highlighting this critical point regarding the zero-shot guarantee. We address the concern directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract and rule-distillation pipeline description: the central claim of a 'fully zero-shot' framework that sets new SOTA while surpassing few-shot GPT-4 methods rests on the distillation step. Identifying failures for rule extraction requires executing generated SQL against ground-truth labels and comparing results; the manuscript provides no explicit statement that this step was performed exclusively on a held-out synthetic corpus rather than the Spider dev/test splits whose accuracies are later reported. If evaluation labels were used, the zero-shot property and the comparison to few-shot baselines are invalidated.
Authors: We agree that the current manuscript lacks an explicit statement on the data source for rule distillation, which is necessary to substantiate the zero-shot claim. The rule distillation was performed exclusively on a held-out synthetic corpus constructed independently of the Spider dev/test splits (with no overlap in queries or schemas), ensuring no label leakage into the reported evaluation results. We will revise the Methods section and add a dedicated subsection detailing the synthetic corpus generation process, the Map-Reduce pipeline execution, and explicit confirmation of separation from evaluation data. This clarification will also be reflected in the abstract. revision: yes
Circularity Check
No circularity; empirical claims rest on benchmark evaluation without self-referential reduction
full rationale
The paper describes an empirical pipeline that observes LLM failure patterns in zero-shot Text-to-SQL, distills rules via Map-Reduce, and augments generation with schema and reasoning modules. No equations, parameter fits, or self-citations are invoked that would make any reported accuracy (e.g., 87.2% on Spider dev) equivalent to its own inputs by construction. The central claims are externally falsifiable via standard benchmark execution accuracy and do not reduce to renaming, self-definition, or load-bearing self-citation chains. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM failures in zero-shot Text-to-SQL are systematic and recurring rather than random
Reference graph
Works this paper leans on
-
[1]
A survey on deep learning approaches for text-to-. The VLDB Journal , author =. 2023 , pages =. doi:10.1007/s00778-022-00776-8 , language =
-
[2]
Recent Advances in Text-to- SQL : A Survey of What We Have and What We Expect
Deng, Naihao and Chen, Yulong and Zhang, Yue. Recent Advances in Text-to- SQL : A Survey of What We Have and What We Expect. Proceedings of the 29th International Conference on Computational Linguistics. 2022
2022
-
[3]
Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL , year=
Hong, Zijin and Yuan, Zheng and Zhang, Qinggang and Chen, Hao and Dong, Junnan and Huang, Feiran and Huang, Xiao , journal=. Next-Generation Database Interfaces: A Survey of LLM-Based Text-to-SQL , year=
-
[4]
2025 , eprint=
Exploring the Landscape of Text-to-SQL with Large Language Models: Progresses, Challenges and Opportunities , author=. 2025 , eprint=
2025
-
[5]
Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. Proceedings of the 2018 Conference on Empirical...
-
[6]
2024 , eprint=
GPT-4 Technical Report , author=. 2024 , eprint=
2024
-
[7]
Shi, Liang and Tang, Zhengju and Zhang, Nan and Zhang, Xiaotong and Yang, Zhi , title =. ACM Comput. Surv. , month = sep, articleno =. 2025 , issue_date =. doi:10.1145/3737873 , abstract =
-
[8]
MS c- SQL : Multi-Sample Critiquing Small Language Models For Text-To- SQL Translation
Gorti, Satya Krishna and Gofman, Ilan and Liu, Zhaoyan and Wu, Jiapeng and Vouitsis, No. MS c- SQL : Multi-Sample Critiquing Small Language Models For Text-To- SQL Translation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 202...
-
[9]
Zhong, Qihuang and Chen, Kunfeng and Ding, Liang and Liu, Juhua and Du, Bo and Tao, Dacheng. Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to- SQL. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.403
-
[10]
Companion of the 2024 International Conference on Management of Data , pages =
Zhang, Chao and Mao, Yuren and Fan, Yijiang and Mi, Yu and Gao, Yunjun and Chen, Lu and Lou, Dongfang and Lin, Jinshu , title =. Companion of the 2024 International Conference on Management of Data , pages =. 2024 , isbn =. doi:10.1145/3626246.3653375 , abstract =
-
[11]
Synthesizing Text-to- SQL Data from Weak and Strong LLM s
Yang, Jiaxi and Hui, Binyuan and Yang, Min and Yang, Jian and Lin, Junyang and Zhou, Chang. Synthesizing Text-to- SQL Data from Weak and Strong LLM s. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.425
-
[12]
Findings of the Association for Computational Linguistics:
Pourreza, Mohammadreza and Rafiei, Davood. DTS - SQL : Decomposed Text-to- SQL with Small Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.481
-
[13]
ACT - SQL : In-Context Learning for Text-to- SQL with Automatically-Generated Chain-of-Thought
Zhang, Hanchong and Cao, Ruisheng and Chen, Lu and Xu, Hongshen and Yu, Kai. ACT - SQL : In-Context Learning for Text-to- SQL with Automatically-Generated Chain-of-Thought. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.227
-
[14]
MAC - SQL : A Multi-Agent Collaborative Framework for Text-to- SQL
Wang, Bing and Ren, Changyu and Yang, Jian and Liang, Xinnian and Bai, Jiaqi and Chai, LinZheng and Yan, Zhao and Zhang, Qian-Wen and Yin, Di and Sun, Xing and Li, Zhoujun. MAC - SQL : A Multi-Agent Collaborative Framework for Text-to- SQL. Proceedings of the 31st International Conference on Computational Linguistics. 2025
2025
-
[15]
SQLP rompt: In-Context Text-to- SQL with Minimal Labeled Data
Sun, Ruoxi and Arik, Sercan and Sinha, Rajarishi and Nakhost, Hootan and Dai, Hanjun and Yin, Pengcheng and Pfister, Tomas. SQLP rompt: In-Context Text-to- SQL with Minimal Labeled Data. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.39
-
[16]
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction , url =
Pourreza, Mohammadreza and Rafiei, Davood , booktitle =. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction , url =
-
[17]
2023 , eprint=
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation , author=. 2023 , eprint=
2023
-
[18]
MCS - SQL : Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to- SQL Generation
Lee, Dongjun and Park, Choongwon and Kim, Jaehyuk and Park, Heesoo. MCS - SQL : Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to- SQL Generation. Proceedings of the 31st International Conference on Computational Linguistics. 2025
2025
-
[19]
2023 , eprint=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=
2023
-
[20]
2022 , eprint=
CodeT: Code Generation with Generated Tests , author=. 2022 , eprint=
2022
-
[21]
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL , url =
Pourreza, Mohammadreza and Li, Hailong and Sun, Ruoxi and Chung, Yeounoh and Talaei, Shayan and Kakkar, Gaurav Tarlok and Gan, Yu and Saberi, Amin and Ozcan, Fatma and Arik, Sercan , booktitle =. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL , url =
-
[22]
Enhancing Text-to- SQL Parsing through Question Rewriting and Execution-Guided Refinement
Mao, Wenxin and Wang, Ruiqi and Guo, Jiyu and Zeng, Jichuan and Gao, Cuiyun and Han, Peiyi and Liu, Chuanyi. Enhancing Text-to- SQL Parsing through Question Rewriting and Execution-Guided Refinement. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.120
-
[23]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i22.34511 , number=
-
[24]
Let ' s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLM s
Aggarwal, Pranjal and Madaan, Aman and Yang, Yiming and Mausam. Let ' s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLM s. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.761
-
[25]
Dong, Xuemei and Zhang, Chao and Ge, Yuhang and Mao, Yuren and Gao, Yunjun and Chen, lu and Lin, Jinshu and Lou, Dongfang , month = jul, year =. C3:. doi:10.48550/arXiv.2307.07306 , language =
-
[26]
doi: 10.18653/v1/2023.findings-acl.53
Gan, Yujian and Chen, Xinyun and Purver, Matthew. Re-appraising the Schema Linking for Text-to- SQL. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.53
-
[27]
Improving Retrieval-augmented Text-to- SQL with AST -based Ranking and Schema Pruning
Shen, Zhili and Vougiouklis, Pavlos and Diao, Chenxin and Vyas, Kaustubh and Ji, Yuanyi and Pan, Jeff Z. Improving Retrieval-augmented Text-to- SQL with AST -based Ranking and Schema Pruning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.449
-
[28]
K aggle DBQA : Realistic Evaluation of Text-to- SQL Parsers
Lee, Chia-Hsuan and Polozov, Oleksandr and Richardson, Matthew. K aggle DBQA : Realistic Evaluation of Text-to- SQL Parsers. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.176
-
[29]
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs , url =
Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and Zhou, Xuanhe and Chenhao, Ma and Li, Guoliang and Chang, Kevin and Huang, Fei and Cheng, Reynold and Li, Yongbin , booktitle =. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Da...
-
[30]
2025 , eprint=
DeepSeek-V3 Technical Report , author=. 2025 , eprint=
2025
-
[31]
Enhancing Text-to- SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies
Nan, Linyong and Zhao, Yilun and Zou, Weijin and Ri, Narutatsu and Tae, Jaesung and Zhang, Ellen and Cohan, Arman and Radev, Dragomir. Enhancing Text-to- SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.996
-
[32]
2023 , eprint=
A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability , author=. 2023 , eprint=
2023
-
[33]
Decomposition for Enhancing Attention: Improving LLM -based Text-to- SQL through Workflow Paradigm
Xie, Yuanzhen and Jin, Xinzhou and Xie, Tao and Lin, Mingxiong and Chen, Liang and Yu, Chenyun and Cheng, Lei and Zhuo, Chengxiang and Hu, Bo and Li, Zang. Decomposition for Enhancing Attention: Improving LLM -based Text-to- SQL through Workflow Paradigm. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findi...
-
[34]
Xie, Xiangjin and Xu, Guangwei and Zhao, Lingyan and Guo, Ruijie , title =. 2025 , issue_date =. doi:10.1145/3725331 , journal =
-
[35]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
-
[36]
Liu , title =
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[37]
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...
-
[38]
2021 , eprint=
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
2021
-
[39]
2026 , howpublished =
GPT-4o API Documentation , author =. 2026 , howpublished =
2026
-
[40]
2022 , url =
OpenAI , title =. 2022 , url =
2022
-
[41]
ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL , url =
Qin, Yang and Chen, Chao and Fu, Zhihang and Chen, Ze and Peng, Dezhong and Hu, Peng and Ye, Jieping , booktitle =. ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL , url =
-
[42]
2025 , eprint=
Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search , author=. 2025 , eprint=
2025
-
[43]
SAFE - SQL : Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to- SQL
Lee, Jimin and Baek, Ingeol and Kim, Byeongjeong and Bae, Hyunkyung and Lee, Hwanhee. SAFE - SQL : Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to- SQL. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.962
-
[44]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[45]
2024 , eprint=
Qwen2.5-Coder Technical Report , author=. 2024 , eprint=
2024
-
[46]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[47]
2026 , eprint=
GLM-5: from Vibe Coding to Agentic Engineering , author=. 2026 , eprint=
2026
-
[48]
2026 , eprint=
Kimi K2: Open Agentic Intelligence , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.