pith. sign in

arxiv: 2605.21792 · v1 · pith:ETYAOLTQnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.DB· cs.LG

Residual Skill Optimization for Text-to-SQL Ensembles

Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DBcs.LG
keywords Text-to-SQLensemble methodsresidual optimizationPass@Kcomplementary skillsSQL generationagentic AI
0
0 comments X

The pith

DivSkill-SQL builds complementary Text-to-SQL ensembles by optimizing each new skill on the failures of the current ensemble, improving accuracy without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-SQL ensembles generate multiple query candidates and pick one, yet their performance is limited by correlated errors that keep Pass@K low. DivSkill-SQL introduces residual skill optimization to add new skills targeted at exactly those failure cases. This method ensures each addition contributes to the probability of having at least one correct candidate. On the Spider2-Lite benchmark the approach lifts accuracy by as much as 11.1 points on Snowflake and 8.3 points on BigQuery over prior ensembles. The skills also transfer across SQL dialects and to related tasks like BIRD-Critic while reducing specific error types such as schema hallucinations.

Core claim

By framing skill addition as residual optimization on ensemble failures, DivSkill-SQL produces agentic skills whose collective coverage provably raises the upper bound on correct selection in Text-to-SQL generation, achieving measurable gains across models and dialects without any parameter updates to the underlying language models.

What carries the argument

Residual skill optimization that trains each successive skill exclusively on the failure examples of the preceding ensemble to maximize marginal contribution to Pass@K.

If this is right

  • Accuracy gains hold across different base models such as Opus-4.6 and GPT-5.4.
  • Skills trained on one dialect transfer directly to others like Snowflake, BigQuery, and SQLite.
  • Performance improvements extend to BIRD-Critic with a 2.6 point gain.
  • Diagnostics reveal up to three times fewer hallucinated schema references and function calls.
  • The method requires no fine-tuning of the base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar residual optimization could apply to other ensemble-based generation tasks facing diversity issues.
  • The focus on failure cases may yield more reliable diversity than stochastic sampling or prompt variations.
  • Automatic selection of skill count could be based on diminishing marginal returns in Pass@K.

Load-bearing premise

That skills trained only on the ensemble's current failure cases will generate genuinely complementary candidates rather than new sets of correlated mistakes.

What would settle it

A direct check would be whether the union of correct queries covered by the full skill ensemble is larger than that of the baseline ensemble by the claimed margin, or if selected accuracy plateaus despite added skills on a new test split.

Figures

Figures reproduced from arXiv: 2605.21792 by Babak Salimi, Canwen Xu, Haoquan Guan, Jiongli Zhu, Nikki Lijing Kuang, Parjanya Prajakta Prashant, Seyedeh Baharan Khatami, Xiaodong Yu, Yingyu Lin, Yuxiong He, Zhewei Yao.

Figure 1
Figure 1. Figure 1: System diagram of DIVSKILL-SQL. The left panel shows skill construction: starting from diverse strategy prompts, the system repeatedly identifies unsolved questions and refines the next skill toward those remaining cases. The right panel shows test-time execution: multiple skill-guided agents solve the same Text-to-SQL problem through different interaction patterns, producing SQL candidates that are then c… view at source ↗
Figure 2
Figure 2. Figure 2: Pass@k comparison between DIVSKILL-SQL and its variants on 100-instance subsets of a) Spider2-lite and b) Bird-Critic. structure. These results indicate that residual skills improve diversity in a more controlled way. Rather than relying on high-temperature perturbations that can introduce hallucinated references or unstable SQL structures, DIVSKILL-SQL induces different agent behaviors while preserving ca… view at source ↗
read the original abstract

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces DivSkill-SQL, a residual skill optimization framework for Text-to-SQL ensembles that constructs complementary agentic skills by optimizing each new skill exclusively on the failure cases of the current ensemble. It claims this approach provably targets the marginal contribution to Pass@K without model fine-tuning. On Spider2-Lite, it reports selected accuracy gains of up to +11.1 points on Snowflake and +8.3 on BigQuery over strong baselines, with consistent results across Opus-4.6 and GPT-5.4, transfer to other dialects and BIRD-Critic (+2.6 pts), and up to 3x fewer hallucinated schema references and function calls.

Significance. If the empirical gains are reproducible and the complementary coverage is verified beyond final accuracy, the work offers a practical method for improving ensemble diversity in semantic parsing tasks. The no-fine-tuning constraint and cross-dialect transfer are notable strengths for real-world Text-to-SQL deployment.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.
  2. [§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.
  3. [§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.
minor comments (1)
  1. [Abstract] The abstract introduces 'DivSkill-SQL' and 'residual skill optimization' without a one-sentence definition; adding a brief parenthetical expansion would improve immediate readability for readers outside the subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.

    Authors: We agree that an explicit formal argument strengthens the central claim. The residual optimization targets marginal contribution because any correct SQL produced on previously uncovered failures directly raises Pass@K for those instances. In the revised manuscript we will add a short proof sketch in §3: let E_k be the set of examples failed by the first k skills; the marginal gain from skill k+1 is P(skill k+1 succeeds on E_k). Optimizing exclusively on E_k maximizes the expected size of newly covered examples under a fixed per-skill success rate on the residual distribution. We will contrast this with non-residual addition (which can redundantly cover already-solved examples) and include a small illustrative counterexample showing the coverage difference. revision: yes

  2. Referee: [§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.

    Authors: We accept this criticism. In the revised §4 we will rerun the full DivSkill-SQL pipeline with three distinct random seeds for both the skill optimization and the final selection step, reporting mean selected accuracy and standard deviation for the Snowflake and BigQuery gains. This will quantify variability due to stochasticity in the residual training process. revision: yes

  3. Referee: [§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.

    Authors: We will strengthen the diagnostics. In the revised §5 we will add pairwise Jaccard indices computed on the per-skill failure sets (instances where each skill produces an incorrect SQL) and report the average inter-skill error overlap. We will also compute the correlation of hallucination types (schema references and function calls) across skills. These metrics will be compared against the baseline ensemble to show that residual optimization yields measurably lower error correlation, supporting the claim of complementary rather than merely stronger individual skills. revision: yes

Circularity Check

1 steps flagged

Residual optimization on failures makes marginal Pass@K targeting definitional by construction

specific steps
  1. self definitional [Abstract]
    "each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K"

    Optimizing exclusively on the ensemble's failure cases ensures by construction that the new skill succeeds where prior ones fail, directly raising Pass@K by the fraction of those cases. The 'provably targeting marginal contribution' assertion is therefore equivalent to the residual selection rule itself rather than a derived or independent result.

full rationale

The paper's central methodological claim is that residual skill optimization on current-ensemble failures 'provably targets its marginal contribution to Pass@K'. This reduces directly to the definition of the optimization target: by selecting training examples where all prior skills fail, any skill that succeeds on them necessarily covers exactly the marginal cases that increase Pass@K. No independent derivation, uniqueness theorem, or external benchmark is shown to establish the 'provable' aspect beyond this construction. Empirical gains and error diagnostics are reported separately and do not rescue the load-bearing justification step. This matches partial circularity (score 6) while leaving the reported accuracy numbers as non-circular observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on the assumption that failure-targeted optimization yields independent skills; no explicit free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Ensemble effectiveness is bounded by Pass@K
    Stated directly as the limiting factor for current ensembles.
invented entities (1)
  • DivSkill-SQL residual skill no independent evidence
    purpose: Complementary agentic skill added to ensemble
    Newly introduced optimization target; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5795 in / 1194 out tokens · 34386 ms · 2026-05-22T08:35:02.771450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DIVSKILL-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · 11 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

  2. [2]

    Introducing Claude Opus 4.6

    Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , February 2026

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  4. [4]

    Reliable text-to-sql with adaptive abstention.Proc

    Kaiwen Chen, Yueting Chen, Nick Koudas, and Xiaohui Yu. Reliable text-to-sql with adaptive abstention.Proc. ACM Manag. Data, 3(1), February 2025. doi: 10.1145/3709719. URLhttps://doi.org/10.1145/3709719

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  6. [6]

    Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

  8. [8]

    PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning

    Yaxun Dai, Haiqin Yang, Mou Hao, and Pingfu Chao. PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 661–681, Vienna, Austria, July 2025. Association for Computational Linguistics...

  9. [9]

    Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration

    Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025

  10. [10]

    C3: Zero-shot text-to-sql with chatgpt,

    Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306, 2023

  11. [11]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023

  12. [12]

    Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

  13. [13]

    SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs

    Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, and Xiaobin Tan. SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 8441–8452, Vienna, Austria, J...

  14. [14]

    URLhttps://aclanthology.org/2025.findings-acl.443/

    doi: 10.18653/v1/2025.findings-acl.443. URLhttps://aclanthology.org/2025.findings-acl.443/

  15. [15]

    V-star: Training verifiers for self-taught reasoners, 2024

    Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

  16. [16]

    MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation

    Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 3...

  17. [17]

    Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

    Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

  18. [18]

    The dawn of natural language to sql: Are we fully ready?Proc

    Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural language to sql: Are we fully ready?Proc. VLDB Endow., 17(11):3318–3331, July 2024. ISSN 2150-8097. doi: 10.14778/3681954. 3682003. URLhttps://doi.org/10.14778/3681954.3682003

  19. [19]

    Codes: Towards building open-source language models for text-to-sql.Proc

    Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. Codes: Towards building open-source language models for text-to-sql.Proc. ACM Manag. Data, 2(3), May 2024. doi: 10.1145/3654930. URLhttps://doi.org/10.1145/3654930

  20. [20]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

  21. [21]

    Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

    Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, et al. Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

  22. [22]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  23. [23]

    Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026

    Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026. 11 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

  24. [24]

    SQL-r1: Training natural language to SQL reasoning model by reinforcement learning

    Peixian MA, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-r1: Training natural language to SQL reasoning model by reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=hgJQcuDwm1

  25. [25]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

  26. [26]

    An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

  27. [27]

    Lever: Learning to verify language-to-code generation with execution

    Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, pages 26106–26128. PMLR, 2023

  28. [28]

    Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

    OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

  29. [29]

    Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

    Silviu Pitis, Michael R Zhang, Andrew Wang, and Jimmy Ba. Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

  30. [30]

    Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

    Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

  31. [31]

    Evaluating cross-domain text-to-sql models and benchmarks

    Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1601–1611, 2023

  32. [32]

    Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

    Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

  33. [33]

    InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA

    Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, et al. Reasoning-sql: Reinforcement learning with sql tailored partial rewards for reasoning-enhanced text-to-sql.arXiv preprint arXiv:2503.23157, 2025

  34. [34]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

  35. [35]

    Exploring chain of thought style prompting for text-to-sql

    Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. Exploring chain of thought style prompting for text-to-sql. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5376–5393, 2023

  36. [36]

    CHESS: Contextual Harnessing for Efficient SQL Synthesis

    Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

  37. [37]

    Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

    Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

  38. [38]

    Mac-sql: A multi-agent collaborative framework for text-to-sql

    Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

  39. [39]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  40. [40]

    Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm

    Yuanzhen Xie, Xinzhou Jin, Tao Xie, Mingxiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, Chengxiang Zhuo, Bo Hu, and Zang Li. Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 107...

  41. [41]

    Large language models as optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023. 12 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

  42. [42]

    MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

    Haolin Yang, Jipeng Zhang, Zhitao He, and Yi R Fung. Mars-sql: A multi-agent reinforcement learning framework for text-to-sql.arXiv preprint arXiv:2511.01008, 2025

  43. [43]

    Synthesizing text-to-SQL data from weak and strong LLMs

    Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. Synthesizing text-to-SQL data from weak and strong LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand, August 2024. Assoc...

  44. [44]

    Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

    Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

  45. [45]

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

  46. [46]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  47. [47]

    Optimizing reasoning for text-to-SQL with execution feedback

    Bohan Zhai, Canwen Xu, Yuxiong He, and Zhewei Yao. Optimizing reasoning for text-to-SQL with execution feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19206–19218, Vienna, Austria, July 2025. Association for Computational Linguistic...

  48. [48]

    Equipping agents for the real world with agent skills, october 2025

    Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills, october 2025. URL https://www. anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills. Accessed, pages 01–28, 2026

  49. [49]

    CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

    Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026

  50. [50]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

  51. [51]

    Structure-guided large language models for text-to-SQL generation

    Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, and Xiao Huang. Structure-guided large language models for text-to-SQL generation. InForty-second International Conference on Machine Learning,

  52. [52]

    URLhttps://openreview.net/forum?id=gT8JSEFqaS

  53. [53]

    Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

    Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

  54. [54]

    anchor the grain before grouping

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022. 13 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT A Proofs A.1 Proof of Proposition A.1 We analyze the po...

  55. [55]

    EXPLORE first: run queries to understand the data–-check table structures, column values, data types, join keys, actual string values in the data

  56. [56]

    If unsure about any SQL function’s syntax or behavior, call lookup_docs BEFORE writing the query

  57. [57]

    For common patterns (top-N, running totals, pivots), call get_sql_pattern for a template

  58. [58]

    PLAN your approach based on what you discovered

  59. [59]

    WRITE and TEST your SQL incrementally–-run it via execute_sql to check results

  60. [60]

    VERIFY results look reasonable (right number of rows, right columns, sensible values)

  61. [61]

    Call review_sql to get a second opinion before submitting

  62. [62]

    SUBMIT only when confident Optimized prompt: default

  63. [63]

    EXPLORE the data first (as in seed)

  64. [64]

    CLARIFY ambiguities–-identify potential traps: NULLs in key columns, case sensitivity, duplicate rows, date formats, and whether counts should be DISTINCT

  65. [65]

    MAP the question to SQL primitives–-explicitly decide join type (INNER vs LEFT), filter placement (WHERE vs HAVING), aggregation scope, and NULL handling before coding

  66. [66]

    Check templates–-call get_sql_pattern and lookup_docs (as in seed)

  67. [67]

    BUILD incrementally–-write and execute_sql each CTE or subquery alone (as in seed)

  68. [68]

    VALIDATE against the question–-re-read the question, then check: correct columns returned? correct filter conditions? DISTINCT where needed? NULL-safe denominators? ordering and limits match?

  69. [69]

    CROSS-CHECK edge cases–-run a quick sanity query (e.g., total counts, min/max values, a spot-check join) to confirm the final result is not inflated by fanout or deflated by over-filtering

  70. [70]

    REVIEW–-call review_sql and address any flagged issues

  71. [71]

    direct_coder.Strategy:drafts SQL immediately, refines through execution feedback

    SUBMIT only after incremental checks and review pass. direct_coder.Strategy:drafts SQL immediately, refines through execution feedback. Seed prompt: direct_coder ## Strategy: DIRECT CODING You are an EFFICIENT SQL writer. Write SQL quickly, test, iterate

  72. [72]

    Identify the core tables, joins, and aggregations needed

    Read the question carefully. Identify the core tables, joins, and aggregations needed

  73. [73]

    Write your best SQL attempt IMMEDIATELY based on the schema

  74. [74]

    If errors occur, read the error message carefully and fix

    Execute it. If errors occur, read the error message carefully and fix

  75. [75]

    If the query runs but results look wrong, investigate specific columns/values

  76. [76]

    Iterate rapidly–-each revision should fix one specific issue

  77. [77]

    Only investigate columns/values that are directly relevant to errors

    Do NOT over-explore. Only investigate columns/values that are directly relevant to errors

  78. [78]

    GEPA added lookup-table awareness and structured error-repair guidance

    SUBMIT as soon as the query produces reasonable results. GEPA added lookup-table awareness and structured error-repair guidance. Key additions over the seed (new or substantially expanded material inbold): Optimized prompt: direct_coder ## Strategy: DIRECT CODING

  79. [79]

    Before writing any SQL, identify ALL tables mentioned or implied by the question

    Read the schema first. Before writing any SQL, identify ALL tables mentioned or implied by the question. Pay special attention to lookup/reference/static tables (e.g., category tables, node tables, type tables) that provide human-readable names or filter criteria–-these almost always require a JOIN. 18 Residual Skill Optimization for Text-to-SQL Ensembles...

  80. [80]

    If the question references a name, label, or category, find which table owns that column

    Map question terms to schema columns. If the question references a name, label, or category, find which table owns that column. Never filter or select on a column that doesn’t exist in the target table–-use the correct table via JOIN instead

Showing first 80 references.