Residual Skill Optimization for Text-to-SQL Ensembles

Babak Salimi; Canwen Xu; Haoquan Guan; Jiongli Zhu; Nikki Lijing Kuang; Parjanya Prajakta Prashant; Seyedeh Baharan Khatami; Xiaodong Yu; Yingyu Lin; Yuxiong He

arxiv: 2605.21792 · v1 · pith:ETYAOLTQnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.DB· cs.LG

Residual Skill Optimization for Text-to-SQL Ensembles

Jiongli Zhu , Haoquan Guan , Parjanya Prajakta Prashant , Nikki Lijing Kuang , Seyedeh Baharan Khatami , Canwen Xu , Xiaodong Yu , Yingyu Lin

show 3 more authors

Zhewei Yao Yuxiong He Babak Salimi

This is my paper

Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DBcs.LG

keywords Text-to-SQLensemble methodsresidual optimizationPass@Kcomplementary skillsSQL generationagentic AI

0 comments

The pith

DivSkill-SQL builds complementary Text-to-SQL ensembles by optimizing each new skill on the failures of the current ensemble, improving accuracy without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-SQL ensembles generate multiple query candidates and pick one, yet their performance is limited by correlated errors that keep Pass@K low. DivSkill-SQL introduces residual skill optimization to add new skills targeted at exactly those failure cases. This method ensures each addition contributes to the probability of having at least one correct candidate. On the Spider2-Lite benchmark the approach lifts accuracy by as much as 11.1 points on Snowflake and 8.3 points on BigQuery over prior ensembles. The skills also transfer across SQL dialects and to related tasks like BIRD-Critic while reducing specific error types such as schema hallucinations.

Core claim

By framing skill addition as residual optimization on ensemble failures, DivSkill-SQL produces agentic skills whose collective coverage provably raises the upper bound on correct selection in Text-to-SQL generation, achieving measurable gains across models and dialects without any parameter updates to the underlying language models.

What carries the argument

Residual skill optimization that trains each successive skill exclusively on the failure examples of the preceding ensemble to maximize marginal contribution to Pass@K.

If this is right

Accuracy gains hold across different base models such as Opus-4.6 and GPT-5.4.
Skills trained on one dialect transfer directly to others like Snowflake, BigQuery, and SQLite.
Performance improvements extend to BIRD-Critic with a 2.6 point gain.
Diagnostics reveal up to three times fewer hallucinated schema references and function calls.
The method requires no fine-tuning of the base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar residual optimization could apply to other ensemble-based generation tasks facing diversity issues.
The focus on failure cases may yield more reliable diversity than stochastic sampling or prompt variations.
Automatic selection of skill count could be based on diminishing marginal returns in Pass@K.

Load-bearing premise

That skills trained only on the ensemble's current failure cases will generate genuinely complementary candidates rather than new sets of correlated mistakes.

What would settle it

A direct check would be whether the union of correct queries covered by the full skill ensemble is larger than that of the baseline ensemble by the claimed margin, or if selected accuracy plateaus despite added skills on a new test split.

Figures

Figures reproduced from arXiv: 2605.21792 by Babak Salimi, Canwen Xu, Haoquan Guan, Jiongli Zhu, Nikki Lijing Kuang, Parjanya Prajakta Prashant, Seyedeh Baharan Khatami, Xiaodong Yu, Yingyu Lin, Yuxiong He, Zhewei Yao.

**Figure 1.** Figure 1: System diagram of DIVSKILL-SQL. The left panel shows skill construction: starting from diverse strategy prompts, the system repeatedly identifies unsolved questions and refines the next skill toward those remaining cases. The right panel shows test-time execution: multiple skill-guided agents solve the same Text-to-SQL problem through different interaction patterns, producing SQL candidates that are then c… view at source ↗

**Figure 2.** Figure 2: Pass@k comparison between DIVSKILL-SQL and its variants on 100-instance subsets of a) Spider2-lite and b) Bird-Critic. structure. These results indicate that residual skills improve diversity in a more controlled way. Rather than relying on high-temperature perturbations that can introduce hallucinated references or unstable SQL structures, DIVSKILL-SQL induces different agent behaviors while preserving ca… view at source ↗

read the original abstract

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DivSkill-SQL shows practical gains on Text-to-SQL ensembles via residual skill optimization on failures, but the complementarity claim rests on limited diversity metrics.

read the letter

Colleague, the main point is that this work introduces DivSkill-SQL to iteratively add skills to Text-to-SQL ensembles by optimizing each new one exclusively on the cases where the current ensemble fails, with the goal of lifting Pass@K without any model fine-tuning. They report up to +11.1 points on Snowflake and +8.3 on BigQuery for Spider2-Lite, plus smaller gains on BIRD-Critic and transfer across dialects like SQLite. The error diagnostics noting up to 3x fewer hallucinations and function-call issues are a concrete plus and suggest the method improves reliability in a usable way for real database interfaces. What the paper does well is keep the approach simple and agentic, building directly on prior ensemble ideas while giving a clear recipe for targeting marginal contributions. The results hold across two base models, which adds some robustness. On the soft spots, the abstract labels the targeting as 'provable' yet the text supplies no derivation or formal argument, so that framing feels more aspirational than demonstrated. The stress-test concern about correlated errors is reasonable here: since all skills come from the same base models via prompting, optimizing on disjoint failure subsets could still reinforce shared biases rather than truly diversify coverage. They provide some hallucination counts but no explicit inter-skill error correlation or diversity statistics, leaving open whether the lifts come from reduced overlap or just stronger individual skills. This is for researchers and engineers building practical Text-to-SQL systems who care about ensemble methods that avoid retraining. A reader focused on applied LLM reliability would find the method and numbers useful. It has enough empirical grounding and a clear practical angle to deserve serious referee time rather than a desk reject.

Referee Report

3 major / 1 minor

Summary. The paper introduces DivSkill-SQL, a residual skill optimization framework for Text-to-SQL ensembles that constructs complementary agentic skills by optimizing each new skill exclusively on the failure cases of the current ensemble. It claims this approach provably targets the marginal contribution to Pass@K without model fine-tuning. On Spider2-Lite, it reports selected accuracy gains of up to +11.1 points on Snowflake and +8.3 on BigQuery over strong baselines, with consistent results across Opus-4.6 and GPT-5.4, transfer to other dialects and BIRD-Critic (+2.6 pts), and up to 3x fewer hallucinated schema references and function calls.

Significance. If the empirical gains are reproducible and the complementary coverage is verified beyond final accuracy, the work offers a practical method for improving ensemble diversity in semantic parsing tasks. The no-fine-tuning constraint and cross-dialect transfer are notable strengths for real-world Text-to-SQL deployment.

major comments (3)

[Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.
[§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.
[§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.

minor comments (1)

[Abstract] The abstract introduces 'DivSkill-SQL' and 'residual skill optimization' without a one-sentence definition; adding a brief parenthetical expansion would improve immediate readability for readers outside the subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.

Authors: We agree that an explicit formal argument strengthens the central claim. The residual optimization targets marginal contribution because any correct SQL produced on previously uncovered failures directly raises Pass@K for those instances. In the revised manuscript we will add a short proof sketch in §3: let E_k be the set of examples failed by the first k skills; the marginal gain from skill k+1 is P(skill k+1 succeeds on E_k). Optimizing exclusively on E_k maximizes the expected size of newly covered examples under a fixed per-skill success rate on the residual distribution. We will contrast this with non-residual addition (which can redundantly cover already-solved examples) and include a small illustrative counterexample showing the coverage difference. revision: yes
Referee: [§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.

Authors: We accept this criticism. In the revised §4 we will rerun the full DivSkill-SQL pipeline with three distinct random seeds for both the skill optimization and the final selection step, reporting mean selected accuracy and standard deviation for the Snowflake and BigQuery gains. This will quantify variability due to stochasticity in the residual training process. revision: yes
Referee: [§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.

Authors: We will strengthen the diagnostics. In the revised §5 we will add pairwise Jaccard indices computed on the per-skill failure sets (instances where each skill produces an incorrect SQL) and report the average inter-skill error overlap. We will also compute the correlation of hallucination types (schema references and function calls) across skills. These metrics will be compared against the baseline ensemble to show that residual optimization yields measurably lower error correlation, supporting the claim of complementary rather than merely stronger individual skills. revision: yes

Circularity Check

1 steps flagged

Residual optimization on failures makes marginal Pass@K targeting definitional by construction

specific steps

self definitional [Abstract]
"each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K"

Optimizing exclusively on the ensemble's failure cases ensures by construction that the new skill succeeds where prior ones fail, directly raising Pass@K by the fraction of those cases. The 'provably targeting marginal contribution' assertion is therefore equivalent to the residual selection rule itself rather than a derived or independent result.

full rationale

The paper's central methodological claim is that residual skill optimization on current-ensemble failures 'provably targets its marginal contribution to Pass@K'. This reduces directly to the definition of the optimization target: by selecting training examples where all prior skills fail, any skill that succeeds on them necessarily covers exactly the marginal cases that increase Pass@K. No independent derivation, uniqueness theorem, or external benchmark is shown to establish the 'provable' aspect beyond this construction. Empirical gains and error diagnostics are reported separately and do not rescue the load-bearing justification step. This matches partial circularity (score 6) while leaving the reported accuracy numbers as non-circular observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on the assumption that failure-targeted optimization yields independent skills; no explicit free parameters or invented physical entities are described.

axioms (1)

domain assumption Ensemble effectiveness is bounded by Pass@K
Stated directly as the limiting factor for current ensembles.

invented entities (1)

DivSkill-SQL residual skill no independent evidence
purpose: Complementary agentic skill added to ensemble
Newly introduced optimization target; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5795 in / 1194 out tokens · 34386 ms · 2026-05-22T08:35:02.771450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DIVSKILL-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · 11 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Introducing Claude Opus 4.6

Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , February 2026

work page 2026
[3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

Reliable text-to-sql with adaptive abstention.Proc

Kaiwen Chen, Yueting Chen, Nick Koudas, and Xiaohui Yu. Reliable text-to-sql with adaptive abstention.Proc. ACM Manag. Data, 3(1), February 2025. doi: 10.1145/3709719. URLhttps://doi.org/10.1145/3709719

work page doi:10.1145/3709719 2025
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning

Yaxun Dai, Haiqin Yang, Mou Hao, and Pingfu Chao. PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 661–681, Vienna, Austria, July 2025. Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-acl.37 2025
[9]

Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025

work page 2025
[10]

C3: Zero-shot text-to-sql with chatgpt,

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306, 2023

work page arXiv 2023
[11]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023
[13]

SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs

Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, and Xiaobin Tan. SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 8441–8452, Vienna, Austria, J...

work page 2025
[14]

URLhttps://aclanthology.org/2025.findings-acl.443/

doi: 10.18653/v1/2025.findings-acl.443. URLhttps://aclanthology.org/2025.findings-acl.443/

work page doi:10.18653/v1/2025.findings-acl.443 2025
[15]

V-star: Training verifiers for self-taught reasoners, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

work page arXiv 2024
[16]

MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation

Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 3...

work page 2025
[17]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

work page arXiv 2024
[18]

The dawn of natural language to sql: Are we fully ready?Proc

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural language to sql: Are we fully ready?Proc. VLDB Endow., 17(11):3318–3331, July 2024. ISSN 2150-8097. doi: 10.14778/3681954. 3682003. URLhttps://doi.org/10.14778/3681954.3682003

work page doi:10.14778/3681954 2024
[19]

Codes: Towards building open-source language models for text-to-sql.Proc

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. Codes: Towards building open-source language models for text-to-sql.Proc. ACM Manag. Data, 2(3), May 2024. doi: 10.1145/3654930. URLhttps://doi.org/10.1145/3654930

work page doi:10.1145/3654930 2024
[20]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[21]

Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, et al. Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

work page arXiv 2025
[22]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[23]

Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026

Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026. 11 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page 2026
[24]

SQL-r1: Training natural language to SQL reasoning model by reinforcement learning

Peixian MA, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-r1: Training natural language to SQL reasoning model by reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=hgJQcuDwm1

work page 2026
[25]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

work page 1978
[27]

Lever: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, pages 26106–26128. PMLR, 2023

work page 2023
[28]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

work page 2026
[29]

Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

Silviu Pitis, Michael R Zhang, Andrew Wang, and Jimmy Ba. Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

work page arXiv 2023
[30]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

work page 2023
[31]

Evaluating cross-domain text-to-sql models and benchmarks

Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1601–1611, 2023

work page 2023
[32]

Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

work page arXiv 2024
[33]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA

Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, et al. Reasoning-sql: Reinforcement learning with sql tailored partial rewards for reasoning-enhanced text-to-sql.arXiv preprint arXiv:2503.23157, 2025

work page arXiv 2025
[34]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

work page 2020
[35]

Exploring chain of thought style prompting for text-to-sql

Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. Exploring chain of thought style prompting for text-to-sql. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5376–5393, 2023

work page 2023
[36]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

work page arXiv 2025
[38]

Mac-sql: A multi-agent collaborative framework for text-to-sql

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

work page 2025
[39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm

Yuanzhen Xie, Xinzhou Jin, Tao Xie, Mingxiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, Chengxiang Zhuo, Bo Hu, and Zang Li. Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 107...

work page doi:10.18653/v1/2024.findings-acl.641 2024
[41]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023. 12 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page 2023
[42]

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang, Jipeng Zhang, Zhitao He, and Yi R Fung. Mars-sql: A multi-agent reinforcement learning framework for text-to-sql.arXiv preprint arXiv:2511.01008, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Synthesizing text-to-SQL data from weak and strong LLMs

Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. Synthesizing text-to-SQL data from weak and strong LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand, August 2024. Assoc...

work page doi:10.18653/v1/2024.acl-long.425 2024
[44]

Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

work page arXiv 2025
[45]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

work page 2018
[46]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Optimizing reasoning for text-to-SQL with execution feedback

Bohan Zhai, Canwen Xu, Yuxiong He, and Zhewei Yao. Optimizing reasoning for text-to-SQL with execution feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19206–19218, Vienna, Austria, July 2025. Association for Computational Linguistic...

work page doi:10.18653/v1/2025.findings-acl.982 2025
[48]

Equipping agents for the real world with agent skills, october 2025

Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills, october 2025. URL https://www. anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills. Accessed, pages 01–28, 2026

work page 2025
[49]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Structure-guided large language models for text-to-SQL generation

Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, and Xiao Huang. Structure-guided large language models for text-to-SQL generation. InForty-second International Conference on Machine Learning,

work page
[52]

URLhttps://openreview.net/forum?id=gT8JSEFqaS

work page
[53]

Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[54]

anchor the grain before grouping

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022. 13 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT A Proofs A.1 Proof of Proposition A.1 We analyze the po...

work page 2022
[55]

EXPLORE first: run queries to understand the data–-check table structures, column values, data types, join keys, actual string values in the data

work page
[56]

If unsure about any SQL function’s syntax or behavior, call lookup_docs BEFORE writing the query

work page
[57]

For common patterns (top-N, running totals, pivots), call get_sql_pattern for a template

work page
[58]

PLAN your approach based on what you discovered

work page
[59]

WRITE and TEST your SQL incrementally–-run it via execute_sql to check results

work page
[60]

VERIFY results look reasonable (right number of rows, right columns, sensible values)

work page
[61]

Call review_sql to get a second opinion before submitting

work page
[62]

SUBMIT only when confident Optimized prompt: default

work page
[63]

EXPLORE the data first (as in seed)

work page
[64]

CLARIFY ambiguities–-identify potential traps: NULLs in key columns, case sensitivity, duplicate rows, date formats, and whether counts should be DISTINCT

work page
[65]

MAP the question to SQL primitives–-explicitly decide join type (INNER vs LEFT), filter placement (WHERE vs HAVING), aggregation scope, and NULL handling before coding

work page
[66]

Check templates–-call get_sql_pattern and lookup_docs (as in seed)

work page
[67]

BUILD incrementally–-write and execute_sql each CTE or subquery alone (as in seed)

work page
[68]

VALIDATE against the question–-re-read the question, then check: correct columns returned? correct filter conditions? DISTINCT where needed? NULL-safe denominators? ordering and limits match?

work page
[69]

CROSS-CHECK edge cases–-run a quick sanity query (e.g., total counts, min/max values, a spot-check join) to confirm the final result is not inflated by fanout or deflated by over-filtering

work page
[70]

REVIEW–-call review_sql and address any flagged issues

work page
[71]

direct_coder.Strategy:drafts SQL immediately, refines through execution feedback

SUBMIT only after incremental checks and review pass. direct_coder.Strategy:drafts SQL immediately, refines through execution feedback. Seed prompt: direct_coder ## Strategy: DIRECT CODING You are an EFFICIENT SQL writer. Write SQL quickly, test, iterate

work page
[72]

Identify the core tables, joins, and aggregations needed

Read the question carefully. Identify the core tables, joins, and aggregations needed

work page
[73]

Write your best SQL attempt IMMEDIATELY based on the schema

work page
[74]

If errors occur, read the error message carefully and fix

Execute it. If errors occur, read the error message carefully and fix

work page
[75]

If the query runs but results look wrong, investigate specific columns/values

work page
[76]

Iterate rapidly–-each revision should fix one specific issue

work page
[77]

Only investigate columns/values that are directly relevant to errors

Do NOT over-explore. Only investigate columns/values that are directly relevant to errors

work page
[78]

GEPA added lookup-table awareness and structured error-repair guidance

SUBMIT as soon as the query produces reasonable results. GEPA added lookup-table awareness and structured error-repair guidance. Key additions over the seed (new or substantially expanded material inbold): Optimized prompt: direct_coder ## Strategy: DIRECT CODING

work page
[79]

Before writing any SQL, identify ALL tables mentioned or implied by the question

Read the schema first. Before writing any SQL, identify ALL tables mentioned or implied by the question. Pay special attention to lookup/reference/static tables (e.g., category tables, node tables, type tables) that provide human-readable names or filter criteria–-these almost always require a JOIN. 18 Residual Skill Optimization for Text-to-SQL Ensembles...

work page
[80]

If the question references a name, label, or category, find which table owns that column

Map question terms to schema columns. If the question references a name, label, or category, find which table owns that column. Never filter or select on a column that doesn’t exist in the target table–-use the correct table via JOIN instead

work page

Showing first 80 references.

[1] [1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Introducing Claude Opus 4.6

Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , February 2026

work page 2026

[3] [3]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[4] [4]

Reliable text-to-sql with adaptive abstention.Proc

Kaiwen Chen, Yueting Chen, Nick Koudas, and Xiaohui Yu. Reliable text-to-sql with adaptive abstention.Proc. ACM Manag. Data, 3(1), February 2025. doi: 10.1145/3709719. URLhttps://doi.org/10.1145/3709719

work page doi:10.1145/3709719 2025

[5] [5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning

Yaxun Dai, Haiqin Yang, Mou Hao, and Pingfu Chao. PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 661–681, Vienna, Austria, July 2025. Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-acl.37 2025

[9] [9]

Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration

Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025

work page 2025

[10] [10]

C3: Zero-shot text-to-sql with chatgpt,

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306, 2023

work page arXiv 2023

[11] [11]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yun- tao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, and Yu Li

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

work page arXiv 2023

[13] [13]

SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs

Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, and Xiaobin Tan. SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 8441–8452, Vienna, Austria, J...

work page 2025

[14] [14]

URLhttps://aclanthology.org/2025.findings-acl.443/

doi: 10.18653/v1/2025.findings-acl.443. URLhttps://aclanthology.org/2025.findings-acl.443/

work page doi:10.18653/v1/2025.findings-acl.443 2025

[15] [15]

V-star: Training verifiers for self-taught reasoners, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024

work page arXiv 2024

[16] [16]

MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation

Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 3...

work page 2025

[17] [17]

Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024

work page arXiv 2024

[18] [18]

The dawn of natural language to sql: Are we fully ready?Proc

Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural language to sql: Are we fully ready?Proc. VLDB Endow., 17(11):3318–3331, July 2024. ISSN 2150-8097. doi: 10.14778/3681954. 3682003. URLhttps://doi.org/10.14778/3681954.3682003

work page doi:10.14778/3681954 2024

[19] [19]

Codes: Towards building open-source language models for text-to-sql.Proc

Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. Codes: Towards building open-source language models for text-to-sql.Proc. ACM Manag. Data, 2(3), May 2024. doi: 10.1145/3654930. URLhttps://doi.org/10.1145/3654930

work page doi:10.1145/3654930 2024

[20] [20]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[21] [21]

Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, et al. Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025

work page arXiv 2025

[22] [22]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[23] [23]

Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026

Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026. 11 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page 2026

[24] [24]

SQL-r1: Training natural language to SQL reasoning model by reinforcement learning

Peixian MA, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-r1: Training natural language to SQL reasoning model by reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=hgJQcuDwm1

work page 2026

[25] [25]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978

work page 1978

[27] [27]

Lever: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, pages 26106–26128. PMLR, 2023

work page 2023

[28] [28]

Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026

work page 2026

[29] [29]

Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

Silviu Pitis, Michael R Zhang, Andrew Wang, and Jimmy Ba. Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023

work page arXiv 2023

[30] [30]

Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023

work page 2023

[31] [31]

Evaluating cross-domain text-to-sql models and benchmarks

Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1601–1611, 2023

work page 2023

[32] [32]

Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

work page arXiv 2024

[33] [33]

InFindings of the Associ- ation for Computational Linguistics: EMNLP 2024, pages 8212–8220, Miami, Florida, USA

Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, et al. Reasoning-sql: Reinforcement learning with sql tailored partial rewards for reasoning-enhanced text-to-sql.arXiv preprint arXiv:2503.23157, 2025

work page arXiv 2025

[34] [34]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

work page 2020

[35] [35]

Exploring chain of thought style prompting for text-to-sql

Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. Exploring chain of thought style prompting for text-to-sql. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5376–5393, 2023

work page 2023

[36] [36]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025

work page arXiv 2025

[38] [38]

Mac-sql: A multi-agent collaborative framework for text-to-sql

Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025

work page 2025

[39] [39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm

Yuanzhen Xie, Xinzhou Jin, Tao Xie, Mingxiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, Chengxiang Zhuo, Bo Hu, and Zang Li. Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 107...

work page doi:10.18653/v1/2024.findings-acl.641 2024

[41] [41]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023. 12 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT

work page 2023

[42] [42]

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang, Jipeng Zhang, Zhitao He, and Yi R Fung. Mars-sql: A multi-agent reinforcement learning framework for text-to-sql.arXiv preprint arXiv:2511.01008, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Synthesizing text-to-SQL data from weak and strong LLMs

Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. Synthesizing text-to-SQL data from weak and strong LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand, August 2024. Assoc...

work page doi:10.18653/v1/2024.acl-long.425 2024

[44] [44]

Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025

work page arXiv 2025

[45] [45]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018

work page 2018

[46] [46]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Optimizing reasoning for text-to-SQL with execution feedback

Bohan Zhai, Canwen Xu, Yuxiong He, and Zhewei Yao. Optimizing reasoning for text-to-SQL with execution feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19206–19218, Vienna, Austria, July 2025. Association for Computational Linguistic...

work page doi:10.18653/v1/2025.findings-acl.982 2025

[48] [48]

Equipping agents for the real world with agent skills, october 2025

Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills, october 2025. URL https://www. anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills. Accessed, pages 01–28, 2026

work page 2025

[49] [49]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Structure-guided large language models for text-to-SQL generation

Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, and Xiao Huang. Structure-guided large language models for text-to-SQL generation. InForty-second International Conference on Machine Learning,

work page

[52] [52]

URLhttps://openreview.net/forum?id=gT8JSEFqaS

work page

[53] [53]

Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026

[54] [54]

anchor the grain before grouping

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022. 13 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT A Proofs A.1 Proof of Proposition A.1 We analyze the po...

work page 2022

[55] [55]

EXPLORE first: run queries to understand the data–-check table structures, column values, data types, join keys, actual string values in the data

work page

[56] [56]

If unsure about any SQL function’s syntax or behavior, call lookup_docs BEFORE writing the query

work page

[57] [57]

For common patterns (top-N, running totals, pivots), call get_sql_pattern for a template

work page

[58] [58]

PLAN your approach based on what you discovered

work page

[59] [59]

WRITE and TEST your SQL incrementally–-run it via execute_sql to check results

work page

[60] [60]

VERIFY results look reasonable (right number of rows, right columns, sensible values)

work page

[61] [61]

Call review_sql to get a second opinion before submitting

work page

[62] [62]

SUBMIT only when confident Optimized prompt: default

work page

[63] [63]

EXPLORE the data first (as in seed)

work page

[64] [64]

CLARIFY ambiguities–-identify potential traps: NULLs in key columns, case sensitivity, duplicate rows, date formats, and whether counts should be DISTINCT

work page

[65] [65]

MAP the question to SQL primitives–-explicitly decide join type (INNER vs LEFT), filter placement (WHERE vs HAVING), aggregation scope, and NULL handling before coding

work page

[66] [66]

Check templates–-call get_sql_pattern and lookup_docs (as in seed)

work page

[67] [67]

BUILD incrementally–-write and execute_sql each CTE or subquery alone (as in seed)

work page

[68] [68]

VALIDATE against the question–-re-read the question, then check: correct columns returned? correct filter conditions? DISTINCT where needed? NULL-safe denominators? ordering and limits match?

work page

[69] [69]

CROSS-CHECK edge cases–-run a quick sanity query (e.g., total counts, min/max values, a spot-check join) to confirm the final result is not inflated by fanout or deflated by over-filtering

work page

[70] [70]

REVIEW–-call review_sql and address any flagged issues

work page

[71] [71]

direct_coder.Strategy:drafts SQL immediately, refines through execution feedback

SUBMIT only after incremental checks and review pass. direct_coder.Strategy:drafts SQL immediately, refines through execution feedback. Seed prompt: direct_coder ## Strategy: DIRECT CODING You are an EFFICIENT SQL writer. Write SQL quickly, test, iterate

work page

[72] [72]

Identify the core tables, joins, and aggregations needed

Read the question carefully. Identify the core tables, joins, and aggregations needed

work page

[73] [73]

Write your best SQL attempt IMMEDIATELY based on the schema

work page

[74] [74]

If errors occur, read the error message carefully and fix

Execute it. If errors occur, read the error message carefully and fix

work page

[75] [75]

If the query runs but results look wrong, investigate specific columns/values

work page

[76] [76]

Iterate rapidly–-each revision should fix one specific issue

work page

[77] [77]

Only investigate columns/values that are directly relevant to errors

Do NOT over-explore. Only investigate columns/values that are directly relevant to errors

work page

[78] [78]

GEPA added lookup-table awareness and structured error-repair guidance

SUBMIT as soon as the query produces reasonable results. GEPA added lookup-table awareness and structured error-repair guidance. Key additions over the seed (new or substantially expanded material inbold): Optimized prompt: direct_coder ## Strategy: DIRECT CODING

work page

[79] [79]

Before writing any SQL, identify ALL tables mentioned or implied by the question

Read the schema first. Before writing any SQL, identify ALL tables mentioned or implied by the question. Pay special attention to lookup/reference/static tables (e.g., category tables, node tables, type tables) that provide human-readable names or filter criteria–-these almost always require a JOIN. 18 Residual Skill Optimization for Text-to-SQL Ensembles...

work page

[80] [80]

If the question references a name, label, or category, find which table owns that column

Map question terms to schema columns. If the question references a name, label, or category, find which table owns that column. Never filter or select on a column that doesn’t exist in the target table–-use the correct table via JOIN instead

work page