Residual Skill Optimization for Text-to-SQL Ensembles
Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3
The pith
DivSkill-SQL builds complementary Text-to-SQL ensembles by optimizing each new skill on the failures of the current ensemble, improving accuracy without fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By framing skill addition as residual optimization on ensemble failures, DivSkill-SQL produces agentic skills whose collective coverage provably raises the upper bound on correct selection in Text-to-SQL generation, achieving measurable gains across models and dialects without any parameter updates to the underlying language models.
What carries the argument
Residual skill optimization that trains each successive skill exclusively on the failure examples of the preceding ensemble to maximize marginal contribution to Pass@K.
If this is right
- Accuracy gains hold across different base models such as Opus-4.6 and GPT-5.4.
- Skills trained on one dialect transfer directly to others like Snowflake, BigQuery, and SQLite.
- Performance improvements extend to BIRD-Critic with a 2.6 point gain.
- Diagnostics reveal up to three times fewer hallucinated schema references and function calls.
- The method requires no fine-tuning of the base models.
Where Pith is reading between the lines
- Similar residual optimization could apply to other ensemble-based generation tasks facing diversity issues.
- The focus on failure cases may yield more reliable diversity than stochastic sampling or prompt variations.
- Automatic selection of skill count could be based on diminishing marginal returns in Pass@K.
Load-bearing premise
That skills trained only on the ensemble's current failure cases will generate genuinely complementary candidates rather than new sets of correlated mistakes.
What would settle it
A direct check would be whether the union of correct queries covered by the full skill ensemble is larger than that of the baseline ensemble by the claimed margin, or if selected accuracy plateaus despite added skills on a new test split.
Figures
read the original abstract
Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DivSkill-SQL, a residual skill optimization framework for Text-to-SQL ensembles that constructs complementary agentic skills by optimizing each new skill exclusively on the failure cases of the current ensemble. It claims this approach provably targets the marginal contribution to Pass@K without model fine-tuning. On Spider2-Lite, it reports selected accuracy gains of up to +11.1 points on Snowflake and +8.3 on BigQuery over strong baselines, with consistent results across Opus-4.6 and GPT-5.4, transfer to other dialects and BIRD-Critic (+2.6 pts), and up to 3x fewer hallucinated schema references and function calls.
Significance. If the empirical gains are reproducible and the complementary coverage is verified beyond final accuracy, the work offers a practical method for improving ensemble diversity in semantic parsing tasks. The no-fine-tuning constraint and cross-dialect transfer are notable strengths for real-world Text-to-SQL deployment.
major comments (3)
- [Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.
- [§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.
- [§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.
minor comments (1)
- [Abstract] The abstract introduces 'DivSkill-SQL' and 'residual skill optimization' without a one-sentence definition; adding a brief parenthetical expansion would improve immediate readability for readers outside the subfield.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: The claim that each skill 'provably targets its marginal contribution to Pass@K' lacks any derivation, proof sketch, or formal argument in the manuscript. The optimization is described as training on current ensemble failures, which risks making the targeting tautological rather than provable; a concrete mathematical justification or counterexample analysis is needed to support this central assertion.
Authors: We agree that an explicit formal argument strengthens the central claim. The residual optimization targets marginal contribution because any correct SQL produced on previously uncovered failures directly raises Pass@K for those instances. In the revised manuscript we will add a short proof sketch in §3: let E_k be the set of examples failed by the first k skills; the marginal gain from skill k+1 is P(skill k+1 succeeds on E_k). Optimizing exclusively on E_k maximizes the expected size of newly covered examples under a fixed per-skill success rate on the residual distribution. We will contrast this with non-residual addition (which can redundantly cover already-solved examples) and include a small illustrative counterexample showing the coverage difference. revision: yes
-
Referee: [§4] §4 (Experiments on Spider2-Lite): The reported accuracy lifts (+11.1 on Snowflake, +8.3 on BigQuery) are presented without error bars, standard deviations, or details on the number of independent runs or random seeds. This undermines assessment of whether the gains are statistically reliable or sensitive to optimization stochasticity.
Authors: We accept this criticism. In the revised §4 we will rerun the full DivSkill-SQL pipeline with three distinct random seeds for both the skill optimization and the final selection step, reporting mean selected accuracy and standard deviation for the Snowflake and BigQuery gains. This will quantify variability due to stochasticity in the residual training process. revision: yes
-
Referee: [§5] §5 (Error diagnostics): The manuscript reports up to 3x fewer hallucinations but provides no explicit inter-skill error-correlation metric, pairwise failure overlap, or diversity statistic (e.g., Jaccard index on error sets across skills). Without this, it is unclear whether gains arise from reduced correlated errors or simply from higher per-skill reliability on the same base models.
Authors: We will strengthen the diagnostics. In the revised §5 we will add pairwise Jaccard indices computed on the per-skill failure sets (instances where each skill produces an incorrect SQL) and report the average inter-skill error overlap. We will also compute the correlation of hallucination types (schema references and function calls) across skills. These metrics will be compared against the baseline ensemble to show that residual optimization yields measurably lower error correlation, supporting the claim of complementary rather than merely stronger individual skills. revision: yes
Circularity Check
Residual optimization on failures makes marginal Pass@K targeting definitional by construction
specific steps
-
self definitional
[Abstract]
"each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K"
Optimizing exclusively on the ensemble's failure cases ensures by construction that the new skill succeeds where prior ones fail, directly raising Pass@K by the fraction of those cases. The 'provably targeting marginal contribution' assertion is therefore equivalent to the residual selection rule itself rather than a derived or independent result.
full rationale
The paper's central methodological claim is that residual skill optimization on current-ensemble failures 'provably targets its marginal contribution to Pass@K'. This reduces directly to the definition of the optimization target: by selecting training examples where all prior skills fail, any skill that succeeds on them necessarily covers exactly the marginal cases that increase Pass@K. No independent derivation, uniqueness theorem, or external benchmark is shown to establish the 'provable' aspect beyond this construction. Empirical gains and error diagnostics are reported separately and do not rescue the load-bearing justification step. This matches partial circularity (score 6) while leaving the reported accuracy numbers as non-circular observations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ensemble effectiveness is bounded by Pass@K
invented entities (1)
-
DivSkill-SQL residual skill
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DIVSKILL-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Anthropic. Introducing Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , February 2026
work page 2026
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[4]
Reliable text-to-sql with adaptive abstention.Proc
Kaiwen Chen, Yueting Chen, Nick Koudas, and Xiaohui Yu. Reliable text-to-sql with adaptive abstention.Proc. ACM Manag. Data, 3(1), February 2025. doi: 10.1145/3709719. URLhttps://doi.org/10.1145/3709719
-
[5]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 10 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning
Yaxun Dai, Haiqin Yang, Mou Hao, and Pingfu Chao. PARSQL: Enhancing text-to-SQL through SQL parsing and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 661–681, Vienna, Austria, July 2025. Association for Computational Linguistics...
-
[9]
Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration
Minghang Deng, Ashwin Ramachandran, Canwen Xu, Lanxiang Hu, Zhewei Yao, Anupam Datta, and Hao Zhang. Reforce: A text-to-sql agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025
work page 2025
-
[10]
C3: Zero-shot text-to-sql with chatgpt,
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306, 2023
-
[11]
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023
-
[13]
SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs
Yu Guo, Dong Jin, Shenghao Ye, Shuangwu Chen, Jian Yang, and Xiaobin Tan. SQLForge: Synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 8441–8452, Vienna, Austria, J...
work page 2025
-
[14]
URLhttps://aclanthology.org/2025.findings-acl.443/
doi: 10.18653/v1/2025.findings-acl.443. URLhttps://aclanthology.org/2025.findings-acl.443/
-
[15]
V-star: Training verifiers for self-taught reasoners, 2024
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners.arXiv preprint arXiv:2402.06457, 2024
-
[16]
MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation
Dongjun Lee, Choongwon Park, Jaehyuk Kim, and Heesoo Park. MCS-SQL: Leveraging multiple prompts and multiple-choice selection for text-to-SQL generation. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 3...
work page 2025
-
[17]
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows.arXiv preprint arXiv:2411.07763, 2024
-
[18]
The dawn of natural language to sql: Are we fully ready?Proc
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. The dawn of natural language to sql: Are we fully ready?Proc. VLDB Endow., 17(11):3318–3331, July 2024. ISSN 2150-8097. doi: 10.14778/3681954. 3682003. URLhttps://doi.org/10.14778/3681954.3682003
-
[19]
Codes: Towards building open-source language models for text-to-sql.Proc
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. Codes: Towards building open-source language models for text-to-sql.Proc. ACM Manag. Data, 2(3), May 2024. doi: 10.1145/3654930. URLhttps://doi.org/10.1145/3654930
-
[20]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[21]
Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, et al. Swe-sql: Illuminating llm pathways to solve user sql issues in real-world applications.arXiv preprint arXiv:2506.18951, 2025
-
[22]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[23]
Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, et al. Xiyan-sql: A novel multi-generator framework for text-to-sql.IEEE Transactions on Knowledge and Data Engineering, 2026. 11 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT
work page 2026
-
[24]
SQL-r1: Training natural language to SQL reasoning model by reinforcement learning
Peixian MA, Xialie Zhuang, Chengjin Xu, Xuhui Jiang, Ran Chen, and Jian Guo. SQL-r1: Training natural language to SQL reasoning model by reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=hgJQcuDwm1
work page 2026
-
[25]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver.arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical programming, 14(1):265–294, 1978
work page 1978
-
[27]
Lever: Learning to verify language-to-code generation with execution
Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning, pages 26106–26128. PMLR, 2023
work page 2023
-
[28]
Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026
OpenAI. Introducing GPT-5.4.https://openai.com/index/introducing-gpt-5-4/, March 2026
work page 2026
-
[29]
Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023
Silviu Pitis, Michael R Zhang, Andrew Wang, and Jimmy Ba. Boosted prompt ensembles for large language models.arXiv preprint arXiv:2304.05970, 2023
-
[30]
Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in neural information processing systems, 36:36339–36348, 2023
work page 2023
-
[31]
Evaluating cross-domain text-to-sql models and benchmarks
Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1601–1611, 2023
work page 2023
-
[32]
Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan O Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024
-
[33]
Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, et al. Reasoning-sql: Reinforcement learning with sql tailored partial rewards for reasoning-enhanced text-to-sql.arXiv preprint arXiv:2503.23157, 2025
-
[34]
Autoprompt: Eliciting knowledge from language models with automatically generated prompts
Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020
work page 2020
-
[35]
Exploring chain of thought style prompting for text-to-sql
Chang-Yu Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. Exploring chain of thought style prompting for text-to-sql. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5376–5393, 2023
work page 2023
-
[36]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis.arXiv preprint arXiv:2405.16755, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems.arXiv preprint arXiv:2505.15201, 2025
-
[38]
Mac-sql: A multi-agent collaborative framework for text-to-sql
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, pages 540–557, 2025
work page 2025
-
[39]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm
Yuanzhen Xie, Xinzhou Jin, Tao Xie, Mingxiong Lin, Liang Chen, Chenyun Yu, Lei Cheng, Chengxiang Zhuo, Bo Hu, and Zang Li. Decomposition for enhancing attention: Improving LLM-based text-to-SQL through workflow paradigm. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 107...
-
[41]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2023. 12 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT
work page 2023
-
[42]
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
Haolin Yang, Jipeng Zhang, Zhitao He, and Yi R Fung. Mars-sql: A multi-agent reinforcement learning framework for text-to-sql.arXiv preprint arXiv:2511.01008, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Synthesizing text-to-SQL data from weak and strong LLMs
Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, and Chang Zhou. Synthesizing text-to-SQL data from weak and strong LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7864–7875, Bangkok, Thailand, August 2024. Assoc...
-
[44]
Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning.arXiv preprint arXiv:2505.23433, 2025
-
[45]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3911–3921, 2018
work page 2018
-
[46]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Optimizing reasoning for text-to-SQL with execution feedback
Bohan Zhai, Canwen Xu, Yuxiong He, and Zhewei Yao. Optimizing reasoning for text-to-SQL with execution feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 19206–19218, Vienna, Austria, July 2025. Association for Computational Linguistic...
-
[48]
Equipping agents for the real world with agent skills, october 2025
Barry Zhang, Keith Lazuka, and Mahesh Murag. Equipping agents for the real world with agent skills, october 2025. URL https://www. anthropic. com/engineering/equipping-agents-for-the-real-world-with-agent-skills. Accessed, pages 01–28, 2026
work page 2025
-
[49]
CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Structure-guided large language models for text-to-SQL generation
Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, and Xiao Huang. Structure-guided large language models for text-to-SQL generation. InForty-second International Conference on Machine Learning,
-
[52]
URLhttps://openreview.net/forum?id=gT8JSEFqaS
-
[53]
Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents.arXiv preprint arXiv:2603.18743, 2026
-
[54]
anchor the grain before grouping
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022. 13 Residual Skill Optimization for Text-to-SQL EnsemblesA PREPRINT A Proofs A.1 Proof of Proposition A.1 We analyze the po...
work page 2022
-
[55]
EXPLORE first: run queries to understand the data–-check table structures, column values, data types, join keys, actual string values in the data
-
[56]
If unsure about any SQL function’s syntax or behavior, call lookup_docs BEFORE writing the query
-
[57]
For common patterns (top-N, running totals, pivots), call get_sql_pattern for a template
-
[58]
PLAN your approach based on what you discovered
-
[59]
WRITE and TEST your SQL incrementally–-run it via execute_sql to check results
-
[60]
VERIFY results look reasonable (right number of rows, right columns, sensible values)
-
[61]
Call review_sql to get a second opinion before submitting
-
[62]
SUBMIT only when confident Optimized prompt: default
-
[63]
EXPLORE the data first (as in seed)
-
[64]
CLARIFY ambiguities–-identify potential traps: NULLs in key columns, case sensitivity, duplicate rows, date formats, and whether counts should be DISTINCT
-
[65]
MAP the question to SQL primitives–-explicitly decide join type (INNER vs LEFT), filter placement (WHERE vs HAVING), aggregation scope, and NULL handling before coding
-
[66]
Check templates–-call get_sql_pattern and lookup_docs (as in seed)
-
[67]
BUILD incrementally–-write and execute_sql each CTE or subquery alone (as in seed)
-
[68]
VALIDATE against the question–-re-read the question, then check: correct columns returned? correct filter conditions? DISTINCT where needed? NULL-safe denominators? ordering and limits match?
-
[69]
CROSS-CHECK edge cases–-run a quick sanity query (e.g., total counts, min/max values, a spot-check join) to confirm the final result is not inflated by fanout or deflated by over-filtering
-
[70]
REVIEW–-call review_sql and address any flagged issues
-
[71]
direct_coder.Strategy:drafts SQL immediately, refines through execution feedback
SUBMIT only after incremental checks and review pass. direct_coder.Strategy:drafts SQL immediately, refines through execution feedback. Seed prompt: direct_coder ## Strategy: DIRECT CODING You are an EFFICIENT SQL writer. Write SQL quickly, test, iterate
-
[72]
Identify the core tables, joins, and aggregations needed
Read the question carefully. Identify the core tables, joins, and aggregations needed
-
[73]
Write your best SQL attempt IMMEDIATELY based on the schema
-
[74]
If errors occur, read the error message carefully and fix
Execute it. If errors occur, read the error message carefully and fix
-
[75]
If the query runs but results look wrong, investigate specific columns/values
-
[76]
Iterate rapidly–-each revision should fix one specific issue
-
[77]
Only investigate columns/values that are directly relevant to errors
Do NOT over-explore. Only investigate columns/values that are directly relevant to errors
-
[78]
GEPA added lookup-table awareness and structured error-repair guidance
SUBMIT as soon as the query produces reasonable results. GEPA added lookup-table awareness and structured error-repair guidance. Key additions over the seed (new or substantially expanded material inbold): Optimized prompt: direct_coder ## Strategy: DIRECT CODING
-
[79]
Before writing any SQL, identify ALL tables mentioned or implied by the question
Read the schema first. Before writing any SQL, identify ALL tables mentioned or implied by the question. Pay special attention to lookup/reference/static tables (e.g., category tables, node tables, type tables) that provide human-readable names or filter criteria–-these almost always require a JOIN. 18 Residual Skill Optimization for Text-to-SQL Ensembles...
-
[80]
If the question references a name, label, or category, find which table owns that column
Map question terms to schema columns. If the question references a name, label, or category, find which table owns that column. Never filter or select on a column that doesn’t exist in the target table–-use the correct table via JOIN instead
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.