MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

arxiv: 2511.01008 · v2 · submitted 2025-11-02 · 💻 cs.CL

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang , Jipeng Zhang , Zhitao He , Alexander Zhou , Yi R. Fung This is my paper

Pith reviewed 2026-05-18 01:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords Text-to-SQLMulti-agent reinforcement learningReAct frameworkSQL query generationLarge language modelsInteractive agentsTrajectory ranking

0 comments p. Extension

The pith

MARS-SQL trains a multi-agent system with reinforcement learning so it can execute SQL on a live database and refine queries from feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that static prompting limits large language models on precise Text-to-SQL tasks and that a trainable multi-agent workflow can close the gap. It decomposes the work into schema grounding, query generation, and validation, then trains the generation agent inside a ReAct-style loop so the agent learns to issue intermediate SQL commands, receive execution results, and adjust its plan. A separate validation step ranks candidate trajectories by next-token prediction probabilities rather than external scoring. If the approach holds, models move from one-shot generation to interactive, self-correcting behavior that improves accuracy on complex schema and logic problems.

Core claim

MARS-SQL decomposes Text-to-SQL into three specialized roles and trains the query-generation agent with a multi-turn RL policy inside a ReAct loop. The agent reasons, executes intermediate SQL statements against a live database, and updates its strategy from execution feedback. Solution selection is cast as a generative modeling task that picks the best trajectory by next-token prediction probabilities. This coupling of interactive learning and trajectory ranking produces execution accuracies of 77.84 percent on the BIRD development set and 89.75 percent on the Spider test set while transferring to out-of-domain benchmarks.

What carries the argument

The multi-turn RL policy inside a ReAct-style loop that lets the generation agent issue SQL actions, observe execution results, and refine its plan, together with next-token probability ranking for selecting the final trajectory.

If this is right

Execution accuracy reaches state-of-the-art levels on both the BIRD development set and the Spider test set.
Performance transfers strongly to out-of-domain Text-to-SQL benchmarks.
The agentic workflow becomes trainable through reinforcement learning instead of relying on fixed prompts.
Interactive execution feedback replaces purely static generation for complex schema alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL loop could be applied to other structured generation tasks such as API call composition or data analysis scripts.
Probability-based trajectory ranking offers a label-free way to select among multiple agent paths in other domains.
Scaling the approach to larger models or longer interaction horizons may further reduce errors on very intricate queries.

Load-bearing premise

Next-token prediction probabilities alone can reliably identify the best interaction trajectory without human labels or external verifiers.

What would settle it

Run the trained agent on a fresh set of queries and check whether the trajectory assigned the highest next-token probability is also the one that produces correct execution results; if the two rankings diverge often, the validation step fails.

Figures

Figures reproduced from arXiv: 2511.01008 by Alexander Zhou, Haolin Yang, Jipeng Zhang, Yi R. Fung, Zhitao He.

**Figure 3.** Figure 3: Execution accuracy on Bird-dev of models fine-tuned with different maximum interaction turns (T), evaluated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of different selection strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text-to-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we propose MARS-SQL, a trainable multi-agent framework for Text-to-SQL. Rather than introducing a new standalone SQL primitive, MARS-SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy within a ReAct-style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS-SQL achieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out-of-domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS-SQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARS-SQL packages multi-turn RL on live DB feedback with next-token probability ranking for trajectory selection, but that ranking step looks like the weakest link in the SOTA claims.

read the letter

MARS-SQL decomposes Text-to-SQL into schema grounding, query generation, and validation agents, then trains the generator with multi-turn RL inside a ReAct loop that executes SQL on the actual database and refines based on feedback. The final selection step ranks trajectories by the model's own next-token probabilities rather than running extra verification. That specific combination of interactive RL plus generative ranking is presented as the new trainable workflow, and the paper reports 77.84% execution accuracy on BIRD dev and 89.75% on Spider test plus some out-of-domain transfer. Code is released, which is straightforward to check.

Referee Report

2 major / 1 minor

Summary. The paper proposes MARS-SQL, a multi-agent reinforcement learning framework for Text-to-SQL that decomposes the task into three specialized agents for schema grounding, query generation, and solution validation. A generation agent is trained via multi-turn RL within a ReAct-style loop that incorporates execution feedback from a live database. Solution selection is performed by treating trajectory ranking as a generative modeling task that uses next-token prediction probabilities. The manuscript reports state-of-the-art execution accuracies of 77.84% on the BIRD development set and 89.75% on the Spider test set, along with strong transfer to out-of-domain benchmarks.

Significance. If the reported gains are shown to stem from the trainable multi-agent RL workflow and the proposed trajectory-ranking mechanism rather than from unstated implementation details or baseline choices, the work would constitute a meaningful contribution to agentic approaches for semantic parsing. The emphasis on interactive learning with execution feedback and the public release of code are positive elements that could support further research in making LLM-based Text-to-SQL systems more robust and adaptive.

major comments (2)

[Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.
[Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.

minor comments (1)

The manuscript should clarify the precise formulation of the multi-turn RL objective and the exact role of each agent in the ReAct loop to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each point below and have revised the manuscript accordingly to provide greater clarity on the methodological details and the validation mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.

Authors: We agree that the abstract is highly condensed and omits key details that are elaborated in the main body of the paper. To address this, we have revised the abstract to briefly mention the reward design (based on execution accuracy and schema alignment), note that results are averaged over multiple runs with reported standard deviations for stability, and reference the baseline comparisons and error analysis presented in Sections 5 and 6. We believe these additions will help readers better attribute the performance gains to the proposed multi-agent RL workflow and trajectory-ranking procedure. revision: yes
Referee: [Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.

Authors: We appreciate this insightful observation regarding the potential limitations of using next-token prediction probabilities for trajectory ranking. In our framework, the ranking is performed on trajectories that have already undergone execution feedback within the ReAct-style loop, providing an additional layer of validation through actual database interactions. Nevertheless, to strengthen the manuscript, we have added an explicit correlation analysis between the model's log-probabilities and execution accuracy on a validation subset, demonstrating a positive correlation. We also discuss cases where high-probability trajectories may still fail and how the multi-agent setup mitigates this. This revision clarifies the role of the generative ranking without relying solely on an unverified assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: performance derived from external benchmarks via RL interaction

full rationale

The paper's central claims rest on empirical execution accuracy measured on independent public benchmarks (BIRD dev at 77.84%, Spider test at 89.75%). These results are obtained by running the trained multi-agent system on live databases and counting correct executions, not by fitting parameters to the target metric or re-deriving the metric from itself. The validation step (next-token probabilities for trajectory selection) is an internal component of the proposed ReAct-style RL loop and does not reduce the reported accuracies to a tautology or self-citation. No equations, uniqueness theorems, or prior self-citations are invoked that would make the SOTA numbers equivalent to the training inputs by construction. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from reinforcement learning and language model decoding; no new physical or mathematical entities are introduced.

free parameters (2)

RL policy hyperparameters
Learning rate, discount factor, and reward scaling for the multi-turn policy are chosen or tuned but not enumerated in the abstract.
Trajectory ranking threshold
The cutoff or weighting used when selecting the best trajectory from next-token probabilities is a modeling choice.

axioms (2)

domain assumption Execution feedback from a live database provides a reliable reward signal for refining SQL generation.
Invoked when describing the ReAct-style loop that refines strategy based on execution feedback.
domain assumption Next-token prediction probabilities from the same model can serve as an effective ranking mechanism for interaction trajectories.
Central to the validation mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5776 in / 1472 out tokens · 42187 ms · 2026-05-18T01:27:56.161920+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Validation agent... selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train the policy πGen using Group Relative Policy Optimization (GRPO)... reward signal Rgen(τ) used to compute Ai is derived solely from execution outcomes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 17 internal anchors

[1]

arXiv:2501.00332 [cs]

URL http://arxiv.org/ abs/2501.00332. arXiv:2501.00332 [cs]. S. Chaturvedi, A. Chadha, and L. Bindschaedler. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction, Aug

work page arXiv
[2]

arXiv:2509.00581 [cs]

URLhttp://arxiv.org/abs/2509.00581. arXiv:2509.00581 [cs]. G. Chen, S. Dong, Y . Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y . Shi. AutoAgents: A Framework for Automatic Agent Generation, Apr

work page arXiv
[3]

Autoagents: A framework for automatic agent generation

URL http://arxiv.org/abs/2309.17288. arXiv:2309.17288 [cs]. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, and R. Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page arXiv
[4]

URLhttps://arxiv.org/abs/2501.12948. M. Deng, A. Ramachandran, C. Xu, L. Hu, Z. Yao, A. Datta, and H. Zhang. RefoRCE: A text-to-SQL agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

URLhttps://arxiv.org/abs/2307.07306. Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate,

work page arXiv
[6]

URLhttps://arxiv.org/abs/2305.14325. Y . Gan, X. Chen, and M. Purver. Exploring underexplored limitations of cross-domain text-to-sql generalization,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

URLhttps://arxiv.org/abs/2109.05157. D. Gao, H. Wang, Y . Li, X. Sun, Y . Qian, B. Ding, and J. Zhou. Text-to-sql empowered by large language models: A benchmark evaluation,

work page arXiv
[8]

URLhttps://arxiv.org/abs/2308.15363. Y . Gao, Y . Liu, X. Li, X. Shi, Y . Zhu, Y . Wang, S. Li, W. Li, Y . Hong, Z. Luo, J. Gao, L. Mou, and Y . Li. A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql,

work page arXiv
[9]

URL https://arxiv.org/abs/ 2411.08599. J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, Mar

work page arXiv
[10]

A Survey on LLM-as-a-Judge

URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]. L. Gui, C. Gârbacea, and V . Veitch. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, Nov

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv:2406.00832 [cs]

URLhttp://arxiv.org/abs/2406.00832. arXiv:2406.00832 [cs]. 10 MARS-SQL Z. He, Z. Liu, P. Li, Y . R. Fung, M. Yan, J. Zhang, F. Huang, and Y . Liu. Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization, Aug

work page arXiv
[12]

arXiv:2502.14496 [cs]

URL http://arxiv.org/ abs/2502.14496. arXiv:2502.14496 [cs]. S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Nov

work page arXiv
[13]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

URLhttp://arxiv.org/abs/2308.00352. arXiv:2308.00352 [cs]. Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, and X. Huang. Next-generation database interfaces: A survey of llm-based text-to-sql,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

URLhttps://arxiv.org/abs/2406.08426. W. Hua, L. Fan, L. Li, K. Mei, J. Ji, Y . Ge, L. Hemphill, and Y . Zhang. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars, Jan

work page arXiv
[15]

War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227,

URLhttp://arxiv.org/abs/2311.17227. arXiv:2311.17227 [cs]. J.-t. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents, May 2025a. URLhttp://arxiv.org/abs/2408. 00989. arXiv:2408.00989 [cs]. Y . Huang, S. Li, Z. Fan, M. LIU, W. Liu, and Y . R. Fung. S...

work page arXiv
[16]

Qwen2.5-Coder Technical Report

URLhttp://arxiv.org/abs/2409.12186. arXiv:2409.12186 [cs]. F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, V . Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

URLhttps://arxiv.org/abs/2411.07763. B. Li, Y . Luo, C. Chai, G. Li, and N. Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?Proceedings of the VLDB Endowment, 17(11):3318–3331, July 2024a. ISSN 2150-8097. doi:10.14778/3681954.3682003. URL http://arxiv.org/abs/2406.01265. arXiv:2406.01265 [cs]. H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zh...

work page doi:10.14778/3681954.3682003
[18]

URLhttps://arxiv.org/abs/2305.03111. J. Li, X. Li, G. Qu, P. Jacobsson, B. Qin, B. Hui, S. Si, N. Huo, X. Xu, Y . Zhang, Z. Tang, Y . Li, F. Widjaja, X. Zhu, F. Zhou, Y . Huang, Y . Papakonstantinou, F. Ozcan, C. Ma, and R. Cheng. SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications, July 2025b. URLhttp://arxiv.org/abs/25...

work page arXiv
[19]

doi: 10.1126/science.abq1158

ISSN 0036-8075, 1095-9203. doi:10.1126/science.abq1158. URLhttp://arxiv.org/abs/2203.07814. arXiv:2203.07814 [cs]. S. Liu, S. Hegde, S. Cao, A. Zhu, D. Li, T. Griggs, E. Tang, A. Malik, K. Hakhamaneshi, R. Liaw, P. Moritz, M. Zaharia, J. E. Gonzalez, and I. Stoica. Skyrl-sql: Matching gpt-4o and o4-mini on text2sql with multi-turn rl, 2025a. Y . Liu, Y . ...

work page doi:10.1126/science.abq1158
[20]

11 MARS-SQL S

Accessed: 2024-06-09. 11 MARS-SQL S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, Oct

work page 2024
[21]

URL http://arxiv.org/abs/2202. 12837. arXiv:2202.12837 [cs]. A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. I. Wang, and X. V . Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution, Sept

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv:2302.08468 [cs]

URL http://arxiv.org/abs/2302.08468. arXiv:2302.08468 [cs]. OpenAI. Gpt-4 technical report,

work page arXiv
[23]

URLhttps://arxiv.org/abs/2412.16720. M. Pourreza and D. Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

URL https://arxiv.org/abs/2304.11015. M. Pourreza, H. Li, R. Sun, Y . Chung, S. Talaei, G. T. Kakkar, Y . Gan, A. Saberi, F. Ozcan, and S. O. Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql,

work page arXiv
[25]

URL https: //arxiv.org/abs/2410.01943. M. Pourreza, S. Talaei, R. Sun, X. Wan, H. Li, A. Mirhoseini, A. Saberi, and S. O. Arik. Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL, Apr

work page arXiv
[26]

arXiv:2503.23157 [cs]

URL http: //arxiv.org/abs/2503.23157. arXiv:2503.23157 [cs]. C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. ChatDev: Communicative Agents for Software Development, June

work page arXiv
[27]

ChatDev: Communicative Agents for Software Development

URL http://arxiv.org/abs/ 2307.07924. arXiv:2307.07924 [cs]. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Oct

work page internal anchor Pith review Pith/arXiv arXiv
[28]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

URLhttp://arxiv.org/abs/2307.16789. arXiv:2307.16789 [cs]. G. Qu, J. Li, B. Qin, X. Li, N. Huo, C. Ma, and R. Cheng. Share: An slm-based hierarchical action correction assistant for text-to-sql,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

URLhttps://arxiv.org/abs/2506.00391. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Apr

work page arXiv
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL http://arxiv. org/abs/2402.03300. arXiv:2402.03300 [cs]. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

URLhttps://arxiv.org/abs/2305.14215. S. Talaei, M. Pourreza, Y .-C. Chang, A. Mirhoseini, and A. Saberi. CHESS: Contextual Harnessing for Efficient SQL Synthesis, Nov

work page arXiv
[32]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

URLhttp://arxiv.org/abs/2405.16755. arXiv:2405.16755 [cs]. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li. Mac-sql: A multi-agent collaborative framework for text-to-sql, 2025a. URLhttps://arxiv.org/abs/2312.11242. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun,...

work page internal anchor Pith review arXiv
[33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

URL http://arxiv.org/abs/2203.11171. arXiv:2203.11171 [cs]. X. Wang, Y . Xiao, J.-t. Huang, S. Yuan, R. Xu, H. Guo, Q. Tu, Y . Fei, Z. Leng, W. Wang, J. Chen, C. Li, and Y . Xiao. InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews, June

work page internal anchor Pith review Pith/arXiv arXiv
[34]

arXiv:2310.17976 [cs]

URLhttp://arxiv.org/abs/2310.17976. arXiv:2310.17976 [cs]. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models,

work page arXiv
[35]

URLhttps://arxiv.org/abs/2201.11903. Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

URL https://arxiv.org/abs/2308.08155. W. Xie, Y . Dai, and W. Jiang. Sde-sql: Enhancing text-to-sql generation in large language models via self-driven exploration with sql probes, 2025a. URLhttps://arxiv.org/abs/2506.07245. 12 MARS-SQL X. Xie, G. Xu, L. Zhao, and R. Guo. Opensearch-sql: Enhancing text-to-sql with dynamic few-shot and consistency alignmen...

work page internal anchor Pith review Pith/arXiv arXiv
[37]

URLhttps://arxiv.org/abs/2210.03629. Z. Yao, G. Sun, L. Borchmann, Z. Shen, M. Deng, B. Zhai, H. Zhang, A. Li, and Y . He. Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL, May

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv:2505.20315 [cs]

URL http://arxiv.org/abs/2505.20315. arXiv:2505.20315 [cs]. T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,

work page arXiv
[39]

URLhttps://arxiv.org/abs/1809.08887. J. Zhang, H. Yang, K. Miao, R. Zhang, R. Pi, J. Gao, and X. Zhou. Exesql: Self-taught text-to-sql models with execution-driven bootstrapping for sql dialects, 2025a. URLhttps://arxiv.org/abs/2505.17231. L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative Verifiers: Reward Modeling as Next-...

work page internal anchor Pith review Pith/arXiv arXiv
[40]

URLhttps://arxiv.org/abs/2406.02818. Y . Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang, W. Luo, and K. Zhang. Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions, Nov

work page arXiv
[41]

Marco-o1: Towards open reasoning models for open-ended solutions, 2024

URL http://arxiv.org/abs/2411.14405. arXiv:2411.14405 [cs]. L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec

work page arXiv
[42]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]. A The Use of Large Language Models Large Language Models (LLMs) were utilized in a limited, assistive capacity for specific tasks in this project. For manuscript preparation, the authors supplied their own draft to an LLM, which then provided suggestions to improve grammar, enhance clarity, and e...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Table 4: Hyperparameters for Grounding and Generation Agent RL Training. Parameter Value Training Parameters Learning Rate1×10 −6 Batch Size 128 Trajectory Rollout Parameters Temperature 0.6 Top-p 0.95 Table 5: Hyperparameters for Validation Agent Dataset Generation. Parameter Value Candidates per Question 16 Temperature 0.7 Top-p 0.9 Top-k 50 B.3 Validat...

work page 2024
[44]

player_name

Table 6: Hyperparameters for Validation Agent SFT. Parameter Value Base ModelQwen2.5-Coder-7B-Instruct Epochs 3 Learning Rate Scheduler Cosine Initial Learning Rate1×10 −5 Effective Batch Size 4 Per-device Batch Size 1 Gradient Accumulation 2 steps Precisionbf16 Optimization DeepSpeed ZeRO Stage 3 C Dataset C.1 Training Dataset Our training data is derive...

work page 2025

[1] [1]

arXiv:2501.00332 [cs]

URL http://arxiv.org/ abs/2501.00332. arXiv:2501.00332 [cs]. S. Chaturvedi, A. Chadha, and L. Bindschaedler. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction, Aug

work page arXiv

[2] [2]

arXiv:2509.00581 [cs]

URLhttp://arxiv.org/abs/2509.00581. arXiv:2509.00581 [cs]. G. Chen, S. Dong, Y . Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y . Shi. AutoAgents: A Framework for Automatic Agent Generation, Apr

work page arXiv

[3] [3]

Autoagents: A framework for automatic agent generation

URL http://arxiv.org/abs/2309.17288. arXiv:2309.17288 [cs]. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, and R. Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page arXiv

[4] [4]

URLhttps://arxiv.org/abs/2501.12948. M. Deng, A. Ramachandran, C. Xu, L. Hu, Z. Yao, A. Datta, and H. Zhang. RefoRCE: A text-to-SQL agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

URLhttps://arxiv.org/abs/2307.07306. Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate,

work page arXiv

[6] [6]

URLhttps://arxiv.org/abs/2305.14325. Y . Gan, X. Chen, and M. Purver. Exploring underexplored limitations of cross-domain text-to-sql generalization,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

URLhttps://arxiv.org/abs/2109.05157. D. Gao, H. Wang, Y . Li, X. Sun, Y . Qian, B. Ding, and J. Zhou. Text-to-sql empowered by large language models: A benchmark evaluation,

work page arXiv

[8] [8]

URLhttps://arxiv.org/abs/2308.15363. Y . Gao, Y . Liu, X. Li, X. Shi, Y . Zhu, Y . Wang, S. Li, W. Li, Y . Hong, Z. Luo, J. Gao, L. Mou, and Y . Li. A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql,

work page arXiv

[9] [9]

URL https://arxiv.org/abs/ 2411.08599. J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, Mar

work page arXiv

[10] [10]

A Survey on LLM-as-a-Judge

URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]. L. Gui, C. Gârbacea, and V . Veitch. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, Nov

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv:2406.00832 [cs]

URLhttp://arxiv.org/abs/2406.00832. arXiv:2406.00832 [cs]. 10 MARS-SQL Z. He, Z. Liu, P. Li, Y . R. Fung, M. Yan, J. Zhang, F. Huang, and Y . Liu. Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization, Aug

work page arXiv

[12] [12]

arXiv:2502.14496 [cs]

URL http://arxiv.org/ abs/2502.14496. arXiv:2502.14496 [cs]. S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Nov

work page arXiv

[13] [13]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

URLhttp://arxiv.org/abs/2308.00352. arXiv:2308.00352 [cs]. Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, and X. Huang. Next-generation database interfaces: A survey of llm-based text-to-sql,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

URLhttps://arxiv.org/abs/2406.08426. W. Hua, L. Fan, L. Li, K. Mei, J. Ji, Y . Ge, L. Hemphill, and Y . Zhang. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars, Jan

work page arXiv

[15] [15]

War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227,

URLhttp://arxiv.org/abs/2311.17227. arXiv:2311.17227 [cs]. J.-t. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents, May 2025a. URLhttp://arxiv.org/abs/2408. 00989. arXiv:2408.00989 [cs]. Y . Huang, S. Li, Z. Fan, M. LIU, W. Liu, and Y . R. Fung. S...

work page arXiv

[16] [16]

Qwen2.5-Coder Technical Report

URLhttp://arxiv.org/abs/2409.12186. arXiv:2409.12186 [cs]. F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, V . Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

URLhttps://arxiv.org/abs/2411.07763. B. Li, Y . Luo, C. Chai, G. Li, and N. Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?Proceedings of the VLDB Endowment, 17(11):3318–3331, July 2024a. ISSN 2150-8097. doi:10.14778/3681954.3682003. URL http://arxiv.org/abs/2406.01265. arXiv:2406.01265 [cs]. H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zh...

work page doi:10.14778/3681954.3682003

[18] [18]

URLhttps://arxiv.org/abs/2305.03111. J. Li, X. Li, G. Qu, P. Jacobsson, B. Qin, B. Hui, S. Si, N. Huo, X. Xu, Y . Zhang, Z. Tang, Y . Li, F. Widjaja, X. Zhu, F. Zhou, Y . Huang, Y . Papakonstantinou, F. Ozcan, C. Ma, and R. Cheng. SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications, July 2025b. URLhttp://arxiv.org/abs/25...

work page arXiv

[19] [19]

doi: 10.1126/science.abq1158

ISSN 0036-8075, 1095-9203. doi:10.1126/science.abq1158. URLhttp://arxiv.org/abs/2203.07814. arXiv:2203.07814 [cs]. S. Liu, S. Hegde, S. Cao, A. Zhu, D. Li, T. Griggs, E. Tang, A. Malik, K. Hakhamaneshi, R. Liaw, P. Moritz, M. Zaharia, J. E. Gonzalez, and I. Stoica. Skyrl-sql: Matching gpt-4o and o4-mini on text2sql with multi-turn rl, 2025a. Y . Liu, Y . ...

work page doi:10.1126/science.abq1158

[20] [20]

11 MARS-SQL S

Accessed: 2024-06-09. 11 MARS-SQL S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, Oct

work page 2024

[21] [21]

URL http://arxiv.org/abs/2202. 12837. arXiv:2202.12837 [cs]. A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. I. Wang, and X. V . Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution, Sept

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

arXiv:2302.08468 [cs]

URL http://arxiv.org/abs/2302.08468. arXiv:2302.08468 [cs]. OpenAI. Gpt-4 technical report,

work page arXiv

[23] [23]

URLhttps://arxiv.org/abs/2412.16720. M. Pourreza and D. Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

URL https://arxiv.org/abs/2304.11015. M. Pourreza, H. Li, R. Sun, Y . Chung, S. Talaei, G. T. Kakkar, Y . Gan, A. Saberi, F. Ozcan, and S. O. Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql,

work page arXiv

[25] [25]

URL https: //arxiv.org/abs/2410.01943. M. Pourreza, S. Talaei, R. Sun, X. Wan, H. Li, A. Mirhoseini, A. Saberi, and S. O. Arik. Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL, Apr

work page arXiv

[26] [26]

arXiv:2503.23157 [cs]

URL http: //arxiv.org/abs/2503.23157. arXiv:2503.23157 [cs]. C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. ChatDev: Communicative Agents for Software Development, June

work page arXiv

[27] [27]

ChatDev: Communicative Agents for Software Development

URL http://arxiv.org/abs/ 2307.07924. arXiv:2307.07924 [cs]. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Oct

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

URLhttp://arxiv.org/abs/2307.16789. arXiv:2307.16789 [cs]. G. Qu, J. Li, B. Qin, X. Li, N. Huo, C. Ma, and R. Cheng. Share: An slm-based hierarchical action correction assistant for text-to-sql,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

URLhttps://arxiv.org/abs/2506.00391. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Apr

work page arXiv

[30] [30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL http://arxiv. org/abs/2402.03300. arXiv:2402.03300 [cs]. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

URLhttps://arxiv.org/abs/2305.14215. S. Talaei, M. Pourreza, Y .-C. Chang, A. Mirhoseini, and A. Saberi. CHESS: Contextual Harnessing for Efficient SQL Synthesis, Nov

work page arXiv

[32] [32]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

URLhttp://arxiv.org/abs/2405.16755. arXiv:2405.16755 [cs]. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li. Mac-sql: A multi-agent collaborative framework for text-to-sql, 2025a. URLhttps://arxiv.org/abs/2312.11242. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun,...

work page internal anchor Pith review arXiv

[33] [33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

URL http://arxiv.org/abs/2203.11171. arXiv:2203.11171 [cs]. X. Wang, Y . Xiao, J.-t. Huang, S. Yuan, R. Xu, H. Guo, Q. Tu, Y . Fei, Z. Leng, W. Wang, J. Chen, C. Li, and Y . Xiao. InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews, June

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

arXiv:2310.17976 [cs]

URLhttp://arxiv.org/abs/2310.17976. arXiv:2310.17976 [cs]. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models,

work page arXiv

[35] [35]

URLhttps://arxiv.org/abs/2201.11903. Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

URL https://arxiv.org/abs/2308.08155. W. Xie, Y . Dai, and W. Jiang. Sde-sql: Enhancing text-to-sql generation in large language models via self-driven exploration with sql probes, 2025a. URLhttps://arxiv.org/abs/2506.07245. 12 MARS-SQL X. Xie, G. Xu, L. Zhao, and R. Guo. Opensearch-sql: Enhancing text-to-sql with dynamic few-shot and consistency alignmen...

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

URLhttps://arxiv.org/abs/2210.03629. Z. Yao, G. Sun, L. Borchmann, Z. Shen, M. Deng, B. Zhai, H. Zhang, A. Li, and Y . He. Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL, May

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

arXiv:2505.20315 [cs]

URL http://arxiv.org/abs/2505.20315. arXiv:2505.20315 [cs]. T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,

work page arXiv

[39] [39]

URLhttps://arxiv.org/abs/1809.08887. J. Zhang, H. Yang, K. Miao, R. Zhang, R. Pi, J. Gao, and X. Zhou. Exesql: Self-taught text-to-sql models with execution-driven bootstrapping for sql dialects, 2025a. URLhttps://arxiv.org/abs/2505.17231. L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative Verifiers: Reward Modeling as Next-...

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

URLhttps://arxiv.org/abs/2406.02818. Y . Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang, W. Luo, and K. Zhang. Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions, Nov

work page arXiv

[41] [41]

Marco-o1: Towards open reasoning models for open-ended solutions, 2024

URL http://arxiv.org/abs/2411.14405. arXiv:2411.14405 [cs]. L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec

work page arXiv

[42] [42]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]. A The Use of Large Language Models Large Language Models (LLMs) were utilized in a limited, assistive capacity for specific tasks in this project. For manuscript preparation, the authors supplied their own draft to an LLM, which then provided suggestions to improve grammar, enhance clarity, and e...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Table 4: Hyperparameters for Grounding and Generation Agent RL Training. Parameter Value Training Parameters Learning Rate1×10 −6 Batch Size 128 Trajectory Rollout Parameters Temperature 0.6 Top-p 0.95 Table 5: Hyperparameters for Validation Agent Dataset Generation. Parameter Value Candidates per Question 16 Temperature 0.7 Top-p 0.9 Top-k 50 B.3 Validat...

work page 2024

[44] [44]

player_name

Table 6: Hyperparameters for Validation Agent SFT. Parameter Value Base ModelQwen2.5-Coder-7B-Instruct Epochs 3 Learning Rate Scheduler Cosine Initial Learning Rate1×10 −5 Effective Batch Size 4 Per-device Batch Size 1 Gradient Accumulation 2 steps Precisionbf16 Optimization DeepSpeed ZeRO Stage 3 C Dataset C.1 Training Dataset Our training data is derive...

work page 2025