pith. sign in

arxiv: 2511.01008 · v2 · submitted 2025-11-02 · 💻 cs.CL

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Pith reviewed 2026-05-18 01:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords Text-to-SQLMulti-agent reinforcement learningReAct frameworkSQL query generationLarge language modelsInteractive agentsTrajectory ranking
0
0 comments X p. Extension

The pith

MARS-SQL trains a multi-agent system with reinforcement learning so it can execute SQL on a live database and refine queries from feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that static prompting limits large language models on precise Text-to-SQL tasks and that a trainable multi-agent workflow can close the gap. It decomposes the work into schema grounding, query generation, and validation, then trains the generation agent inside a ReAct-style loop so the agent learns to issue intermediate SQL commands, receive execution results, and adjust its plan. A separate validation step ranks candidate trajectories by next-token prediction probabilities rather than external scoring. If the approach holds, models move from one-shot generation to interactive, self-correcting behavior that improves accuracy on complex schema and logic problems.

Core claim

MARS-SQL decomposes Text-to-SQL into three specialized roles and trains the query-generation agent with a multi-turn RL policy inside a ReAct loop. The agent reasons, executes intermediate SQL statements against a live database, and updates its strategy from execution feedback. Solution selection is cast as a generative modeling task that picks the best trajectory by next-token prediction probabilities. This coupling of interactive learning and trajectory ranking produces execution accuracies of 77.84 percent on the BIRD development set and 89.75 percent on the Spider test set while transferring to out-of-domain benchmarks.

What carries the argument

The multi-turn RL policy inside a ReAct-style loop that lets the generation agent issue SQL actions, observe execution results, and refine its plan, together with next-token probability ranking for selecting the final trajectory.

If this is right

  • Execution accuracy reaches state-of-the-art levels on both the BIRD development set and the Spider test set.
  • Performance transfers strongly to out-of-domain Text-to-SQL benchmarks.
  • The agentic workflow becomes trainable through reinforcement learning instead of relying on fixed prompts.
  • Interactive execution feedback replaces purely static generation for complex schema alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RL loop could be applied to other structured generation tasks such as API call composition or data analysis scripts.
  • Probability-based trajectory ranking offers a label-free way to select among multiple agent paths in other domains.
  • Scaling the approach to larger models or longer interaction horizons may further reduce errors on very intricate queries.

Load-bearing premise

Next-token prediction probabilities alone can reliably identify the best interaction trajectory without human labels or external verifiers.

What would settle it

Run the trained agent on a fresh set of queries and check whether the trajectory assigned the highest next-token probability is also the one that produces correct execution results; if the two rankings diverge often, the validation step fails.

Figures

Figures reproduced from arXiv: 2511.01008 by Alexander Zhou, Haolin Yang, Jipeng Zhang, Yi R. Fung, Zhitao He.

Figure 1
Figure 1. Figure 1: Illustration of interactive reasoning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Execution accuracy on Bird-dev of models fine-tuned with different maximum interaction turns (T), evaluated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different selection strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text-to-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we propose MARS-SQL, a trainable multi-agent framework for Text-to-SQL. Rather than introducing a new standalone SQL primitive, MARS-SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy within a ReAct-style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS-SQL achieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out-of-domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS-SQL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MARS-SQL, a multi-agent reinforcement learning framework for Text-to-SQL that decomposes the task into three specialized agents for schema grounding, query generation, and solution validation. A generation agent is trained via multi-turn RL within a ReAct-style loop that incorporates execution feedback from a live database. Solution selection is performed by treating trajectory ranking as a generative modeling task that uses next-token prediction probabilities. The manuscript reports state-of-the-art execution accuracies of 77.84% on the BIRD development set and 89.75% on the Spider test set, along with strong transfer to out-of-domain benchmarks.

Significance. If the reported gains are shown to stem from the trainable multi-agent RL workflow and the proposed trajectory-ranking mechanism rather than from unstated implementation details or baseline choices, the work would constitute a meaningful contribution to agentic approaches for semantic parsing. The emphasis on interactive learning with execution feedback and the public release of code are positive elements that could support further research in making LLM-based Text-to-SQL systems more robust and adaptive.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.
  2. [Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.
minor comments (1)
  1. The manuscript should clarify the precise formulation of the multi-turn RL objective and the exact role of each agent in the ReAct loop to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each point below and have revised the manuscript accordingly to provide greater clarity on the methodological details and the validation mechanism.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.

    Authors: We agree that the abstract is highly condensed and omits key details that are elaborated in the main body of the paper. To address this, we have revised the abstract to briefly mention the reward design (based on execution accuracy and schema alignment), note that results are averaged over multiple runs with reported standard deviations for stability, and reference the baseline comparisons and error analysis presented in Sections 5 and 6. We believe these additions will help readers better attribute the performance gains to the proposed multi-agent RL workflow and trajectory-ranking procedure. revision: yes

  2. Referee: [Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.

    Authors: We appreciate this insightful observation regarding the potential limitations of using next-token prediction probabilities for trajectory ranking. In our framework, the ranking is performed on trajectories that have already undergone execution feedback within the ReAct-style loop, providing an additional layer of validation through actual database interactions. Nevertheless, to strengthen the manuscript, we have added an explicit correlation analysis between the model's log-probabilities and execution accuracy on a validation subset, demonstrating a positive correlation. We also discuss cases where high-probability trajectories may still fail and how the multi-agent setup mitigates this. This revision clarifies the role of the generative ranking without relying solely on an unverified assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: performance derived from external benchmarks via RL interaction

full rationale

The paper's central claims rest on empirical execution accuracy measured on independent public benchmarks (BIRD dev at 77.84%, Spider test at 89.75%). These results are obtained by running the trained multi-agent system on live databases and counting correct executions, not by fitting parameters to the target metric or re-deriving the metric from itself. The validation step (next-token probabilities for trajectory selection) is an internal component of the proposed ReAct-style RL loop and does not reduce the reported accuracies to a tautology or self-citation. No equations, uniqueness theorems, or prior self-citations are invoked that would make the SOTA numbers equivalent to the training inputs by construction. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions from reinforcement learning and language model decoding; no new physical or mathematical entities are introduced.

free parameters (2)
  • RL policy hyperparameters
    Learning rate, discount factor, and reward scaling for the multi-turn policy are chosen or tuned but not enumerated in the abstract.
  • Trajectory ranking threshold
    The cutoff or weighting used when selecting the best trajectory from next-token probabilities is a modeling choice.
axioms (2)
  • domain assumption Execution feedback from a live database provides a reliable reward signal for refining SQL generation.
    Invoked when describing the ReAct-style loop that refines strategy based on execution feedback.
  • domain assumption Next-token prediction probabilities from the same model can serve as an effective ranking mechanism for interaction trajectories.
    Central to the validation mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5776 in / 1472 out tokens · 42187 ms · 2026-05-18T01:27:56.161920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 17 internal anchors

  1. [1]

    arXiv:2501.00332 [cs]

    URL http://arxiv.org/ abs/2501.00332. arXiv:2501.00332 [cs]. S. Chaturvedi, A. Chadha, and L. Bindschaedler. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction, Aug

  2. [2]

    arXiv:2509.00581 [cs]

    URLhttp://arxiv.org/abs/2509.00581. arXiv:2509.00581 [cs]. G. Chen, S. Dong, Y . Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y . Shi. AutoAgents: A Framework for Automatic Agent Generation, Apr

  3. [3]

    Autoagents: A framework for automatic agent generation

    URL http://arxiv.org/abs/2309.17288. arXiv:2309.17288 [cs]. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, and R. Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

  4. [4]

    URLhttps://arxiv.org/abs/2501.12948. M. Deng, A. Ramachandran, C. Xu, L. Hu, Z. Yao, A. Datta, and H. Zhang. RefoRCE: A text-to-SQL agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild,

  5. [5]

    URLhttps://arxiv.org/abs/2307.07306. Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate,

  6. [6]

    URLhttps://arxiv.org/abs/2305.14325. Y . Gan, X. Chen, and M. Purver. Exploring underexplored limitations of cross-domain text-to-sql generalization,

  7. [7]

    URLhttps://arxiv.org/abs/2109.05157. D. Gao, H. Wang, Y . Li, X. Sun, Y . Qian, B. Ding, and J. Zhou. Text-to-sql empowered by large language models: A benchmark evaluation,

  8. [8]

    URLhttps://arxiv.org/abs/2308.15363. Y . Gao, Y . Liu, X. Li, X. Shi, Y . Zhu, Y . Wang, S. Li, W. Li, Y . Hong, Z. Luo, J. Gao, L. Mou, and Y . Li. A preview of xiyan-sql: A multi-generator ensemble framework for text-to-sql,

  9. [9]

    URL https://arxiv.org/abs/ 2411.08599. J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y . Wang, W. Gao, L. Ni, and J. Guo. A Survey on LLM-as-a-Judge, Mar

  10. [10]

    A Survey on LLM-as-a-Judge

    URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]. L. Gui, C. Gârbacea, and V . Veitch. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, Nov

  11. [11]

    arXiv:2406.00832 [cs]

    URLhttp://arxiv.org/abs/2406.00832. arXiv:2406.00832 [cs]. 10 MARS-SQL Z. He, Z. Liu, P. Li, Y . R. Fung, M. Yan, J. Zhang, F. Huang, and Y . Liu. Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization, Aug

  12. [12]

    arXiv:2502.14496 [cs]

    URL http://arxiv.org/ abs/2502.14496. arXiv:2502.14496 [cs]. S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Nov

  13. [13]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    URLhttp://arxiv.org/abs/2308.00352. arXiv:2308.00352 [cs]. Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, and X. Huang. Next-generation database interfaces: A survey of llm-based text-to-sql,

  14. [14]

    URLhttps://arxiv.org/abs/2406.08426. W. Hua, L. Fan, L. Li, K. Mei, J. Ji, Y . Ge, L. Hemphill, and Y . Zhang. War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars, Jan

  15. [15]

    War and peace (waragent): Large language model-based multi-agent simulation of world wars.arXiv preprint arXiv:2311.17227,

    URLhttp://arxiv.org/abs/2311.17227. arXiv:2311.17227 [cs]. J.-t. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents, May 2025a. URLhttp://arxiv.org/abs/2408. 00989. arXiv:2408.00989 [cs]. Y . Huang, S. Li, Z. Fan, M. LIU, W. Liu, and Y . R. Fung. S...

  16. [16]

    Qwen2.5-Coder Technical Report

    URLhttp://arxiv.org/abs/2409.12186. arXiv:2409.12186 [cs]. F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, V . Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows,

  17. [17]

    URLhttps://arxiv.org/abs/2411.07763. B. Li, Y . Luo, C. Chai, G. Li, and N. Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?Proceedings of the VLDB Endowment, 17(11):3318–3331, July 2024a. ISSN 2150-8097. doi:10.14778/3681954.3682003. URL http://arxiv.org/abs/2406.01265. arXiv:2406.01265 [cs]. H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zh...

  18. [18]

    URLhttps://arxiv.org/abs/2305.03111. J. Li, X. Li, G. Qu, P. Jacobsson, B. Qin, B. Hui, S. Si, N. Huo, X. Xu, Y . Zhang, Z. Tang, Y . Li, F. Widjaja, X. Zhu, F. Zhou, Y . Huang, Y . Papakonstantinou, F. Ozcan, C. Ma, and R. Cheng. SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications, July 2025b. URLhttp://arxiv.org/abs/25...

  19. [19]

    doi: 10.1126/science.abq1158

    ISSN 0036-8075, 1095-9203. doi:10.1126/science.abq1158. URLhttp://arxiv.org/abs/2203.07814. arXiv:2203.07814 [cs]. S. Liu, S. Hegde, S. Cao, A. Zhu, D. Li, T. Griggs, E. Tang, A. Malik, K. Hakhamaneshi, R. Liaw, P. Moritz, M. Zaharia, J. E. Gonzalez, and I. Stoica. Skyrl-sql: Matching gpt-4o and o4-mini on text2sql with multi-turn rl, 2025a. Y . Liu, Y . ...

  20. [20]

    11 MARS-SQL S

    Accessed: 2024-06-09. 11 MARS-SQL S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, Oct

  21. [21]

    URL http://arxiv.org/abs/2202. 12837. arXiv:2202.12837 [cs]. A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. I. Wang, and X. V . Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution, Sept

  22. [22]

    arXiv:2302.08468 [cs]

    URL http://arxiv.org/abs/2302.08468. arXiv:2302.08468 [cs]. OpenAI. Gpt-4 technical report,

  23. [23]

    URLhttps://arxiv.org/abs/2412.16720. M. Pourreza and D. Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction,

  24. [24]

    URL https://arxiv.org/abs/2304.11015. M. Pourreza, H. Li, R. Sun, Y . Chung, S. Talaei, G. T. Kakkar, Y . Gan, A. Saberi, F. Ozcan, and S. O. Arik. Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql,

  25. [25]

    URL https: //arxiv.org/abs/2410.01943. M. Pourreza, S. Talaei, R. Sun, X. Wan, H. Li, A. Mirhoseini, A. Saberi, and S. O. Arik. Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL, Apr

  26. [26]

    arXiv:2503.23157 [cs]

    URL http: //arxiv.org/abs/2503.23157. arXiv:2503.23157 [cs]. C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. ChatDev: Communicative Agents for Software Development, June

  27. [27]

    ChatDev: Communicative Agents for Software Development

    URL http://arxiv.org/abs/ 2307.07924. arXiv:2307.07924 [cs]. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Oct

  28. [28]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    URLhttp://arxiv.org/abs/2307.16789. arXiv:2307.16789 [cs]. G. Qu, J. Li, B. Qin, X. Li, N. Huo, C. Ma, and R. Cheng. Share: An slm-based hierarchical action correction assistant for text-to-sql,

  29. [29]

    URLhttps://arxiv.org/abs/2506.00391. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, Apr

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URL http://arxiv. org/abs/2402.03300. arXiv:2402.03300 [cs]. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  31. [31]

    URLhttps://arxiv.org/abs/2305.14215. S. Talaei, M. Pourreza, Y .-C. Chang, A. Mirhoseini, and A. Saberi. CHESS: Contextual Harnessing for Efficient SQL Synthesis, Nov

  32. [32]

    CHESS: Contextual Harnessing for Efficient SQL Synthesis

    URLhttp://arxiv.org/abs/2405.16755. arXiv:2405.16755 [cs]. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li. Mac-sql: A multi-agent collaborative framework for text-to-sql, 2025a. URLhttps://arxiv.org/abs/2312.11242. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun,...

  33. [33]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    URL http://arxiv.org/abs/2203.11171. arXiv:2203.11171 [cs]. X. Wang, Y . Xiao, J.-t. Huang, S. Yuan, R. Xu, H. Guo, Q. Tu, Y . Fei, Z. Leng, W. Wang, J. Chen, C. Li, and Y . Xiao. InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews, June

  34. [34]

    arXiv:2310.17976 [cs]

    URLhttp://arxiv.org/abs/2310.17976. arXiv:2310.17976 [cs]. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models,

  35. [35]

    URLhttps://arxiv.org/abs/2201.11903. Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation,

  36. [36]

    URL https://arxiv.org/abs/2308.08155. W. Xie, Y . Dai, and W. Jiang. Sde-sql: Enhancing text-to-sql generation in large language models via self-driven exploration with sql probes, 2025a. URLhttps://arxiv.org/abs/2506.07245. 12 MARS-SQL X. Xie, G. Xu, L. Zhao, and R. Guo. Opensearch-sql: Enhancing text-to-sql with dynamic few-shot and consistency alignmen...

  37. [37]

    URLhttps://arxiv.org/abs/2210.03629. Z. Yao, G. Sun, L. Borchmann, Z. Shen, M. Deng, B. Zhai, H. Zhang, A. Li, and Y . He. Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL, May

  38. [38]

    arXiv:2505.20315 [cs]

    URL http://arxiv.org/abs/2505.20315. arXiv:2505.20315 [cs]. T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,

  39. [39]

    URLhttps://arxiv.org/abs/1809.08887. J. Zhang, H. Yang, K. Miao, R. Zhang, R. Pi, J. Gao, and X. Zhou. Exesql: Self-taught text-to-sql models with execution-driven bootstrapping for sql dialects, 2025a. URLhttps://arxiv.org/abs/2505.17231. L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative Verifiers: Reward Modeling as Next-...

  40. [40]

    URLhttps://arxiv.org/abs/2406.02818. Y . Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang, W. Luo, and K. Zhang. Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions, Nov

  41. [41]

    Marco-o1: Towards open reasoning models for open-ended solutions, 2024

    URL http://arxiv.org/abs/2411.14405. arXiv:2411.14405 [cs]. L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec

  42. [42]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]. A The Use of Large Language Models Large Language Models (LLMs) were utilized in a limited, assistive capacity for specific tasks in this project. For manuscript preparation, the authors supplied their own draft to an LLM, which then provided suggestions to improve grammar, enhance clarity, and e...

  43. [43]

    Table 4: Hyperparameters for Grounding and Generation Agent RL Training. Parameter Value Training Parameters Learning Rate1×10 −6 Batch Size 128 Trajectory Rollout Parameters Temperature 0.6 Top-p 0.95 Table 5: Hyperparameters for Validation Agent Dataset Generation. Parameter Value Candidates per Question 16 Temperature 0.7 Top-p 0.9 Top-k 50 B.3 Validat...

  44. [44]

    player_name

    Table 6: Hyperparameters for Validation Agent SFT. Parameter Value Base ModelQwen2.5-Coder-7B-Instruct Epochs 3 Learning Rate Scheduler Cosine Initial Learning Rate1×10 −5 Effective Batch Size 4 Per-device Batch Size 1 Gradient Accumulation 2 steps Precisionbf16 Optimization DeepSpeed ZeRO Stage 3 C Dataset C.1 Training Dataset Our training data is derive...