MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
Pith reviewed 2026-05-18 01:27 UTC · model grok-4.3
The pith
MARS-SQL trains a multi-agent system with reinforcement learning so it can execute SQL on a live database and refine queries from feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MARS-SQL decomposes Text-to-SQL into three specialized roles and trains the query-generation agent with a multi-turn RL policy inside a ReAct loop. The agent reasons, executes intermediate SQL statements against a live database, and updates its strategy from execution feedback. Solution selection is cast as a generative modeling task that picks the best trajectory by next-token prediction probabilities. This coupling of interactive learning and trajectory ranking produces execution accuracies of 77.84 percent on the BIRD development set and 89.75 percent on the Spider test set while transferring to out-of-domain benchmarks.
What carries the argument
The multi-turn RL policy inside a ReAct-style loop that lets the generation agent issue SQL actions, observe execution results, and refine its plan, together with next-token probability ranking for selecting the final trajectory.
If this is right
- Execution accuracy reaches state-of-the-art levels on both the BIRD development set and the Spider test set.
- Performance transfers strongly to out-of-domain Text-to-SQL benchmarks.
- The agentic workflow becomes trainable through reinforcement learning instead of relying on fixed prompts.
- Interactive execution feedback replaces purely static generation for complex schema alignment.
Where Pith is reading between the lines
- The same RL loop could be applied to other structured generation tasks such as API call composition or data analysis scripts.
- Probability-based trajectory ranking offers a label-free way to select among multiple agent paths in other domains.
- Scaling the approach to larger models or longer interaction horizons may further reduce errors on very intricate queries.
Load-bearing premise
Next-token prediction probabilities alone can reliably identify the best interaction trajectory without human labels or external verifiers.
What would settle it
Run the trained agent on a fresh set of queries and check whether the trajectory assigned the highest next-token probability is also the one that produces correct execution results; if the two rankings diverge often, the validation step fails.
Figures
read the original abstract
Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text-to-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we propose MARS-SQL, a trainable multi-agent framework for Text-to-SQL. Rather than introducing a new standalone SQL primitive, MARS-SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy within a ReAct-style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS-SQL achieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out-of-domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS-SQL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MARS-SQL, a multi-agent reinforcement learning framework for Text-to-SQL that decomposes the task into three specialized agents for schema grounding, query generation, and solution validation. A generation agent is trained via multi-turn RL within a ReAct-style loop that incorporates execution feedback from a live database. Solution selection is performed by treating trajectory ranking as a generative modeling task that uses next-token prediction probabilities. The manuscript reports state-of-the-art execution accuracies of 77.84% on the BIRD development set and 89.75% on the Spider test set, along with strong transfer to out-of-domain benchmarks.
Significance. If the reported gains are shown to stem from the trainable multi-agent RL workflow and the proposed trajectory-ranking mechanism rather than from unstated implementation details or baseline choices, the work would constitute a meaningful contribution to agentic approaches for semantic parsing. The emphasis on interactive learning with execution feedback and the public release of code are positive elements that could support further research in making LLM-based Text-to-SQL systems more robust and adaptive.
major comments (2)
- [Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.
- [Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.
minor comments (1)
- The manuscript should clarify the precise formulation of the multi-turn RL objective and the exact role of each agent in the ReAct loop to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each point below and have revised the manuscript accordingly to provide greater clarity on the methodological details and the validation mechanism.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims (77.84% execution accuracy on BIRD dev, 89.75% on Spider test) are presented without any reported details on reward design, training stability, baseline comparisons, statistical significance testing, or error analysis. These omissions make it impossible to determine whether the gains are attributable to the multi-agent RL policy or the trajectory-ranking procedure.
Authors: We agree that the abstract is highly condensed and omits key details that are elaborated in the main body of the paper. To address this, we have revised the abstract to briefly mention the reward design (based on execution accuracy and schema alignment), note that results are averaged over multiple runs with reported standard deviations for stability, and reference the baseline comparisons and error analysis presented in Sections 5 and 6. We believe these additions will help readers better attribute the performance gains to the proposed multi-agent RL workflow and trajectory-ranking procedure. revision: yes
-
Referee: [Abstract] Validation mechanism (as described in the abstract): The claim that next-token prediction probabilities can reliably identify the optimal interaction trajectory rests on an unverified assumption that these probabilities correlate strongly with execution accuracy and schema correctness. In Text-to-SQL, high-probability outputs can still be syntactically plausible yet semantically incorrect; without an external verifier or explicit correlation analysis, this step risks selecting suboptimal trajectories and weakening both the RL training signal and the final reported results.
Authors: We appreciate this insightful observation regarding the potential limitations of using next-token prediction probabilities for trajectory ranking. In our framework, the ranking is performed on trajectories that have already undergone execution feedback within the ReAct-style loop, providing an additional layer of validation through actual database interactions. Nevertheless, to strengthen the manuscript, we have added an explicit correlation analysis between the model's log-probabilities and execution accuracy on a validation subset, demonstrating a positive correlation. We also discuss cases where high-probability trajectories may still fail and how the multi-agent setup mitigates this. This revision clarifies the role of the generative ranking without relying solely on an unverified assumption. revision: yes
Circularity Check
No circularity: performance derived from external benchmarks via RL interaction
full rationale
The paper's central claims rest on empirical execution accuracy measured on independent public benchmarks (BIRD dev at 77.84%, Spider test at 89.75%). These results are obtained by running the trained multi-agent system on live databases and counting correct executions, not by fitting parameters to the target metric or re-deriving the metric from itself. The validation step (next-token probabilities for trajectory selection) is an internal component of the proposed ReAct-style RL loop and does not reduce the reported accuracies to a tautology or self-citation. No equations, uniqueness theorems, or prior self-citations are invoked that would make the SOTA numbers equivalent to the training inputs by construction. The derivation chain is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
free parameters (2)
- RL policy hyperparameters
- Trajectory ranking threshold
axioms (2)
- domain assumption Execution feedback from a live database provides a reliable reward signal for refining SQL generation.
- domain assumption Next-token prediction probabilities from the same model can serve as an effective ranking mechanism for interaction trajectories.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Validation agent... selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train the policy πGen using Group Relative Policy Optimization (GRPO)... reward signal Rgen(τ) used to compute Ai is derived solely from execution outcomes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL http://arxiv.org/ abs/2501.00332. arXiv:2501.00332 [cs]. S. Chaturvedi, A. Chadha, and L. Bindschaedler. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction, Aug
-
[2]
URLhttp://arxiv.org/abs/2509.00581. arXiv:2509.00581 [cs]. G. Chen, S. Dong, Y . Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y . Shi. AutoAgents: A Framework for Automatic Agent Generation, Apr
-
[3]
Autoagents: A framework for automatic agent generation
URL http://arxiv.org/abs/2309.17288. arXiv:2309.17288 [cs]. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, and R. Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
-
[4]
URLhttps://arxiv.org/abs/2501.12948. M. Deng, A. Ramachandran, C. Xu, L. Hu, Z. Yao, A. Datta, and H. Zhang. RefoRCE: A text-to-SQL agent with self-refinement, format restriction, and column exploration. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild,
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [5]
-
[6]
URLhttps://arxiv.org/abs/2305.14325. Y . Gan, X. Chen, and M. Purver. Exploring underexplored limitations of cross-domain text-to-sql generalization,
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
- [8]
- [9]
-
[10]
URL http://arxiv.org/abs/2411.15594. arXiv:2411.15594 [cs]. L. Gui, C. Gârbacea, and V . Veitch. BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, Nov
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
URLhttp://arxiv.org/abs/2406.00832. arXiv:2406.00832 [cs]. 10 MARS-SQL Z. He, Z. Liu, P. Li, Y . R. Fung, M. Yan, J. Zhang, F. Huang, and Y . Liu. Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization, Aug
-
[12]
URL http://arxiv.org/ abs/2502.14496. arXiv:2502.14496 [cs]. S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, Nov
-
[13]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
URLhttp://arxiv.org/abs/2308.00352. arXiv:2308.00352 [cs]. Z. Hong, Z. Yuan, Q. Zhang, H. Chen, J. Dong, F. Huang, and X. Huang. Next-generation database interfaces: A survey of llm-based text-to-sql,
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
URLhttp://arxiv.org/abs/2311.17227. arXiv:2311.17227 [cs]. J.-t. Huang, J. Zhou, T. Jin, X. Zhou, Z. Chen, W. Wang, Y . Yuan, M. R. Lyu, and M. Sap. On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents, May 2025a. URLhttp://arxiv.org/abs/2408. 00989. arXiv:2408.00989 [cs]. Y . Huang, S. Li, Z. Fan, M. LIU, W. Liu, and Y . R. Fung. S...
-
[16]
Qwen2.5-Coder Technical Report
URLhttp://arxiv.org/abs/2409.12186. arXiv:2409.12186 [cs]. F. Lei, J. Chen, Y . Ye, R. Cao, D. Shin, H. Su, Z. Suo, H. Gao, W. Hu, P. Yin, V . Zhong, C. Xiong, R. Sun, Q. Liu, S. Wang, and T. Yu. Spider 2.0: Evaluating language models on real-world enterprise text-to-sql workflows,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
URLhttps://arxiv.org/abs/2411.07763. B. Li, Y . Luo, C. Chai, G. Li, and N. Tang. The Dawn of Natural Language to SQL: Are We Fully Ready?Proceedings of the VLDB Endowment, 17(11):3318–3331, July 2024a. ISSN 2150-8097. doi:10.14778/3681954.3682003. URL http://arxiv.org/abs/2406.01265. arXiv:2406.01265 [cs]. H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zh...
-
[18]
URLhttps://arxiv.org/abs/2305.03111. J. Li, X. Li, G. Qu, P. Jacobsson, B. Qin, B. Hui, S. Si, N. Huo, X. Xu, Y . Zhang, Z. Tang, Y . Li, F. Widjaja, X. Zhu, F. Zhou, Y . Huang, Y . Papakonstantinou, F. Ozcan, C. Ma, and R. Cheng. SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications, July 2025b. URLhttp://arxiv.org/abs/25...
-
[19]
ISSN 0036-8075, 1095-9203. doi:10.1126/science.abq1158. URLhttp://arxiv.org/abs/2203.07814. arXiv:2203.07814 [cs]. S. Liu, S. Hegde, S. Cao, A. Zhu, D. Li, T. Griggs, E. Tang, A. Malik, K. Hakhamaneshi, R. Liaw, P. Moritz, M. Zaharia, J. E. Gonzalez, and I. Stoica. Skyrl-sql: Matching gpt-4o and o4-mini on text2sql with multi-turn rl, 2025a. Y . Liu, Y . ...
-
[20]
Accessed: 2024-06-09. 11 MARS-SQL S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, Oct
work page 2024
-
[21]
URL http://arxiv.org/abs/2202. 12837. arXiv:2202.12837 [cs]. A. Ni, S. Iyer, D. Radev, V . Stoyanov, W.-t. Yih, S. I. Wang, and X. V . Lin. LEVER: Learning to Verify Language-to-Code Generation with Execution, Sept
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
URL http://arxiv.org/abs/2302.08468. arXiv:2302.08468 [cs]. OpenAI. Gpt-4 technical report,
-
[23]
URLhttps://arxiv.org/abs/2412.16720. M. Pourreza and D. Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction,
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
- [25]
-
[26]
URL http: //arxiv.org/abs/2503.23157. arXiv:2503.23157 [cs]. C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun. ChatDev: Communicative Agents for Software Development, June
-
[27]
ChatDev: Communicative Agents for Software Development
URL http://arxiv.org/abs/ 2307.07924. arXiv:2307.07924 [cs]. Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, Oct
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
URLhttp://arxiv.org/abs/2307.16789. arXiv:2307.16789 [cs]. G. Qu, J. Li, B. Qin, X. Li, N. Huo, C. Ma, and R. Cheng. Share: An slm-based hierarchical action correction assistant for text-to-sql,
work page internal anchor Pith review Pith/arXiv arXiv
- [29]
-
[30]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URL http://arxiv. org/abs/2402.03300. arXiv:2402.03300 [cs]. G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
-
[32]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
URLhttp://arxiv.org/abs/2405.16755. arXiv:2405.16755 [cs]. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun, and Z. Li. Mac-sql: A multi-agent collaborative framework for text-to-sql, 2025a. URLhttps://arxiv.org/abs/2312.11242. B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q.-W. Zhang, D. Yin, X. Sun,...
work page internal anchor Pith review arXiv
-
[33]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
URL http://arxiv.org/abs/2203.11171. arXiv:2203.11171 [cs]. X. Wang, Y . Xiao, J.-t. Huang, S. Yuan, R. Xu, H. Guo, Q. Tu, Y . Fei, Z. Leng, W. Wang, J. Chen, C. Li, and Y . Xiao. InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews, June
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
URLhttp://arxiv.org/abs/2310.17976. arXiv:2310.17976 [cs]. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models,
-
[35]
URLhttps://arxiv.org/abs/2201.11903. Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
URL https://arxiv.org/abs/2308.08155. W. Xie, Y . Dai, and W. Jiang. Sde-sql: Enhancing text-to-sql generation in large language models via self-driven exploration with sql probes, 2025a. URLhttps://arxiv.org/abs/2506.07245. 12 MARS-SQL X. Xie, G. Xu, L. Zhao, and R. Guo. Opensearch-sql: Enhancing text-to-sql with dynamic few-shot and consistency alignmen...
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
URLhttps://arxiv.org/abs/2210.03629. Z. Yao, G. Sun, L. Borchmann, Z. Shen, M. Deng, B. Zhai, H. Zhang, A. Li, and Y . He. Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL, May
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
URL http://arxiv.org/abs/2505.20315. arXiv:2505.20315 [cs]. T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task,
-
[39]
URLhttps://arxiv.org/abs/1809.08887. J. Zhang, H. Yang, K. Miao, R. Zhang, R. Pi, J. Gao, and X. Zhou. Exesql: Self-taught text-to-sql models with execution-driven bootstrapping for sql dialects, 2025a. URLhttps://arxiv.org/abs/2505.17231. L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal. Generative Verifiers: Reward Modeling as Next-...
work page internal anchor Pith review Pith/arXiv arXiv
- [40]
-
[41]
Marco-o1: Towards open reasoning models for open-ended solutions, 2024
URL http://arxiv.org/abs/2411.14405. arXiv:2411.14405 [cs]. L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, Dec
-
[42]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]. A The Use of Large Language Models Large Language Models (LLMs) were utilized in a limited, assistive capacity for specific tasks in this project. For manuscript preparation, the authors supplied their own draft to an LLM, which then provided suggestions to improve grammar, enhance clarity, and e...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Table 4: Hyperparameters for Grounding and Generation Agent RL Training. Parameter Value Training Parameters Learning Rate1×10 −6 Batch Size 128 Trajectory Rollout Parameters Temperature 0.6 Top-p 0.95 Table 5: Hyperparameters for Validation Agent Dataset Generation. Parameter Value Candidates per Question 16 Temperature 0.7 Top-p 0.9 Top-k 50 B.3 Validat...
work page 2024
-
[44]
Table 6: Hyperparameters for Validation Agent SFT. Parameter Value Base ModelQwen2.5-Coder-7B-Instruct Epochs 3 Learning Rate Scheduler Cosine Initial Learning Rate1×10 −5 Effective Batch Size 4 Per-device Batch Size 1 Gradient Accumulation 2 steps Precisionbf16 Optimization DeepSpeed ZeRO Stage 3 C Dataset C.1 Training Dataset Our training data is derive...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.