pith. sign in

arxiv: 2605.19010 · v1 · pith:DMKUC6PMnew · submitted 2026-05-18 · 💻 cs.AI

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

Pith reviewed 2026-05-20 10:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords NL2SQLmulti-agent systemslarge language modelsBIRD benchmarkschema enrichmentSQL generationorchestrator
0
0 comments X

The pith

A multi-agent LLM orchestrator with enriched schemas converts natural language to SQL at 78.1 percent semantic accuracy on the BIRD benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a multi-agent method for converting natural language questions into SQL queries. It relies on an LLM-based orchestrator that plans, reflects, and self-corrects while using semantically enriched schema metadata and user-provided business rules. The system is evaluated on the BIRD benchmark for large-scale databases and reports 78.1 percent semantic accuracy. A sympathetic reader would care because current LLM approaches still fall short of human experts on complex queries, and this method aims to close the gap through orchestration and context enrichment.

Core claim

The central claim is that an optimized multi-agent orchestrator using LLMs for planning, orchestration, reflection, and self-correction, combined with an advanced schema enrichment method that adds context-aware metadata and business rules, produces accurate SQL queries from natural language inputs.

What carries the argument

The multi-agent orchestrator that directs LLMs to plan, orchestrate, reflect, and self-correct, together with the schema enrichment process that generates context-aware metadata.

Load-bearing premise

The LLM orchestrator can reliably plan, orchestrate, reflect, and self-correct to produce accurate SQL when supplied with semantically enriched schema metadata and business rules.

What would settle it

Running the method on the BIRD benchmark after disabling the self-correction step and observing whether semantic accuracy falls substantially below 78.1 percent.

Figures

Figures reproduced from arXiv: 2605.19010 by Anup Shirgaonkar, Chandra Dhir, Lakshmi Tumati, Olena Bogdanov, Pareekshitreddy Gaddam, Saurabh Jain, Vijay Parthasarathy, Yeunji Jung.

Figure 1
Figure 1. Figure 1: NL2SQL agent architecture. An offline metadata generation pipeline probes the SQL database and schema [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The NL2SQL orchestrator is a state machine that tracks the overall state of task progress and dynamically [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MCP Server and agentic tools setup: Agentic tools perform higher level functions, and MCP tools perform [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy on BIRD financial dataset with different SQL generator LLMs. Orchestrator in all cases used [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between latency and number of turns of thematic NL2SQL agent. Number of turns is defined [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Natural language to SQL (NL2SQL) conversion is an important problem for researchers and enterprises due to the ubiquitous importance of relational databases in broad-ranging practical problems. Despite the rapid advancements in the capabilities of LLMs, NL2SQL has not reached parity in accuracy with human expert SQL writers, hence needing additional improvements in NL2SQL algorithms. This study presents a new multi-agent method for NL2SQL that achieves 78.1% semantic accuracy on the BIg Bench for LaRge-scale Database (BIRD) benchmark. Our method leverages a semantically enriched representation of user-provided schema, adds user-provided business rules, and produces accurate SQL queries. The main contributions of this study are (a) We designed an optimized new orchestrator in a multi-agent solution that uses LLMs to plan, orchestrate, reflect, and self-correct to generate accurate SQL queries, (b) We developed an advanced schema enrichment method that creates context-aware metadata to improve accuracy, and (c) We demonstrated the accuracy and generalizability of the method across different domains and datasets by evaluating it on the BIRD-SQL benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces AgentNLQ, a multi-agent NL2SQL system that uses an LLM-based orchestrator to plan, orchestrate, reflect, and self-correct, combined with semantically enriched schema metadata and user-provided business rules. It reports 78.1% semantic accuracy on the BIRD benchmark and claims contributions in orchestrator design, schema enrichment, and cross-domain generalizability.

Significance. If the accuracy holds under evaluation conditions directly comparable to prior work, the result would represent a meaningful advance in practical NL2SQL, narrowing the gap to human-level performance and providing a reusable agentic template for database interfaces. The emphasis on reflection/self-correction and context-aware schema metadata is a constructive direction for agent-based methods.

major comments (1)
  1. [Abstract] Abstract: the headline claim of 78.1% semantic accuracy on BIRD rests on the use of 'user-provided business rules' in addition to schema and NL question. Standard BIRD evaluation supplies only the question plus schema; if the reported score incorporates rules unavailable to baselines, the number is not directly comparable and the isolated contribution of the orchestrator or schema enrichment cannot be assessed. The manuscript must specify the exact inputs supplied during evaluation, whether rules were available for every test case, and include an ablation removing the rules.
minor comments (1)
  1. [Abstract] Abstract: the BIRD acronym expansion is written as 'BIg Bench for LaRge-scale Database'; correct the capitalization to the conventional form.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding evaluation inputs and comparability below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of 78.1% semantic accuracy on BIRD rests on the use of 'user-provided business rules' in addition to schema and NL question. Standard BIRD evaluation supplies only the question plus schema; if the reported score incorporates rules unavailable to baselines, the number is not directly comparable and the isolated contribution of the orchestrator or schema enrichment cannot be assessed. The manuscript must specify the exact inputs supplied during evaluation, whether rules were available for every test case, and include an ablation removing the rules.

    Authors: We agree that clarifying the evaluation protocol is essential for fair comparison with prior NL2SQL work. In our experiments, the 78.1% semantic accuracy was obtained using the natural language question, the database schema, and user-provided business rules for every test case in the BIRD benchmark, as these rules form an integral part of the proposed general-purpose agent for realistic enterprise use cases. To address the referee's concern directly, we will revise the abstract, methods, and results sections to explicitly document the inputs supplied at evaluation time. We will also add an ablation study that reports performance when business rules are removed, thereby isolating the contributions of the orchestrator and schema enrichment. This revision will make the results more directly comparable to baselines that use only question and schema. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result grounded in external evaluation

full rationale

The paper presents an engineering contribution—a multi-agent NL2SQL system with an LLM orchestrator, schema enrichment, and user business rules—whose central claim is an empirical 78.1% semantic accuracy on the external BIRD benchmark. No derivation chain, equations, or first-principles result exists that reduces to its own inputs by construction. The method is described as a design choice evaluated against a public benchmark rather than a fitted parameter or self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling are referenced in the provided text. The result is therefore self-contained against external data and receives the default low-circularity finding for empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard assumptions about LLM planning and self-correction capabilities plus the representativeness of the BIRD benchmark; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5756 in / 1155 out tokens · 51682 ms · 2026-05-20T10:37:33.518746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Command a: An enterprise-ready large language model,

    Cohere. Command a: An enterprise-ready large language model.https://arxiv.org/pdf/2504.00698,

  2. [2]

    doi:10.48550/arXiv.2502.00675 , abstract =

    Yuting Deng et al. ReFoRCE: A text-to-SQL agent with self-refinement, consensus enforcement, and column exploration. arXiv preprint arXiv:2502.00675,

  3. [3]

    Dragusin et al

    R. Dragusin et al. Grounding LLMs for database exploration: Intent scoping and paraphrasing for robust NL2SQL. VLDB Workshops 2025, AIDB,

  4. [4]

    Andrew Estornell et al

    https://www.vldb.org/2025/Workshops/VLDB-Workshops-2025/ AIDB/AIDB25_5.pdf. Andrew Estornell et al. ACC-Collab: An actor-critic approach to multi-agent LLM collaboration.arXiv preprint arXiv:2411.00053,

  5. [5]

    Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

    URL https://www.cidrdb.org/ cidr2024/papers/p74-floratou.pdf. Adam Fourney et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468,

  6. [6]
  7. [7]

    A survey of NL2SQL with large language models – where are we, and where are we going?arXiv preprint arXiv:2408.05109v1,

    Xinyu Liu et al. A survey of NL2SQL with large language models – where are we, and where are we going?arXiv preprint arXiv:2408.05109v1,

  8. [8]

    Mohammadjafari, A

    Sara Mohammadjafari et al. From natural language to SQL: Review of LLM-based text-to-SQL systems.arXiv preprint arXiv:2410.01066,

  9. [9]

    Chase-sql: Multi-path reasoning and preference optimized candidate selection in text-to-sql.arXiv preprint arXiv:2410.01943, 2024

    Mohammadreza Pourreza et al. CHASE-SQL: Multi-path reasoning and preference optimized candidate selection in text-to-SQL.arXiv preprint arXiv:2410.01943,

  10. [10]

    Connor Shorten et al

    URL https: //proceedings.neurips.cc/paper_files/paper/2023/file/d842425e4bf79ba039352da0f658a906- Paper-Conference.pdf. Connor Shorten et al. Function calling for SQL generation.arXiv preprint arXiv:2502.00032,

  11. [11]

    Mac- sql: A multi-agent collaborative framework for text-to-sql,

    Bing Wang et al. MAC-SQL: A multi-agent collaborative framework for text-to-SQL.arXiv preprint arXiv:2312.11242,

  12. [12]

    XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

    URLhttps://aclanthology.org/2024.acl-short.34.pdf. XiYan-SQL Team. XiYan-SQL: A novel multi-generator framework for text-to-SQL.arXiv preprint arXiv:2507.04701,

  13. [13]

    ExCoT: Optimizing reasoning for text-to-SQL with execution feedback.arXiv preprint arXiv:2503.19988,

    Yiyuan Zhai et al. ExCoT: Optimizing reasoning for text-to-SQL with execution feedback.arXiv preprint arXiv:2503.19988,

  14. [14]

    Classification code

    RES4: ground truth: [(71,)]→generated answer: None RES2: ground truth: [(71,)]→generated answer: (72,) RETURN a JSON: {{"Classification code": "<add it here>", "Reasoning": "<explanation>"}} Appendix B Human alignment of the LLM judge We compared human labeled ground truth (RES error codes) and LLM judge generated labels to evaluate the human alignment of...