Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, Jingren Zhou · 2023 · arXiv 2308.15363

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

read on arXiv browse 25 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

ACE-SQL jointly optimizes schema linking and SQL generation via RL with empirical credit assignment from execution-correct rollouts, achieving 65.3% greedy execution accuracy on BIRD Dev using 0.93k output tokens.

Data Flow Control: Data Safety Policies for AI Agents

cs.DB · 2026-06-04 · unverdicted · novelty 7.0

Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

EntSQL is a new benchmark with 1,066 examples across five domains where top systems reach only 15.9% accuracy on English inputs when long-form enterprise documents are provided.

Residual Skill Optimization for Text-to-SQL Ensembles

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

SOMA-SQL resolves multi-source ambiguity in NL-to-SQL using synthetic query logs and ambiguity-driven execution probing, reporting 13% average execution accuracy gains over baselines on six benchmarks.

SANE Schema-aware Natural-language Evaluation of Biological Data

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

SANE is a new schema-aware benchmark paradigm for text-to-SQL evaluation that demonstrates few-shot LLMs with structured prompting can generate accurate queries on constrained biological data schemas without fine-tuning.

FINER-SQL: Boosting Small Language Models for Text-to-SQL

cs.DB · 2026-05-05 · unverdicted · novelty 6.0

FINER-SQL boosts 3B-parameter small language models to 67.73% and 85% execution accuracy on BIRD and Spider benchmarks via dense memory and atomic rewards in group relative policy optimization, matching larger LLMs at lower latency.

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement

cs.DB · 2026-05-01 · unverdicted · novelty 6.0

EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

Querying Structured Data Through Natural Language Using Language Models

cs.CL · 2026-04-03 · conditional · novelty 6.0

Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.

Access Paths for Efficient Ordering with Large Language Models

cs.DB · 2025-08-30 · unverdicted · novelty 6.0

Introduces the LLM ORDER BY semantic operator with algorithmic improvements, a semantic-aware external merge sort, and a budget-aware optimizer that selects near-optimal access paths for LLM-based ordering.

Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning

cs.CL · 2025-05-20 · unverdicted · novelty 6.0

N-rep consistency achieves comparable BIRD benchmark scores for text-to-SQL at $0.039 per query by combining multiple schema representations, without chain-of-thought reasoning or fine-tuning.

Schema-First Retrieval: Embedding Catalogs for Natural Language Analytics

cs.IR · 2026-06-23 · unverdicted · novelty 5.0

Schema-First Retrieval embeds catalog metadata rather than rows and uses parallel retrieval plus reranking to raise table and column recall and cut SQL execution errors on three benchmarks.

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.

Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration

cs.HC · 2026-04-18 · unverdicted · novelty 5.0

An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

cs.AI · 2026-04-08 · unverdicted · novelty 5.0

LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

cs.CL · 2025-11-02 · unverdicted · novelty 5.0

MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.

XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

cs.CL · 2025-07-07 · unverdicted · novelty 5.0

XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.

CHESS: Contextual Harnessing for Efficient SQL Synthesis

cs.LG · 2024-05-27 · conditional · novelty 5.0

CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

cs.AI · 2026-06-01 · unverdicted · novelty 4.0

BADGER is a new enterprise evaluation framework that adds LLM-assisted SQL component extraction and a Hybrid-EX metric validated on 150 human-annotated queries to existing text-to-SQL and agentic assessment methods.

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

cs.IR · 2026-04-12 · unverdicted · novelty 4.0

An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.

LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting

cs.AI · 2026-01-25 · unverdicted · novelty 4.0

SSEV reaches 85.5-86.4% execution accuracy on Spider benchmarks and 66.3% on BIRD-Dev through self-refinement and voting; ReCAPAgent-SQL achieves 31% on initial Spider 2.0-Lite queries via agent collaboration.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding cs.CL · 2026-04-30 · unverdicted · none · ref 9
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer