hub

Spider 2.0: Evaluating language models on real- world enterprise text-to-SQL workflows

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, et al · 2024 · arXiv 2411.07763

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 2

citation-polarity summary

use dataset 2 background 1 support 1

representative citing papers

Residual Skill Optimization for Text-to-SQL Ensembles

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.

LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.

SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking

cs.DB · 2026-04-09 · unverdicted · novelty 7.0

SynQL synthesizes diverse, execution-ready SQL workloads by deterministically traversing foreign-key graphs to populate ASTs, yielding high topological entropy and cost-model training data with R² ≥ 0.79 on held-out sets.

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

cs.DB · 2026-03-04 · unverdicted · novelty 7.0

SpotIt+ uses verification to find realistic counterexample databases that expose discrepancies between generated and gold SQL queries missed by standard test-based evaluation on the BIRD dataset.

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

cs.DB · 2026-02-25 · unverdicted · novelty 7.0

New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.

Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation

cs.DB · 2026-05-07 · unverdicted · novelty 6.0

Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

DataClawBench is a new benchmark for exploratory real-world financial data analysis that shows increased exploration by LLM agents does not reliably produce task-relevant progress or correct answers.

From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

SemanticAgent introduces a three-stage semantic analysis, synthesis, and verification process that produces higher-quality text-to-SQL training data than prior execution-only methods.

AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views

cs.DB · 2026-04-08 · unverdicted · novelty 6.0

AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.

LLMs Get Lost In Multi-Turn Conversation

cs.CL · 2025-05-09 · unverdicted · novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

cs.AI · 2026-05-11 · unverdicted · novelty 5.0

A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.

From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability

cs.DB · 2026-03-15 · unverdicted · novelty 5.0

A catalog-driven framework translates natural language into PromQL queries with dynamic temporal resolution for cloud-native observability.

ClinQueryAgent: A Conversational Agent for Population Health Management

cs.IR · 2026-04-13 · unverdicted · novelty 4.0

The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.

citing papers explorer

Showing 17 of 17 citing papers.

Residual Skill Optimization for Text-to-SQL Ensembles cs.CL · 2026-05-20 · unverdicted · none · ref 17
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction cs.CL · 2026-05-10 · unverdicted · none · ref 14
LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.
SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking cs.DB · 2026-04-09 · unverdicted · none · ref 9
SynQL synthesizes diverse, execution-ready SQL workloads by deterministically traversing foreign-key graphs to populate ASTs, yielding high topological entropy and cost-model training data with R² ≥ 0.79 on held-out sets.
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints cs.DB · 2026-03-04 · unverdicted · none · ref 12
SpotIt+ uses verification to find realistic counterexample databases that expose discrepancies between generated and gold SQL queries missed by standard test-based evaluation on the BIRD dataset.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"? cs.DB · 2026-02-25 · unverdicted · none · ref 31
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 39
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems cs.AI · 2026-05-14 · unverdicted · none · ref 36
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation cs.DB · 2026-05-07 · unverdicted · none · ref 5
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 3
DataClawBench is a new benchmark for exploratory real-world financial data analysis that shows increased exploration by LLM agents does not reliably produce task-relevant progress or correct answers.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction cs.AI · 2026-04-30 · unverdicted · none · ref 21
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis cs.AI · 2026-04-23 · unverdicted · none · ref 38
SemanticAgent introduces a three-stage semantic analysis, synthesis, and verification process that produces higher-quality text-to-SQL training data than prior execution-only methods.
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views cs.DB · 2026-04-08 · unverdicted · none · ref 19
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
LLMs Get Lost In Multi-Turn Conversation cs.CL · 2025-05-09 · unverdicted · none · ref 46
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
AgentNLQ: A General-Purpose Agent for Natural Language to SQL cs.AI · 2026-05-18 · unverdicted · none · ref 6
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 1
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability cs.DB · 2026-03-15 · unverdicted · none · ref 5
A catalog-driven framework translates natural language into PromQL queries with dynamic temporal resolution for cloud-native observability.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 181
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.

Spider 2.0: Evaluating language models on real- world enterprise text-to-SQL workflows

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer