Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
hub
Spider 2.0: Evaluating language models on real- world enterprise text-to-SQL workflows
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 17representative citing papers
LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.
SynQL synthesizes diverse, execution-ready SQL workloads by deterministically traversing foreign-key graphs to populate ASTs, yielding high topological entropy and cost-model training data with R² ≥ 0.79 on held-out sets.
SpotIt+ uses verification to find realistic counterexample databases that expose discrepancies between generated and gold SQL queries missed by standard test-based evaluation on the BIRD dataset.
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
DataClawBench is a new benchmark for exploratory real-world financial data analysis that shows increased exploration by LLM agents does not reliably produce task-relevant progress or correct answers.
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
SemanticAgent introduces a three-stage semantic analysis, synthesis, and verification process that produces higher-quality text-to-SQL training data than prior execution-only methods.
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
A catalog-driven framework translates natural language into PromQL queries with dynamic temporal resolution for cloud-native observability.
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
citing papers explorer
-
Residual Skill Optimization for Text-to-SQL Ensembles
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
-
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-based methods.
-
SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking
SynQL synthesizes diverse, execution-ready SQL workloads by deterministically traversing foreign-key graphs to populate ASTs, yielding high topological entropy and cost-model training data with R² ≥ 0.79 on held-out sets.
-
SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints
SpotIt+ uses verification to find realistic counterexample databases that expose discrepancies between generated and gold SQL queries missed by standard test-based evaluation on the BIRD dataset.
-
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
-
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
-
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
-
Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
-
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
DataClawBench is a new benchmark for exploratory real-world financial data analysis that shows increased exploration by LLM agents does not reliably produce task-relevant progress or correct answers.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
SemanticAgent: A Semantics-Aware Framework for Text-to-SQL Data Synthesis
SemanticAgent introduces a three-stage semantic analysis, synthesis, and verification process that produces higher-quality text-to-SQL training data than prior execution-only methods.
-
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
-
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
-
From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability
A catalog-driven framework translates natural language into PromQL queries with dynamic temporal resolution for cloud-native observability.
-
ClinQueryAgent: A Conversational Agent for Population Health Management
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.