ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363, 2023
25 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
ACE-SQL jointly optimizes schema linking and SQL generation via RL with empirical credit assignment from execution-correct rollouts, achieving 65.3% greedy execution accuracy on BIRD Dev using 0.93k output tokens.
Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.
EntSQL is a new benchmark with 1,066 examples across five domains where top systems reach only 15.9% accuracy on English inputs when long-form enterprise documents are provided.
Residual skill optimization creates complementary Text-to-SQL agents by training each new skill on prior ensemble failures, yielding accuracy gains on Spider2-Lite and transfer to other dialects and tasks.
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
SOMA-SQL resolves multi-source ambiguity in NL-to-SQL using synthetic query logs and ambiguity-driven execution probing, reporting 13% average execution accuracy gains over baselines on six benchmarks.
SANE is a new schema-aware benchmark paradigm for text-to-SQL evaluation that demonstrates few-shot LLMs with structured prompting can generate accurate queries on constrained biological data schemas without fine-tuning.
FINER-SQL boosts 3B-parameter small language models to 67.73% and 85% execution accuracy on BIRD and Spider benchmarks via dense memory and atomic rewards in group relative policy optimization, matching larger LLMs at lower latency.
EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Introduces the LLM ORDER BY semantic operator with algorithmic improvements, a semantic-aware external merge sort, and a budget-aware optimizer that selects near-optimal access paths for LLM-based ordering.
N-rep consistency achieves comparable BIRD benchmark scores for text-to-SQL at $0.039 per query by combining multiple schema representations, without chain-of-thought reasoning or fine-tuning.
Schema-First Retrieval embeds catalog metadata rather than rows and uses parallel retrieval plus reranking to raise table and column recall and cut SQL execution errors on three benchmarks.
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.
LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
BADGER is a new enterprise evaluation framework that adds LLM-assisted SQL component extraction and a Hybrid-EX metric validated on 150 human-annotated queries to existing text-to-SQL and agentic assessment methods.
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
SSEV reaches 85.5-86.4% execution accuracy on Spider benchmarks and 66.3% on BIRD-Dev through self-refinement and voting; ReCAPAgent-SQL achieves 31% on initial Spider 2.0-Lite queries via agent collaboration.
citing papers explorer
-
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.