hub

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, Dragomir Radev · 2018 · DOI 10.18653/v1/d18-1425

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open at publisher browse 15 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

cs.CL · 2023-10-10 · unverdicted · novelty 8.0

SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.

Mind2Web: Towards a Generalist Agent for the Web

cs.CL · 2023-06-09 · accept · novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

cs.DB · 2026-05-09 · unverdicted · novelty 7.0

PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.

NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions

cs.DB · 2026-04-13 · conditional · novelty 7.0

NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.

Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation

cs.DB · 2026-05-07 · unverdicted · novelty 6.0

Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.

Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.

SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation

cs.DB · 2026-04-15 · unverdicted · novelty 6.0

A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

cs.SD · 2025-09-09 · unverdicted · novelty 6.0

AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

ClinQueryAgent: A Conversational Agent for Population Health Management

cs.IR · 2026-04-13 · unverdicted · novelty 4.0

The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

cs.IR · 2026-04-12 · unverdicted · novelty 4.0

An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 15 of 15 citing papers.

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? cs.CL · 2023-10-10 · unverdicted · none · ref 141
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
Mind2Web: Towards a Generalist Agent for the Web cs.CL · 2023-06-09 · accept · none · ref 45
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
PrepBench: How Far Are We from Natural-Language-Driven Data Preparation? cs.DB · 2026-05-09 · unverdicted · none · ref 55
PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions cs.DB · 2026-04-13 · conditional · none · ref 74
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models cs.AI · 2026-02-02 · unverdicted · none · ref 8
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation cs.DB · 2026-05-07 · unverdicted · none · ref 12
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis cs.SE · 2026-04-23 · unverdicted · none · ref 30
A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 38
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation cs.DB · 2026-04-15 · unverdicted · none · ref 3
A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs cs.SD · 2025-09-09 · unverdicted · none · ref 19
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 139
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 178
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 182
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method cs.IR · 2026-04-12 · unverdicted · none · ref 74
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
Benchmark Data Contamination of Large Language Models: A Survey cs.CL · 2024-06-06 · unverdicted · none · ref 174
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer