SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
hub
S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
citing papers explorer
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench reveals that even top language models like Claude 2 resolve only 1.96% of 2,294 real-world GitHub issues, highlighting a gap in practical coding capabilities.
-
Mind2Web: Towards a Generalist Agent for the Web
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
-
PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?
PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.
-
NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions
NL2SQLBench is a new modular benchmarking framework that evaluates LLM NL2SQL methods across three core modules on existing datasets, exposing large accuracy gaps and computational inefficiency.
-
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Frontier LLMs display emerging investigatory agency in autonomous database analysis but struggle with long-horizon exploration on the new DDR-Bench.
-
Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation
Text-to-SQL queries universally reduce to Filter-Aggregate-Return operations with domain-varying W5H semantic profiles, showing near-zero causal and mechanistic reasoning everywhere.
-
Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis
A structured JSON intermediate representation for LLM-generated static analysis queries outperforms both direct generation and agentic tool use, with gains of 15-25 percentage points on large models.
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
-
SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation
A self-healing LLM pipeline for natural language to PostgreSQL translation achieves up to 9.3 percentage point accuracy gains on benchmarks through error diagnosis and anti-regression mechanisms.
-
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
ClinQueryAgent: A Conversational Agent for Population Health Management
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.