hub Mixed citations

Spider: A large-scale human-labeled dataset for com- plex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887

· 2018 · cs.CL · arXiv 1809.08887

Mixed citation behavior. Most common role is background (60%).

21 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 21 citing papers arXiv PDF

abstract

We present Spider, a large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at https://yale-lily.github.io/spider

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 2

citation-polarity summary

background 3 use dataset 2

representative citing papers

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.

Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

cs.DB · 2026-02-25 · unverdicted · novelty 7.0

New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

cs.CR · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.

Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions

cs.DB · 2026-03-24 · unverdicted · novelty 6.0

A literature survey that taxonomizes methods, datasets, and evaluation practices for natural language interfaces to geospatial and temporal databases while identifying recurring trends and future directions.

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

cs.LG · 2025-11-27 · unverdicted · novelty 6.0

TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Modular curriculum learning with tier-specific adapters outperforms standard fine-tuning on complex Text-to-SQL queries in Spider and BIRD benchmarks by avoiding catastrophic forgetting.

InCoder-32B-Thinking: Industrial Code World Model for Thinking

cs.AR · 2026-04-03 · unverdicted · novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

LLMs Get Lost In Multi-Turn Conversation

cs.CL · 2025-05-09 · unverdicted · novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

Measuring Coding Challenge Competence With APPS

cs.SE · 2021-05-20 · unverdicted · novelty 6.0

APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.

MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

cs.CL · 2025-11-02 · unverdicted · novelty 5.0

MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.

XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

cs.CL · 2025-07-07 · unverdicted · novelty 5.0

XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.

CHESS: Contextual Harnessing for Efficient SQL Synthesis

cs.LG · 2024-05-27 · conditional · novelty 5.0

CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.

An Alternate Agentic AI Architecture (It's About the Data)

cs.DB · 2026-04-23 · unverdicted · novelty 5.0

RUBICON replaces opaque LLM-based tool orchestration in agentic AI with an explicit query algebra (AQL: Find, From, Where) executed via wrappers to deliver traceable, deterministic access to heterogeneous enterprise data systems.

Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

Blue DIL is a new architecture that unifies structured enterprise data, LLM world knowledge, and personal context through declarative query plans and agents for multi-source multi-modal applications.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Supplement Generation Training for Enhancing Agentic Task Performance

cs.LG · 2026-04-22 · unverdicted · novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

Qwen2.5-Coder Technical Report

cs.CL · 2024-09-18 · unverdicted · novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

citing papers explorer

Showing 21 of 21 citing papers.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation cs.CR · 2025-07-14 · unverdicted · none · ref 57 · internal anchor
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems cs.AI · 2026-05-20 · unverdicted · none · ref 49 · internal anchor
DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"? cs.DB · 2026-02-25 · unverdicted · none · ref 56 · internal anchor
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts cs.CR · 2026-05-07 · unverdicted · none · ref 45 · 2 links · internal anchor
PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions cs.DB · 2026-03-24 · unverdicted · none · ref 129 · internal anchor
A literature survey that taxonomizes methods, datasets, and evaluation practices for natural language interfaces to geospatial and temporal databases while identifying recurring trends and future directions.
TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation cs.LG · 2025-11-27 · unverdicted · none · ref 49 · internal anchor
TreeCoder improves LLM code generation accuracy by representing decoding as an optimizable tree search over programs with first-class constraints for syntax, style, and execution, outperforming baselines on MBPP and SQL-Spider.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 220 · internal anchor
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding cs.CL · 2026-04-30 · unverdicted · none · ref 36
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL cs.AI · 2026-04-20 · unverdicted · none · ref 25
Modular curriculum learning with tier-specific adapters outperforms standard fine-tuning on complex Text-to-SQL queries in Spider and BIRD benchmarks by avoiding catastrophic forgetting.
InCoder-32B-Thinking: Industrial Code World Model for Thinking cs.AR · 2026-04-03 · unverdicted · none · ref 45
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
LLMs Get Lost In Multi-Turn Conversation cs.CL · 2025-05-09 · unverdicted · none · ref 90
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
Measuring Coding Challenge Competence With APPS cs.SE · 2021-05-20 · unverdicted · none · ref 16
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL cs.CL · 2025-11-02 · unverdicted · none · ref 39 · internal anchor
MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL cs.CL · 2025-07-07 · unverdicted · none · ref 7 · internal anchor
XiYan-SQL achieves SOTA Text-to-SQL accuracy by combining schema filtering, a multi-generator ensemble fine-tuned on varied SQL formats, and a selection model.
CHESS: Contextual Harnessing for Efficient SQL Synthesis cs.LG · 2024-05-27 · conditional · none · ref 62 · internal anchor
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
An Alternate Agentic AI Architecture (It's About the Data) cs.DB · 2026-04-23 · unverdicted · none · ref 1
RUBICON replaces opaque LLM-based tool orchestration in agentic AI with an explicit query algebra (AQL: Find, From, Where) executed via wrappers to deliver traceable, deterministic access to heterogeneous enterprise data systems.
Blue Data Intelligence Layer: Streaming Data and Agents for Multi-source Multi-modal Data-Centric Applications cs.AI · 2026-04-16 · unverdicted · none · ref 18
Blue DIL is a new architecture that unifies structured enterprise data, LLM world knowledge, and personal context through declarative query plans and agents for multi-source multi-modal applications.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 53
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 182
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Supplement Generation Training for Enhancing Agentic Task Performance cs.LG · 2026-04-22 · unverdicted · none · ref 44
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
Qwen2.5-Coder Technical Report cs.CL · 2024-09-18 · unverdicted · none · ref 38
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

Spider: A large-scale human-labeled dataset for com- plex and cross-domain semantic parsing and text-to-sql task.arXiv preprint arXiv:1809.08887

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer