SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
citing papers explorer
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.