SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
Advances in Neural Information Processing Systems , volume=
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
citing papers explorer
-
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
-
CHESS: Contextual Harnessing for Efficient SQL Synthesis
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.