pith. sign in

arxiv: 2310.10628 · v1 · pith:B2W5U7HKnew · submitted 2023-10-16 · 💻 cs.CL

Data Contamination Through the Lens of Time

classification 💻 cs.CL
keywords datacontaminationllmsbenchmarksmodelstrainingevaluatingpublicly
0
0 comments X
read the original abstract

Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

    cs.LG 2026-05 unverdicted novelty 7.0

    Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

  2. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  3. Benchmark Data Contamination of Large Language Models: A Survey

    cs.CL 2024-06 unverdicted novelty 3.0

    A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.