pith. sign in

arxiv: 2406.04244 · v1 · pith:UHXBMDQAnew · submitted 2024-06-06 · 💻 cs.CL

Benchmark Data Contamination of Large Language Models: A Survey

Pith reviewed 2026-05-22 23:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords benchmark data contaminationlarge language modelsLLM evaluationdata leakagebenchmark reliabilityalternative evaluation methodstraining data overlap
0
0 comments X

The pith

Benchmark data contamination from training sets renders standard LLM evaluations unreliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models can absorb information from evaluation benchmarks during training, a process that produces inflated or misleading performance scores. This issue matters because benchmarks are the main way researchers and users judge whether models are improving at real tasks. The authors review documented cases of the problem, survey methods proposed to detect or avoid it, and discuss remaining challenges plus possible future approaches. If the survey is correct, then many published results on models like GPT-4 rest on contaminated data and cannot be taken at face value without additional checks.

Core claim

The paper establishes that benchmark data contamination occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase, and that alternative assessment methods must be explored to reduce the associated risks in real-world applications.

What carries the argument

Benchmark Data Contamination (BDC), the process by which evaluation-benchmark text enters a model's training corpus and thereby inflates measured performance on those same benchmarks.

If this is right

  • Standard public benchmarks lose their value as trustworthy measures of progress.
  • Model developers must adopt data-curation practices that explicitly exclude known evaluation sets.
  • New evaluation protocols such as private held-out tests or dynamic benchmarks become necessary for credible claims.
  • Reported performance numbers on contaminated benchmarks cannot be compared directly across models trained on different data mixtures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing leaderboards may systematically overstate model capabilities until contamination is measured and corrected.
  • The same leakage risk applies to any fixed test set used repeatedly in machine learning, not just language models.
  • Detection techniques for contamination could be turned into routine pre-release audits for new models.

Load-bearing premise

The papers reviewed in the survey correctly describe how widespread and severe the contamination problem is, and the alternative evaluation methods they describe can be used without creating equally serious new flaws.

What would settle it

A controlled experiment that retrains several current LLMs from scratch on data guaranteed to exclude all benchmark test sets and then shows those models achieve the same scores on the benchmarks as the original contaminated versions.

read the original abstract

The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This survey paper defines Benchmark Data Contamination (BDC) as the inadvertent inclusion of evaluation benchmark data in LLM training corpora, which distorts reported performance on those benchmarks. It reviews the phenomenon across models such as GPT-4, Claude-3, and Gemini, surveys mitigation strategies and alternative evaluation approaches, and outlines open challenges and future research directions for reliable LLM assessment.

Significance. If the coverage of the literature is representative, the survey would usefully consolidate a growing body of work on an issue that directly threatens the validity of standard LLM benchmarks. Its value would lie in mapping the problem space and cataloguing proposed remedies rather than in any novel empirical or theoretical contribution.

major comments (2)
  1. [Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.
  2. [Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.
minor comments (2)
  1. [Abstract] The abstract repeats the definition of BDC without adding new information; a single concise definition would suffice.
  2. [Throughout] No quantitative summary (e.g., number of papers reviewed, distribution across years or venues) is supplied to give readers a sense of the evidence base.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments below. Both points identify areas where additional transparency and analysis can strengthen the survey, and we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.

    Authors: We agree that the absence of an explicit literature search protocol limits the ability to assess coverage. Although the paper is a narrative survey rather than a formal systematic review, we will add a new subsection (likely in Section 2 or a dedicated Methods section) that describes the search strategy: databases consulted (arXiv, ACL Anthology, Google Scholar), time window (primarily 2020–2024), keywords used (e.g., “benchmark contamination”, “data leakage LLM”, “evaluation contamination”), and inclusion criteria (peer-reviewed or pre-print works that empirically study or propose mitigations for BDC in LLMs). Exclusion criteria (e.g., non-English papers, purely theoretical works without empirical component) will also be stated. This addition will directly address the concern about transparency. revision: yes

  2. Referee: [Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.

    Authors: The current version surveys the range of proposed alternatives (dynamic benchmarks, private test sets, contamination detection methods, etc.) but does not synthesize quantitative comparisons. We will expand the relevant section to include a comparative table or structured discussion that extracts and contrasts reported computational overhead, scalability limits, and known contamination vulnerabilities from the cited papers. Where primary sources lack such metrics we will explicitly note the gap and flag it as an open research direction rather than claiming superiority. This revision will ground the discussion without introducing new unsubstantiated claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey summarizes external literature

full rationale

The paper is a survey whose content consists entirely of descriptions and summaries of prior external work on benchmark data contamination. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted within the manuscript itself. All load-bearing statements are attributed to cited literature rather than derived internally, and no self-citation chains, ansatzes, or renamings of results are used to support any novel claim. The structure therefore contains no steps that reduce by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey paper. No free parameters, mathematical axioms, or invented entities are introduced or required by any central claim.

pith-pipeline@v0.9.0 · 5650 in / 964 out tokens · 84808 ms · 2026-05-22T23:05:13.024470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

    cs.AI 2026-05 accept novelty 8.0

    AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

  2. Provable Joint Decontamination for Benchmarking Multiple Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

  3. Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

    cs.CL 2026-05 conditional novelty 7.0

    LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.

  4. Dataset Watermarking for Closed LLMs with Provable Detection

    cs.LG 2026-05 unverdicted novelty 7.0

    A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...

  5. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  6. BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

    cs.CL 2026-04 conditional novelty 7.0

    A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.

  7. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 conditional novelty 7.0

    A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...

  8. Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

    cs.CL 2026-04 unverdicted novelty 7.0

    Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.

  9. How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

    cs.AI 2026-04 unverdicted novelty 7.0

    A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.

  10. LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

    cs.CL 2026-04 unverdicted novelty 7.0

    LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.

  11. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 unverdicted novelty 6.0

    NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

  12. ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

    cs.AI 2026-04 unverdicted novelty 6.0

    ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...

  13. Micro Language Models Enable Instant Responses

    cs.CL 2026-04 conditional novelty 6.0

    Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.

  14. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  15. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    cs.SE 2025-09 conditional novelty 6.0

    SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.

  16. Guidelines for Empirical Studies in Software Engineering involving Large Language Models

    cs.SE 2025-08 accept novelty 6.0

    A group of 22 researchers proposes seven study types and eight guidelines for empirical software engineering studies involving LLMs to enhance reproducibility and replicability.

  17. League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

    cs.AI 2025-07 unverdicted novelty 6.0

    League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

  18. A Study of LLMs' Preferences for Libraries and Programming Languages

    cs.SE 2025-03 unverdicted novelty 6.0

    Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.

  19. The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

    cs.LG 2026-05 unverdicted novelty 5.0

    ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.

  20. Riemann-Bench: A Benchmark for Moonshot Mathematics

    cs.AI 2026-04 conditional novelty 5.0

    Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

  21. LLM Benchmark Datasets Should Be Contamination-Resistant

    cs.LG 2026-05 unverdicted novelty 4.0

    Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

  22. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

    cs.SE 2026-04 unverdicted novelty 4.0

    Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

  23. Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

    cs.LG 2025-07 unverdicted novelty 4.0

    Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.

  24. LLM Harms: A Taxonomy and Discussion

    cs.CY 2025-12 unverdicted novelty 3.0

    This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 22 Pith papers · 19 internal anchors

  1. [1]

    Aggarwal

    Charu C. Aggarwal. 2018. Opinion Mining and Sentiment Analysis . Springer International Publishing, Cham, 413–434. https://doi.org/10.1007/978-3-319-73531-3_13

  2. [2]

    Abdulmohsen Al-Thubaity, Sakhar Alkhereyf, Hanan Murayshid, Nouf Alshalawi, Maha Omirah, Raghad Alateeq, Rawabi Almutairi, Razan Alsuwailem, Manal Alhassoun, and Imaan Alkhanen. 2023. Evaluating ChatGPT and Bard AI on Arabic Sentiment Analysis. In Proceedings of ArabicNLP 2023, Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali,...

  3. [3]

    Mussa Aman. 2024. Large Language Model Based Fake News Detection. Procedia Computer Science 231 (2024), 740–

  4. [4]

    https://doi.org/10.1016/j.procs.2023.12.144 14th International Conference on Emerging Ubiquitous Systems and Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (EUSPN/ICTH 2023)

  5. [5]

    Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family

  6. [6]

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

  7. [7]

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an- Examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track . https://openreview.net/forum...

  8. [8]

    Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contami- nation and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Yvette Graham and Matthew Purver (Eds.). Associati...

  9. [9]

    Do, Yan Xu, and Pascale Fung

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

  10. [10]

    Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasingh...

  11. [11]

    Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems , T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press. https: //proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf

  12. [12]

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf

  13. [13]

    Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Em...

  14. [14]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 545...

  15. [15]

    Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. 2024. Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models. arXiv:2404.06209 [cs.LG]

  16. [16]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  17. [17]

    Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermea- sures in Code Language Model. arXiv:2403.16898 [cs.SE]

  18. [18]

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting Training Data from Diffusion Models. In 32nd USENIX Security Sympo- sium (USENIX Security 23) . USENIX Association, Anaheim, CA, 5253–5270. https://www.usenix.org/conference/ , Vol. 1, No. 1, Article ...

  19. [19]

    Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21) . USENIX Association, 2633–2650. https://www.usenix.org/...

  20. [20]

    Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swami- nathan. 2024. Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs. arXiv:2403.00393 [cs.CR]

  21. [21]

    Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7312–7327. https://doi.org/10....

  22. [22]

    Yu, Qiang Yang, and Xing Xie

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (mar 2024), 45 pages. https://doi.org/10.1145/3641289

  23. [23]

    Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. 2023. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. arXiv:2310.01957 [cs.RO]

  24. [24]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  25. [25]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

  26. [26]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]

  27. [27]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  28. [28]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

  29. [29]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]

  30. [30]

    Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, and Xipeng Qiu. 2021. Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Kristina Toutanova, Anna Rumshisky, Luke Zettlemoy...

  31. [31]

    Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, and Pasquale Minervini. 2021. Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, an...

  32. [32]

    Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, and Martin Vechev. 2024. Evading Data Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823 [cs.LG] , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 23

  33. [33]

    Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023. Investigating Data Contamina- tion in Modern Benchmarks for Large Language Models. arXiv:2311.09783 [cs.CL]

  34. [34]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Christ...

  35. [35]

    Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuanjing Huang, Lucia Sp...

  36. [36]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]

  37. [37]

    Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. 2024. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. arXiv:2402.15938 [cs.CL]

  38. [38]

    Duarte, Xuandong Zhao, Arlindo L

    André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. 2024. DE-COP: Detecting Copyrighted Content in Language Models Training Data. arXiv:2402.09910 [cs.CL]

  39. [39]

    Hashimoto

    Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval

  40. [40]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]

  41. [41]

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

    Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs.CL]

  42. [42]

    Aparna Elangovan, Jiayuan He, and Karin Verspoor. 2021. Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational ...

  43. [43]

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. 2023. Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’ Predictions. arXiv:2207.14251 [cs.CL]

  44. [44]

    Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. 2024. NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes. arXiv:2312.14890 [cs.AI]

  45. [45]

    James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Ling...

  46. [46]

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig

  47. [47]

    In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

    PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202) , Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10764–10799. https://proceedings.mlr.press/v202/ gao23f.html

  48. [48]

    Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ 4a8423d5e91fda00bb...

  49. [49]

    Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. 2024. Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? arXiv:2404.06644 [cs.CL]

  50. [50]

    Shahriar Golchin and Mihai Surdeanu. 2024. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. arXiv:2311.06233 [cs.CL]

  51. [51]

    Shahriar Golchin and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= 2Rwq6c3tvr

  52. [52]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems , , Vol. 1, No. 1, Article . Publication date: June 2024. 24 Xu et al. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberge...

  53. [53]

    Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG]

  54. [54]

    Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. . https://doi.org/10.1561/2500000010

  55. [55]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html

  56. [56]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE]

  57. [57]

    Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https: //doi.org/10.1109/ACCESS.2023.3300381

  58. [58]

    Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. InAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d- Paper.pdf

  59. [59]

    Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv:2302.09210 [cs.CL]

  60. [60]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  61. [61]

    Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection.Proceedings of the AAAI Conference on Artificial Intelligence 38, 20 (Mar. 2024), 22105–22113. https://doi.org/10.1609/aaai.v38i20.30214

  62. [62]

    Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv:2403.02839 [cs.CL]

  63. [64]

    Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2038–2047. https:/...

  64. [65]

    Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, and Weizhu Chen. 2023. Competition-Level Problems are Effective LLM Evaluators. arXiv:2312.02143 [cs.CL]

  65. [66]

    Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Cho- quette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference , C. Maria Keet, Hung-Yi Le...

  66. [67]

    Nicos Isaak. 2023. PronounFlow: A Hybrid Approach for Calibrating Pronouns in Sentences. arXiv:2308.15235 [cs.CL]

  67. [68]

    Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) , Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta (Eds.). Associat...

  68. [69]

    Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computation...

  69. [70]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 25 Models for Code. ar...

  70. [71]

    Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. arXiv:2306.13651 [cs.CL]

  71. [72]

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. In Advances in Neural Information Processing Systems , A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

  72. [73]

    Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo. 2024. Does Data Contamination Make a Difference? Insights from Intentionally Contamination Pre-training Data For Language Models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models . https://openreview.net/ forum?id=nLtl8JNOxg

  73. [74]

    Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745 [cs.CL]

  74. [75]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  75. [76]

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285. https://doi.org/10.1613/jair.301

  76. [77]

    Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR,...

  77. [78]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]

  78. [79]

    Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural entrenchment of folktales is encoded in language. Palgrave Communications 5, 1 (2019). https://doi.org/10.1057/s41599-019-0234-9

  79. [80]

    Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring Catastrophic Forgetting in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.11651

  80. [81]

    Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi

Showing first 80 references.