Benchmark Data Contamination of Large Language Models: A Survey
Pith reviewed 2026-05-22 23:05 UTC · model grok-4.3
The pith
Benchmark data contamination from training sets renders standard LLM evaluations unreliable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that benchmark data contamination occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase, and that alternative assessment methods must be explored to reduce the associated risks in real-world applications.
What carries the argument
Benchmark Data Contamination (BDC), the process by which evaluation-benchmark text enters a model's training corpus and thereby inflates measured performance on those same benchmarks.
If this is right
- Standard public benchmarks lose their value as trustworthy measures of progress.
- Model developers must adopt data-curation practices that explicitly exclude known evaluation sets.
- New evaluation protocols such as private held-out tests or dynamic benchmarks become necessary for credible claims.
- Reported performance numbers on contaminated benchmarks cannot be compared directly across models trained on different data mixtures.
Where Pith is reading between the lines
- Many existing leaderboards may systematically overstate model capabilities until contamination is measured and corrected.
- The same leakage risk applies to any fixed test set used repeatedly in machine learning, not just language models.
- Detection techniques for contamination could be turned into routine pre-release audits for new models.
Load-bearing premise
The papers reviewed in the survey correctly describe how widespread and severe the contamination problem is, and the alternative evaluation methods they describe can be used without creating equally serious new flaws.
What would settle it
A controlled experiment that retrains several current LLMs from scratch on data guaranteed to exclude all benchmark test sets and then shows those models achieve the same scores on the benchmarks as the original contaminated versions.
read the original abstract
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey paper defines Benchmark Data Contamination (BDC) as the inadvertent inclusion of evaluation benchmark data in LLM training corpora, which distorts reported performance on those benchmarks. It reviews the phenomenon across models such as GPT-4, Claude-3, and Gemini, surveys mitigation strategies and alternative evaluation approaches, and outlines open challenges and future research directions for reliable LLM assessment.
Significance. If the coverage of the literature is representative, the survey would usefully consolidate a growing body of work on an issue that directly threatens the validity of standard LLM benchmarks. Its value would lie in mapping the problem space and cataloguing proposed remedies rather than in any novel empirical or theoretical contribution.
major comments (2)
- [Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.
- [Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.
minor comments (2)
- [Abstract] The abstract repeats the definition of BDC without adding new information; a single concise definition would suffice.
- [Throughout] No quantitative summary (e.g., number of papers reviewed, distribution across years or venues) is supplied to give readers a sense of the evidence base.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. We address the two major comments below. Both points identify areas where additional transparency and analysis can strengthen the survey, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.
Authors: We agree that the absence of an explicit literature search protocol limits the ability to assess coverage. Although the paper is a narrative survey rather than a formal systematic review, we will add a new subsection (likely in Section 2 or a dedicated Methods section) that describes the search strategy: databases consulted (arXiv, ACL Anthology, Google Scholar), time window (primarily 2020–2024), keywords used (e.g., “benchmark contamination”, “data leakage LLM”, “evaluation contamination”), and inclusion criteria (peer-reviewed or pre-print works that empirically study or propose mitigations for BDC in LLMs). Exclusion criteria (e.g., non-English papers, purely theoretical works without empirical component) will also be stated. This addition will directly address the concern about transparency. revision: yes
-
Referee: [Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.
Authors: The current version surveys the range of proposed alternatives (dynamic benchmarks, private test sets, contamination detection methods, etc.) but does not synthesize quantitative comparisons. We will expand the relevant section to include a comparative table or structured discussion that extracts and contrasts reported computational overhead, scalability limits, and known contamination vulnerabilities from the cited papers. Where primary sources lack such metrics we will explicitly note the gap and flag it as an open research direction rather than claiming superiority. This revision will ground the discussion without introducing new unsubstantiated claims. revision: yes
Circularity Check
No significant circularity; survey summarizes external literature
full rationale
The paper is a survey whose content consists entirely of descriptions and summaries of prior external work on benchmark data contamination. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted within the manuscript itself. All load-bearing statements are attributed to cited literature rather than derived internally, and no self-citation chains, ansatzes, or renamings of results are used to support any novel claim. The structure therefore contains no steps that reduce by construction to the paper's own inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 24 Pith papers
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
-
Micro Language Models Enable Instant Responses
Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
A group of 22 researchers proposes seven study types and eight guidelines for empirical software engineering studies involving LLMs to enhance reproducibility and replicability.
-
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
-
Riemann-Bench: A Benchmark for Moonshot Mathematics
Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
-
LLM Benchmark Datasets Should Be Contamination-Resistant
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
-
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
-
LLM Harms: A Taxonomy and Discussion
This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.
Reference graph
Works this paper leans on
-
[1]
Charu C. Aggarwal. 2018. Opinion Mining and Sentiment Analysis . Springer International Publishing, Cham, 413–434. https://doi.org/10.1007/978-3-319-73531-3_13
-
[2]
Abdulmohsen Al-Thubaity, Sakhar Alkhereyf, Hanan Murayshid, Nouf Alshalawi, Maha Omirah, Raghad Alateeq, Rawabi Almutairi, Razan Alsuwailem, Manal Alhassoun, and Imaan Alkhanen. 2023. Evaluating ChatGPT and Bard AI on Arabic Sentiment Analysis. In Proceedings of ArabicNLP 2023, Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali,...
-
[3]
Mussa Aman. 2024. Large Language Model Based Fake News Detection. Procedia Computer Science 231 (2024), 740–
work page 2024
-
[4]
https://doi.org/10.1016/j.procs.2023.12.144 14th International Conference on Emerging Ubiquitous Systems and Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (EUSPN/ICTH 2023)
-
[5]
Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family
work page 2024
-
[6]
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an- Examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track . https://openreview.net/forum...
work page 2023
-
[8]
Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contami- nation and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Yvette Graham and Matthew Purver (Eds.). Associati...
work page 2024
-
[9]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...
-
[10]
Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasingh...
work page 2023
-
[11]
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems , T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press. https: //proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf
work page 2000
-
[12]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf
work page 2023
-
[13]
Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Em...
-
[14]
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 545...
- [15]
-
[16]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
- [17]
-
[18]
Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting Training Data from Diffusion Models. In 32nd USENIX Security Sympo- sium (USENIX Security 23) . USENIX Association, Anaheim, CA, 5253–5270. https://www.usenix.org/conference/ , Vol. 1, No. 1, Article ...
work page 2023
-
[19]
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21) . USENIX Association, 2633–2650. https://www.usenix.org/...
work page 2021
- [20]
-
[21]
Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7312–7327. https://doi.org/10....
-
[22]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (mar 2024), 45 pages. https://doi.org/10.1145/3641289
- [23]
-
[24]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
-
[25]
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2023
-
[28]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord
-
[29]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, and Xipeng Qiu. 2021. Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Kristina Toutanova, Anna Rumshisky, Luke Zettlemoy...
-
[31]
Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, and Pasquale Minervini. 2021. Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, an...
-
[32]
Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, and Martin Vechev. 2024. Evading Data Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823 [cs.LG] , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 23
- [33]
-
[34]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Christ...
-
[35]
Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuanjing Huang, Lucia Sp...
-
[36]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [37]
-
[38]
Duarte, Xuandong Zhao, Arlindo L
André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. 2024. DE-COP: Detecting Copyrighted Content in Language Models Training Data. arXiv:2402.09910 [cs.CL]
- [39]
- [40]
-
[41]
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Aparna Elangovan, Jiayuan He, and Karin Verspoor. 2021. Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational ...
- [43]
- [44]
-
[45]
James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Ling...
-
[46]
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig
-
[47]
PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202) , Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10764–10799. https://proceedings.mlr.press/v202/ gao23f.html
-
[48]
Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ 4a8423d5e91fda00bb...
work page 2017
-
[49]
Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. 2024. Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? arXiv:2404.06644 [cs.CL]
- [50]
-
[51]
Shahriar Golchin and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= 2Rwq6c3tvr
work page 2024
-
[52]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems , , Vol. 1, No. 1, Article . Publication date: June 2024. 24 Xu et al. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberge...
work page 2014
-
[53]
Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. . https://doi.org/10.1561/2500000010
-
[55]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html
work page 2017
-
[56]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https: //doi.org/10.1109/ACCESS.2023.3300381
-
[58]
Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. InAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d- Paper.pdf
work page 2016
- [59]
-
[60]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection.Proceedings of the AAAI Conference on Artificial Intelligence 38, 20 (Mar. 2024), 22105–22113. https://doi.org/10.1609/aaai.v38i20.30214
- [62]
-
[64]
Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2038–2047. https:/...
- [65]
-
[66]
Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Cho- quette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference , C. Maria Keet, Hung-Yi Le...
- [67]
-
[68]
Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) , Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta (Eds.). Associat...
-
[69]
Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computation...
-
[70]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 25 Models for Code. ar...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [71]
-
[72]
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. In Advances in Neural Information Processing Systems , A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...
work page 2023
-
[73]
Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo. 2024. Does Data Contamination Make a Difference? Insights from Intentionally Contamination Pre-training Data For Language Models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models . https://openreview.net/ forum?id=nLtl8JNOxg
work page 2024
- [74]
-
[75]
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285. https://doi.org/10.1613/jair.301
-
[77]
Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR,...
work page 2022
-
[78]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[79]
Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural entrenchment of folktales is encoded in language. Palgrave Communications 5, 1 (2019). https://doi.org/10.1057/s41599-019-0234-9
-
[80]
Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring Catastrophic Forgetting in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.11651
-
[81]
Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.