Benchmark Data Contamination of Large Language Models: A Survey

Cheng Xu; Derek Greene; M-Tahar Kechadi; Shuhao Guan

arxiv: 2406.04244 · v1 · pith:UHXBMDQAnew · submitted 2024-06-06 · 💻 cs.CL

Benchmark Data Contamination of Large Language Models: A Survey

Cheng Xu , Shuhao Guan , Derek Greene , M-Tahar Kechadi This is my paper

Pith reviewed 2026-05-22 23:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords benchmark data contaminationlarge language modelsLLM evaluationdata leakagebenchmark reliabilityalternative evaluation methodstraining data overlap

0 comments

The pith

Benchmark data contamination from training sets renders standard LLM evaluations unreliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models can absorb information from evaluation benchmarks during training, a process that produces inflated or misleading performance scores. This issue matters because benchmarks are the main way researchers and users judge whether models are improving at real tasks. The authors review documented cases of the problem, survey methods proposed to detect or avoid it, and discuss remaining challenges plus possible future approaches. If the survey is correct, then many published results on models like GPT-4 rest on contaminated data and cannot be taken at face value without additional checks.

Core claim

The paper establishes that benchmark data contamination occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase, and that alternative assessment methods must be explored to reduce the associated risks in real-world applications.

What carries the argument

Benchmark Data Contamination (BDC), the process by which evaluation-benchmark text enters a model's training corpus and thereby inflates measured performance on those same benchmarks.

If this is right

Standard public benchmarks lose their value as trustworthy measures of progress.
Model developers must adopt data-curation practices that explicitly exclude known evaluation sets.
New evaluation protocols such as private held-out tests or dynamic benchmarks become necessary for credible claims.
Reported performance numbers on contaminated benchmarks cannot be compared directly across models trained on different data mixtures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing leaderboards may systematically overstate model capabilities until contamination is measured and corrected.
The same leakage risk applies to any fixed test set used repeatedly in machine learning, not just language models.
Detection techniques for contamination could be turned into routine pre-release audits for new models.

Load-bearing premise

The papers reviewed in the survey correctly describe how widespread and severe the contamination problem is, and the alternative evaluation methods they describe can be used without creating equally serious new flaws.

What would settle it

A controlled experiment that retrains several current LLMs from scratch on data guaranteed to exclude all benchmark test sets and then shows those models achieve the same scores on the benchmarks as the original contaminated versions.

read the original abstract

The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes existing literature on benchmark data contamination without adding new data or methods.

read the letter

This survey paper reviews benchmark data contamination in LLMs. The main takeaway is that training data can overlap with evaluation benchmarks and distort reported performance, and the authors collect prior work on the problem plus some alternative assessment ideas. It does a reasonable job laying out why this matters for reliable testing and pointing to mitigation approaches discussed in the literature. The organization of the topic into challenges and future directions is clear enough for a reader who wants a quick map of the area. What it does well is flag a practical issue that affects how we trust LLM numbers, and it gives credit to the papers that first raised the contamination concern. The soft spots are in the review mechanics. The abstract gives no details on search terms, databases, or inclusion rules, so it is hard to tell whether the coverage is balanced or if key papers were missed. As a survey it also does not run any checks on the alternatives it mentions, which leaves their practicality untested here. This paper is for people who build or use LLM benchmarks and want a single reference point on contamination risks. A reader new to evaluation methodology would get value from the summary of existing findings and suggested directions. It deserves a serious referee because the topic is timely and a careful synthesis could shape how future benchmarks are built, even if the paper itself is descriptive rather than empirical. I would send it to peer review so that reviewers can verify the literature selection and balance.

Referee Report

2 major / 2 minor

Summary. This survey paper defines Benchmark Data Contamination (BDC) as the inadvertent inclusion of evaluation benchmark data in LLM training corpora, which distorts reported performance on those benchmarks. It reviews the phenomenon across models such as GPT-4, Claude-3, and Gemini, surveys mitigation strategies and alternative evaluation approaches, and outlines open challenges and future research directions for reliable LLM assessment.

Significance. If the coverage of the literature is representative, the survey would usefully consolidate a growing body of work on an issue that directly threatens the validity of standard LLM benchmarks. Its value would lie in mapping the problem space and cataloguing proposed remedies rather than in any novel empirical or theoretical contribution.

major comments (2)

[Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.
[Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.

minor comments (2)

[Abstract] The abstract repeats the definition of BDC without adding new information; a single concise definition would suffice.
[Throughout] No quantitative summary (e.g., number of papers reviewed, distribution across years or venues) is supplied to give readers a sense of the evidence base.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments below. Both points identify areas where additional transparency and analysis can strengthen the survey, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Introduction] Abstract and introduction: no search protocol, inclusion/exclusion criteria, or database sources are stated, making it impossible to judge whether the surveyed literature is comprehensive or systematically selected; this directly affects the reliability of the central descriptive claim.

Authors: We agree that the absence of an explicit literature search protocol limits the ability to assess coverage. Although the paper is a narrative survey rather than a formal systematic review, we will add a new subsection (likely in Section 2 or a dedicated Methods section) that describes the search strategy: databases consulted (arXiv, ACL Anthology, Google Scholar), time window (primarily 2020–2024), keywords used (e.g., “benchmark contamination”, “data leakage LLM”, “evaluation contamination”), and inclusion criteria (peer-reviewed or pre-print works that empirically study or propose mitigations for BDC in LLMs). Exclusion criteria (e.g., non-English papers, purely theoretical works without empirical component) will also be stated. This addition will directly address the concern about transparency. revision: yes
Referee: [Section on alternative methods (inferred from abstract)] The manuscript asserts that alternative assessment methods can mitigate BDC risks, yet provides no concrete comparison of their computational cost, scalability, or susceptibility to new forms of contamination; without such analysis the recommendation of alternatives remains ungrounded.

Authors: The current version surveys the range of proposed alternatives (dynamic benchmarks, private test sets, contamination detection methods, etc.) but does not synthesize quantitative comparisons. We will expand the relevant section to include a comparative table or structured discussion that extracts and contrasts reported computational overhead, scalability limits, and known contamination vulnerabilities from the cited papers. Where primary sources lack such metrics we will explicitly note the gap and flag it as an open research direction rather than claiming superiority. This revision will ground the discussion without introducing new unsubstantiated claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey summarizes external literature

full rationale

The paper is a survey whose content consists entirely of descriptions and summaries of prior external work on benchmark data contamination. No derivations, equations, fitted parameters, predictions, or uniqueness theorems are asserted within the manuscript itself. All load-bearing statements are attributed to cited literature rather than derived internally, and no self-citation chains, ansatzes, or renamings of results are used to support any novel claim. The structure therefore contains no steps that reduce by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey paper. No free parameters, mathematical axioms, or invented entities are introduced or required by any central claim.

pith-pipeline@v0.9.0 · 5650 in / 964 out tokens · 84808 ms · 2026-05-22T23:05:13.024470+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders
cs.AI 2026-05 accept novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
cs.CL 2026-05 conditional novelty 7.0

LLM generative error correction improves low-resource Frisian ASR performance, with comparable gains on a contamination-controlled offline dataset confirming true correction ability.
Dataset Watermarking for Closed LLMs with Provable Detection
cs.LG 2026-05 unverdicted novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
cs.CL 2026-04 conditional novelty 7.0

A graphlet-anchored framework generates 119,856 factually grounded biomedical QA pairs that improve accuracy on PubMedQA and MedQA benchmarks.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 conditional novelty 7.0

A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and toke...
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection
cs.CL 2026-04 unverdicted novelty 7.0

LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
Micro Language Models Enable Instant Responses
cs.CL 2026-04 conditional novelty 6.0

Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
cs.SE 2025-09 conditional novelty 6.0

SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
cs.SE 2025-08 accept novelty 6.0

A group of 22 researchers proposes seven study types and eight guidelines for empirical software engineering studies involving LLMs to enhance reproducibility and replicability.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI 2025-07 unverdicted novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
A Study of LLMs' Preferences for Libraries and Programming Languages
cs.SE 2025-03 unverdicted novelty 6.0

Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
cs.LG 2026-05 unverdicted novelty 5.0

ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Riemann-Bench: A Benchmark for Moonshot Mathematics
cs.AI 2026-04 conditional novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
LLM Benchmark Datasets Should Be Contamination-Resistant
cs.LG 2026-05 unverdicted novelty 4.0

Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
cs.SE 2026-04 unverdicted novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
cs.LG 2025-07 unverdicted novelty 4.0

Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
LLM Harms: A Taxonomy and Discussion
cs.CY 2025-12 unverdicted novelty 3.0

This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 22 Pith papers · 19 internal anchors

[1]

Aggarwal

Charu C. Aggarwal. 2018. Opinion Mining and Sentiment Analysis . Springer International Publishing, Cham, 413–434. https://doi.org/10.1007/978-3-319-73531-3_13

work page doi:10.1007/978-3-319-73531-3_13 2018
[2]

Abdulmohsen Al-Thubaity, Sakhar Alkhereyf, Hanan Murayshid, Nouf Alshalawi, Maha Omirah, Raghad Alateeq, Rawabi Almutairi, Razan Alsuwailem, Manal Alhassoun, and Imaan Alkhanen. 2023. Evaluating ChatGPT and Bard AI on Arabic Sentiment Analysis. In Proceedings of ArabicNLP 2023, Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali,...

work page doi:10.18653/v1/2023.arabicnlp-1.27 2023
[3]

Mussa Aman. 2024. Large Language Model Based Fake News Detection. Procedia Computer Science 231 (2024), 740–

work page 2024
[4]

https://doi.org/10.1016/j.procs.2023.12.144 14th International Conference on Emerging Ubiquitous Systems and Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (EUSPN/ICTH 2023)

work page doi:10.1016/j.procs.2023.12.144 2023
[5]

Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family

work page 2024
[6]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an- Examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track . https://openreview.net/forum...

work page 2023
[8]

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contami- nation and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Yvette Graham and Matthew Purver (Eds.). Associati...

work page 2024
[9]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

work page doi:10.18653/v1/2023.ijcnlp-main.45 2023
[10]

Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasingh...

work page 2023
[11]

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems , T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press. https: //proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf

work page 2000
[12]

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf

work page 2023
[13]

Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Em...

work page doi:10.18653/v1/2022.emnlp-main.233 2022
[14]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 545...

work page doi:10.18653/v1/2020.acl-main.485 2020
[15]

Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. 2024. Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models. arXiv:2404.06209 [cs.LG]

work page arXiv 2024
[16]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[17]

Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermea- sures in Code Language Model. arXiv:2403.16898 [cs.SE]

work page arXiv 2024
[18]

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting Training Data from Diffusion Models. In 32nd USENIX Security Sympo- sium (USENIX Security 23) . USENIX Association, Anaheim, CA, 5253–5270. https://www.usenix.org/conference/ , Vol. 1, No. 1, Article ...

work page 2023
[19]

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21) . USENIX Association, 2633–2650. https://www.usenix.org/...

work page 2021
[20]

Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swami- nathan. 2024. Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs. arXiv:2403.00393 [cs.CR]

work page arXiv 2024
[21]

Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7312–7327. https://doi.org/10....

work page doi:10.18653/v1/2023.emnlp-main.453 2023
[22]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (mar 2024), 45 pages. https://doi.org/10.1145/3641289

work page doi:10.1145/3641289 2024
[23]

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. 2023. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. arXiv:2310.01957 [cs.RO]

work page arXiv 2023
[24]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page
[25]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2023
[28]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

work page
[29]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, and Xipeng Qiu. 2021. Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Kristina Toutanova, Anna Rumshisky, Luke Zettlemoy...

work page doi:10.18653/v1/2021.naacl- 2021
[31]

Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, and Pasquale Minervini. 2021. Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, an...

work page doi:10.18653/v1/2021.eacl-main.190 2021
[32]

Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, and Martin Vechev. 2024. Evading Data Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823 [cs.LG] , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 23

work page arXiv 2024
[33]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023. Investigating Data Contamina- tion in Modern Benchmarks for Large Language Models. arXiv:2311.09783 [cs.CL]

work page arXiv 2023
[34]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Christ...

work page doi:10.18653/v1/n19-1423 2019
[35]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuanjing Huang, Lucia Sp...

work page doi:10.18653/v1/2021.emnlp-main.98 2021
[36]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. 2024. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. arXiv:2402.15938 [cs.CL]

work page arXiv 2024
[38]

Duarte, Xuandong Zhao, Arlindo L

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. 2024. DE-COP: Detecting Copyrighted Content in Language Models Training Data. arXiv:2402.09910 [cs.CL]

work page arXiv 2024
[39]

Hashimoto

Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval

work page 2024
[40]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]

work page arXiv 2023
[41]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Aparna Elangovan, Jiayuan He, and Karin Verspoor. 2021. Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational ...

work page doi:10.18653/v1/2021.eacl-main.113 2021
[43]

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. 2023. Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’ Predictions. arXiv:2207.14251 [cs.CL]

work page arXiv 2023
[44]

Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. 2024. NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes. arXiv:2312.14890 [cs.AI]

work page arXiv 2024
[45]

James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Ling...

work page doi:10.18653/v1/2020.emnlp-main.86 2020
[46]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig

work page
[47]

In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202) , Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10764–10799. https://proceedings.mlr.press/v202/ gao23f.html

work page
[48]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ 4a8423d5e91fda00bb...

work page 2017
[49]

Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. 2024. Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? arXiv:2404.06644 [cs.CL]

work page arXiv 2024
[50]

Shahriar Golchin and Mihai Surdeanu. 2024. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. arXiv:2311.06233 [cs.CL]

work page arXiv 2024
[51]

Shahriar Golchin and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= 2Rwq6c3tvr

work page 2024
[52]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems , , Vol. 1, No. 1, Article . Publication date: June 2024. 24 Xu et al. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberge...

work page 2014
[53]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. . https://doi.org/10.1561/2500000010

work page doi:10.1561/2500000010 2017
[55]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html

work page 2017
[56]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https: //doi.org/10.1109/ACCESS.2023.3300381

work page doi:10.1109/access.2023.3300381 2023
[58]

Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. InAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d- Paper.pdf

work page 2016
[59]

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv:2302.09210 [cs.CL]

work page arXiv 2023
[60]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection.Proceedings of the AAAI Conference on Artificial Intelligence 38, 20 (Mar. 2024), 22105–22113. https://doi.org/10.1609/aaai.v38i20.30214

work page doi:10.1609/aaai.v38i20.30214 2024
[62]

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv:2403.02839 [cs.CL]

work page arXiv 2024
[64]

Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2038–2047. https:/...

work page doi:10.18653/v1/2022.findings-emnlp.148 2022
[65]

Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, and Weizhu Chen. 2023. Competition-Level Problems are Effective LLM Evaluators. arXiv:2312.02143 [cs.CL]

work page arXiv 2023
[66]

Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Cho- quette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference , C. Maria Keet, Hung-Yi Le...

work page doi:10.18653/v1/2023.inlg-main.3 2023
[67]

Nicos Isaak. 2023. PronounFlow: A Hybrid Approach for Calibrating Pronouns in Sentences. arXiv:2308.15235 [cs.CL]

work page arXiv 2023
[68]

Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) , Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta (Eds.). Associat...

work page doi:10.18653/v1/2023 2023
[69]

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computation...

work page doi:10.18653/v1/2023.emnlp-main.308 2023
[70]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 25 Models for Code. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. arXiv:2306.13651 [cs.CL]

work page arXiv 2023
[72]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. In Advances in Neural Information Processing Systems , A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

work page 2023
[73]

Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo. 2024. Does Data Contamination Make a Difference? Insights from Intentionally Contamination Pre-training Data For Language Models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models . https://openreview.net/ forum?id=nLtl8JNOxg

work page 2024
[74]

Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745 [cs.CL]

work page arXiv 2023
[75]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285. https://doi.org/10.1613/jair.301

work page doi:10.1613/jair.301 1996
[77]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR,...

work page 2022
[78]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[79]

Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural entrenchment of folktales is encoded in language. Palgrave Communications 5, 1 (2019). https://doi.org/10.1057/s41599-019-0234-9

work page doi:10.1057/s41599-019-0234-9 2019
[80]

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring Catastrophic Forgetting in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.11651

work page doi:10.1609/aaai.v32i1.11651 2018
[81]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi

work page

Showing first 80 references.

[1] [1]

Aggarwal

Charu C. Aggarwal. 2018. Opinion Mining and Sentiment Analysis . Springer International Publishing, Cham, 413–434. https://doi.org/10.1007/978-3-319-73531-3_13

work page doi:10.1007/978-3-319-73531-3_13 2018

[2] [2]

Abdulmohsen Al-Thubaity, Sakhar Alkhereyf, Hanan Murayshid, Nouf Alshalawi, Maha Omirah, Raghad Alateeq, Rawabi Almutairi, Razan Alsuwailem, Manal Alhassoun, and Imaan Alkhanen. 2023. Evaluating ChatGPT and Bard AI on Arabic Sentiment Analysis. In Proceedings of ArabicNLP 2023, Hassan Sawaf, Samhaa El-Beltagy, Wajdi Zaghouani, Walid Magdy, Ahmed Abdelali,...

work page doi:10.18653/v1/2023.arabicnlp-1.27 2023

[3] [3]

Mussa Aman. 2024. Large Language Model Based Fake News Detection. Procedia Computer Science 231 (2024), 740–

work page 2024

[4] [4]

https://doi.org/10.1016/j.procs.2023.12.144 14th International Conference on Emerging Ubiquitous Systems and Pervasive Networks / 13th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (EUSPN/ICTH 2023)

work page doi:10.1016/j.procs.2023.12.144 2023

[5] [5]

Anthropic. 2024. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family

work page 2024

[6] [6]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an- Examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track . https://openreview.net/forum...

work page 2023

[8] [8]

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, and Ondrej Dusek. 2024. Leak, Cheat, Repeat: Data Contami- nation and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , Yvette Graham and Matthew Purver (Eds.). Associati...

work page 2024

[9] [9]

Do, Yan Xu, and Pascale Fung

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Langua...

work page doi:10.18653/v1/2023.ijcnlp-main.45 2023

[10] [10]

Rachel Bawden and François Yvon. 2023. Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasingh...

work page 2023

[11] [11]

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems , T. Leen, T. Dietterich, and V. Tresp (Eds.), Vol. 13. MIT Press. https: //proceedings.neurips.cc/paper_files/paper/2000/file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf

work page 2000

[12] [12]

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf

work page 2023

[13] [13]

Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Em...

work page doi:10.18653/v1/2022.emnlp-main.233 2022

[14] [14]

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 545...

work page doi:10.18653/v1/2020.acl-main.485 2020

[15] [15]

Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. 2024. Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models. arXiv:2404.06209 [cs.LG]

work page arXiv 2024

[16] [16]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020

[17] [17]

Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermea- sures in Code Language Model. arXiv:2403.16898 [cs.SE]

work page arXiv 2024

[18] [18]

Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. 2023. Extracting Training Data from Diffusion Models. In 32nd USENIX Security Sympo- sium (USENIX Security 23) . USENIX Association, Anaheim, CA, 5253–5270. https://www.usenix.org/conference/ , Vol. 1, No. 1, Article ...

work page 2023

[19] [19]

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Language Models. In 30th USENIX Security Symposium (USENIX Security 21) . USENIX Association, 2633–2650. https://www.usenix.org/...

work page 2021

[20] [20]

Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, and Manohar Swami- nathan. 2024. Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs. arXiv:2403.00393 [cs.CR]

work page arXiv 2024

[21] [21]

Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 7312–7327. https://doi.org/10....

work page doi:10.18653/v1/2023.emnlp-main.453 2023

[22] [22]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3, Article 39 (mar 2024), 45 pages. https://doi.org/10.1145/3641289

work page doi:10.1145/3641289 2024

[23] [23]

Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. 2023. Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. arXiv:2310.01957 [cs.RO]

work page arXiv 2023

[24] [24]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page

[25] [25]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2023

[28] [28]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

work page

[29] [29]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Junqi Dai, Hang Yan, Tianxiang Sun, Pengfei Liu, and Xipeng Qiu. 2021. Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Kristina Toutanova, Anna Rumshisky, Luke Zettlemoy...

work page doi:10.18653/v1/2021.naacl- 2021

[31] [31]

Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, and Pasquale Minervini. 2021. Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, an...

work page doi:10.18653/v1/2021.eacl-main.190 2021

[32] [32]

Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, and Martin Vechev. 2024. Evading Data Contamination Detection for Language Models is (too) Easy. arXiv:2402.02823 [cs.LG] , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 23

work page arXiv 2024

[33] [33]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. 2023. Investigating Data Contamina- tion in Modern Benchmarks for Large Language Models. arXiv:2311.09783 [cs.CL]

work page arXiv 2023

[34] [34]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Christ...

work page doi:10.18653/v1/n19-1423 2019

[35] [35]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuanjing Huang, Lucia Sp...

work page doi:10.18653/v1/2021.emnlp-main.98 2021

[36] [36]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. 2024. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models. arXiv:2402.15938 [cs.CL]

work page arXiv 2024

[38] [38]

Duarte, Xuandong Zhao, Arlindo L

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, and Lei Li. 2024. DE-COP: Detecting Copyrighted Content in Language Models Training Data. arXiv:2402.09910 [cs.CL]

work page arXiv 2024

[39] [39]

Hashimoto

Yann Dubois, Balazs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2024. Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval

work page 2024

[40] [40]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. arXiv:2305.14387 [cs.LG]

work page arXiv 2023

[41] [41]

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Aparna Elangovan, Jiayuan He, and Karin Verspoor. 2021. Memorization vs. Generalization : Quantifying Data Leakage in NLP Performance Evaluation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational ...

work page doi:10.18653/v1/2021.eacl-main.113 2021

[43] [43]

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. 2023. Measuring Causal Effects of Data Statistics on Language Model’s ‘Factual’ Predictions. arXiv:2207.14251 [cs.CL]

work page arXiv 2023

[44] [44]

Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, and Yongfeng Zhang. 2024. NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes. arXiv:2312.14890 [cs.AI]

work page arXiv 2024

[45] [45]

James Ferguson, Matt Gardner, Hannaneh Hajishirzi, Tushar Khot, and Pradeep Dasigi. 2020. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Ling...

work page doi:10.18653/v1/2020.emnlp-main.86 2020

[46] [46]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig

work page

[47] [47]

In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

PAL: Program-aided Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202) , Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10764–10799. https://proceedings.mlr.press/v202/ gao23f.html

work page

[48] [48]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems , I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/ 4a8423d5e91fda00bb...

work page 2017

[49] [49]

Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dastgheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban. 2024. Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language? arXiv:2404.06644 [cs.CL]

work page arXiv 2024

[50] [50]

Shahriar Golchin and Mihai Surdeanu. 2024. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models. arXiv:2311.06233 [cs.CL]

work page arXiv 2024

[51] [51]

Shahriar Golchin and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= 2Rwq6c3tvr

work page 2024

[52] [52]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems , , Vol. 1, No. 1, Article . Publication date: June 2024. 24 Xu et al. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberge...

work page 2014

[53] [53]

Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. . https://doi.org/10.1561/2500000010

work page doi:10.1561/2500000010 2017

[55] [55]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1321–1330. https://proceedings.mlr.press/v70/guo17a.html

work page 2017

[56] [56]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access 11 (2023), 80218–80245. https: //doi.org/10.1109/ACCESS.2023.3300381

work page doi:10.1109/access.2023.3300381 2023

[58] [58]

Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. InAdvances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d- Paper.pdf

work page 2016

[59] [59]

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. arXiv:2302.09210 [cs.CL]

work page arXiv 2023

[60] [60]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [61]

Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. Bad Actor, Good Advisor: Exploring the Role of Large Language Models in Fake News Detection.Proceedings of the AAAI Conference on Artificial Intelligence 38, 20 (Mar. 2024), 22105–22113. https://doi.org/10.1609/aaai.v38i20.30214

work page doi:10.1609/aaai.v38i20.30214 2024

[62] [62]

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv:2403.02839 [cs.CL]

work page arXiv 2024

[63] [64]

Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. 2022. Are Large Pre-Trained Language Models Leaking Your Personal Information?. In Findings of the Association for Computational Linguistics: EMNLP 2022 , Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2038–2047. https:/...

work page doi:10.18653/v1/2022.findings-emnlp.148 2022

[64] [65]

Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, and Weizhu Chen. 2023. Competition-Level Problems are Effective LLM Evaluators. arXiv:2312.02143 [cs.CL]

work page arXiv 2023

[65] [66]

Daphne Ippolito, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Cho- quette Choo, and Nicholas Carlini. 2023. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy. In Proceedings of the 16th International Natural Language Generation Conference , C. Maria Keet, Hung-Yi Le...

work page doi:10.18653/v1/2023.inlg-main.3 2023

[66] [67]

Nicos Isaak. 2023. PronounFlow: A Hybrid Approach for Calibrating Pronouns in Sentences. arXiv:2308.15235 [cs.CL]

work page arXiv 2023

[67] [68]

Shotaro Ishihara. 2023. Training Data Extraction From Pre-trained Language Models: A Survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023) , Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, and Rahul Gupta (Eds.). Associat...

work page doi:10.18653/v1/2023 2023

[68] [69]

Alon Jacovi, Avi Caciularu, Omer Goldman, and Yoav Goldberg. 2023. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computation...

work page doi:10.18653/v1/2023.emnlp-main.308 2023

[69] [70]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language , Vol. 1, No. 1, Article . Publication date: June 2024. Benchmark Data Contamination of Large Language Models: A Survey 25 Models for Code. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [71]

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2023. Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. arXiv:2306.13651 [cs.CL]

work page arXiv 2023

[71] [72]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. In Advances in Neural Information Processing Systems , A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. C...

work page 2023

[72] [73]

Minhao Jiang, Ken Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, Jiawei Han, and Sanmi Koyejo. 2024. Does Data Contamination Make a Difference? Insights from Intentionally Contamination Pre-training Data For Language Models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models . https://openreview.net/ forum?id=nLtl8JNOxg

work page 2024

[73] [74]

Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv:2301.08745 [cs.CL]

work page arXiv 2023

[74] [75]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[75] [76]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285. https://doi.org/10.1613/jair.301

work page doi:10.1613/jair.301 1996

[76] [77]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating Training Data Mitigates Privacy Risks in Language Models. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR,...

work page 2022

[77] [78]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[78] [79]

Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural entrenchment of folktales is encoded in language. Palgrave Communications 5, 1 (2019). https://doi.org/10.1057/s41599-019-0234-9

work page doi:10.1057/s41599-019-0234-9 2019

[79] [80]

Ronald Kemker, Marc McClure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. 2018. Measuring Catastrophic Forgetting in Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.11651

work page doi:10.1609/aaai.v32i1.11651 2018

[80] [81]

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi

work page