Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

Alon Jacovi, Avi Caciularu, Omer Goldman, Yoav Goldberg · 2023 · DOI 10.18653/v1/2023.emnlp-main.308

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

cs.AI · 2026-05-11 · conditional · novelty 6.0 · 2 refs

EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

cs.LG · 2026-05-07 · unverdicted · novelty 4.0 · 2 refs

Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 3 of 3 citing papers.

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents cs.AI · 2026-05-11 · conditional · none · ref 23 · 2 links
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 69 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Benchmark Data Contamination of Large Language Models: A Survey cs.CL · 2024-06-06 · unverdicted · none · ref 69
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer