EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
citing papers explorer
-
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.