Investigating data contamination in modern benchmarks for large language models

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan · 2023 · arXiv 2311.09783

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

A new MTMM-geometric framework unifies LLM evaluation metrics into three latent dimensions to separate method variance from true capabilities.

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI · 2025-07-30 · unverdicted · novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

cs.CL · 2025-11-03 · unverdicted · novelty 5.0

ZoFia is a zero-shot fake news detection framework that uses hierarchical entity salience retrieval followed by multi-LLM adversarial debate to improve robustness over single-model approaches.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation cs.CL · 2026-05-08 · unverdicted · none · ref 6
A new MTMM-geometric framework unifies LLM evaluation metrics into three latent dimensions to separate method variance from true capabilities.

Investigating data contamination in modern benchmarks for large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer