LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLM s
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 2polarities
background 2representative citing papers
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.