JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
Task contamination: language models may not be few-shot anymore
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.
citing papers explorer
-
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
-
Silent Failures in Federated Personalization of Foundation Models
Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.