Task contamination: language models may not be few-shot anymore

Changmao Li, Jeffrey Flanigan · 2024 · DOI 10.1609/aaai.v38i16.29808

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

Silent Failures in Federated Personalization of Foundation Models

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.

citing papers explorer

Showing 2 of 2 citing papers.

Provable Joint Decontamination for Benchmarking Multiple Large Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 145
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
Silent Failures in Federated Personalization of Foundation Models cs.LG · 2026-05-31 · unverdicted · none · ref 27
Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.

Task contamination: language models may not be few-shot anymore

fields

years

verdicts

representative citing papers

citing papers explorer