CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
Transactions of the Association for Computational Linguistics , volume =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
CONDITIONAL 2representative citing papers
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
citing papers explorer
-
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.