Preference poisoning against log-linear DPO reduces to a binary sparse approximation problem solved by lattice-reduction (BAL-A) and matching-pursuit (BMP-A) algorithms that carry recovery guarantees.
Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
An off-Earth autonomy pathway can reduce AGI confrontation incentives by making early cooperation preferable to power-seeking on Earth.
citing papers explorer
-
Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
OBBR projects poisoned samples into benign space via rewriting with open-book examples, raising safety performance by 51% on average versus prior defenses across five attacks and four LLMs.
-
BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.