RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CR 3years
2026 3representative citing papers
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.
Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.
citing papers explorer
-
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward
RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
-
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.
-
On the Privacy of LLMs: An Ablation Study
Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.