pith. machine review for the scientific record. sign in

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.CR 3

years

2026 3

representative citing papers

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

cs.CR · 2026-04-23 · unverdicted · novelty 6.0

BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.

On the Privacy of LLMs: An Ablation Study

cs.CR · 2026-05-04 · unverdicted · novelty 4.0

Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.

citing papers explorer

Showing 3 of 3 citing papers.

  • Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward cs.CR · 2026-04-10 · accept · none · ref 7

    RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.

  • Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers cs.CR · 2026-04-23 · unverdicted · none · ref 21

    BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.

  • On the Privacy of LLMs: An Ablation Study cs.CR · 2026-05-04 · unverdicted · none · ref 21

    Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.