A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations

Yihe Zhou, Tao Ni, Wei-Bin Lee, Qingchuan Zhao · 2025 · arXiv 2502.05224

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

cs.CR · 2026-04-10 · accept · novelty 7.0

RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.

Breaking the Rounding Trap: Securing LLMs against Quantization-Conditioned Backdoors

cs.CR · 2026-06-28 · unverdicted · novelty 6.0

QuantGuard is a pre-quantization method using differentiable rounding controls, error-guided reversal constraints, output consistency, and weight regularization on a small calibration set to suppress quantization-conditioned backdoors while preserving performance.

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

cs.CR · 2026-04-23 · unverdicted · novelty 6.0

BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.

On the Privacy of LLMs: An Ablation Study

cs.CR · 2026-05-04 · unverdicted · novelty 4.0

Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts

cs.CR · 2025-10-26 · unverdicted · novelty 4.0

Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts cs.CR · 2025-10-26 · unverdicted · none · ref 31
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations

fields

years

verdicts

representative citing papers

citing papers explorer