RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
QuantGuard is a pre-quantization method using differentiable rounding controls, error-guided reversal constraints, output consistency, and weight regularization on a small calibration set to suppress quantization-conditioned backdoors while preserving performance.
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
BadStyle creates stealthy backdoors in LLMs by poisoning samples with imperceptible style triggers and using an auxiliary loss to stabilize payload injection, achieving high attack success rates across multiple models while evading defenses.
Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
A survey that examines fragmentation in existing 6G security approaches, develops a cross-layer threat taxonomy, maps countermeasures, and identifies research gaps for trustworthy AI-native 6G ecosystems.
citing papers explorer
-
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.