In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.
Poisoning attacks on llms require a near-constant number of poison samples
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 3representative citing papers
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.
Dual-space semantic-character mutations on prompts achieve higher misuse success rates against DeepSeek than single-space attacks alone.
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
citing papers explorer
-
When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks
In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.
-
Learning Through Noise: Why Subliminal Learning Works and When It Fails
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
-
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
-
Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry
Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.
-
DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection
Dual-space semantic-character mutations on prompts achieve higher misuse success rates against DeepSeek than single-space attacks alone.
-
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
- Narrow Secret Loyalty Dodges Black-Box Audits