Poisoning attacks on llms require a near-constant number of poison samples

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, et al · 2025 · arXiv 2510.07192

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 3

citation-polarity summary

background 2 support 1

representative citing papers

When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.

Learning Through Noise: Why Subliminal Learning Works and When It Fails

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.

DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection

cs.CR · 2026-04-14 · unverdicted · novelty 4.0

Dual-space semantic-character mutations on prompts achieve higher misuse success rates against DeepSeek than single-space attacks alone.

From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI

cs.CR · 2026-05-15 · unverdicted · novelty 3.0

The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · 2 refs

citing papers explorer

Showing 7 of 7 citing papers.

When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks cs.LG · 2026-05-21 · unverdicted · none · ref 40
In the proportional high-dimensional regime, stronger backdoor training triggers improve clean accuracy and make attack success non-monotonic for regularized GLMs on Gaussian mixtures, with closed-form proofs for squared loss and fixed-point extensions to convex losses.
Learning Through Noise: Why Subliminal Learning Works and When It Fails cs.LG · 2026-05-22 · unverdicted · none · ref 9
Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems cs.CL · 2026-05-11 · unverdicted · none · ref 7
RUBEN discovers minimal rule sets explaining RAG LLM outputs via novel pruning and applies them to evaluate LLM safety against adversarial injections.
Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry cs.LG · 2026-05-05 · unverdicted · none · ref 57
Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.
DeepSeek Robustness Against Semantic-Character Dual-Space Mutated Prompt Injection cs.CR · 2026-04-14 · unverdicted · none · ref 23
Dual-space semantic-character mutations on prompts achieve higher misuse success rates against DeepSeek than single-space attacks alone.
From AI-Generated Content to Agentic Action: Security and Safety Threats in Generative AI cs.CR · 2026-05-15 · unverdicted · none · ref 123
The paper analyzes evolving security and safety threats in generative AI from content generation to agentic actions, noting that attack surfaces expand faster than defenses and that many safeguards require institutional coordination not yet in place.
Narrow Secret Loyalty Dodges Black-Box Audits cs.CR · 2026-05-07 · unreviewed · ref 19 · 2 links

Poisoning attacks on llms require a near-constant number of poison samples

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer