OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
Autobackdoor: Automating backdoor attacks via llm agents
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.CR 3years
2026 3representative citing papers
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
citing papers explorer
-
OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
-
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
- SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems