Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A safeguard that uses speculative inference on small language models to produce draft responses for safety prediction, lowering false negatives in pre-model jailbreak detection.
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
citing papers explorer
-
Under the Hood of SKILL.md: Semantic Supply-chain Attacks on AI Agent Skill Registry
Semantic manipulations of SKILL.md descriptions enable effective supply-chain attacks that bias AI agent skill registries toward adversarial skills in discovery, selection, and governance.
-
Exploring and Developing a Pre-Model Safeguard with Draft Models
A safeguard that uses speculative inference on small language models to produce draft responses for safety prediction, lowering false negatives in pre-model jailbreak detection.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
- REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations