POLARIS formalizes policies in FOL, constructs a Semantic Policy Graph to discover compositional violations, and generates natural-language tests, reporting higher coverage and attack success than baselines.
Liu, et al., TraceAegis: Provenance-based anomaly detection for AI agent execution traces (2025)
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Runtime Skill Audit introduces targeted runtime probing to detect malicious LLM agent skills, reporting 90% accuracy and resilience to self-evolving attacks on 100 skills versus static baselines.
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
Content embeddings from SBERT enable AUROC above 0.89 for attack detection in MCP tool-call sessions, with tree ensembles on pooled embeddings reaching 0.975 and outperforming GNNs when using task-stratified splits instead of random ones.
citing papers explorer
-
Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
POLARIS formalizes policies in FOL, constructs a Semantic Policy Graph to discover compositional violations, and generates natural-language tests, reporting higher coverage and attack success than baselines.