MalSkillBench supplies the first sandbox-verified dataset of malicious agent skills and shows that existing detectors achieve high recall on code injection but collapse on prompt injection and agent-control attacks.
hub
N e M o Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 12verdicts
UNVERDICTED 12roles
background 1polarities
background 1representative citing papers
MCPHunt benchmark finds 11.5-41.3% policy-violating credential propagation in multi-server MCP agents across five models, reducible up to 97% by prompt mitigations while retaining most utility.
Presents a distributionally robust optimization method for sound probabilistic verification of Datalog policies in AI agents that bounds violation risk regardless of predicate correlations.
Cordon is a transactional runtime system that binds tool intents to reversible state, staged effects, and audit metadata to validate composed agent workflows before commit.
Presents MedSci Skills, an open-source toolkit with deterministic integrity gates for verifying LLM-assisted clinical manuscripts against reporting guidelines like STARD, PRISMA, and STROBE.
Introduces GuardZoo benchmark and RouteGuard router-expert system showing monolithic guardrails suffer task interference while specialized routing improves threat detection and generalization.
PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
Neuro-symbolic pipeline using formal logic and semantic embeddings detects hallucinations in LLM medical reports at 83%+ for entities and 72% for fabrications while cutting creation time 30%.
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.
citing papers explorer
-
Cordon: Semantic Transactions for Tool-Using LLM Agents
Cordon is a transactional runtime system that binds tool intents to reversible state, staged effects, and audit metadata to validate composed agent workflows before commit.