Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
hub Canonical reference
Agents of Chaos
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.
hub tools
citation-role summary
citation-polarity summary
years
2026 21roles
background 13polarities
background 13representative citing papers
SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using TLA+ to verify six deterministic properties.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
AI agent providers face an exhaustive inventory requirement for actions and data flows, as high-risk systems with untraceable behavioral drift cannot meet the AI Act's essential requirements.
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
LivePI benchmark shows indirect prompt injection attack success rates of 10.7% to 29.6% across five AI models in live test environments covering seven input surfaces and multiple malicious goals.
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Open-world agents suffer from an Authorization-Execution Gap arising from delegation incompleteness, channel corruption, and composition fragmentation, requiring dynamic runtime integrity checks instead of only upfront filters or post-hoc audits.
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
The paper introduces the Informational Viability Principle and Agent Viability Framework to govern autonomous AI agents by bounding unobserved risks using viability theory, with a new Viability Index for predictive control.
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.
Benchmarking four LLMs on O*NET skills yields SAFI scores showing mathematics and programming as most automatable while active listening and reading comprehension are least, with 78.7% of real AI interactions being augmentation rather than replacement.
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using TLA+ to verify six deterministic properties.
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
-
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
-
Toward a Principled Framework for Agent Safety Measurement
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
-
AI Agents Under EU Law
AI agent providers face an exhaustive inventory requirement for actions and data flows, as high-risk systems with untraceable behavioral drift cannot meet the AI Act's essential requirements.
-
Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study
Analysis of 17k LLM agent skills reveals 520 vulnerable ones with 1,708 leakage issues, primarily from debug output exposure, with a 10-pattern taxonomy and released dataset for future detection.
-
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injectio
LivePI benchmark shows indirect prompt injection attack success rates of 10.7% to 29.6% across five AI models in live test environments covering seven input surfaces and multiple malicious goals.
-
Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
-
The Authorization-Execution Gap Is a Major Safety and Security Problem in Open-World Agents
Open-world agents suffer from an Authorization-Execution Gap arising from delegation incompleteness, channel corruption, and composition fragmentation, requiring dynamic runtime integrity checks instead of only upfront filters or post-hoc audits.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
Towards Security-Auditable LLM Agents: A Unified Graph Representation
Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
-
Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
-
Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents
The paper introduces the Informational Viability Principle and Agent Viability Framework to govern autonomous AI agents by bounding unobserved risks using viability theory, with a new Viability Index for predictive control.
-
Control Charts for Multi-agent Systems
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
-
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
-
Governed Reasoning for Institutional AI
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.
-
The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era
Benchmarking four LLMs on O*NET skills yields SAFI scores showing mathematics and programming as most automatable while active listening and reading comprehension are least, with 78.7% of real AI interactions being augmentation rather than replacement.
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
-
Weird Generalization is Weirdly Brittle
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.