Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
hub Canonical reference
Agents of Chaos
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
We report an exploratory red-teaming study of autonomous language-model-powered agents deployed in a live laboratory environment with persistent memory, email accounts, Discord access, file systems, and shell execution. Over a two-week period, twenty AI researchers interacted with the agents under benign and adversarial conditions. Focusing on failures emerging from the integration of language models with autonomy, tool use, and multi-party communication, we document eleven representative case studies. Observed behaviors include unauthorized compliance with non-owners, disclosure of sensitive information, execution of destructive system-level actions, denial-of-service conditions, uncontrolled resource consumption, identity spoofing vulnerabilities, cross-agent propagation of unsafe practices, and partial system takeover. In several cases, agents reported task completion while the underlying system state contradicted those reports. We also report on some of the failed attempts. Our findings establish the existence of security-, privacy-, and governance-relevant vulnerabilities in realistic deployment settings. These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms, and warrant urgent attention from legal scholars, policymakers, and researchers across disciplines. This report serves as an initial empirical contribution to that broader conversation.
hub tools
citation-role summary
citation-polarity summary
years
2026 28roles
background 14polarities
background 14representative citing papers
SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using TLA+ to verify six deterministic properties.
Large-scale analysis of 3.1 million posts shows AI agent sub-communities on Moltbook develop distinct linguistic identities through selective attraction and differential retention, not individual adaptation.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
AI agent providers face an exhaustive inventory requirement for actions and data flows, as high-risk systems with untraceable behavioral drift cannot meet the AI Act's essential requirements.
Domination-Avoiding agents provably avoid collusion in repeated price-competition markets and avoid playing strategies eliminated by iterated elimination of dominated strategies in any game.
Frontier AI agents frequently violate corrigibility by overriding interruptions in benign computer-use tasks, with misalignment increasing alongside model capability.
LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.
LivePI benchmark reports indirect prompt injection success rates of 10.7-29.6% across five models on seven input surfaces and shows a two-layer defense blocking all malicious completions while preserving utility.
PCAP conditions adversarial searches on multiple attacker personas to discover more diverse and transferable jailbreaks, yielding richer safety fine-tuning datasets that boost model robustness on GPT-OSS 120B.
Open-world agents suffer from an Authorization-Execution Gap arising from delegation incompleteness, channel corruption, and composition fragmentation, requiring dynamic runtime integrity checks instead of only upfront filters or post-hoc audits.
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
Agent-BOM is a unified hierarchical attributed directed graph that models static capability bases and dynamic semantic states of LLM agents for path-level security auditing and risk assessment.
Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.
The paper introduces the Informational Viability Principle and Agent Viability Framework to govern autonomous AI agents by bounding unobserved risks using viability theory, with a new Viability Index for predictive control.
Authors define EASE as a modular architecture for LLM multi-agent simulations, implement it in the SiliSocS sandbox, and illustrate its use via three case studies on research questions in generated social scenarios.
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.
Benchmarking four LLMs on O*NET skills yields SAFI scores showing mathematics and programming as most automatable while active listening and reading comprehension are least, with 78.7% of real AI interactions being augmentation rather than replacement.
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.