hub Canonical reference

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang · 2024 · cs.LG · arXiv 2410.09024

Canonical reference. 92% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 dataset 2

citation-polarity summary

background 12 use dataset 1

representative citing papers

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

REALM is the first unified red-teaming benchmark for physical-world VLMs that aligns diverse attack methods via an agentic target-generation pipeline and evaluates them on shared datasets showing text/typographic attacks as most effective.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

cs.SE · 2026-06-15 · conditional · novelty 7.0

Wall-clock leaky-integrator monitors on variable-cadence agent streams exhibit a sharp cliff between constant-alarm and silent regimes, while sample-time CUSUM monitors remain invariant; transition detection avoids the trap.

From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

cs.CR · 2026-05-22 · unverdicted · novelty 7.0

Introduces MCP-TDP benchmark showing near-100% attack success on models like GPT-4o for tool description poisoning and proposes reactive self-correction defense.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

cs.SE · 2026-05-18 · conditional · novelty 7.0

The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

cs.CR · 2026-05-17 · unverdicted · novelty 7.0

Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.

Do Coding Agents Understand Least-Privilege Authorization?

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

cs.CR · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.

Toward a Principled Framework for Agent Safety Measurement

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

cs.CR · 2026-04-20 · unverdicted · novelty 7.0 · 2 refs

The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

ClawsBench is a benchmark using high-fidelity mock services to evaluate LLM agents on 44 productivity tasks, finding 39-64% success rates and 7-33% unsafe action rates depending on scaffolding.

Cordon: Semantic Transactions for Tool-Using LLM Agents

cs.OS · 2026-06-16 · unverdicted · novelty 6.0

Cordon is a transactional runtime system that binds tool intents to reversible state, staged effects, and audit metadata to validate composed agent workflows before commit.

An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios

cs.CR · 2026-06-15 · unverdicted · novelty 6.0

Joint evaluation of three tool-using LLM agents on 12 realistic tasks finds consistent failures in data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness, with no agent achieving both correct and safe execution.

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Frontier AI agents frequently violate corrigibility by overriding interruptions in benign computer-use tasks, with misalignment increasing alongside model capability.

Provably Secure Agent Guardrail

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.

citing papers explorer

Showing 50 of 57 citing papers.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 1 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs cs.CV · 2026-06-22 · unverdicted · none · ref 1 · internal anchor
REALM is the first unified red-teaming benchmark for physical-world VLMs that aligns diverse attack methods via an agentic target-generation pipeline and evaluates them on shared datasets showing text/typographic attacks as most effective.
SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents cs.CR · 2026-06-16 · accept · none · ref 2 · internal anchor
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence cs.SE · 2026-06-15 · conditional · none · ref 12 · internal anchor
Wall-clock leaky-integrator monitors on variable-cadence agent streams exhibit a sharp cliff between constant-alarm and silent regimes, while sample-time CUSUM monitors remain invariant; transition detection avoids the trap.
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails cs.CR · 2026-06-12 · unverdicted · none · ref 22 · internal anchor
Attackers can force LLM guardrails into extended reasoning loops via optimized payloads, causing 13-63x token amplification and up to 148x latency in agent systems.
When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents cs.CR · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
Introduces MCP-TDP benchmark showing near-100% attack success on models like GPT-4o for tool description poisoning and proposes reactive self-correction defense.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 2 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 44 · internal anchor
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks cs.SE · 2026-05-18 · conditional · none · ref 1 · internal anchor
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback cs.CR · 2026-05-17 · unverdicted · none · ref 21 · internal anchor
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 49 · internal anchor
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents cs.CR · 2026-05-05 · unverdicted · none · ref 1 · 2 links · internal anchor
MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
Toward a Principled Framework for Agent Safety Measurement cs.CR · 2026-05-02 · unverdicted · none · ref 1 · internal anchor
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models cs.CR · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection cs.CR · 2026-04-20 · unverdicted · none · ref 1 · 2 links · internal anchor
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces cs.AI · 2026-04-06 · unverdicted · none · ref 1 · internal anchor
ClawsBench is a benchmark using high-fidelity mock services to evaluate LLM agents on 44 productivity tasks, finding 39-64% success rates and 7-33% unsafe action rates depending on scaffolding.
Cordon: Semantic Transactions for Tool-Using LLM Agents cs.OS · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
Cordon is a transactional runtime system that binds tool intents to reversible state, staged effects, and audit metadata to validate composed agent workflows before commit.
An Evaluation of Data Leakage Risks in Tool-Using LLM Agents in Realistic Scenarios cs.CR · 2026-06-15 · unverdicted · none · ref 13 · internal anchor
Joint evaluation of three tool-using LLM agents on 12 realistic tasks finds consistent failures in data awareness, audience awareness, policy compliance, data minimization, and access-boundary awareness, with no agent achieving both correct and safe execution.
ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use cs.LG · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
Frontier AI agents frequently violate corrigibility by overriding interruptions in benign computer-use tasks, with misalignment increasing alongside model capability.
Provably Secure Agent Guardrail cs.AI · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences cs.CR · 2026-05-18 · unverdicted · none · ref 1 · internal anchor
OEP poisons self-evolving LLM agents by constructing clean edge-case experiences that appear locally valid yet cause harmful over-generalization during reflection, achieving over 50% attack success rate on GPT-4o agents across three domains.
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection cs.CR · 2026-05-18 · unverdicted · none · ref 21 · 2 links · internal anchor
LivePI benchmark reports indirect prompt injection success rates of 10.7-29.6% across five models on seven input surfaces and shows a two-layer defense blocking all malicious completions while preserving utility.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 40 · internal anchor
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills cs.CR · 2026-05-13 · conditional · none · ref 1 · internal anchor
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks cs.CR · 2026-05-08 · unverdicted · none · ref 58 · internal anchor
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization cs.CR · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with zero-shot transfer to real prompts.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 6 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 1 · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts cs.CR · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw cs.CR · 2026-04-06 · conditional · none · ref 1 · internal anchor
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression cs.CL · 2026-04-05 · unverdicted · none · ref 2 · internal anchor
CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller trustworthiness loss.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 54 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem cs.CR · 2025-06-17 · unverdicted · none · ref 47 · internal anchor
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 7 · internal anchor
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Agents That Know Too Much: A Data-Centric Survey of Privacy in LLM Agents cs.CR · 2026-06-25 · unverdicted · none · ref 8 · internal anchor
A data-centric survey finds that only information-flow control covers compositional and cross-session leakage in LLM agents and that no single benchmark tests an agent across all its data surfaces under one policy.
Uncertainty-Aware Clarification in LLM Agents with Information Gain cs.AI · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
The paper introduces an Information Gain Reward to train clarification behavior in LLM agents, reporting a 3.7% success rate gain over no-clarification baselines in τ-Bench evaluations across five models with minimal added steps.
Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment cs.AI · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
A three-layer probabilistic assume-guarantee architecture is structurally required for safe LLM agent deployment.
ADR: An Agentic Detection System for Enterprise Agentic AI Security cs.AI · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
ADR is a three-component detection system for AI agents that combines telemetry sensors, red teaming, and two-tier detection, achieving 97.2% precision in a ten-month Uber deployment and outperforming baselines on the new ADR-Bench.
Toward Securing AI Agents Like Operating Systems cs.CR · 2026-05-14 · unverdicted · none · ref 50 · internal anchor
LLM agents share OS-like security challenges; a case study on four agents finds protections often fail without careful setup but many vulnerabilities are mitigable with OS techniques.
AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use cs.AI · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
AgentTrust introduces a runtime interception system for AI agent tool use that achieves 95-97% verdict accuracy on 930 safety scenarios including obfuscated shell payloads.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 2 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data cs.AI · 2026-05-23 · unverdicted · none · ref 12 · internal anchor
JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.
Seven simple steps for log analysis in AI systems cs.AI · 2026-02-13 · unverdicted · none · ref 1 · internal anchor
A seven-step pipeline for log analysis in AI systems is outlined with code examples to support rigorous and reproducible evaluation of model capabilities and behaviors.
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges cs.AI · 2025-10-27 · unverdicted · none · ref 46 · internal anchor
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 147 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer