ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.
and Sekar, V
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CR 5representative citing papers
SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.
PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.
Structured CTI standards like ATT&CK describe adversary actions but lack the ordering, preconditions, and environmental details needed for direct multi-stage emulation, and a translation method can bridge this gap when assumptions are recorded.
Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.
citing papers explorer
-
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.
-
Beyond Collection: Measuring the Detection Efficacy of Modern Security Logging Standards
SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.
-
PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents
PocketAgents introduces a manifest-driven library for LLM-based autonomous defense agents, evaluated in 18 closed-loop trials against a DarkSide-inspired attack where 13 trials produced validated blocking actions.
-
The Procedural Semantics Gap in Structured CTI: A Measurement-Driven STIX Analysis for APT Emulation
Structured CTI standards like ATT&CK describe adversary actions but lack the ordering, preconditions, and environmental details needed for direct multi-stage emulation, and a translation method can bridge this gap when assumptions are recorded.
-
Autonomous Adversary: Red-Teaming in the age of LLM
Expert-defined action plans for LLM agents achieve higher task completion in lateral-movement scenarios than fully autonomous or self-scaffolded modes, but failures remain common due to brittle commands and state handling.