ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.
CyberGym: Evaluating AI Agents’ Real-World Cybersecurity Capabilities at Scale, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
ExploitBench decomposes LLM exploitation into 16 oracle-verified capability flags and finds public frontier models trigger crashes but rarely reach arbitrary code execution on 41 V8 bugs.