Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.
PentestGPT: Evaluating and harnessing large language models for automated penetration testing,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.