CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
arXiv preprint arXiv:2505.17107 , url=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 1polarities
support 1representative citing papers
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.
Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.
RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.
citing papers explorer
-
CrackMeBench: Binary Reverse Engineering for Agents
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
-
Dynamic Cyber Ranges
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
-
uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
-
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
-
CTFusion: A CTF-based Benchmark for LLM Agent Evaluation
CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.
-
Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks
Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.
-
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.