arXiv preprint arXiv:2505.17107 , url=

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique · 2025 · arXiv 2505.17107

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

Dynamic Cyber Ranges

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.

uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

cs.SE · 2026-05-14 · unverdicted · novelty 6.0

Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

cs.CR · 2026-04-18 · unverdicted · novelty 5.0

Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.

RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs

cs.CR · 2026-04-20 · unverdicted · novelty 4.0

RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.

citing papers explorer

Showing 7 of 7 citing papers.

CrackMeBench: Binary Reverse Engineering for Agents cs.SE · 2026-05-11 · accept · none · ref 20
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
Dynamic Cyber Ranges cs.CR · 2026-04-27 · unverdicted · none · ref 57
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs cs.CR · 2026-05-15 · unverdicted · none · ref 43
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries cs.SE · 2026-05-14 · unverdicted · none · ref 44
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
CTFusion: A CTF-based Benchmark for LLM Agent Evaluation cs.LG · 2026-05-12 · unverdicted · none · ref 22
CTFusion is a live-CTF streaming benchmark that prevents data contamination by forwarding only the first correct flag per challenge under a shared team account.
Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks cs.CR · 2026-04-18 · unverdicted · none · ref 8
Claude 4.5 Opus reaches 59% solve rate on offensive cyber CTF tasks, with a Kali Linux environment adding 9.5 percentage points over Ubuntu while prompt engineering often hurts performance in equipped setups.
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs cs.CR · 2026-04-20 · unverdicted · none · ref 22
RAVEN combines LLM agents and RAG to generate Project Zero-style vulnerability reports, achieving 54.21% average quality on 105 NIST-SARD samples across 15 CWE types.

arXiv preprint arXiv:2505.17107 , url=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer