Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song · 2025 · arXiv 2506.02548

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

dataset 2 background 1

citation-polarity summary

use dataset 2 support 1

representative citing papers

No More, No Less: Task Alignment in Terminal Agents

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

Beyond Collection: Measuring the Detection Efficacy of Modern Security Logging Standards

cs.CR · 2026-05-07 · unverdicted · novelty 7.0

SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

cs.SE · 2026-04-13 · conditional · novelty 6.0

AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.

Program Analysis Guided LLM Agent for Proof-of-Concept Generation

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Kimi K2.5: Visual Agentic Intelligence

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand

cs.CR · 2026-05-06 · unverdicted · novelty 4.0

Agentic AI lowers the cost and speed of cyber attacks, requiring immediate improvements in identity management, phishing-resistant authentication, patching, and agent governance for large enterprises and the Mittelstand.

citing papers explorer

Showing 8 of 8 citing papers.

No More, No Less: Task Alignment in Terminal Agents cs.LG · 2026-05-12 · unverdicted · none · ref 29
The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
CrackMeBench: Binary Reverse Engineering for Agents cs.SE · 2026-05-11 · accept · none · ref 16
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
Beyond Collection: Measuring the Detection Efficacy of Modern Security Logging Standards cs.CR · 2026-05-07 · unverdicted · none · ref 34
SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection cs.SE · 2026-04-13 · conditional · none · ref 59
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
Program Analysis Guided LLM Agent for Proof-of-Concept Generation cs.SE · 2026-04-08 · unverdicted · none · ref 40
PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 48
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 69
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand cs.CR · 2026-05-06 · unverdicted · none · ref 15
Agentic AI lowers the cost and speed of cyber attacks, requiring immediate improvements in identity management, phishing-resistant authentication, patching, and agent governance for large enterprises and the Mittelstand.

Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer