The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8representative citing papers
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Agentic AI lowers the cost and speed of cyber attacks, requiring immediate improvements in identity management, phishing-resistant authentication, patching, and agent governance for large enterprises and the Mittelstand.
citing papers explorer
-
No More, No Less: Task Alignment in Terminal Agents
The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
-
CrackMeBench: Binary Reverse Engineering for Agents
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
-
Beyond Collection: Measuring the Detection Efficacy of Modern Security Logging Standards
SETC framework provides the first systematic comparison of CIM, OCSF, and ECS logging standards by running 50 RCE exploits and measuring how well each captures attack indicators.
-
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
-
Program Analysis Guided LLM Agent for Proof-of-Concept Generation
PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand
Agentic AI lowers the cost and speed of cyber attacks, requiring immediate improvements in identity management, phishing-resistant authentication, patching, and agent governance for large enterprises and the Mittelstand.