pith. sign in

Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

dataset 2 background 1

citation-polarity summary

years

2026 8

representative citing papers

No More, No Less: Task Alignment in Terminal Agents

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

Kimi K2.5: Visual Agentic Intelligence

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

citing papers explorer

Showing 8 of 8 citing papers.