LLM agents reach only 35% average checkpoint completion on ten realistic CTF challenges in a new open benchmark with automated partial-credit scoring.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.
citing papers explorer
-
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
LLM agents reach only 35% average checkpoint completion on ten realistic CTF challenges in a new open benchmark with automated partial-credit scoring.
-
ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.
-
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.