LLM agents reach only 35% average checkpoint completion on ten realistic CTF challenges in a new open benchmark with automated partial-credit scoring.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
citing papers explorer
-
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture the Flag Challenges
LLM agents reach only 35% average checkpoint completion on ten realistic CTF challenges in a new open benchmark with automated partial-credit scoring.
-
Agentic Much? Adoption of Coding Agents on GitHub
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.