ProgramBench: Can Language Models Rebuild Programs From Scratch?

· 2026 · cs.SE · arXiv 2605.03546

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

representative citing papers

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

cs.CR · 2026-05-26 · unverdicted · novelty 7.0

SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.

citing papers explorer

Showing 2 of 2 citing papers.

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks? cs.CR · 2026-05-26 · unverdicted · none · ref 20 · internal anchor
SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems cs.SE · 2026-05-26 · unverdicted · none · ref 30 · internal anchor
RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

fields

years

verdicts

representative citing papers

citing papers explorer