ProgramBench: Can Language Models Rebuild Programs From Scratch?

· 2026 · cs.SE · arXiv 2605.03546

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

representative citing papers

MirrorCode: AI can rebuild entire programs from behavior alone

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

cs.CR · 2026-05-26 · unverdicted · novelty 7.0

SEC-bench Pro benchmark with 183 real vulnerabilities shows frontier LLM coding agents achieve at most 38.8% success on SpiderMonkey and 32% on V8.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

cs.AI · 2026-05-31 · conditional · novelty 6.0

LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.

Cheap Code, Costly Judgment: A Case Study on Governable Agentic Software Engineering

cs.SE · 2026-07-01 · unverdicted · novelty 5.0

A case study of AI-agentic software development yields a process model explaining how engineering judgment converts recurring structural failures into durable governance mechanisms.

citing papers explorer

Showing 1 of 1 citing paper after filters.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models cs.AI · 2026-05-31 · conditional · none · ref 59 · internal anchor
LRMs show a large production-evaluation gap on the VAIR dataset with valid answers but invalid reasoning, driven by answer confirmation bias as evidenced by CoT analysis, linear probes, and causal patching.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

fields

years

verdicts

representative citing papers

citing papers explorer