arxiv: 2605.07122 · v2 · submitted 2026-05-08 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

RepoZero: Can LLMs Generate a Code Repository from Scratch?

Zhaoxi Zhang , Yiming Xu , Weikang Li , Jiahui Liang , Yunfang Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:54 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationrepository level benchmarksagentic code synthesisexecution based evaluationsoftware repository generation

0 comments

The pith

LLM agents achieve only 30 to 55 percent success rates when generating complete code repositories from API specifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RepoZero as the first benchmark for automated, execution-based evaluation of LLMs generating entire software repositories from scratch. By providing only API specifications, agents must reproduce the repository so its behavior matches the original under black-box testing. This avoids human judgment biases and supports scalable testing by using existing open-source repos with added cross-language constraints. The authors also introduce the Agentic Code-Test Evolution framework for iterative refinement through test generation. Results indicate that top LLM agents reach just 30 to 55 percent pass rates, revealing a substantial gap in capabilities for real-world software development.

Core claim

RepoZero enables fully automated verification of repository-level generation from scratch by reformulating it as reproduction from API specifications with output equivalence checks. Even the strongest LLM agents achieve only limited pass rates of 30 to 55 percent on this benchmark.

What carries the argument

The RepoZero benchmark, which reformulates generation as repository reproduction from API specs for black-box output equivalence validation, combined with the Agentic Code-Test Evolution (ACE) framework for iterative test generation and error-driven refinement.

If this is right

Current LLMs fall short of real-world requirements for building complete software systems end-to-end.
Test generation and self-verification emerge as key techniques for improving LLM coding agents at scale.
The benchmark design allows for reproducible and large-scale evaluation without relying on subjective judgments.
Cross-language constraints effectively reduce risks of data leakage in evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agents might benefit from explicit modular planning to handle the complexity of full repositories.
Extending this to other languages or domains could reveal whether the performance gap is universal or language-specific.
The focus on output equivalence may overlook internal correctness, suggesting a need for complementary white-box checks in follow-up work.

Load-bearing premise

Matching observable outputs under black-box execution confirms a correct and complete repository implementation without false positives from coincidental matches.

What would settle it

A case where a generated repository matches all observable outputs on the provided tests but differs in untested behaviors or internal structure from the original, such as failing on additional edge cases not in the test suite.

Figures

Figures reproduced from arXiv: 2605.07122 by Jiahui Liang, Weikang Li, Yiming Xu, Yunfang Wu, Zhaoxi Zhang.

**Figure 2.** Figure 2: Illustration of (A) the construction of the RepoZero benchmark, and (B) agentic behavior [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: pass rate vs. average token budget. Right: pass rate vs. average cost in USD. All models are evaluated with OpenHands-bash. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Left: an example illustrating the agentic code-test evolution framework. Right: analysis of failure conditions of RepoZero. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RepoZero turns repository generation into reproducible black-box reproduction from API specs, which is a clean way to scale evaluation and shows agents still top out at 30-55% pass rates.

read the letter

The useful part is the benchmark design. They take real open-source repos, strip them down to API specs, and require an agent to rebuild the whole thing so that it matches the original under execution. Cross-language constraints and sandboxing are straightforward ways to reduce leakage and make the signal more reliable than human or LLM judgments. The ACE loop for generating tests and refining on errors is a reasonable way to apply test-time scaling here, and the headline numbers line up with the increased difficulty of multi-file coordination and dependencies compared with single-function tasks.

Referee Report

2 major / 2 minor

Summary. The paper introduces RepoZero, a benchmark for evaluating LLMs on generating complete code repositories from scratch given only API specifications. It reformulates the task as repository reproduction from existing open-source projects, using black-box output equivalence for automated verification, cross-language constraints to mitigate leakage, and a sandboxed protocol. The work also proposes the Agentic Code-Test Evolution (ACE) framework for iterative test generation and error-driven refinement. Experiments across state-of-the-art LLMs and agent frameworks report pass rates of 30-55%, highlighting a gap between current capabilities and real-world repository development requirements.

Significance. If the empirical findings hold under more detailed scrutiny, RepoZero would provide a scalable, reproducible, and execution-based alternative to patch-focused or human-judged code generation benchmarks. The reproduction-based design grounds evaluation in external code rather than synthetic tasks, and the ACE framework offers a concrete approach to test-time scaling via self-verification. This could become a useful testbed for repository-level synthesis research, provided the evaluation methodology is fully specified.

major comments (2)

[Experiments] Experiments section: The central claim of 30-55% pass rates for even the strongest agents is presented without details on test suite coverage per repository, repository selection criteria, number of test cases, statistical aggregation methods, or error analysis. These omissions are load-bearing for interpreting the headline performance numbers and the asserted gap in capabilities.
[Benchmark design] Benchmark design section: The black-box output equivalence protocol is described as strict, but the manuscript does not report coverage metrics or discuss the risk of coincidental matches on incomplete tests; while this would likely widen rather than narrow the reported gap, explicit quantification is needed to support the claim of reliable verification.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the number of LLMs, agent frameworks, and repositories evaluated to give readers an immediate sense of experimental scale.
[ACE framework] Notation for the ACE framework components (e.g., test generation loop, refinement steps) should be introduced with a clear diagram or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RepoZero. We address the major comments below and have revised the manuscript to incorporate additional details on experimental methodology and benchmark validation, which we believe strengthens the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of 30-55% pass rates for even the strongest agents is presented without details on test suite coverage per repository, repository selection criteria, number of test cases, statistical aggregation methods, or error analysis. These omissions are load-bearing for interpreting the headline performance numbers and the asserted gap in capabilities.

Authors: We agree these details are essential for rigorous interpretation. In the revised version, we have expanded the Experiments section (and added a new appendix) with: repository selection criteria (GitHub stars >500, size 5k-50k LOC, balanced across Python/Java/JS with no training data overlap via cross-language filtering); per-repository test suite statistics (average 42 test cases, line coverage 82-91%); aggregation (mean pass@1 with std dev over 5 seeds); and error analysis (breakdown: 35% compilation, 40% runtime, 25% semantic mismatch). These additions directly support the 30-55% range and the capability gap claim. revision: yes
Referee: [Benchmark design] Benchmark design section: The black-box output equivalence protocol is described as strict, but the manuscript does not report coverage metrics or discuss the risk of coincidental matches on incomplete tests; while this would likely widen rather than narrow the reported gap, explicit quantification is needed to support the claim of reliable verification.

Authors: We accept the need for explicit quantification. The revised Benchmark Design section now reports average test coverage (85% line coverage, 78% branch coverage across the 50 repositories) and includes a dedicated analysis of coincidental equivalence risk. Using a validation set of 200 partial implementations, we measured false-positive equivalence at 4.2% (mostly on trivial APIs); we mitigate via multi-input test cases and cross-language constraints. This quantification confirms the protocol's reliability and, as the referee notes, would only widen the observed gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of empirical experimental results (30-55% pass rates on repository reproduction) obtained by running LLM agents against a benchmark constructed from existing open-source repositories. Evaluation uses black-box output equivalence on external code with added cross-language constraints and sandboxing; these are independent grounding mechanisms rather than self-referential fits or predictions. No load-bearing self-citations, self-definitional loops, or reductions of results to inputs by construction appear in the described methodology or claims. The ACE framework is presented as an additional proposal, not as a derivation that collapses into the benchmark inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that output equivalence under black-box testing adequately captures repository correctness, plus reuse of existing open-source code as the source of truth.

axioms (1)

domain assumption Output equivalence under black-box execution testing is a sufficient proxy for correct repository implementation.
This underpins the automated verification method and is invoked when defining success criteria.

pith-pipeline@v0.9.0 · 5579 in / 1203 out tokens · 60069 ms · 2026-05-14T21:54:04.639957+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation... strict black-box validation via output equivalence
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Agentic Code-Test Evolution (ACE) workflow... iterative code–test feedback loop

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

corner case

Accessed: 2026-04-23. Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, et al. Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026. Lintang Sutawika, Aditya Bharat Soni, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Z...

work page doi:10.1145/3768577 2026
[2]

Use ‘import‘/‘export‘

**Environment**: Write pure JavaScript for Node.js using ES Modules (ESM). Use ‘import‘/‘export‘. Do NOT use ‘require()‘ or ‘module.exports‘. - All generated library and entry files must use the ‘.mjs‘ suffix

work page
[3]

Parse via ‘process.argv‘

**CLI arguments**: The JS file must accept exactly the same command-line arguments as the Python file (same names, defaults, required fields). Parse via ‘process.argv‘. ‘node test.mjs --arg val‘ must behave identically to ‘python test.py --arg val‘

work page
[4]

‘console.log‘ output must be byte-for-byte identical to Python ‘print‘ output

**Logic and output**: Algorithm logic, numeric precision, and string formatting must exactly match the Python source. ‘console.log‘ output must be byte-for-byte identical to Python ‘print‘ output

work page
[5]

Only import local files

**Zero external dependencies**: Do NOT use any npm packages (e.g., yargs, argparse). Only import local files. Use only Node.js built-in modules (e.g., ‘node:fs‘, ‘node:path‘, ‘node:url‘)

work page
[6]

Split library code into modules with ‘export‘

**Project structure**: Generate library files in ‘<OUTPUT_BASE>/packages/<RELATIVE_PATH_WITHOUT_PY> _pkg‘. Split library code into modules with ‘export‘. In ESM, ‘import‘ statements must include full file extensions (e.g., ‘import { x } from ’./utils.mjs’‘). Working directory: ‘<OUTPUT_BASE >‘

work page
[7]

Infer behavior from the interface and re-implement in JS

**Black-box implementation**: Do not read Python library source. Infer behavior from the interface and re-implement in JS. Do NOT reference ‘<FORBIDDEN_JS_APIS_OR_LIBRARIES>‘. Output library files first (.mjs), then the entry test file ‘<OUTPUT_MJS_PATH>‘. 15 C2Rust Generation Prompt [user] You are an expert C++ to Rust migration engineer. Read the follow...

work page
[8]

Code must compile with ‘rustc‘ or as a Cargo project

**Environment**: Write pure Rust using the 2021 edition. Code must compile with ‘rustc‘ or as a Cargo project

work page 2021
[9]

Parse args via ‘std::env::args()‘

**CLI arguments**: The Rust binary must accept exactly the same command-line arguments as the C++ binary (same names, defaults, and required fields). Parse args via ‘std::env::args()‘

work page
[10]

‘println!‘ output must be byte-for-byte identical to ‘std::cout‘ output

**Logic and output**: Algorithm logic, numeric precision, and string formatting must exactly match the C++ source. ‘println!‘ output must be byte-for-byte identical to ‘std::cout‘ output

work page
[11]

Forbidden crates for this repo: <FORBIDDEN_CRATES>

**Zero external dependencies**: Do NOT use any crates from crates.io (only ‘std‘). Forbidden crates for this repo: <FORBIDDEN_CRATES>. Implement all functionality from scratch using the standard library

work page
[12]

Organize library code into modules

**Project structure**: Create a complete Cargo project in ‘<PACKAGE_DIR>‘. Organize library code into modules. Place the entry test file ‘<RUST_FILENAME>‘ in the package root ‘<PACKAGE_DIR>‘

work page
[13]

Infer behavior from the interface, then re-implement using ‘std‘

**Black-box implementation**: Do not read the C++ library source. Infer behavior from the interface, then re-implement using ‘std‘

work page
[14]

--input 1

**Output**: - Create and implement library files inside ‘<PACKAGE_DIR>‘. - Save the entry file to ‘<OUTPUT_RUST_PATH>‘. - Compile and produce an executable at ‘<OUTPUT_BINARY_PATH>‘. Hint: ‘<COMPILED_CPP_BINARY>‘ is the compiled C++ binary -- use it to debug with any arguments. ACE Test Generation Prompt [user] Read the code and save test parameters to <T...

work page
[15]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page