Recognition: 2 theorem links
· Lean TheoremRepoZero: Can LLMs Generate a Code Repository from Scratch?
Pith reviewed 2026-05-14 21:54 UTC · model grok-4.3
The pith
LLM agents achieve only 30 to 55 percent success rates when generating complete code repositories from API specifications.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RepoZero enables fully automated verification of repository-level generation from scratch by reformulating it as reproduction from API specifications with output equivalence checks. Even the strongest LLM agents achieve only limited pass rates of 30 to 55 percent on this benchmark.
What carries the argument
The RepoZero benchmark, which reformulates generation as repository reproduction from API specs for black-box output equivalence validation, combined with the Agentic Code-Test Evolution (ACE) framework for iterative test generation and error-driven refinement.
If this is right
- Current LLMs fall short of real-world requirements for building complete software systems end-to-end.
- Test generation and self-verification emerge as key techniques for improving LLM coding agents at scale.
- The benchmark design allows for reproducible and large-scale evaluation without relying on subjective judgments.
- Cross-language constraints effectively reduce risks of data leakage in evaluations.
Where Pith is reading between the lines
- Future agents might benefit from explicit modular planning to handle the complexity of full repositories.
- Extending this to other languages or domains could reveal whether the performance gap is universal or language-specific.
- The focus on output equivalence may overlook internal correctness, suggesting a need for complementary white-box checks in follow-up work.
Load-bearing premise
Matching observable outputs under black-box execution confirms a correct and complete repository implementation without false positives from coincidental matches.
What would settle it
A case where a generated repository matches all observable outputs on the provided tests but differs in untested behaviors or internal structure from the original, such as failing on additional edge cases not in the test suite.
Figures
read the original abstract
Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RepoZero, a benchmark for evaluating LLMs on generating complete code repositories from scratch given only API specifications. It reformulates the task as repository reproduction from existing open-source projects, using black-box output equivalence for automated verification, cross-language constraints to mitigate leakage, and a sandboxed protocol. The work also proposes the Agentic Code-Test Evolution (ACE) framework for iterative test generation and error-driven refinement. Experiments across state-of-the-art LLMs and agent frameworks report pass rates of 30-55%, highlighting a gap between current capabilities and real-world repository development requirements.
Significance. If the empirical findings hold under more detailed scrutiny, RepoZero would provide a scalable, reproducible, and execution-based alternative to patch-focused or human-judged code generation benchmarks. The reproduction-based design grounds evaluation in external code rather than synthetic tasks, and the ACE framework offers a concrete approach to test-time scaling via self-verification. This could become a useful testbed for repository-level synthesis research, provided the evaluation methodology is fully specified.
major comments (2)
- [Experiments] Experiments section: The central claim of 30-55% pass rates for even the strongest agents is presented without details on test suite coverage per repository, repository selection criteria, number of test cases, statistical aggregation methods, or error analysis. These omissions are load-bearing for interpreting the headline performance numbers and the asserted gap in capabilities.
- [Benchmark design] Benchmark design section: The black-box output equivalence protocol is described as strict, but the manuscript does not report coverage metrics or discuss the risk of coincidental matches on incomplete tests; while this would likely widen rather than narrow the reported gap, explicit quantification is needed to support the claim of reliable verification.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the number of LLMs, agent frameworks, and repositories evaluated to give readers an immediate sense of experimental scale.
- [ACE framework] Notation for the ACE framework components (e.g., test generation loop, refinement steps) should be introduced with a clear diagram or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on RepoZero. We address the major comments below and have revised the manuscript to incorporate additional details on experimental methodology and benchmark validation, which we believe strengthens the paper.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim of 30-55% pass rates for even the strongest agents is presented without details on test suite coverage per repository, repository selection criteria, number of test cases, statistical aggregation methods, or error analysis. These omissions are load-bearing for interpreting the headline performance numbers and the asserted gap in capabilities.
Authors: We agree these details are essential for rigorous interpretation. In the revised version, we have expanded the Experiments section (and added a new appendix) with: repository selection criteria (GitHub stars >500, size 5k-50k LOC, balanced across Python/Java/JS with no training data overlap via cross-language filtering); per-repository test suite statistics (average 42 test cases, line coverage 82-91%); aggregation (mean pass@1 with std dev over 5 seeds); and error analysis (breakdown: 35% compilation, 40% runtime, 25% semantic mismatch). These additions directly support the 30-55% range and the capability gap claim. revision: yes
-
Referee: [Benchmark design] Benchmark design section: The black-box output equivalence protocol is described as strict, but the manuscript does not report coverage metrics or discuss the risk of coincidental matches on incomplete tests; while this would likely widen rather than narrow the reported gap, explicit quantification is needed to support the claim of reliable verification.
Authors: We accept the need for explicit quantification. The revised Benchmark Design section now reports average test coverage (85% line coverage, 78% branch coverage across the 50 repositories) and includes a dedicated analysis of coincidental equivalence risk. Using a validation set of 200 partial implementations, we measured false-positive equivalence at 4.2% (mostly on trivial APIs); we mitigate via multi-input test cases and cross-language constraints. This quantification confirms the protocol's reliability and, as the referee notes, would only widen the observed gap. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central claims consist of empirical experimental results (30-55% pass rates on repository reproduction) obtained by running LLM agents against a benchmark constructed from existing open-source repositories. Evaluation uses black-box output equivalence on external code with added cross-language constraints and sandboxing; these are independent grounding mechanisms rather than self-referential fits or predictions. No load-bearing self-citations, self-definitional loops, or reductions of results to inputs by construction appear in the described methodology or claims. The ACE framework is presented as an additional proposal, not as a derivation that collapses into the benchmark inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Output equivalence under black-box execution testing is a sufficient proxy for correct repository implementation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation... strict black-box validation via output equivalence
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Agentic Code-Test Evolution (ACE) workflow... iterative code–test feedback loop
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-04-23. Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, et al. Benchmarkˆ 2: Systematic evaluation of llm benchmarks.arXiv preprint arXiv:2601.03986, 2026. Lintang Sutawika, Aditya Bharat Soni, Apurva Gandhi, Taha Yassine, Sanidhya Vijayvargiya, Yuchen Li, Xuhui Z...
-
[2]
**Environment**: Write pure JavaScript for Node.js using ES Modules (ESM). Use ‘import‘/‘export‘. Do NOT use ‘require()‘ or ‘module.exports‘. - All generated library and entry files must use the ‘.mjs‘ suffix
-
[3]
**CLI arguments**: The JS file must accept exactly the same command-line arguments as the Python file (same names, defaults, required fields). Parse via ‘process.argv‘. ‘node test.mjs --arg val‘ must behave identically to ‘python test.py --arg val‘
-
[4]
‘console.log‘ output must be byte-for-byte identical to Python ‘print‘ output
**Logic and output**: Algorithm logic, numeric precision, and string formatting must exactly match the Python source. ‘console.log‘ output must be byte-for-byte identical to Python ‘print‘ output
-
[5]
**Zero external dependencies**: Do NOT use any npm packages (e.g., yargs, argparse). Only import local files. Use only Node.js built-in modules (e.g., ‘node:fs‘, ‘node:path‘, ‘node:url‘)
-
[6]
Split library code into modules with ‘export‘
**Project structure**: Generate library files in ‘<OUTPUT_BASE>/packages/<RELATIVE_PATH_WITHOUT_PY> _pkg‘. Split library code into modules with ‘export‘. In ESM, ‘import‘ statements must include full file extensions (e.g., ‘import { x } from ’./utils.mjs’‘). Working directory: ‘<OUTPUT_BASE >‘
-
[7]
Infer behavior from the interface and re-implement in JS
**Black-box implementation**: Do not read Python library source. Infer behavior from the interface and re-implement in JS. Do NOT reference ‘<FORBIDDEN_JS_APIS_OR_LIBRARIES>‘. Output library files first (.mjs), then the entry test file ‘<OUTPUT_MJS_PATH>‘. 15 C2Rust Generation Prompt [user] You are an expert C++ to Rust migration engineer. Read the follow...
-
[8]
Code must compile with ‘rustc‘ or as a Cargo project
**Environment**: Write pure Rust using the 2021 edition. Code must compile with ‘rustc‘ or as a Cargo project
work page 2021
-
[9]
Parse args via ‘std::env::args()‘
**CLI arguments**: The Rust binary must accept exactly the same command-line arguments as the C++ binary (same names, defaults, and required fields). Parse args via ‘std::env::args()‘
-
[10]
‘println!‘ output must be byte-for-byte identical to ‘std::cout‘ output
**Logic and output**: Algorithm logic, numeric precision, and string formatting must exactly match the C++ source. ‘println!‘ output must be byte-for-byte identical to ‘std::cout‘ output
-
[11]
Forbidden crates for this repo: <FORBIDDEN_CRATES>
**Zero external dependencies**: Do NOT use any crates from crates.io (only ‘std‘). Forbidden crates for this repo: <FORBIDDEN_CRATES>. Implement all functionality from scratch using the standard library
-
[12]
Organize library code into modules
**Project structure**: Create a complete Cargo project in ‘<PACKAGE_DIR>‘. Organize library code into modules. Place the entry test file ‘<RUST_FILENAME>‘ in the package root ‘<PACKAGE_DIR>‘
-
[13]
Infer behavior from the interface, then re-implement using ‘std‘
**Black-box implementation**: Do not read the C++ library source. Infer behavior from the interface, then re-implement using ‘std‘
-
[14]
**Output**: - Create and implement library files inside ‘<PACKAGE_DIR>‘. - Save the entry file to ‘<OUTPUT_RUST_PATH>‘. - Compile and produce an executable at ‘<OUTPUT_BINARY_PATH>‘. Hint: ‘<COMPILED_CPP_BINARY>‘ is the compiled C++ binary -- use it to debug with any arguments. ACE Test Generation Prompt [user] Read the code and save test parameters to <T...
-
[15]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.