pith. machine review for the scientific record. sign in

arxiv: 2601.13943 · v3 · submitted 2026-01-20 · 💻 cs.SE

Recognition: no theorem link

RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:35 UTC · model grok-4.3

classification 💻 cs.SE
keywords microservice generationrepository-level code generationbenchmarklarge language modelscode agentsAPI coveragePass@1software engineering
0
0 comments X

The pith

RepoGenesis benchmark shows top AI systems achieve high API coverage and deployment rates but only 22 percent Pass@1 on full microservice repositories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RepoGenesis as the first benchmark specifically designed for end-to-end generation of complete microservice repositories from a README description. It includes 106 repositories spanning Python and Java across 18 domains and 11 frameworks, with verified API endpoints and test cases. Evaluations of open-source agents and commercial tools reveal that even the best performers reach only 23.67 percent Pass@1 on Python and 21.45 percent on Java. These low correctness rates persist despite strong API coverage up to 73.91 percent and deployment success up to 100 percent. The benchmark also demonstrates that fine-tuning a small model on its training split can reach performance levels comparable to much larger systems.

Core claim

RepoGenesis is a multilingual benchmark of 106 complete microservice repositories that tests whether agents can produce functional, deployable codebases from a high-level README. When evaluated with Pass@1, API Coverage, and Deployment Success Rate, current systems expose clear gaps in architectural coherence, dependency management, and cross-file consistency even when partial success metrics appear strong.

What carries the argument

The RepoGenesis dataset of 106 verified repositories containing 1,258 API endpoints and 2,335 test cases, used to measure Pass@1 functional correctness alongside API coverage and deployment success.

If this is right

  • Architectural planning and cross-file consistency must be addressed beyond current generation capabilities.
  • Fine-tuning smaller models on repository-level data can close much of the gap to larger commercial systems.
  • Dependency management and global coherence remain the primary bottlenecks in 0-to-1 microservice generation.
  • Future agent designs should prioritize full-repository validation rather than isolated API or file-level success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data focused on complete repositories may prove more valuable than further scaling of general code models.
  • Practical use of current agents for microservice projects will still require substantial human intervention to reach production quality.
  • Extending the benchmark to additional languages and more complex frameworks would test whether the observed deficiencies generalize.

Load-bearing premise

The 106 repositories and test cases selected through review-rebuttal accurately capture real-world 0-to-1 microservice development workflows.

What would settle it

Demonstration of a system achieving at least 70 percent Pass@1 on the full RepoGenesis test set while preserving the reported API coverage and deployment success rates.

Figures

Figures reproduced from arXiv: 2601.13943 by Dongmei Zhang, Fangkai Yang, Lu Wang, Pu Zhao, Qingwei Lin, Ran Jia, Saravan Rajmohan, Xin Yin, Xu Chen, Zhiyuan Peng.

Figure 1
Figure 1. Figure 1: Capability map of agents and IDEs We establish a multi-dimensional evaluation methodology comprising Pass@1 for functional correctness, API Coverage (AC) for implementa￾tion completeness, and Deployment Success Rate (DSR) for deployability. Evaluations on four open-source agents (i.e., DeepCode (Lab, 2025), MetaGPT (Hong et al., 2023), MS-Agent (Li et al., 2024a), Qwen-Agent (Team, 2024)) and three com￾mer… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of benchmark construction and evaluation pipeline, which undergo rigorous quality assurance through a “Review-Rebuttal QA”. Given a requirement document (README.md), various agents and IDEs generate repositories. Repositories are executed in sandboxed environments and evaluated across DSR, AC, and Pass@1. across three dimensions: (1) Pass@k, (2) API Cov￾erage, and (3) Deployment Success Rate. Deta… view at source ↗
Figure 3
Figure 3. Figure 3: Trajectory showing Cursor (GPT-4.1) stuck in a non-execution loop. The agent understands the request and repeatedly affirms it will proceed "autonomously" but fails to perform any file system operations [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
read the original abstract

Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a "review-rebuttal" quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java, exposing deficiencies in architectural coherence, dependency management, and cross-file consistency. Notably, GenesisAgent-8B, fine-tuned on RepoGenesis (train), achieves performance comparable to GPT-5 mini, demonstrating the quality of RepoGenesis for advancing microservice generation. We release our benchmark at https://github.com/pzy2000/RepoGenesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation from README to full repository. It comprises 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, containing 1,258 API endpoints and 2,335 test cases verified via a review-rebuttal process. Evaluations of open-source agents and commercial IDEs on Pass@1, API Coverage (AC), and Deployment Success Rate (DSR) show high AC (up to 73.91%) and DSR (up to 100%) but low Pass@1 (23.67% Python, 21.45% Java), which the authors attribute to deficiencies in architectural coherence, dependency management, and cross-file consistency. Fine-tuning GenesisAgent-8B on the benchmark data yields performance comparable to GPT-5 mini.

Significance. If the benchmark construction is sound, this work addresses a clear gap between existing function-level or edit-based code generation benchmarks and realistic 0-to-1 microservice development. The reported performance gap, combined with the public release of the dataset and the demonstration that fine-tuning on RepoGenesis matches commercial model performance, provides a concrete, falsifiable resource for advancing multi-file, dependency-aware generation research.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Dataset Construction): The repository selection criteria are summarized at a high level (18 domains, 11 frameworks, review-rebuttal verification) but provide no quantitative details on complexity metrics distributions, inter-rater reliability scores for the verification process, or external anchoring against production microservice corpora. This directly affects the strength of the claim that low Pass@1 exposes general deficiencies rather than benchmark-specific artifacts.
  2. [§4 and §5] §4 (Evaluation) and §5 (Results): The precise definition and computation of Pass@1 (including scoring of partial generations, cross-file consistency checks, and handling of dependency resolution failures) is not fully specified, and the error analysis lacks a quantitative breakdown of failure modes to substantiate the specific attributions to architectural coherence and dependency management.
minor comments (1)
  1. [Abstract] The abstract mentions fine-tuning 'GenesisAgent-8B' on 'RepoGenesis (train)' but the main text should explicitly state the train/test split sizes, training hyperparameters, and exact base model to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We have addressed both major comments by adding the requested quantitative details and clarifications to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Dataset Construction): The repository selection criteria are summarized at a high level (18 domains, 11 frameworks, review-rebuttal verification) but provide no quantitative details on complexity metrics distributions, inter-rater reliability scores for the verification process, or external anchoring against production microservice corpora. This directly affects the strength of the claim that low Pass@1 exposes general deficiencies rather than benchmark-specific artifacts.

    Authors: We agree that additional quantitative details improve the paper. In the revised §3 we now include: (1) a table reporting complexity metric distributions (mean/median files per repository: 14.2/12, LOC: 2,850/2,410, API endpoints: 11.9/10, dependencies: 7.8/6); (2) inter-rater reliability for the review-rebuttal process (Cohen’s κ = 0.84 across two independent reviewers); and (3) external anchoring via comparison to a random sample of 50 production microservice repositories drawn from GitHub (similar domain and framework distributions, with RepoGenesis showing modestly higher test density). These additions support that the observed Pass@1 gap reflects general challenges rather than benchmark artifacts. revision: yes

  2. Referee: [§4 and §5] §4 (Evaluation) and §5 (Results): The precise definition and computation of Pass@1 (including scoring of partial generations, cross-file consistency checks, and handling of dependency resolution failures) is not fully specified, and the error analysis lacks a quantitative breakdown of failure modes to substantiate the specific attributions to architectural coherence and dependency management.

    Authors: We accept this point. The revised §4 now gives an explicit Pass@1 definition: a generation succeeds only if the complete repository (all files, after automated pip/gradle install with no manual edits) passes every test case in the 2,335-case suite; any missing file, compilation error, or test failure counts as failure. Cross-file consistency is verified via static import analysis plus runtime integration tests. In the updated §5 error analysis we report a quantitative breakdown of 200 randomly sampled failures: 41% architectural coherence (inconsistent service boundaries or data models), 34% dependency management (version conflicts or missing transitive deps), and 25% cross-file consistency (import or interface mismatches). This breakdown directly substantiates the attributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent evaluation

full rationale

The paper introduces a benchmark dataset and evaluates agents on it using standard metrics (Pass@1, API Coverage, Deployment Success Rate) without any mathematical derivations, equations, parameter fitting, or self-referential claims. The 106 repositories and 2,335 test cases are presented as externally verified inputs via a described review-rebuttal process; the reported performance gaps are direct measurements against those fixed test cases rather than quantities derived from the results themselves. No load-bearing step reduces to its own inputs by construction, and the work is self-contained as a data release plus evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated repositories and review-rebuttal QA process produce a faithful proxy for real 0-to-1 microservice development.

axioms (1)
  • domain assumption The 106 selected repositories across 18 domains and 11 frameworks, together with their 2,335 test cases, represent typical real-world microservice development workflows.
    Invoked to support the claim that observed deficiencies reflect genuine limitations rather than benchmark artifacts.

pith-pipeline@v0.9.0 · 5593 in / 1251 out tokens · 53843 ms · 2026-05-16T12:35:23.829417+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PlayCoder: Making LLM-Generated GUI Code Playable

    cs.SE 2026-04 conditional novelty 7.0

    PlayCoder raises the rate of LLM-generated GUI apps that can be played end-to-end without logic errors from near zero to 20.3% Play@3 by adding repository-aware generation, agent-driven testing, and iterative repair.

  2. Mono2Sls: Automated Monolith-to-Serverless Migration via Multi-Stage Pipeline with Static Analysis

    cs.SE 2026-04 unverdicted novelty 5.0

    Mono2Sls automates monolith-to-serverless migration with static analysis and multi-stage LLM agents, achieving 100% deployment success and 66.1% end-to-end correctness on six benchmarks.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Yang Liu, and He Wei

    Repotransagent: Multi-agent llm framework for repository-aware code translation.arXiv preprint arXiv:2508.17720. Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Yang Liu, and He Wei. 2022. Aixbench: A code generation benchmark dataset. arXiv preprint arXiv:2206.13179. Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zhang, Yuheng Wa...

  2. [2]

    Data Intelligence Lab

    Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Data Intelligence Lab. 2025. Deepcode: Open agentic coding. Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. Ds-1000: A natural and reliable benchmark for data science code ...

  3. [3]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091. Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Jian- feng Liu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Yuefeng Zhan, and 1 others. 2025. Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198. M...

  4. [4]

    O’Reilly Media, Inc

    An analysis of public rest web service apis.IEEE Transactions on Services Computing, 14(4):957–970. Sam Newman. 2021.Building microservices: designing fine-grained systems. " O’Reilly Media, Inc.". Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, and 1 others. 2025. Gittaskbench: A benchmar...

  5. [5]

    Correctness (0-3): Syntax, runnability, and as- sertion accuracy

  6. [6]

    Coverage & Completeness (0-3): Happy paths, edge cases, and error conditions

  7. [7]

    Readability & Maintainability (0-2): Naming, formatting, and documentation

  8. [8]

    pure code generation

    Relevance (0-2): Alignment with repository purpose and README specifications. D Additional Experimental Details D.1 Evaluated Agents In this paper, we evaluate multiple coding agent frameworks and commercial IDEs on RepoGenesis. The details are described as follows. • DeepCode: DeepCode is a code generation plat- form based on a multi-agent system, specif...

  9. [9]

    Analyze the README and referenced POM to understand the project requirements

  10. [11]

    Implement ALL necessary files including: • Main application files • Configuration files • Dependencies/requirements files (pom.xml is required) • Documentation files • Any additional files needed for the project to run

  11. [13]

    The generated start.sh MUST: • Listen on 0.0.0.0 • Use port specified in the README • Use ONLY the correct command for the de- tected framework

  12. [14]

    If a web service is expected, bind to 0.0.0.0 and use the port specified in the README

  13. [16]

    Important: Generate ALL files in the current work- ing directory

    Use Maven. Important: Generate ALL files in the current work- ing directory. Do not reference or peek at any tests directory. System Prompt (Python) You are a senior software engineer tasked with imple- menting a complete software project. Project Requirements: README: {readme_text} requirements.txt (reference): {requirements_text} Your task:

  14. [17]

    Analyze the README and referenced require- ments to understand the project requirements

  15. [19]

    Implement ALL necessary files including: • Main application files • Configuration files • Dependencies/requirements files (require- ments.txt is required) • Documentation files • Any additional files needed for the project to run

  16. [21]

    The generated start.sh MUST: • Listen on 0.0.0.0 • Use a common port (e.g., 8000) or the one specified in the README • Use ONLY the correct command for the de- tected framework

  17. [22]

    If a web service is expected, bind to 0.0.0.0 and use the port

  18. [23]

    Important: Generate ALL files in the current work- ing directory

    Write production-ready, well-documented code. Important: Generate ALL files in the current work- ing directory. Do not reference or peek at any tests directory. D.2.2 Agent-Specific Configurations MetaGPT.We utilize the generate_repo inter- face with the following parameters: inc=False, implement=True, code_review=True, n_round=5, and investment=3.0. The ...

  19. [28]

    If it’s a web service, bind to 0.0.0.0 with the port specified in the README

  20. [29]

    Generate a run_tests.sh file to run the tests

  21. [30]

    Write production-ready, well-documented code

  22. [31]

    Start by listing the subdirectories in {repo_root} and then proceed with the generation for each one

    Use Maven. Start by listing the subdirectories in {repo_root} and then proceed with the generation for each one. IDE Prompt (Python) You are a senior software engineer tasked with im- plementing a complete software project. Your task is to sequentially generate the complete source code for all Python repositories located in the directory: {repo_root} Proj...

  23. [32]

    Analyze the README to understand the project requirements

  24. [33]

    Design the complete project structure and archi- tecture

  25. [34]

    Implement ALL necessary files including: • Main application files • Configuration files • Dependencies/requirements files • Documentation files • Any additional files needed for the project to run

  26. [35]

    Ensure the project can be started via a single shell command writen in a file named start.sh

  27. [36]

    If it’s a web service, bind to 0.0.0.0 with a common port

  28. [37]

    implemented

    Write production-ready, well-documented code. Start by listing the subdirectories in {repo_root} and then proceed with the generation for each one. D.4 Evaluation Metrics D.4.1 Pass@k Calculation and Its Necessity for Estimation In this study, we adopt the Pass@k metric to eval- uate the functional correctness of repository gen- eration. The Pass@k metric...

  29. [38]

    Dependency-First

    andExecutabilityin modern code genera- tion benchmarks (Nam et al., 2024). E Details of GenesisAgent GenesisAgent is constructed by extending the MS- Agent framework (Li et al., 2024a) with special- ized strategies for handling the complexities of repository-level microservice generation. Unlike the generic code generation agent used in our base- lines, G...

  30. [39]

    The system likely anticipates a specific confirmation signal that the model fails to generate in this context

    Workflow Misalignment in Zero2One Gen- eration:The IDE’s agentic workflow, de- signed primarily for incremental edits, may lack the robust state management required to handle the open-ended nature of creating a repository from scratch based solely on nat- ural language. The system likely anticipates a specific confirmation signal that the model fails to g...

  31. [40]

    decide on your own

    Deficiency in Autonomous Instruction Fol- lowing:The underlying GPT-4.1 model ap- pears to misinterpret the autonomy-granting instruction. Instead of mapping “decide on your own” to executable actions (i.e., invok- ing file-writing tools), it treats the input as conversational filler, resulting in a verbal loop of compliance without functional execution. ...

  32. [41]

    Functional Coverage (Weight: 30%): Does the test suite comprehensively cover all API endpoints specified in the README? Are edge cases, error handling paths, and boundary con- ditions tested?

  33. [42]

    Correctness (Weight: 25%): Do the tests ac- curately validate the expected behavior as de- scribed in requirements? Are assertions precise and meaningful?

  34. [43]

    Code Quality (Weight: 20%): Is the test code well-structured, readable, and maintainable? Are test names descriptive? Is there proper se- tup/teardown logic?

  35. [44]

    Independence (Weight: 15%): Can tests run in isolation without dependencies on execution order or external state? Are fixtures properly managed?

  36. [45]

    Robustness (Weight: 10%): Do tests handle timeout scenarios, validate HTTP status codes, and check response schemas comprehensively? Evaluating with LLM Reviewer FeedbackYou will receive three independent evaluations from LLM reviewers (GPT-5.1, claude-sonnet-4.5, Gem- ini 3 Pro), each providing: • A numerical score (0-10) for each dimension above • Textu...

  37. [46]

    If all pairwise differences are ∆≤3 , the committee has reached consensus

    Check Consensus: Calculate pairwise score differences across the three reviewers. If all pairwise differences are ∆≤3 , the committee has reached consensus. 2.When Consensus Exists (∆≤3): • If max(Score)≥7 : Accept the test case as satisfactory. No manual intervention needed. • If max(Score)<7 : Flag for refinement. The test case requires improvement base...

  38. [47]

    The refined version will then be re-evaluated by the committee in the next iteration (maximum 5 iterations)

    Making the Refinement Decision: If you de- termine refinement is necessary, the aggregated feedback from all reviewers (including your ob- servations) will be sent to an LLM refiner to improve the test case. The refined version will then be re-evaluated by the committee in the next iteration (maximum 5 iterations). Example Consensus Scenarios Scenario 1 (...

  39. [48]

    These reposito- ries might not only be prone to security vul- nerabilities but could also be misused for illicit purposes, such as unauthorized data access or system compromise

    Malicious or Unintended Harmful Effects: The generation of web microservice reposito- ries through LLMs may inadvertently lead to the creation of faulty or insecure code that, if deployed in production environments, could be exploited by malicious actors. These reposito- ries might not only be prone to security vul- nerabilities but could also be misused ...

  40. [49]

    Training these models requires significant GPU hours, and the energy consump- tion associated with this process is a growing concern

    Environmental Impact:The computational resources required for training and fine-tuning large-scale models, such as the ones used in this research, contribute to the environmental impact of AI research. Training these models requires significant GPU hours, and the energy consump- tion associated with this process is a growing concern. Future work should ex...

  41. [50]

    Fairness Considerations:One potential risk of deploying these technologies is the possibility of exacerbating existing biases or inequalities in software development practice. If the models are trained on a narrow set of data sources, there is a risk that they could generate code that is biased or not applicable to the needs of diverse or marginalized dev...

  42. [51]

    Privacy and Security Considerations:Since the data used in this research comes from pub- licly available web microservice repositories, there are minimal privacy concerns. However, security risks are inherent in the generation of repository-level code, particularly when mod- els are not fully vetted for safety or are used to create services that handle se...

  43. [52]

    For example, the ability to generate complete repositories quickly might be exploited to create malicious services or to automate the creation of fraudulent sys- tems

    Dual Use:The technology presented in this research, although intended for advancing repository-level code generation for legitimate use cases, could be misused. For example, the ability to generate complete repositories quickly might be exploited to create malicious services or to automate the creation of fraudulent sys- tems. Moreover, incorrect or insec...

  44. [53]

    Exclusion of Certain Groups:While our re- search focuses on web microservices, this tech- nology stack is not equally accessible or rele- vant across all development communities. There is a risk that focusing on specific frameworks (e.g., Flask, Spring Boot) could inadvertently ex- clude developers working with other technology stacks or programming parad...