pith. sign in

arxiv: 2605.15815 · v1 · pith:C6QOUXW2new · submitted 2026-05-15 · 💻 cs.SE · cs.CL· cs.MA

BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

Pith reviewed 2026-05-20 17:00 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.MA
keywords repository bootstrappingagent knowledge distillationsoftware setup automationmulti-agent frameworkDocker verificationrepair strategiesreproducible environment setupcode agent efficiency
0
0 comments X

The pith

BootstrapAgent distills the trial-and-error knowledge of setting up software repositories into a reusable .bootstrap contract that future agents can follow directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the initial setup of unfamiliar code repositories as a knowledge-capture problem rather than a one-off task. It presents BootstrapAgent, a multi-agent system that runs exploratory bootstrapping, then packages the resulting heuristics into a persistent, machine-readable .bootstrap contract covering environment setup, checks, verification steps, and repair tactics. The contract is produced through evidence extraction, structured planning, Docker-based verification, and trace-driven repair, plus techniques for warm repair and delta repair to keep the process reproducible. Experiments across three benchmarks report a 92.9 percent success rate, more than 10 points above the baseline, together with 25.9 percent lower token consumption and 22.3 percent shorter build times for downstream agents. A sympathetic reader sees the work as converting per-instance exploration cost into shareable, verifiable startup knowledge that agents can reuse without repeating the same search.

Core claim

BootstrapAgent formulates repository bootstrapping as a reusable startup-knowledge problem and solves it by distilling exploration traces into a .bootstrap contract that encodes environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. The contract is generated through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, augmented by warm repair with clean replay for faster iteration and delta repair with sanity checks to avoid reward hacking.

What carries the argument

The .bootstrap contract: a persistent, verifiable, agent-consumable artifact that captures setup heuristics, checks, verification steps, and repair strategies so downstream agents can apply them without re-exploring the repository.

If this is right

  • Downstream code agents reach usable repository states more reliably and with lower token budgets.
  • Build and setup times decrease because the contract supplies pre-validated steps instead of repeated trial-and-error.
  • Knowledge accumulated during one bootstrap run becomes available to all future agents working on similar repositories.
  • Verification remains reproducible because the contract is executed inside deterministic Docker environments.
  • Repair strategies become shareable artifacts rather than conversation-specific memories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation approach could be applied to other recurring agent prerequisites such as data-preparation pipelines or test-environment configuration.
  • If contracts are stored in a shared repository they could form a growing library of startup knowledge that improves over time as more agents contribute.
  • The method might reduce the need for human-written setup scripts in open-source projects by letting agents produce and maintain their own .bootstrap contracts.
  • Integration with larger agent workflows could let a single contract serve as the entry point for multiple downstream tasks on the same repository.

Load-bearing premise

The distilled .bootstrap contract will remain effective and generalizable when applied by downstream agents to repositories outside the three evaluated benchmarks without requiring substantial extra adaptation.

What would settle it

Run the generated .bootstrap contracts on a fresh collection of repositories never seen during the original experiments and measure whether success rate falls below the reported 92.9 percent or whether agents require major additional human-written fixes to reach usable states.

Figures

Figures reproduced from arXiv: 2605.15815 by Chengkun Wei, Jin Shi, Oucheng Liu, Shiyuan Wang, Sihan Fu.

Figure 1
Figure 1. Figure 1: The process of deploying a repo by Code Agent. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of BootstrapAgent. 3.6 Containerized Verification This stage executes the generated contract inside a Docker environment. It is necessary because repository bootstrap correctness cannot be reliably inferred from static evidence alone. The repository is mounted at a fixed path, and the verifier runs setup, doctor, minimal verification, strongest verification, and optional run probes with stage-spec… view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between token usage and time cost [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Warm-starting results across benchmarks. The first row reports wall-clock time, and the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Estimated cost without warm repair. Repo2Run ExecutionAgent Installamatic −50 −25 0 25 50 75 100 Estimated wall-clock reduction (%) median 42.9% median 30.2% median 32.2% [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial-and-error exploration, yet the resulting knowledge--resolved dependencies, repair strategies--stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi-agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent-consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold-start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at https://github.com/Vossera/BootstrapAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces BootstrapAgent, a multi-agent framework for distilling repository bootstrapping knowledge—such as resolved dependencies and repair strategies—into a reusable, verifiable .bootstrap contract. The approach uses evidence extraction, structured planning, deterministic Docker verification, warm repair with clean replay, and delta repair with sanity checks. Experiments on three benchmarks report a 92.9% success rate (over 10% above baseline), 25.9% lower downstream token usage, and 22.3% shorter build times, with code released at https://github.com/Vossera/BootstrapAgent.

Significance. If the central empirical claims hold under broader testing, the work addresses a practical bottleneck in code agents by converting per-repository trial-and-error into portable startup knowledge. The open code release and emphasis on verifiable contracts are positive for reproducibility. The reported efficiency gains could meaningfully reduce agent costs in repository interaction tasks.

major comments (1)
  1. [§5] §5 (Experiments): The evaluation is restricted to three benchmarks with no held-out repository set, cross-benchmark transfer test, or zero-shot application to arbitrary external repositories. This directly undermines the reusability claim for the .bootstrap contract, as the headline metrics (92.9% success, 10%+ lift, 25.9% token and 22.3% time reductions) rest on the untested assumption that the contract encodes portable invariants rather than benchmark-specific heuristics.
minor comments (2)
  1. [Abstract] Abstract: The description of benchmark selection, baseline implementations, and statistical testing is absent; adding one sentence on these would improve clarity without altering the narrative.
  2. [§3] Notation: The term '.bootstrap contract' is introduced without an early formal definition or example schema; a small table or listing in §3 would help readers parse the contract components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on domain assumptions about containerized environments and introduces a new contract artifact whose generality is asserted rather than independently validated outside the reported experiments.

axioms (1)
  • domain assumption Docker containers provide a sufficiently deterministic and reproducible environment for verifying repository setup steps.
    Central to the verification and repair pipeline described in the abstract.
invented entities (1)
  • .bootstrap contract no independent evidence
    purpose: Structured, persistent, agent-consumable artifact that stores distilled setup knowledge, diagnostics, and repairs.
    New file format and content schema introduced by the framework.

pith-pipeline@v0.9.0 · 5753 in / 1307 out tokens · 120319 ms · 2026-05-20T17:00:39.811921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 3 internal anchors

  1. [1]

    Claude Code: AI-Powered Coding Assistant for Developers

    Anthropic. Claude Code: AI-Powered Coding Assistant for Developers. https://claude. ai/code, 2026. Accessed: 2026-04-23

  2. [2]

    How Claude Remembers Your Project

    Anthropic. How Claude Remembers Your Project. https://code.claude.com/docs/en/ memory, 2026

  3. [3]

    Setupbench: Assessing software engineering agents’ ability to bootstrap development environments, 2025

    Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents’ ability to bootstrap development environments, 2025. URL https:// arxiv.org/abs/2507.09063

  4. [4]

    You name it, i run it: An llm agent to execute tests of arbitrary projects.Proc

    Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728922. URL https://doi.org/10.1145/3728922

  5. [5]

    Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URLhttps://arxiv.org/abs/2208.08227

  6. [6]

    Hassan, and Hajimu Iida

    Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjanasith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. Agent readmes: An empirical study of context files for agentic coding,

  7. [7]

    URLhttps://arxiv.org/abs/2511.12884

  8. [8]

    SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

    Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. Securevibebench: Evaluating secure coding capabilities of code agents with realistic vulnerability scenarios, 2026. URLhttps://arxiv.org/abs/2509.22097

  9. [9]

    Swe-dev: Evaluating and training autonomous feature-driven software development, 2026

    Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development, 2026. URLhttps://arxiv.org/abs/2505.16975

  10. [10]

    Envbench: A benchmark for automated environment setup, 2025

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. URL https://arxiv.org/ abs/2503.14443

  11. [11]

    Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

    Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

  12. [12]

    Gemini CLI: Build, Debug, and Deploy with AI

    Google. Gemini CLI: Build, Debug, and Deploy with AI. https://geminicli.com/, 2026. Accessed: 2026-04-23

  13. [13]

    Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks, 2026

    Lianghong Guo, Yanlin Wang, Caihua Li, Wei Tao, Pengyu Yang, Jiachi Chen, Haoyu Song, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks, 2026. URLhttps://arxiv.org/abs/2506.10954

  14. [14]

    Nam Le Hai, Dung Manh Nguyen, and Nghi D. Q. Bui. On the impacts of contexts on repository-level code generation, 2025. URLhttps://arxiv.org/abs/2406.11927

  15. [15]

    Swe-perf: Can language models optimize code performance on real-world repositories?,

    Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?,

  16. [16]

    URLhttps://arxiv.org/abs/2507.12415

  17. [17]

    DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets

    Eric Horton and Chris Parnin. Dockerizeme: Automatic inference of environment dependencies for python code snippets, 2019. URLhttps://arxiv.org/abs/1905.11127

  18. [18]

    Repo2run: Automated building executable environment for code repository at scale, 2025

    Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale, 2025. URL https://arxiv. org/abs/2502.13681. 10

  19. [19]

    R2E: Turning any github repository into a programming agent environment

    Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. R2E: Turning any github repository into a programming agent environment. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learnin...

  20. [20]

    R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025

    Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URLhttps://arxiv.org/abs/2504.07164

  21. [21]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

  22. [22]

    Piper: On-device environment setup via online reinforcement learning, 2025

    Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, and Yaroslav Zharov. Piper: On-device environment setup via online reinforcement learning, 2025. URL https://arxiv.org/abs/2509.25455

  23. [23]

    Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E

    Hao Li, Hicham Masri, Filipe R. Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E. Hassan. Understanding prompt management in github repositories: A call for best practices. IEEE Software, 43(2):85–93, March 2026. ISSN 1937-4194. doi: 10.1109/ms.2025.3644251. URLhttp://dx.doi.org/10.1109/MS.2025.3644251

  24. [24]

    Heragent: Rethinking the automated environment deployment via hierarchical test pyramid, 2026

    Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, and He Ye. Heragent: Rethinking the automated environment deployment via hierarchical test pyramid, 2026. URL https: //arxiv.org/abs/2602.07871

  25. [25]

    Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=pPjZIOuQuF

  26. [26]

    Understanding code agent behaviour: An empirical study of success and failure trajectories, 2025

    Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. Understanding code agent behaviour: An empirical study of success and failure trajectories, 2025. URL https: //arxiv.org/abs/2511.00197

  27. [27]

    Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024

    Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024. URL https://arxiv.org/abs/2412. 06294

  28. [28]

    Context engineering for ai agents in open-source software, 2026

    Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. Context engineering for ai agents in open-source software, 2026. URL https://arxiv.org/abs/ 2510.21413

  29. [29]

    Swt-bench: Testing and validating real-world bug-fixes with code agents

    Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 81857–81887. Curran Associates, Inc., 2024. doi: 10.5...

  30. [30]

    Codex: AI Coding Partner from OpenAI

    OpenAI. Codex: AI Coding Partner from OpenAI. https://openai.com/codex/, 2026. Accessed: 2026-04-23

  31. [31]

    Can agents fix agent issues?, 2025

    Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues?, 2025. URLhttps://arxiv.org/abs/2505.20749

  32. [32]

    Automatically generating dockerfiles via deep learning: Challenges and promises, 2023

    Giovanni Rosa, Antonio Mastropaolo, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. Automatically generating dockerfiles via deep learning: Challenges and promises, 2023. URL https://arxiv.org/abs/2303.15990. 11

  33. [33]

    Repository-level prompt generation for large language models of code, 2023

    Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2023. URLhttps://arxiv.org/abs/2206.12839

  34. [34]

    Treefix: Enabling execution with a tree of prefixes, 2025

    Beatriz Souza and Michael Pradel. Treefix: Enabling execution with a tree of prefixes, 2025. URLhttps://arxiv.org/abs/2501.12339

  35. [35]

    Hot Swapping

    Spring Boot. Hot Swapping. https://docs.spring.io/spring-boot/how-to/ hotswapping.html. Spring Boot Reference Documentation. Accessed: 2026-05-10

  36. [36]

    Local software buildability across java versions (registered report), 2024

    Matúš Sulír, Jaroslav Porubän, and Sergej Chodarev. Local software buildability across java versions (registered report), 2024. URLhttps://arxiv.org/abs/2408.11544

  37. [37]

    Griswold, and Leo Porter

    Sander Valstar, William G. Griswold, and Leo Porter. Using devcontainers to standardize student development environments: An experience report. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’20, page 377–383, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450368742. doi...

  38. [38]

    Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

    Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

  39. [39]

    URLhttps://arxiv.org/abs/2512.17419

  40. [40]

    Learning beyond gradients

    Jiayi Weng. Learning beyond gradients. https://trinkle23897.github.io/ learning-beyond-gradients/, May 2026. Blog post

  41. [41]

    CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories

    Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

  42. [42]

    doi: 10.18653/v1/ 2023.blackboxnlp-1.2

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.633. URLhttps://aclanthology.org/2025.naacl-long.633/

  43. [43]

    Repost: Scalable repository-level coding environment construction with sandbox testing, 2025

    Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. Repost: Scalable repository-level coding environment construction with sandbox testing, 2025. URL https://arxiv.org/abs/2503.07358

  44. [44]

    RepoST: Scalable repository-level coding environment construction with sandbox testing

    Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. RepoST: Scalable repository-level coding environment construction with sandbox testing. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= 2txrMBpw3q

  45. [45]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

  46. [46]

    SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

  47. [47]

    Neural program repair with execution-based backpropagation

    He Ye, Matias Martinez, and Martin Monperrus. Neural program repair with execution-based backpropagation. InProceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1506–1518, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510222. URL https://doi.org/10.1145/ 3510003.3510222

  48. [48]

    Dockergen: A knowledge graph based approach for software containerization

    Hongjie Ye, Jiahong Zhou, Wei Chen, Jiaxin Zhu, Guoquan Wu, and Jun Wei. Dockergen: A knowledge graph based approach for software containerization. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pages 986–991, 2021. doi: 10.1109/COMPSAC51774.2021.00133. 12

  49. [49]

    Cxxcrafter: An llm-based agent for automated c/c++ open source software building.Proc

    Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. Cxxcrafter: An llm-based agent for automated c/c++ open source software building.Proc. ACM Softw. Eng., 2(FSE), June 2025. doi: 10.1145/3729386. URLhttps://doi.org/10.1145/3729386

  50. [50]

    Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025

    Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025. URL https://arxiv.org/abs...