BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

Chengkun Wei; Jin Shi; Oucheng Liu; Shiyuan Wang; Sihan Fu

REVIEW 1 major objections 2 minor 50 references

BootstrapAgent distills the trial-and-error knowledge of setting up software repositories into a reusable .bootstrap contract that future agents can follow directly.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 17:00 UTC pith:C6QOUXW2

arxiv 2605.15815 v1 pith:C6QOUXW2 submitted 2026-05-15 cs.SE cs.CLcs.MA

BootstrapAgent: Distilling Repository Setup into Reusable Agent Knowledge

Sihan Fu , Oucheng Liu , Shiyuan Wang , Jin Shi , Chengkun Wei This is my paper

classification cs.SE cs.CLcs.MA

keywords repository bootstrappingagent knowledge distillationsoftware setup automationmulti-agent frameworkDocker verificationrepair strategiesreproducible environment setupcode agent efficiency

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the initial setup of unfamiliar code repositories as a knowledge-capture problem rather than a one-off task. It presents BootstrapAgent, a multi-agent system that runs exploratory bootstrapping, then packages the resulting heuristics into a persistent, machine-readable .bootstrap contract covering environment setup, checks, verification steps, and repair tactics. The contract is produced through evidence extraction, structured planning, Docker-based verification, and trace-driven repair, plus techniques for warm repair and delta repair to keep the process reproducible. Experiments across three benchmarks report a 92.9 percent success rate, more than 10 points above the baseline, together with 25.9 percent lower token consumption and 22.3 percent shorter build times for downstream agents. A sympathetic reader sees the work as converting per-instance exploration cost into shareable, verifiable startup knowledge that agents can reuse without repeating the same search.

Core claim

BootstrapAgent formulates repository bootstrapping as a reusable startup-knowledge problem and solves it by distilling exploration traces into a .bootstrap contract that encodes environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. The contract is generated through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, augmented by warm repair with clean replay for faster iteration and delta repair with sanity checks to avoid reward hacking.

What carries the argument

The .bootstrap contract: a persistent, verifiable, agent-consumable artifact that captures setup heuristics, checks, verification steps, and repair strategies so downstream agents can apply them without re-exploring the repository.

Load-bearing premise

The distilled .bootstrap contract will remain effective and generalizable when applied by downstream agents to repositories outside the three evaluated benchmarks without requiring substantial extra adaptation.

What would settle it

Run the generated .bootstrap contracts on a fresh collection of repositories never seen during the original experiments and measure whether success rate falls below the reported 92.9 percent or whether agents require major additional human-written fixes to reach usable states.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Downstream code agents reach usable repository states more reliably and with lower token budgets.
Build and setup times decrease because the contract supplies pre-validated steps instead of repeated trial-and-error.
Knowledge accumulated during one bootstrap run becomes available to all future agents working on similar repositories.
Verification remains reproducible because the contract is executed inside deterministic Docker environments.
Repair strategies become shareable artifacts rather than conversation-specific memories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation approach could be applied to other recurring agent prerequisites such as data-preparation pipelines or test-environment configuration.
If contracts are stored in a shared repository they could form a growing library of startup knowledge that improves over time as more agents contribute.
The method might reduce the need for human-written setup scripts in open-source projects by letting agents produce and maintain their own .bootstrap contracts.
Integration with larger agent workflows could let a single contract serve as the entry point for multiple downstream tasks on the same repository.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

1 major / 2 minor

Summary. The paper introduces BootstrapAgent, a multi-agent framework for distilling repository bootstrapping knowledge—such as resolved dependencies and repair strategies—into a reusable, verifiable .bootstrap contract. The approach uses evidence extraction, structured planning, deterministic Docker verification, warm repair with clean replay, and delta repair with sanity checks. Experiments on three benchmarks report a 92.9% success rate (over 10% above baseline), 25.9% lower downstream token usage, and 22.3% shorter build times, with code released at https://github.com/Vossera/BootstrapAgent.

Significance. If the central empirical claims hold under broader testing, the work addresses a practical bottleneck in code agents by converting per-repository trial-and-error into portable startup knowledge. The open code release and emphasis on verifiable contracts are positive for reproducibility. The reported efficiency gains could meaningfully reduce agent costs in repository interaction tasks.

major comments (1)

[§5] §5 (Experiments): The evaluation is restricted to three benchmarks with no held-out repository set, cross-benchmark transfer test, or zero-shot application to arbitrary external repositories. This directly undermines the reusability claim for the .bootstrap contract, as the headline metrics (92.9% success, 10%+ lift, 25.9% token and 22.3% time reductions) rest on the untested assumption that the contract encodes portable invariants rather than benchmark-specific heuristics.

minor comments (2)

[Abstract] Abstract: The description of benchmark selection, baseline implementations, and statistical testing is absent; adding one sentence on these would improve clarity without altering the narrative.
[§3] Notation: The term '.bootstrap contract' is introduced without an early formal definition or example schema; a small table or listing in §3 would help readers parse the contract components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on domain assumptions about containerized environments and introduces a new contract artifact whose generality is asserted rather than independently validated outside the reported experiments.

axioms (1)

domain assumption Docker containers provide a sufficiently deterministic and reproducible environment for verifying repository setup steps.
Central to the verification and repair pipeline described in the abstract.

invented entities (1)

.bootstrap contract no independent evidence
purpose: Structured, persistent, agent-consumable artifact that stores distilled setup knowledge, diagnostics, and repairs.
New file format and content schema introduced by the framework.

pith-pipeline@v0.9.0 · 5753 in / 1307 out tokens · 120319 ms · 2026-05-20T17:00:39.811921+00:00 · methodology

0 comments

read the original abstract

Code agents increasingly help developers work with unfamiliar repositories, but every such task depends on a costly prerequisite: bootstrapping the repository into a usable development state. This process requires substantial trial-and-error exploration, yet the resulting knowledge--resolved dependencies, repair strategies--stays trapped in a single conversation, unavailable to future agents. We therefore formulate repository bootstrapping as a reusable startup knowledge problem and introduce BootstrapAgent, a multi-agent framework that distills the heuristics discovered during bootstrap exploration into a persistent, verifiable, agent-consumable .bootstrap contract. Through evidence extraction, structured planning, deterministic Docker-based verification, and trace-driven repair, BootstrapAgent generates a contract covering environment setup, diagnostic checks, minimal verification, and accumulated repair knowledge. We further propose warm repair with clean replay to accelerate iterative debugging without sacrificing cold-start reproducibility, and a delta repair with sanity check to prevent reward hacking. Experiments on three benchmarks show that BootstrapAgent achieves a 92.9% success rate, outperforming the baseline by over 10% while reducing downstream agent token usage by 25.9% and build time by 22.3%. Our code is available at https://github.com/Vossera/BootstrapAgent.

Figures

Figures reproduced from arXiv: 2605.15815 by Chengkun Wei, Jin Shi, Oucheng Liu, Shiyuan Wang, Sihan Fu.

**Figure 2.** Figure 2: Overview of BootstrapAgent. 3.6 Containerized Verification This stage executes the generated contract inside a Docker environment. It is necessary because repository bootstrap correctness cannot be reliably inferred from static evidence alone. The repository is mounted at a fixed path, and the verifier runs setup, doctor, minimal verification, strongest verification, and optional run probes with stage-spec… view at source ↗

**Figure 5.** Figure 5: Relationship between token usage and time cost [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Warm-starting results across benchmarks. The first row reports wall-clock time, and the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Estimated cost without warm repair. Repo2Run ExecutionAgent Installamatic −50 −25 0 25 50 75 100 Estimated wall-clock reduction (%) median 42.9% median 30.2% median 32.2% [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

Claude Code: AI-Powered Coding Assistant for Developers

Anthropic. Claude Code: AI-Powered Coding Assistant for Developers. https://claude. ai/code, 2026. Accessed: 2026-04-23

work page 2026
[2]

How Claude Remembers Your Project

Anthropic. How Claude Remembers Your Project. https://code.claude.com/docs/en/ memory, 2026

work page 2026
[3]

SetupBench: Assessing software engineering agents’ ability to bootstrap development environments.arXiv preprint arXiv:2507.09063, 2025

Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents’ ability to bootstrap development environments, 2025. URL https:// arxiv.org/abs/2507.09063

work page arXiv 2025
[4]

You name it, i run it: An llm agent to execute tests of arbitrary projects,

Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728922. URL https://doi.org/10.1145/3728922

work page doi:10.1145/3728922 2025
[5]

arXiv preprint arXiv:2208.08227 , year=

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URLhttps://arxiv.org/abs/2208.08227

work page arXiv 2022
[6]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjanasith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. Agent readmes: An empirical study of context files for agentic coding,

work page
[7]

URLhttps://arxiv.org/abs/2511.12884

work page arXiv
[8]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. Securevibebench: Evaluating secure coding capabilities of code agents with realistic vulnerability scenarios, 2026. URLhttps://arxiv.org/abs/2509.22097

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Swe-dev: Evaluating and training autonomous feature-driven software development, 2026

Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development, 2026. URLhttps://arxiv.org/abs/2505.16975

work page arXiv 2026
[10]

EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. URL https://arxiv.org/ abs/2503.14443

work page arXiv 2025
[11]

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

work page internal anchor Pith review arXiv 2026
[12]

Gemini CLI: Build, Debug, and Deploy with AI

Google. Gemini CLI: Build, Debug, and Deploy with AI. https://geminicli.com/, 2026. Accessed: 2026-04-23

work page 2026
[13]

Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks

Lianghong Guo, Yanlin Wang, Caihua Li, Wei Tao, Pengyu Yang, Jiachi Chen, Haoyu Song, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks, 2026. URLhttps://arxiv.org/abs/2506.10954

work page arXiv 2026
[14]

Nam Le Hai, Dung Manh Nguyen, and Nghi D. Q. Bui. On the impacts of contexts on repository-level code generation, 2025. URLhttps://arxiv.org/abs/2406.11927

work page Pith review arXiv 2025
[15]

Swe-perf: Can language models optimize code performance on real-world repositories?,

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?,

work page
[16]

URLhttps://arxiv.org/abs/2507.12415

work page internal anchor Pith review arXiv
[17]

DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets

Eric Horton and Chris Parnin. Dockerizeme: Automatic inference of environment dependencies for python code snippets, 2019. URLhttps://arxiv.org/abs/1905.11127

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Repo2run: Automated building executable environment for code repository at scale, 2025

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale, 2025. URL https://arxiv. org/abs/2502.13681. 10

work page arXiv 2025
[19]

R2E: Turning any github repository into a programming agent environment

Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. R2E: Turning any github repository into a programming agent environment. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learnin...

work page 2024
[20]

R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URLhttps://arxiv.org/abs/2504.07164

work page Pith review arXiv 2025
[21]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

work page 2024
[22]

Piper: On-device environment setup via online reinforcement learning, 2025

Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, and Yaroslav Zharov. Piper: On-device environment setup via online reinforcement learning, 2025. URL https://arxiv.org/abs/2509.25455

work page arXiv 2025
[23]

Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E

Hao Li, Hicham Masri, Filipe R. Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E. Hassan. Understanding prompt management in github repositories: A call for best practices. IEEE Software, 43(2):85–93, March 2026. ISSN 1937-4194. doi: 10.1109/ms.2025.3644251. URLhttp://dx.doi.org/10.1109/MS.2025.3644251

work page doi:10.1109/ms.2025.3644251 2026
[24]

HerAgent: Rethinking the automated environment deployment via hierarchical test pyramid.ArXiv, abs/2602.07871, 2026

Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, and He Ye. Heragent: Rethinking the automated environment deployment via hierarchical test pyramid, 2026. URL https: //arxiv.org/abs/2602.07871

work page arXiv 2026
[25]

Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=pPjZIOuQuF

work page 2024
[26]

Understanding code agent behaviour: An empirical study of success and failure trajectories,

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. Understanding code agent behaviour: An empirical study of success and failure trajectories, 2025. URL https: //arxiv.org/abs/2511.00197

work page arXiv 2025
[27]

Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024

Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024. URL https://arxiv.org/abs/2412. 06294

work page 2024
[28]

Context engineering for ai agents in open-source software,

Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. Context engineering for ai agents in open-source software, 2026. URL https://arxiv.org/abs/ 2510.21413

work page arXiv 2026
[29]

Swt-bench: Testing and validating real-world bug-fixes with code agents

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 81857–81887. Curran Associates, Inc., 2024. doi: 10.5...

work page doi:10.52202/079017-2601 2024
[30]

Codex: AI Coding Partner from OpenAI

OpenAI. Codex: AI Coding Partner from OpenAI. https://openai.com/codex/, 2026. Accessed: 2026-04-23

work page 2026
[31]

Can agents fix agent issues?, 2025

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues?, 2025. URLhttps://arxiv.org/abs/2505.20749

work page arXiv 2025
[32]

Automatically generating dockerfiles via deep learning: Challenges and promises, 2023

Giovanni Rosa, Antonio Mastropaolo, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. Automatically generating dockerfiles via deep learning: Challenges and promises, 2023. URL https://arxiv.org/abs/2303.15990. 11

work page arXiv 2023
[33]

Repository-level prompt generation for large language models of code, 2023

Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2023. URLhttps://arxiv.org/abs/2206.12839

work page arXiv 2023
[34]

Treefix: Enabling execution with a tree of prefixes, 2025

Beatriz Souza and Michael Pradel. Treefix: Enabling execution with a tree of prefixes, 2025. URLhttps://arxiv.org/abs/2501.12339

work page arXiv 2025
[35]

Hot Swapping

Spring Boot. Hot Swapping. https://docs.spring.io/spring-boot/how-to/ hotswapping.html. Spring Boot Reference Documentation. Accessed: 2026-05-10

work page 2026
[36]

Local software buildability across java versions (registered report), 2024

Matúš Sulír, Jaroslav Porubän, and Sergej Chodarev. Local software buildability across java versions (registered report), 2024. URLhttps://arxiv.org/abs/2408.11544

work page arXiv 2024
[37]

Griswold, and Leo Porter

Sander Valstar, William G. Griswold, and Leo Porter. Using devcontainers to standardize student development environments: An experience report. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’20, page 377–383, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450368742. doi...

work page doi:10.1145/3341525.3387424 2020
[38]

Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

work page
[39]

URLhttps://arxiv.org/abs/2512.17419

work page arXiv
[40]

Learning beyond gradients

Jiayi Weng. Learning beyond gradients. https://trinkle23897.github.io/ learning-beyond-gradients/, May 2026. Blog post

work page 2026
[41]

CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories

Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

work page 2025
[42]

Fine-grained prediction of reading comprehension from eye movements,

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.633. URLhttps://aclanthology.org/2025.naacl-long.633/

work page doi:10.18653/v1/ 2025
[43]

Repost: Scalable repository-level coding environment construction with sandbox testing, 2025

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. Repost: Scalable repository-level coding environment construction with sandbox testing, 2025. URL https://arxiv.org/abs/2503.07358

work page arXiv 2025
[44]

RepoST: Scalable repository-level coding environment construction with sandbox testing

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. RepoST: Scalable repository-level coding environment construction with sandbox testing. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= 2txrMBpw3q

work page 2025
[45]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

work page 2025
[47]

Georgiou, M

He Ye, Matias Martinez, and Martin Monperrus. Neural program repair with execution-based backpropagation. InProceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1506–1518, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510222. URL https://doi.org/10.1145/ 3510003.3510222

work page doi:10.1145/3510003.3510222 2022
[48]

In: 2021 IEEE 45th Annual Comput- ers, Software, and Applications Conference (COMPSAC)

Hongjie Ye, Jiahong Zhou, Wei Chen, Jiaxin Zhu, Guoquan Wu, and Jun Wei. Dockergen: A knowledge graph based approach for software containerization. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pages 986–991, 2021. doi: 10.1109/COMPSAC51774.2021.00133. 12

work page doi:10.1109/compsac51774.2021.00133 2021
[49]

Cxxcrafter: An llm-based agent for automated c/c++ open source software building,

Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. Cxxcrafter: An llm-based agent for automated c/c++ open source software building.Proc. ACM Softw. Eng., 2(FSE), June 2025. doi: 10.1145/3729386. URLhttps://doi.org/10.1145/3729386

work page doi:10.1145/3729386 2025
[50]

Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025

Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025. URL https://arxiv.org/abs...

work page arXiv 2025

[1] [1]

Claude Code: AI-Powered Coding Assistant for Developers

Anthropic. Claude Code: AI-Powered Coding Assistant for Developers. https://claude. ai/code, 2026. Accessed: 2026-04-23

work page 2026

[2] [2]

How Claude Remembers Your Project

Anthropic. How Claude Remembers Your Project. https://code.claude.com/docs/en/ memory, 2026

work page 2026

[3] [3]

SetupBench: Assessing software engineering agents’ ability to bootstrap development environments.arXiv preprint arXiv:2507.09063, 2025

Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. Setupbench: Assessing software engineering agents’ ability to bootstrap development environments, 2025. URL https:// arxiv.org/abs/2507.09063

work page arXiv 2025

[4] [4]

You name it, i run it: An llm agent to execute tests of arbitrary projects,

Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728922. URL https://doi.org/10.1145/3728922

work page doi:10.1145/3728922 2025

[5] [5]

arXiv preprint arXiv:2208.08227 , year=

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URLhttps://arxiv.org/abs/2208.08227

work page arXiv 2022

[6] [6]

Hassan, and Hajimu Iida

Worawalan Chatlatanagulchai, Hao Li, Yutaro Kashiwa, Brittany Reid, Kundjanasith Thonglek, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Bram Adams, Ahmed E. Hassan, and Hajimu Iida. Agent readmes: An empirical study of context files for agentic coding,

work page

[7] [7]

URLhttps://arxiv.org/abs/2511.12884

work page arXiv

[8] [8]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. Securevibebench: Evaluating secure coding capabilities of code agents with realistic vulnerability scenarios, 2026. URLhttps://arxiv.org/abs/2509.22097

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Swe-dev: Evaluating and training autonomous feature-driven software development, 2026

Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development, 2026. URLhttps://arxiv.org/abs/2505.16975

work page arXiv 2026

[10] [10]

EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. Envbench: A benchmark for automated environment setup, 2025. URL https://arxiv.org/ abs/2503.14443

work page arXiv 2025

[11] [11]

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988

work page internal anchor Pith review arXiv 2026

[12] [12]

Gemini CLI: Build, Debug, and Deploy with AI

Google. Gemini CLI: Build, Debug, and Deploy with AI. https://geminicli.com/, 2026. Accessed: 2026-04-23

work page 2026

[13] [13]

Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks

Lianghong Guo, Yanlin Wang, Caihua Li, Wei Tao, Pengyu Yang, Jiachi Chen, Haoyu Song, Duyu Tang, and Zibin Zheng. Swe-factory: Your automated factory for issue resolution training data and evaluation benchmarks, 2026. URLhttps://arxiv.org/abs/2506.10954

work page arXiv 2026

[14] [14]

Nam Le Hai, Dung Manh Nguyen, and Nghi D. Q. Bui. On the impacts of contexts on repository-level code generation, 2025. URLhttps://arxiv.org/abs/2406.11927

work page Pith review arXiv 2025

[15] [15]

Swe-perf: Can language models optimize code performance on real-world repositories?,

Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, and Zejun Ma. Swe-perf: Can language models optimize code performance on real-world repositories?,

work page

[16] [16]

URLhttps://arxiv.org/abs/2507.12415

work page internal anchor Pith review arXiv

[17] [17]

DockerizeMe: Automatic Inference of Environment Dependencies for Python Code Snippets

Eric Horton and Chris Parnin. Dockerizeme: Automatic inference of environment dependencies for python code snippets, 2019. URLhttps://arxiv.org/abs/1905.11127

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Repo2run: Automated building executable environment for code repository at scale, 2025

Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale, 2025. URL https://arxiv. org/abs/2502.13681. 10

work page arXiv 2025

[19] [19]

R2E: Turning any github repository into a programming agent environment

Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. R2E: Turning any github repository into a programming agent environment. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learnin...

work page 2024

[20] [20]

R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e- gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URLhttps://arxiv.org/abs/2504.07164

work page Pith review arXiv 2025

[21] [21]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

work page 2024

[22] [22]

Piper: On-device environment setup via online reinforcement learning, 2025

Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, and Yaroslav Zharov. Piper: On-device environment setup via online reinforcement learning, 2025. URL https://arxiv.org/abs/2509.25455

work page arXiv 2025

[23] [23]

Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E

Hao Li, Hicham Masri, Filipe R. Cogo, Abdul Ali Bangash, Bram Adams, and Ahmed E. Hassan. Understanding prompt management in github repositories: A call for best practices. IEEE Software, 43(2):85–93, March 2026. ISSN 1937-4194. doi: 10.1109/ms.2025.3644251. URLhttp://dx.doi.org/10.1109/MS.2025.3644251

work page doi:10.1109/ms.2025.3644251 2026

[24] [24]

HerAgent: Rethinking the automated environment deployment via hierarchical test pyramid.ArXiv, abs/2602.07871, 2026

Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, and He Ye. Heragent: Rethinking the automated environment deployment via hierarchical test pyramid, 2026. URL https: //arxiv.org/abs/2602.07871

work page arXiv 2026

[25] [25]

Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=pPjZIOuQuF

work page 2024

[26] [26]

Understanding code agent behaviour: An empirical study of success and failure trajectories,

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. Understanding code agent behaviour: An empirical study of success and failure trajectories, 2025. URL https: //arxiv.org/abs/2511.00197

work page arXiv 2025

[27] [27]

Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024

Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip install: Evaluating llm agents for the automated installation of python projects, 2024. URL https://arxiv.org/abs/2412. 06294

work page 2024

[28] [28]

Context engineering for ai agents in open-source software,

Seyedmoein Mohsenimofidi, Matthias Galster, Christoph Treude, and Sebastian Baltes. Context engineering for ai agents in open-source software, 2026. URL https://arxiv.org/abs/ 2510.21413

work page arXiv 2026

[29] [29]

Swt-bench: Testing and validating real-world bug-fixes with code agents

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 81857–81887. Curran Associates, Inc., 2024. doi: 10.5...

work page doi:10.52202/079017-2601 2024

[30] [30]

Codex: AI Coding Partner from OpenAI

OpenAI. Codex: AI Coding Partner from OpenAI. https://openai.com/codex/, 2026. Accessed: 2026-04-23

work page 2026

[31] [31]

Can agents fix agent issues?, 2025

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues?, 2025. URLhttps://arxiv.org/abs/2505.20749

work page arXiv 2025

[32] [32]

Automatically generating dockerfiles via deep learning: Challenges and promises, 2023

Giovanni Rosa, Antonio Mastropaolo, Simone Scalabrino, Gabriele Bavota, and Rocco Oliveto. Automatically generating dockerfiles via deep learning: Challenges and promises, 2023. URL https://arxiv.org/abs/2303.15990. 11

work page arXiv 2023

[33] [33]

Repository-level prompt generation for large language models of code, 2023

Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language models of code, 2023. URLhttps://arxiv.org/abs/2206.12839

work page arXiv 2023

[34] [34]

Treefix: Enabling execution with a tree of prefixes, 2025

Beatriz Souza and Michael Pradel. Treefix: Enabling execution with a tree of prefixes, 2025. URLhttps://arxiv.org/abs/2501.12339

work page arXiv 2025

[35] [35]

Hot Swapping

Spring Boot. Hot Swapping. https://docs.spring.io/spring-boot/how-to/ hotswapping.html. Spring Boot Reference Documentation. Accessed: 2026-05-10

work page 2026

[36] [36]

Local software buildability across java versions (registered report), 2024

Matúš Sulír, Jaroslav Porubän, and Sergej Chodarev. Local software buildability across java versions (registered report), 2024. URLhttps://arxiv.org/abs/2408.11544

work page arXiv 2024

[37] [37]

Griswold, and Leo Porter

Sander Valstar, William G. Griswold, and Leo Porter. Using devcontainers to standardize student development environments: An experience report. InProceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE ’20, page 377–383, New York, NY , USA, 2020. Association for Computing Machinery. ISBN 9781450368742. doi...

work page doi:10.1145/3341525.3387424 2020

[38] [38]

Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, and Gabriel Maduekwe. Swe-bench++: A framework for the scalable generation of software engineering benchmarks from open-source repositories,

work page

[39] [39]

URLhttps://arxiv.org/abs/2512.17419

work page arXiv

[40] [40]

Learning beyond gradients

Jiayi Weng. Learning beyond gradients. https://trinkle23897.github.io/ learning-beyond-gradients/, May 2026. Blog post

work page 2026

[41] [41]

CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories

Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-bench: Benchmark- ing LLM agents in deployment of computer science research repositories. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

work page 2025

[42] [42]

Fine-grained prediction of reading comprehension from eye movements,

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.633. URLhttps://aclanthology.org/2025.naacl-long.633/

work page doi:10.18653/v1/ 2025

[43] [43]

Repost: Scalable repository-level coding environment construction with sandbox testing, 2025

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. Repost: Scalable repository-level coding environment construction with sandbox testing, 2025. URL https://arxiv.org/abs/2503.07358

work page arXiv 2025

[44] [44]

RepoST: Scalable repository-level coding environment construction with sandbox testing

Yiqing Xie, Alex Xie, Divyanshu Sheth, Pengfei Liu, Daniel Fried, and Carolyn Rose. RepoST: Scalable repository-level coding environment construction with sandbox testing. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id= 2txrMBpw3q

work page 2025

[45] [45]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

work page 2025

[47] [47]

Georgiou, M

He Ye, Matias Martinez, and Martin Monperrus. Neural program repair with execution-based backpropagation. InProceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1506–1518, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510222. URL https://doi.org/10.1145/ 3510003.3510222

work page doi:10.1145/3510003.3510222 2022

[48] [48]

In: 2021 IEEE 45th Annual Comput- ers, Software, and Applications Conference (COMPSAC)

Hongjie Ye, Jiahong Zhou, Wei Chen, Jiaxin Zhu, Guoquan Wu, and Jun Wei. Dockergen: A knowledge graph based approach for software containerization. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), pages 986–991, 2021. doi: 10.1109/COMPSAC51774.2021.00133. 12

work page doi:10.1109/compsac51774.2021.00133 2021

[49] [49]

Cxxcrafter: An llm-based agent for automated c/c++ open source software building,

Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, and Min Yang. Cxxcrafter: An llm-based agent for automated c/c++ open source software building.Proc. ACM Softw. Eng., 2(FSE), June 2025. doi: 10.1145/3729386. URLhttps://doi.org/10.1145/3729386

work page doi:10.1145/3729386 2025

[50] [50]

Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025

Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Di-bench: Benchmarking large language models on dependency inference with testable repositories at scale, 2025. URL https://arxiv.org/abs...

work page arXiv 2025