Recognition: unknown
Building an Internal Coding Agent at Zup: Lessons and Open Questions
Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3
The pith
Targeted tool design and layered safety guardrails improve coding agent reliability more than prompt engineering at Zup.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors state that technical model quality alone is insufficient for production-ready coding agents. Targeted tool design such as string-replacement edits over full-file rewrites, layered safety guardrails, state management, and progressive human oversight modes are decisive. These elements improved reliability more than prompt engineering and enabled organic adoption without mandating trust, demonstrating that engineering decisions around the model determine whether a coding agent delivers value in practice.
What carries the argument
String-replacement edit tools paired with layered safety guardrails and progressive human oversight modes that allow users to select review intensity.
If this is right
- Production coding agents can reach usable reliability through system-level choices rather than model-scale increases alone.
- Flexible oversight levels allow adoption to grow as users observe consistent behavior without forced trust.
- Development resources shift toward edit mechanics, constraint enforcement, and state handling over isolated prompt tuning.
- Similar system designs may reduce the prototype-to-production gap for other internal AI tools.
Where Pith is reading between the lines
- The same emphasis on constrained edit tools and adjustable oversight could extend to non-coding AI agents that modify documents or data.
- Longer-term observation might show whether these agents change how developers review or write code over months of use.
- Replication across organizations of different sizes would test whether the observed benefits hold beyond Zup's specific context.
Load-bearing premise
The reported gains in reliability and adoption stem primarily from the described tool design, safety layers, and oversight modes rather than from unmentioned factors such as team composition or project types at Zup.
What would settle it
A side-by-side comparison in the same environment of one agent version using string-replacement edits and safety layers against another using full-file rewrites with no safety layers, tracking reliability metrics and user adoption rates.
read the original abstract
Enterprise teams building internal coding agents face a gap between prototype performance and production readiness. The root cause is that technical model quality alone is insufficient -- tool design, safety enforcement, state management, and human trust calibration are equally decisive, yet underreported in the literature. We present CodeGen, an internal coding agent at Zup, and show that targeted tool design (e.g., string-replacement edits over full-file rewrites) and layered safety guardrails improved agent reliability more than prompt engineering, while progressive human oversight modes drove organic adoption without mandating trust. These findings suggest that the engineering decisions surrounding the model -- not the model itself -- determine whether a coding agent delivers real value in practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a single-organization case study of CodeGen, an internal coding agent built at Zup. It claims that engineering decisions—specifically targeted tool design (string-replacement edits rather than full-file rewrites), layered safety guardrails, and progressive human-oversight modes—were more decisive for reliability and organic adoption than prompt engineering or underlying model quality, and it shares lessons and open questions from the deployment experience.
Significance. If the observations can be substantiated, the paper would usefully document practical, non-model factors that determine whether coding agents move from prototype to production use in enterprise settings. Such reports remain scarce in the SE literature, which tends to emphasize model capabilities over system-level design and trust calibration.
major comments (2)
- [Abstract and §4] Abstract and §4 (Tool Design): the assertion that string-replacement edits improved reliability more than prompt engineering is presented without any success rates, error distributions, before/after comparisons, or ablation results that would allow the reader to evaluate the relative contribution.
- [§5 and §6] §5 (Safety Guardrails) and §6 (Adoption): the claims that layered guardrails outperformed prompt engineering and that progressive oversight produced organic adoption rest on internal observations only; no quantitative adoption metrics, user feedback scores, or controlled variants are reported, leaving the causal attribution unsupported.
minor comments (2)
- [Discussion] The manuscript would benefit from an explicit limitations subsection that acknowledges the single-site, non-controlled nature of the observations.
- [Throughout] A short table summarizing the key design choices and the qualitative outcomes attributed to each would improve readability and make the lessons easier to extract.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our case study on CodeGen. We appreciate the acknowledgment of the paper's potential contribution to documenting practical factors in deploying coding agents. We agree with the major comments that our assertions require better qualification due to the lack of quantitative data. We will make revisions to address this. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Tool Design): the assertion that string-replacement edits improved reliability more than prompt engineering is presented without any success rates, error distributions, before/after comparisons, or ablation results that would allow the reader to evaluate the relative contribution.
Authors: We agree that the manuscript lacks quantitative metrics or comparative analyses to substantiate that string-replacement edits improved reliability more than prompt engineering. Our observations stem from the development and deployment process at Zup. In the revised version, we will modify the abstract and Section 4 to describe the rationale and observed benefits of the string-replacement approach as part of our engineering lessons, while adding a clear statement that these are not the result of controlled experiments or ablations and that direct comparisons to prompt engineering alone are not available. revision: yes
-
Referee: [§5 and §6] §5 (Safety Guardrails) and §6 (Adoption): the claims that layered guardrails outperformed prompt engineering and that progressive oversight produced organic adoption rest on internal observations only; no quantitative adoption metrics, user feedback scores, or controlled variants are reported, leaving the causal attribution unsupported.
Authors: We acknowledge the validity of this observation: the paper does not report quantitative adoption metrics, user feedback scores, or controlled variants to support the claims about layered guardrails and progressive oversight. We will revise Sections 5 and 6 to focus on detailing the design of the guardrails and oversight modes, and to report our internal experiences and perceived impacts on safety and adoption, accompanied by a note on the qualitative and observational basis of these insights without formal evaluation or causal proof. revision: yes
Circularity Check
No circularity: descriptive case study without derivations or fitted claims
full rationale
The paper is a single-site case study reporting practical lessons from building an internal coding agent. It contains no equations, no parameter fitting, no derivations, and no self-citation chains that bear load on any central result. Claims about tool design, safety guardrails, and oversight modes are presented as observations from deployment experience rather than as predictions derived from prior inputs within the paper. No step reduces by construction to a self-definition or renamed fit; the work is self-contained as an engineering report.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. doi: 10.48550/arXiv. 2107.03374
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2021
-
[2]
doi:10.1109/ICSE-FoSE59343.2023.00008
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. InProceedings of the International Conference on Software Engineering: Future of Software Engineering (ICSE-F oSE), 2023. doi: 10.1109/icse-fose59343.2023.00008. 8
-
[3]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2310.06770
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770 2024
-
[4]
Trust in automation: Designing for appropriate reliance.Human Factors, 46(1):50–80, 2004
John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human Factors, 46(1):50–80, 2004. doi: 10.1518/hfes.46.1.50.30392
-
[5]
Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. A model for types and levels of human interaction with automation.IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 30(3):286–297, 2000. doi: 10.1109/3468.844354
-
[6]
The Impact of AI on Developer Productivity: Evidence from GitHub Copilot
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590, 2023. doi: 10.48550/arXiv.2302.06590
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.06590 2023
-
[7]
Gustavo Pinto, Cleidson R. B. de Souza, João Batista Neto, Alberto de Souza, Tarcísio Gotto, and Edward Monteiro. Lessons from building stackspot AI: A contextualized AI coding assistant. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2024, Lisbon, Portugal, April 14-20, 2024, pages ...
-
[8]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InProceedings of the International Conference on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2307.16789
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.16789 2024
-
[9]
Zeeshan Rasheed, Muhammad Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen Duc, and Pekka Abrahamsson. CodePori: Large-scale system for au- tonomous software development using multi-agent technology.SSRN Electronic Journal, 2024. doi: 10.2139/ssrn.4979510
-
[10]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.48550/arXiv.2302.04761
work page internal anchor Pith review doi:10.48550/arxiv.2302.04761 2023
-
[11]
A survey on large language model based autonomous agents , volume =
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. doi: 10.1007/s11704-024-40231-1
-
[12]
Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. InProceedings of the International Symposium on Machine Programming (MAPS), 2022. doi: 10.1145/3520312.3534862
-
[13]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. doi: 10.48550/arXiv.2405.15793
work page internal anchor Pith review doi:10.48550/arxiv.2405.15793 2024
-
[14]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAd- vances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.52202/075280-0517
-
[15]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023. doi: 10.48550/arXiv.2210. 03629
-
[16]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InProceedings of the International Symposium on Software Testing and Analysis (ISSTA), 2024. doi: 10.1145/3650212.3680384. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.