pith. machine review for the scientific record. sign in

arxiv: 2604.09805 · v1 · submitted 2026-04-10 · 💻 cs.SE

Recognition: unknown

Building an Internal Coding Agent at Zup: Lessons and Open Questions

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.SE
keywords coding agentstool designsafety guardrailshuman oversightenterprise AIsoftware engineeringagent reliabilityadoption
0
0 comments X

The pith

Targeted tool design and layered safety guardrails improve coding agent reliability more than prompt engineering at Zup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise teams building internal coding agents face a persistent gap between prototype performance and production use because model quality by itself does not address how code gets edited, errors get caught, or users decide to trust the system. The paper presents CodeGen and shows that concrete engineering choices, such as string-replacement edits instead of full-file rewrites and multiple layers of safety checks, produced larger reliability gains than prompt refinements. Progressive oversight modes, where humans can choose how closely to review changes, also led to voluntary adoption without requiring blanket trust. These findings shift attention from the underlying model to the surrounding system design that makes the agent usable in real enterprise codebases.

Core claim

The authors state that technical model quality alone is insufficient for production-ready coding agents. Targeted tool design such as string-replacement edits over full-file rewrites, layered safety guardrails, state management, and progressive human oversight modes are decisive. These elements improved reliability more than prompt engineering and enabled organic adoption without mandating trust, demonstrating that engineering decisions around the model determine whether a coding agent delivers value in practice.

What carries the argument

String-replacement edit tools paired with layered safety guardrails and progressive human oversight modes that allow users to select review intensity.

If this is right

  • Production coding agents can reach usable reliability through system-level choices rather than model-scale increases alone.
  • Flexible oversight levels allow adoption to grow as users observe consistent behavior without forced trust.
  • Development resources shift toward edit mechanics, constraint enforcement, and state handling over isolated prompt tuning.
  • Similar system designs may reduce the prototype-to-production gap for other internal AI tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same emphasis on constrained edit tools and adjustable oversight could extend to non-coding AI agents that modify documents or data.
  • Longer-term observation might show whether these agents change how developers review or write code over months of use.
  • Replication across organizations of different sizes would test whether the observed benefits hold beyond Zup's specific context.

Load-bearing premise

The reported gains in reliability and adoption stem primarily from the described tool design, safety layers, and oversight modes rather than from unmentioned factors such as team composition or project types at Zup.

What would settle it

A side-by-side comparison in the same environment of one agent version using string-replacement edits and safety layers against another using full-file rewrites with no safety layers, tracking reliability metrics and user adoption rates.

read the original abstract

Enterprise teams building internal coding agents face a gap between prototype performance and production readiness. The root cause is that technical model quality alone is insufficient -- tool design, safety enforcement, state management, and human trust calibration are equally decisive, yet underreported in the literature. We present CodeGen, an internal coding agent at Zup, and show that targeted tool design (e.g., string-replacement edits over full-file rewrites) and layered safety guardrails improved agent reliability more than prompt engineering, while progressive human oversight modes drove organic adoption without mandating trust. These findings suggest that the engineering decisions surrounding the model -- not the model itself -- determine whether a coding agent delivers real value in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript is a single-organization case study of CodeGen, an internal coding agent built at Zup. It claims that engineering decisions—specifically targeted tool design (string-replacement edits rather than full-file rewrites), layered safety guardrails, and progressive human-oversight modes—were more decisive for reliability and organic adoption than prompt engineering or underlying model quality, and it shares lessons and open questions from the deployment experience.

Significance. If the observations can be substantiated, the paper would usefully document practical, non-model factors that determine whether coding agents move from prototype to production use in enterprise settings. Such reports remain scarce in the SE literature, which tends to emphasize model capabilities over system-level design and trust calibration.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Tool Design): the assertion that string-replacement edits improved reliability more than prompt engineering is presented without any success rates, error distributions, before/after comparisons, or ablation results that would allow the reader to evaluate the relative contribution.
  2. [§5 and §6] §5 (Safety Guardrails) and §6 (Adoption): the claims that layered guardrails outperformed prompt engineering and that progressive oversight produced organic adoption rest on internal observations only; no quantitative adoption metrics, user feedback scores, or controlled variants are reported, leaving the causal attribution unsupported.
minor comments (2)
  1. [Discussion] The manuscript would benefit from an explicit limitations subsection that acknowledges the single-site, non-controlled nature of the observations.
  2. [Throughout] A short table summarizing the key design choices and the qualitative outcomes attributed to each would improve readability and make the lessons easier to extract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our case study on CodeGen. We appreciate the acknowledgment of the paper's potential contribution to documenting practical factors in deploying coding agents. We agree with the major comments that our assertions require better qualification due to the lack of quantitative data. We will make revisions to address this. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Tool Design): the assertion that string-replacement edits improved reliability more than prompt engineering is presented without any success rates, error distributions, before/after comparisons, or ablation results that would allow the reader to evaluate the relative contribution.

    Authors: We agree that the manuscript lacks quantitative metrics or comparative analyses to substantiate that string-replacement edits improved reliability more than prompt engineering. Our observations stem from the development and deployment process at Zup. In the revised version, we will modify the abstract and Section 4 to describe the rationale and observed benefits of the string-replacement approach as part of our engineering lessons, while adding a clear statement that these are not the result of controlled experiments or ablations and that direct comparisons to prompt engineering alone are not available. revision: yes

  2. Referee: [§5 and §6] §5 (Safety Guardrails) and §6 (Adoption): the claims that layered guardrails outperformed prompt engineering and that progressive oversight produced organic adoption rest on internal observations only; no quantitative adoption metrics, user feedback scores, or controlled variants are reported, leaving the causal attribution unsupported.

    Authors: We acknowledge the validity of this observation: the paper does not report quantitative adoption metrics, user feedback scores, or controlled variants to support the claims about layered guardrails and progressive oversight. We will revise Sections 5 and 6 to focus on detailing the design of the guardrails and oversight modes, and to report our internal experiences and perceived impacts on safety and adoption, accompanied by a note on the qualitative and observational basis of these insights without formal evaluation or causal proof. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive case study without derivations or fitted claims

full rationale

The paper is a single-site case study reporting practical lessons from building an internal coding agent. It contains no equations, no parameter fitting, no derivations, and no self-citation chains that bear load on any central result. Claims about tool design, safety guardrails, and oversight modes are presented as observations from deployment experience rather than as predictions derived from prior inputs within the paper. No step reduces by construction to a self-definition or renamed fit; the work is self-contained as an engineering report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an experience report on software engineering practices. It introduces no mathematical free parameters, axioms, or invented entities. Claims rest on observational lessons from a single company deployment.

pith-pipeline@v0.9.0 · 5417 in / 1140 out tokens · 60136 ms · 2026-05-10T16:38:35.351961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. doi: 10.48550/arXiv. 2107.03374

  2. [2]

    doi:10.1109/ICSE-FoSE59343.2023.00008

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. InProceedings of the International Conference on Software Engineering: Future of Software Engineering (ICSE-F oSE), 2023. doi: 10.1109/icse-fose59343.2023.00008. 8

  3. [3]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In Proceedings of the International Conference on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2310.06770

  4. [4]

    Trust in automation: Designing for appropriate reliance.Human Factors, 46(1):50–80, 2004

    John D Lee and Katrina A See. Trust in automation: Designing for appropriate reliance.Human Factors, 46(1):50–80, 2004. doi: 10.1518/hfes.46.1.50.30392

  5. [5]

    IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans30(3), 286–297 (2000) https: //doi.org/10.1109/3468.844354

    Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. A model for types and levels of human interaction with automation.IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 30(3):286–297, 2000. doi: 10.1109/3468.844354

  6. [6]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot.arXiv preprint arXiv:2302.06590, 2023. doi: 10.48550/arXiv.2302.06590

  7. [7]

    Gustavo Pinto, Cleidson R. B. de Souza, João Batista Neto, Alberto de Souza, Tarcísio Gotto, and Edward Monteiro. Lessons from building stackspot AI: A contextualized AI coding assistant. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2024, Lisbon, Portugal, April 14-20, 2024, pages ...

  8. [8]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InProceedings of the International Conference on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2307.16789

  9. [9]

    CodePori: Large-scale system for au- tonomous software development using multi-agent technology.SSRN Electronic Journal, 2024

    Zeeshan Rasheed, Muhammad Sami, Muhammad Waseem, Kai-Kristian Kemell, Xiaofeng Wang, Anh Nguyen Duc, and Pekka Abrahamsson. CodePori: Large-scale system for au- tonomous software development using multi-agent technology.SSRN Electronic Journal, 2024. doi: 10.2139/ssrn.4979510

  10. [10]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.48550/arXiv.2302.04761

  11. [11]

    A survey on large language model based autonomous agents , volume =

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024. doi: 10.1007/s11704-024-40231-1

  12. [12]

    F., Alon, U., Neubig, G

    Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. InProceedings of the International Symposium on Machine Programming (MAPS), 2022. doi: 10.1145/3520312.3534862

  13. [13]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Liber, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. doi: 10.48550/arXiv.2405.15793

  14. [14]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAd- vances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.52202/075280-0517

  15. [15]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InProceedings of the International Conference on Learning Representations (ICLR), 2023. doi: 10.48550/arXiv.2210. 03629

  16. [16]

    2024 , isbn =

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InProceedings of the International Symposium on Software Testing and Analysis (ISSTA), 2024. doi: 10.1145/3650212.3680384. 9