pith. sign in

arxiv: 2605.16517 · v2 · pith:IH4RQ3MAnew · submitted 2026-05-15 · 💻 cs.SE

Customizing an LLM for Enterprise Software Engineering

Pith reviewed 2026-05-21 08:26 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM adaptationenterprise software engineeringmodel customizationA/B testingcode generationmid-trainingproprietary datasetdeveloper productivity
0
0 comments X

The pith

Customizing an LLM on a trillion tokens of internal enterprise code reduces developer iterations by 23 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise software development generates large volumes of data through incremental changes, deployments, and maintenance that can be used to tailor large language models for organizational needs. The paper shows how curating a proprietary dataset at this scale and applying a mid-training strategy allows the model to specialize while retaining general abilities. In a blind A/B study with 29,000 developers the adapted model required fewer turns to reach working code and produced outputs that survived longer in the codebase. The work supplies a complete process covering signal extraction, data preparation, full-stack tuning, and application deployment. If the results hold, other organizations could use similar internal data to create more effective coding assistants than off-the-shelf models provide.

Core claim

By curating a trillion-token proprietary dataset from software engineering activities and using a mid-training strategy to mitigate catastrophic forgetting, the adapted model outperforms baselines, reducing the mean number of iterations per turn by 23% and increasing code survival rates by about 17%.

What carries the argument

The end-to-end adaptation process that extracts high-value signals from engineering data, prepares them at scale, performs continued pre-training with mid-training safeguards, and deploys the result into developer tools.

If this is right

  • Developers reach working solutions after fewer adjustment cycles when assisted by the adapted model.
  • Higher code survival rates reduce the time spent on revisions and maintenance after initial generation.
  • The outlined steps for data handling and tuning supply a replicable method for other large organizations to adapt models on their own engineering data.
  • Ongoing generation of development data creates a feedback loop for continued model refinement over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Comparable gains could appear in other enterprises that possess similar volumes of proprietary development records and apply the same curation steps.
  • The approach raises the question of how much data volume is necessary before specialization yields clear benefits over general models.
  • Combining the adapted model with additional developer tools could produce productivity effects larger than the isolated metrics reported here.

Load-bearing premise

The measured gains in the A/B study arise specifically from the dataset curation and mid-training strategy rather than from differences in prompts, user selection, or other deployment variables.

What would settle it

Re-running the A/B study while holding prompts and developer cohorts fixed between the adapted and baseline models would show whether the 23% and 17% differences persist or disappear.

Figures

Figures reproduced from arXiv: 2605.16517 by Aditya Kini, Aditya Pandey, Ahmed Omran, Alexander Fr\"ommgen, Amy Hua, Anita Gergely, Danny Tarlow, Franjo Ivan\v{c}i\'c, Gufeng Zhang, Marc Brockschmidt, Martin Sevenich, Milad Hashemi, Parthasarathy Ranganathan, Petros Maniatis, Saksham Thakur, Satish Chandra, Vincent Nguyen, Zaheer Abbas.

Figure 1
Figure 1. Figure 1: A sample interaction in the code review tool which [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example of a question asked by a SWE on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A developer’s IDE editing session is visualized as [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: (5) Multi-Turn Agentic Trajectories. In order to capture the iterative nature of software development we include multi￾step journeys datasets from extended periods of time. These data sources could be “bugs”, where developers compile sets of changelists to complete large tasks, or Activity Timelines, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: A comment resolution example. One example from this category that fits into this repre￾sentation is listed below. (3) Automated Program Repair (APR): We adapt the diff for￾mat for the Issues & Fixes category. Here, the “Instruction” is structured as a compiler error or static analysis warning, and the target is the fix. By modeling compiler errors as “automated review comments,” we unify the representation… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of how a logged interaction between [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The E2E training process and Replace” blocks. GfG’s proficiency comes from the “critique￾and-refine” examples described in §3.2 where the model is trained to handle the block-like nature of code reviews, where comments are strictly associated with specific code snippets. While GfG shows slight regression on the unified diff (udiff) format likely due to the strict line-prefix required for exact context matc… view at source ↗
read the original abstract

Enterprise software development is a continuous evolutionary process, characterized by incremental additions, architectural revisions, production deployments and rigorous maintenance. These activities generate valuable data that modern LLMs could be finetuned on, to unlock additional tool possibilities for enterprise software engineering. While frontier LLMs are already very capable, this form of customization offers a compelling path for enterprise-specific optimization. We introduce Gemini for Google (GfG)}, an adaptation of Gemini specialized for Google's internal software engineering ecosystem. This paper details the model's end-to-end development, from curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting. In a large-scale blind A/B study across 29,000 developers, Gemini for Google significantly outperformed baselines: reducing the mean number of iterations per turn by 23\%, and increasing code survival rates by about 17%. Beyond metrics, we provide a comprehensive blueprint for enterprise model adaptation, covering: (1)The extraction of high-value signals from software engineering data, (2)Data preparation strategies, (3)Full-stack model tuning (continued pre-training and post-training), and (4)The deployment of downstream applications. We believe this methodology offers a replicable path for other organizations to unlock the full potential of their internal engineering data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Gemini for Google (GfG), an adaptation of the Gemini model specialized for Google's internal software engineering ecosystem. It details the curation of a trillion-token proprietary dataset from enterprise activities, a mid-training strategy to mitigate catastrophic forgetting, and reports results from a large-scale blind A/B study involving 29,000 developers. In this study, GfG reduced the mean number of iterations per turn by 23% and increased code survival rates by approximately 17% compared to baselines. The paper also outlines a comprehensive blueprint for enterprise model adaptation, covering data signal extraction, preparation strategies, full-stack tuning (continued pre-training and post-training), and deployment of downstream applications.

Significance. If the reported improvements can be confidently attributed to the dataset curation and mid-training approach, this work would offer significant value by providing a replicable methodology for other organizations to customize LLMs using their internal software engineering data. The large scale of the A/B study (29,000 developers) strengthens the empirical contribution, and the end-to-end blueprint addresses practical aspects of enterprise LLM adaptation that are often underexplored in academic literature.

major comments (1)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.
minor comments (2)
  1. [Introduction] Ensure the abbreviation 'GfG' is introduced once and used consistently thereafter.
  2. [Data Curation] In the data curation description, specify the exact heuristics or filters applied to identify high-value signals from incremental code changes and maintenance logs.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their careful review and for underscoring the importance of methodological transparency in the A/B study. We address the major comment below and have revised the manuscript to the extent possible while respecting the proprietary nature of Google's internal systems.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.

    Authors: We agree that greater detail on the study protocol would strengthen readers' ability to attribute the reported gains to the model customization rather than deployment artifacts. In the revised manuscript we have added a concise protocol description in the Evaluation section. It states that developers were assigned via stratified randomization by team and tenure band, that blinding was enforced through an identical IDE plugin interface with no model identity disclosed to users, that prompt templates and interaction surfaces were held constant across conditions, that the study included all developers with recorded activity during the window (with exclusion only for sessions lacking complete telemetry), and that both raw and tenure-adjusted metrics are reported. We cannot, however, release the precise randomization implementation, full covariate regression specifications, or raw exclusion logs, as these are tied to internal production systems. revision: partial

standing simulated objections not resolved
  • Granular details of the randomization algorithm, exact covariate adjustment models, and complete data-exclusion logs, which involve proprietary internal tooling and cannot be disclosed.

Circularity Check

0 steps flagged

No circularity: empirical A/B results are independent of dataset curation claims

full rationale

The paper's central claims rest on a reported large-scale blind A/B study across 29,000 developers measuring iteration counts and code survival rates. These are direct empirical observations from live usage, not predictions derived from equations, fitted parameters, or self-referential definitions. The methodology describes dataset curation and mid-training but does not reduce the A/B outcomes to those inputs by construction; the study results stand as separate evidence. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as an engineering report rather than a mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from LLM fine-tuning literature (e.g., that continued pre-training on domain data improves task performance without catastrophic forgetting when mitigated) and treats the internal dataset as high-value without external validation benchmarks.

axioms (1)
  • domain assumption Continued pre-training on proprietary engineering data yields net positive task performance after mitigation of forgetting.
    Invoked in the description of the mid-training strategy.

pith-pipeline@v0.9.0 · 5825 in / 1274 out tokens · 31659 ms · 2026-05-21T08:26:16.308183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 10 internal anchors

  1. [1]

    Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] https://arxiv.org/abs/2207. 14255

  2. [2]

    Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed sys- tems. In7th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

  3. [3]

    Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. InACM SIGPLAN Conference on Programming Language Design and Customizing an LLM for Enterprise Software Engineering Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 363–

  4. [4]

    http://dl.acm.org/citation.cfm?id=1806638

  5. [5]

    Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. 2025. ARC Prize 2024: Technical Report. arXiv:2412.04604 [cs.AI] https://arxiv.org/abs/ 2412.04604

  6. [6]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

  7. [7]

    Alexander Frömmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- zagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Dan Zheng, Satish Chandra, and Petros Maniatis. 2024. Resolving Code Review Comments with Machine Learning. In2024 IEEE/ACM 46th International ...

  8. [8]

    Paul Gauthier. 2025. Aider Polyglot Coding Leaderboard. https://aider.chat/ docs/leaderboards/

  9. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Amy Yang, et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

  10. [10]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yun Xiong, and Wenfeng Liang

  11. [11]

    DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

  12. [12]

    Dean Hildebrand and Denis Serenyi. 2021. A peek behind Colossus, Google’s file system. Google Cloud Blog. https://cloud.google.com/blog/products/storage- data-transfer/a-peek-behind-colossus-googles-file-system

  13. [13]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974

  14. [14]

    Emily Johnston and Stephanie Tang. 2024. Safely repairing broken builds with ML. https://research.google/blog/safely-repairing-broken-builds-with-ml/. Google Research Blog

  15. [15]

    Hannah Lin, Martin Maas, Maximilian Roquemore, Arman Hasanzadeh, Fred Lewis, Yusuf Simonson, Tzu-Wei Yang, Amir Yazdanbakhsh, Deniz Altinbüken, Florin Papa, Maggie Nolan Edmonds, Aditya Patil, Don Schwarz, Satish Chan- dra, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2025. ECO: An LLM-Driven Efficient Code Optimizer for Warehouse Scale C...

  16. [16]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

  17. [17]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. arXiv:2511.01846 [cs.C...

  18. [18]

    Petros Maniatis and Daniel Tarlow. 2023. Large sequence models for software development activities. https://research.google/blog/large-sequence-models- for-software-development-activities/

  19. [19]

    Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, and Maxim Tabachnyk. 2026. Smart Paste: Auto- matically Fixing Copy/Paste for Google Developers. arXiv:2510.03843 [cs.SE] https://arxiv.org/abs/2510.03843

  20. [20]

    Stoyan Nikolov, Daniele Codecasa, Anna Sjovall, Maxim Tabachnyk, Satish Chandra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google using AI for internal code migrations? arXiv:2501.06972 [cs.SE] https://arxiv.org/abs/2501. 06972

  21. [21]

    Long Phan et al. 2025. Humanity’s Last Exam. arXiv:2501.14249 [cs.LG] https: //arxiv.org/abs/2501.14249

  22. [22]

    Sundar Pichai. 2026. Cloud Next ‘26: Momentum and innovation at Google scale. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google- cloud/cloud-next-2026-sundar-pichai/

  23. [23]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferber, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

  24. [24]

    Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. InVLDB

  25. [25]

    Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

    Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. http://research.google.com/archive/papers/dapper-2010-1.pdf

  26. [26]

    Maxim Tabachnyk and Stoyan Nikolov. 2022. ML-Enhanced Code Completion Improves Developer Productivity. https://research.google/blog/ml-enhanced- code-completion-improves-developer-productivity/. Google Research Blog

  27. [27]

    Maxim Tabachnyk, Xu Shu, Alexander Frömmgen, Pavel Sychev, Vahid Meimand, Ilia Krets, Stanislav Pyatykh, Abner Araujo, Kristóf Molnár, and Satish Chandra

  28. [28]

    arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

    Achieving Productivity Gains with AI-based IDE features: A Journey at Google. arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

  29. [29]

    Korupolu, David Oppenheimer, Eric Tune, and John Wilkes

    Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. InProceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France

  30. [30]

    2020.Software Engineering at Google

    Hyrum Wright, Titus Delafayette Winters, and Tom Manshreck. 2020.Software Engineering at Google

  31. [31]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

  32. [32]

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. arXiv:2409.02813 [cs.CL] https://arxiv.org/abs/2409. 02813