Customizing an LLM for Enterprise Software Engineering
Pith reviewed 2026-05-21 08:26 UTC · model grok-4.3
The pith
Customizing an LLM on a trillion tokens of internal enterprise code reduces developer iterations by 23 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By curating a trillion-token proprietary dataset from software engineering activities and using a mid-training strategy to mitigate catastrophic forgetting, the adapted model outperforms baselines, reducing the mean number of iterations per turn by 23% and increasing code survival rates by about 17%.
What carries the argument
The end-to-end adaptation process that extracts high-value signals from engineering data, prepares them at scale, performs continued pre-training with mid-training safeguards, and deploys the result into developer tools.
If this is right
- Developers reach working solutions after fewer adjustment cycles when assisted by the adapted model.
- Higher code survival rates reduce the time spent on revisions and maintenance after initial generation.
- The outlined steps for data handling and tuning supply a replicable method for other large organizations to adapt models on their own engineering data.
- Ongoing generation of development data creates a feedback loop for continued model refinement over time.
Where Pith is reading between the lines
- Comparable gains could appear in other enterprises that possess similar volumes of proprietary development records and apply the same curation steps.
- The approach raises the question of how much data volume is necessary before specialization yields clear benefits over general models.
- Combining the adapted model with additional developer tools could produce productivity effects larger than the isolated metrics reported here.
Load-bearing premise
The measured gains in the A/B study arise specifically from the dataset curation and mid-training strategy rather than from differences in prompts, user selection, or other deployment variables.
What would settle it
Re-running the A/B study while holding prompts and developer cohorts fixed between the adapted and baseline models would show whether the 23% and 17% differences persist or disappear.
Figures
read the original abstract
Enterprise software development is a continuous evolutionary process, characterized by incremental additions, architectural revisions, production deployments and rigorous maintenance. These activities generate valuable data that modern LLMs could be finetuned on, to unlock additional tool possibilities for enterprise software engineering. While frontier LLMs are already very capable, this form of customization offers a compelling path for enterprise-specific optimization. We introduce Gemini for Google (GfG)}, an adaptation of Gemini specialized for Google's internal software engineering ecosystem. This paper details the model's end-to-end development, from curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting. In a large-scale blind A/B study across 29,000 developers, Gemini for Google significantly outperformed baselines: reducing the mean number of iterations per turn by 23\%, and increasing code survival rates by about 17%. Beyond metrics, we provide a comprehensive blueprint for enterprise model adaptation, covering: (1)The extraction of high-value signals from software engineering data, (2)Data preparation strategies, (3)Full-stack model tuning (continued pre-training and post-training), and (4)The deployment of downstream applications. We believe this methodology offers a replicable path for other organizations to unlock the full potential of their internal engineering data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Gemini for Google (GfG), an adaptation of the Gemini model specialized for Google's internal software engineering ecosystem. It details the curation of a trillion-token proprietary dataset from enterprise activities, a mid-training strategy to mitigate catastrophic forgetting, and reports results from a large-scale blind A/B study involving 29,000 developers. In this study, GfG reduced the mean number of iterations per turn by 23% and increased code survival rates by approximately 17% compared to baselines. The paper also outlines a comprehensive blueprint for enterprise model adaptation, covering data signal extraction, preparation strategies, full-stack tuning (continued pre-training and post-training), and deployment of downstream applications.
Significance. If the reported improvements can be confidently attributed to the dataset curation and mid-training approach, this work would offer significant value by providing a replicable methodology for other organizations to customize LLMs using their internal software engineering data. The large scale of the A/B study (29,000 developers) strengthens the empirical contribution, and the end-to-end blueprint addresses practical aspects of enterprise LLM adaptation that are often underexplored in academic literature.
major comments (1)
- [Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.
minor comments (2)
- [Introduction] Ensure the abbreviation 'GfG' is introduced once and used consistently thereafter.
- [Data Curation] In the data curation description, specify the exact heuristics or filters applied to identify high-value signals from incremental code changes and maintenance logs.
Simulated Author's Rebuttal
We thank the referee for their careful review and for underscoring the importance of methodological transparency in the A/B study. We address the major comment below and have revised the manuscript to the extent possible while respecting the proprietary nature of Google's internal systems.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.
Authors: We agree that greater detail on the study protocol would strengthen readers' ability to attribute the reported gains to the model customization rather than deployment artifacts. In the revised manuscript we have added a concise protocol description in the Evaluation section. It states that developers were assigned via stratified randomization by team and tenure band, that blinding was enforced through an identical IDE plugin interface with no model identity disclosed to users, that prompt templates and interaction surfaces were held constant across conditions, that the study included all developers with recorded activity during the window (with exclusion only for sessions lacking complete telemetry), and that both raw and tenure-adjusted metrics are reported. We cannot, however, release the precise randomization implementation, full covariate regression specifications, or raw exclusion logs, as these are tied to internal production systems. revision: partial
- Granular details of the randomization algorithm, exact covariate adjustment models, and complete data-exclusion logs, which involve proprietary internal tooling and cannot be disclosed.
Circularity Check
No circularity: empirical A/B results are independent of dataset curation claims
full rationale
The paper's central claims rest on a reported large-scale blind A/B study across 29,000 developers measuring iteration counts and code survival rates. These are direct empirical observations from live usage, not predictions derived from equations, fitted parameters, or self-referential definitions. The methodology describes dataset curation and mid-training but does not reduce the A/B outcomes to those inputs by construction; the study results stand as separate evidence. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as an engineering report rather than a mathematical reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continued pre-training on proprietary engineering data yields net positive task performance after mitigation of forgetting.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
large-scale blind A/B study across 29,000 developers... reducing the mean number of iterations per turn by 23%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] https://arxiv.org/abs/2207. 14255
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed sys- tems. In7th USENIX Symposium on Operating Systems Design and Implementation (OSDI)
work page 2006
-
[3]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. InACM SIGPLAN Conference on Programming Language Design and Customizing an LLM for Enterprise Software Engineering Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 363–
work page 2010
-
[4]
http://dl.acm.org/citation.cfm?id=1806638
- [5]
-
[6]
Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/
work page 2024
-
[7]
Alexander Frömmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- zagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Dan Zheng, Satish Chandra, and Petros Maniatis. 2024. Resolving Code Review Comments with Machine Learning. In2024 IEEE/ACM 46th International ...
work page 2024
-
[8]
Paul Gauthier. 2025. Aider Polyglot Coding Leaderboard. https://aider.chat/ docs/leaderboards/
work page 2025
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Amy Yang, et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yun Xiong, and Wenfeng Liang
-
[11]
DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Dean Hildebrand and Denis Serenyi. 2021. A peek behind Colossus, Google’s file system. Google Cloud Blog. https://cloud.google.com/blog/products/storage- data-transfer/a-peek-behind-colossus-googles-file-system
work page 2021
-
[13]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Emily Johnston and Stephanie Tang. 2024. Safely repairing broken builds with ML. https://research.google/blog/safely-repairing-broken-builds-with-ml/. Google Research Blog
work page 2024
-
[15]
Hannah Lin, Martin Maas, Maximilian Roquemore, Arman Hasanzadeh, Fred Lewis, Yusuf Simonson, Tzu-Wei Yang, Amir Yazdanbakhsh, Deniz Altinbüken, Florin Papa, Maggie Nolan Edmonds, Aditya Patil, Don Schwarz, Satish Chan- dra, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2025. ECO: An LLM-Driven Efficient Code Optimizer for Warehouse Scale C...
-
[16]
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. arXiv:2511.01846 [cs.C...
-
[18]
Petros Maniatis and Daniel Tarlow. 2023. Large sequence models for software development activities. https://research.google/blog/large-sequence-models- for-software-development-activities/
work page 2023
-
[19]
Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, and Maxim Tabachnyk. 2026. Smart Paste: Auto- matically Fixing Copy/Paste for Google Developers. arXiv:2510.03843 [cs.SE] https://arxiv.org/abs/2510.03843
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [20]
-
[21]
Long Phan et al. 2025. Humanity’s Last Exam. arXiv:2501.14249 [cs.LG] https: //arxiv.org/abs/2501.14249
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Sundar Pichai. 2026. Cloud Next ‘26: Momentum and innovation at Google scale. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google- cloud/cloud-next-2026-sundar-pichai/
work page 2026
-
[23]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferber, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. InVLDB
work page 2013
-
[25]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. http://research.google.com/archive/papers/dapper-2010-1.pdf
work page 2010
-
[26]
Maxim Tabachnyk and Stoyan Nikolov. 2022. ML-Enhanced Code Completion Improves Developer Productivity. https://research.google/blog/ml-enhanced- code-completion-improves-developer-productivity/. Google Research Blog
work page 2022
-
[27]
Maxim Tabachnyk, Xu Shu, Alexander Frömmgen, Pavel Sychev, Vahid Meimand, Ilia Krets, Stanislav Pyatykh, Abner Araujo, Kristóf Molnár, and Satish Chandra
-
[28]
arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964
Achieving Productivity Gains with AI-based IDE features: A Journey at Google. arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964
-
[29]
Korupolu, David Oppenheimer, Eric Tune, and John Wilkes
Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. InProceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France
work page 2015
-
[30]
2020.Software Engineering at Google
Hyrum Wright, Titus Delafayette Winters, and Tom Manshreck. 2020.Software Engineering at Google
work page 2020
-
[31]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. arXiv:2409.02813 [cs.CL] https://arxiv.org/abs/2409. 02813
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.