Customizing an LLM for Enterprise Software Engineering

Aditya Kini; Aditya Pandey; Ahmed Omran; Alexander Fr\"ommgen; Amy Hua; Anita Gergely; Danny Tarlow; Franjo Ivan\v{c}i\'c; Gufeng Zhang; Marc Brockschmidt

arxiv: 2605.16517 · v2 · pith:IH4RQ3MAnew · submitted 2026-05-15 · 💻 cs.SE

Customizing an LLM for Enterprise Software Engineering

Aditya Kini , Satish Chandra , Milad Hashemi , Saksham Thakur , Aditya Pandey , Vincent Nguyen , Marc Brockschmidt , Franjo Ivan\v{c}i\'c

show 10 more authors

Danny Tarlow Parthasarathy Ranganathan Petros Maniatis Ahmed Omran Zaheer Abbas Anita Gergely Martin Sevenich Gufeng Zhang Amy Hua Alexander Fr\"ommgen

This is my paper

Pith reviewed 2026-05-21 08:26 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM adaptationenterprise software engineeringmodel customizationA/B testingcode generationmid-trainingproprietary datasetdeveloper productivity

0 comments

The pith

Customizing an LLM on a trillion tokens of internal enterprise code reduces developer iterations by 23 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise software development generates large volumes of data through incremental changes, deployments, and maintenance that can be used to tailor large language models for organizational needs. The paper shows how curating a proprietary dataset at this scale and applying a mid-training strategy allows the model to specialize while retaining general abilities. In a blind A/B study with 29,000 developers the adapted model required fewer turns to reach working code and produced outputs that survived longer in the codebase. The work supplies a complete process covering signal extraction, data preparation, full-stack tuning, and application deployment. If the results hold, other organizations could use similar internal data to create more effective coding assistants than off-the-shelf models provide.

Core claim

By curating a trillion-token proprietary dataset from software engineering activities and using a mid-training strategy to mitigate catastrophic forgetting, the adapted model outperforms baselines, reducing the mean number of iterations per turn by 23% and increasing code survival rates by about 17%.

What carries the argument

The end-to-end adaptation process that extracts high-value signals from engineering data, prepares them at scale, performs continued pre-training with mid-training safeguards, and deploys the result into developer tools.

If this is right

Developers reach working solutions after fewer adjustment cycles when assisted by the adapted model.
Higher code survival rates reduce the time spent on revisions and maintenance after initial generation.
The outlined steps for data handling and tuning supply a replicable method for other large organizations to adapt models on their own engineering data.
Ongoing generation of development data creates a feedback loop for continued model refinement over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable gains could appear in other enterprises that possess similar volumes of proprietary development records and apply the same curation steps.
The approach raises the question of how much data volume is necessary before specialization yields clear benefits over general models.
Combining the adapted model with additional developer tools could produce productivity effects larger than the isolated metrics reported here.

Load-bearing premise

The measured gains in the A/B study arise specifically from the dataset curation and mid-training strategy rather than from differences in prompts, user selection, or other deployment variables.

What would settle it

Re-running the A/B study while holding prompts and developer cohorts fixed between the adapted and baseline models would show whether the 23% and 17% differences persist or disappear.

Figures

Figures reproduced from arXiv: 2605.16517 by Aditya Kini, Aditya Pandey, Ahmed Omran, Alexander Fr\"ommgen, Amy Hua, Anita Gergely, Danny Tarlow, Franjo Ivan\v{c}i\'c, Gufeng Zhang, Marc Brockschmidt, Martin Sevenich, Milad Hashemi, Parthasarathy Ranganathan, Petros Maniatis, Saksham Thakur, Satish Chandra, Vincent Nguyen, Zaheer Abbas.

**Figure 2.** Figure 2: An example of a question asked by a SWE on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A developer’s IDE editing session is visualized as [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: (5) Multi-Turn Agentic Trajectories. In order to capture the iterative nature of software development we include multistep journeys datasets from extended periods of time. These data sources could be “bugs”, where developers compile sets of changelists to complete large tasks, or Activity Timelines, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 4.** Figure 4: A comment resolution example. One example from this category that fits into this representation is listed below. (3) Automated Program Repair (APR): We adapt the diff format for the Issues & Fixes category. Here, the “Instruction” is structured as a compiler error or static analysis warning, and the target is the fix. By modeling compiler errors as “automated review comments,” we unify the representation… view at source ↗

**Figure 6.** Figure 6: Illustration of how a logged interaction between [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: The E2E training process and Replace” blocks. GfG’s proficiency comes from the “critiqueand-refine” examples described in §3.2 where the model is trained to handle the block-like nature of code reviews, where comments are strictly associated with specific code snippets. While GfG shows slight regression on the unified diff (udiff) format likely due to the strict line-prefix required for exact context matc… view at source ↗

read the original abstract

Enterprise software development is a continuous evolutionary process, characterized by incremental additions, architectural revisions, production deployments and rigorous maintenance. These activities generate valuable data that modern LLMs could be finetuned on, to unlock additional tool possibilities for enterprise software engineering. While frontier LLMs are already very capable, this form of customization offers a compelling path for enterprise-specific optimization. We introduce Gemini for Google (GfG)}, an adaptation of Gemini specialized for Google's internal software engineering ecosystem. This paper details the model's end-to-end development, from curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting. In a large-scale blind A/B study across 29,000 developers, Gemini for Google significantly outperformed baselines: reducing the mean number of iterations per turn by 23\%, and increasing code survival rates by about 17%. Beyond metrics, we provide a comprehensive blueprint for enterprise model adaptation, covering: (1)The extraction of high-value signals from software engineering data, (2)Data preparation strategies, (3)Full-stack model tuning (continued pre-training and post-training), and (4)The deployment of downstream applications. We believe this methodology offers a replicable path for other organizations to unlock the full potential of their internal engineering data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows measurable gains from training on a trillion tokens of internal code with mid-training to limit forgetting, backed by a 29k-developer A/B test, but the study lacks enough protocol details to tie the results cleanly to the customization steps.

read the letter

The main thing to know is that the authors adapted Gemini on a trillion tokens of internal Google engineering data with mid-training to avoid forgetting, and measured gains in a 29k-developer A/B study. The gains look real at face value, but the study details are thin on how the conditions were isolated. What the paper does well is describe the end-to-end process in enough detail to be useful as a blueprint. Extracting high-value signals from software data, preparing the trillion-token set, the continued pre-training plus post-training, and then the deployment story all get covered. The large scale of the dataset and the live study set this apart from smaller experiments in the literature. Reporting specific metrics like reduced iterations and higher code survival from actual developer use is a strength. The soft spot is the A/B study. The abstract says it was blind and large, but gives no information on randomization procedure, whether prompts and UIs were identical, or how they handled potential biases in user assignment or task selection. That makes it hard to rule out that the 23% and 17% improvements came from something other than the model customization itself. The concern in the stress-test note holds up based on what's in the abstract. This kind of paper is for teams at large organizations looking to customize models on their own data, and for academics interested in industrial-scale adaptation. A reader working on similar projects would find the practical steps valuable. It deserves a serious referee because the scale and real-world results are worth examining closely, even if the methods section needs strengthening. I would send it to peer review and ask for expanded details on the experiment design in revisions.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Gemini for Google (GfG), an adaptation of the Gemini model specialized for Google's internal software engineering ecosystem. It details the curation of a trillion-token proprietary dataset from enterprise activities, a mid-training strategy to mitigate catastrophic forgetting, and reports results from a large-scale blind A/B study involving 29,000 developers. In this study, GfG reduced the mean number of iterations per turn by 23% and increased code survival rates by approximately 17% compared to baselines. The paper also outlines a comprehensive blueprint for enterprise model adaptation, covering data signal extraction, preparation strategies, full-stack tuning (continued pre-training and post-training), and deployment of downstream applications.

Significance. If the reported improvements can be confidently attributed to the dataset curation and mid-training approach, this work would offer significant value by providing a replicable methodology for other organizations to customize LLMs using their internal software engineering data. The large scale of the A/B study (29,000 developers) strengthens the empirical contribution, and the end-to-end blueprint addresses practical aspects of enterprise LLM adaptation that are often underexplored in academic literature.

major comments (1)

[Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.

minor comments (2)

[Introduction] Ensure the abbreviation 'GfG' is introduced once and used consistently thereafter.
[Data Curation] In the data curation description, specify the exact heuristics or filters applied to identify high-value signals from incremental code changes and maintenance logs.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their careful review and for underscoring the importance of methodological transparency in the A/B study. We address the major comment below and have revised the manuscript to the extent possible while respecting the proprietary nature of Google's internal systems.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central claim rests on the blind A/B study across 29,000 developers reporting a 23% reduction in mean iterations per turn and ~17% increase in code survival rates. The manuscript provides no protocol details on randomization, blinding enforcement, prompt/interface standardization across models, user assignment, data exclusion criteria, or covariate adjustment. This omission is load-bearing because it leaves open the possibility that gains trace to uncontrolled deployment variables rather than the trillion-token dataset or mid-training strategy.

Authors: We agree that greater detail on the study protocol would strengthen readers' ability to attribute the reported gains to the model customization rather than deployment artifacts. In the revised manuscript we have added a concise protocol description in the Evaluation section. It states that developers were assigned via stratified randomization by team and tenure band, that blinding was enforced through an identical IDE plugin interface with no model identity disclosed to users, that prompt templates and interaction surfaces were held constant across conditions, that the study included all developers with recorded activity during the window (with exclusion only for sessions lacking complete telemetry), and that both raw and tenure-adjusted metrics are reported. We cannot, however, release the precise randomization implementation, full covariate regression specifications, or raw exclusion logs, as these are tied to internal production systems. revision: partial

standing simulated objections not resolved

Granular details of the randomization algorithm, exact covariate adjustment models, and complete data-exclusion logs, which involve proprietary internal tooling and cannot be disclosed.

Circularity Check

0 steps flagged

No circularity: empirical A/B results are independent of dataset curation claims

full rationale

The paper's central claims rest on a reported large-scale blind A/B study across 29,000 developers measuring iteration counts and code survival rates. These are direct empirical observations from live usage, not predictions derived from equations, fitted parameters, or self-referential definitions. The methodology describes dataset curation and mid-training but does not reduce the A/B outcomes to those inputs by construction; the study results stand as separate evidence. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as an engineering report rather than a mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions from LLM fine-tuning literature (e.g., that continued pre-training on domain data improves task performance without catastrophic forgetting when mitigated) and treats the internal dataset as high-value without external validation benchmarks.

axioms (1)

domain assumption Continued pre-training on proprietary engineering data yields net positive task performance after mitigation of forgetting.
Invoked in the description of the mid-training strategy.

pith-pipeline@v0.9.0 · 5825 in / 1274 out tokens · 31659 ms · 2026-05-21T08:26:16.308183+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

large-scale blind A/B study across 29,000 developers... reducing the mean number of iterations per turn by 23%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 10 internal anchors

[1]

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] https://arxiv.org/abs/2207. 14255

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed sys- tems. In7th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2006
[3]

Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. InACM SIGPLAN Conference on Programming Language Design and Customizing an LLM for Enterprise Software Engineering Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 363–

work page 2010
[4]

http://dl.acm.org/citation.cfm?id=1806638

work page
[5]

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. 2025. ARC Prize 2024: Technical Report. arXiv:2412.04604 [cs.AI] https://arxiv.org/abs/ 2412.04604

work page arXiv 2025
[6]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

work page 2024
[7]

Alexander Frömmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- zagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Dan Zheng, Satish Chandra, and Petros Maniatis. 2024. Resolving Code Review Comments with Machine Learning. In2024 IEEE/ACM 46th International ...

work page 2024
[8]

Paul Gauthier. 2025. Aider Polyglot Coding Leaderboard. https://aider.chat/ docs/leaderboards/

work page 2025
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Amy Yang, et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yun Xiong, and Wenfeng Liang

work page
[11]

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Dean Hildebrand and Denis Serenyi. 2021. A peek behind Colossus, Google’s file system. Google Cloud Blog. https://cloud.google.com/blog/products/storage- data-transfer/a-peek-behind-colossus-googles-file-system

work page 2021
[13]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Emily Johnston and Stephanie Tang. 2024. Safely repairing broken builds with ML. https://research.google/blog/safely-repairing-broken-builds-with-ml/. Google Research Blog

work page 2024
[15]

Hannah Lin, Martin Maas, Maximilian Roquemore, Arman Hasanzadeh, Fred Lewis, Yusuf Simonson, Tzu-Wei Yang, Amir Yazdanbakhsh, Deniz Altinbüken, Florin Papa, Maggie Nolan Edmonds, Aditya Patil, Don Schwarz, Satish Chan- dra, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2025. ECO: An LLM-Driven Efficient Code Optimizer for Warehouse Scale C...

work page arXiv 2025
[16]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. arXiv:2511.01846 [cs.C...

work page arXiv 2025
[18]

Petros Maniatis and Daniel Tarlow. 2023. Large sequence models for software development activities. https://research.google/blog/large-sequence-models- for-software-development-activities/

work page 2023
[19]

Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, and Maxim Tabachnyk. 2026. Smart Paste: Auto- matically Fixing Copy/Paste for Google Developers. arXiv:2510.03843 [cs.SE] https://arxiv.org/abs/2510.03843

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Stoyan Nikolov, Daniele Codecasa, Anna Sjovall, Maxim Tabachnyk, Satish Chandra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google using AI for internal code migrations? arXiv:2501.06972 [cs.SE] https://arxiv.org/abs/2501. 06972

work page arXiv 2025
[21]

Long Phan et al. 2025. Humanity’s Last Exam. arXiv:2501.14249 [cs.LG] https: //arxiv.org/abs/2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Sundar Pichai. 2026. Cloud Next ‘26: Momentum and innovation at Google scale. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google- cloud/cloud-next-2026-sundar-pichai/

work page 2026
[23]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferber, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. InVLDB

work page 2013
[25]

Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. http://research.google.com/archive/papers/dapper-2010-1.pdf

work page 2010
[26]

Maxim Tabachnyk and Stoyan Nikolov. 2022. ML-Enhanced Code Completion Improves Developer Productivity. https://research.google/blog/ml-enhanced- code-completion-improves-developer-productivity/. Google Research Blog

work page 2022
[27]

Maxim Tabachnyk, Xu Shu, Alexander Frömmgen, Pavel Sychev, Vahid Meimand, Ilia Krets, Stanislav Pyatykh, Abner Araujo, Kristóf Molnár, and Satish Chandra

work page
[28]

arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

Achieving Productivity Gains with AI-based IDE features: A Journey at Google. arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

work page arXiv
[29]

Korupolu, David Oppenheimer, Eric Tune, and John Wilkes

Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. InProceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France

work page 2015
[30]

2020.Software Engineering at Google

Hyrum Wright, Titus Delafayette Winters, and Tom Manshreck. 2020.Software Engineering at Google

work page 2020
[31]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. arXiv:2409.02813 [cs.CL] https://arxiv.org/abs/2409. 02813

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] https://arxiv.org/abs/2207. 14255

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed sys- tems. In7th USENIX Symposium on Operating Systems Design and Implementation (OSDI)

work page 2006

[3] [3]

Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy, Efficient Data-Parallel Pipelines. InACM SIGPLAN Conference on Programming Language Design and Customizing an LLM for Enterprise Software Engineering Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 363–

work page 2010

[4] [4]

http://dl.acm.org/citation.cfm?id=1806638

work page

[5] [5]

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. 2025. ARC Prize 2024: Technical Report. arXiv:2412.04604 [cs.AI] https://arxiv.org/abs/ 2412.04604

work page arXiv 2025

[6] [6]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

work page 2024

[7] [7]

Alexander Frömmgen, Jacob Austin, Peter Choy, Nimesh Ghelani, Lera Kharatyan, Gabriela Surita, Elena Khrapko, Pascal Lamblin, Pierre-Antoine Man- zagol, Marcus Revaj, Maxim Tabachnyk, Daniel Tarlow, Kevin Villela, Dan Zheng, Satish Chandra, and Petros Maniatis. 2024. Resolving Code Review Comments with Machine Learning. In2024 IEEE/ACM 46th International ...

work page 2024

[8] [8]

Paul Gauthier. 2025. Aider Polyglot Coding Leaderboard. https://aider.chat/ docs/leaderboards/

work page 2025

[9] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Amy Yang, et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yun Xiong, and Wenfeng Liang

work page

[11] [11]

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Dean Hildebrand and Denis Serenyi. 2021. A peek behind Colossus, Google’s file system. Google Cloud Blog. https://cloud.google.com/blog/products/storage- data-transfer/a-peek-behind-colossus-googles-file-system

work page 2021

[13] [13]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974 [cs.SE] https://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Emily Johnston and Stephanie Tang. 2024. Safely repairing broken builds with ML. https://research.google/blog/safely-repairing-broken-builds-with-ml/. Google Research Blog

work page 2024

[15] [15]

Hannah Lin, Martin Maas, Maximilian Roquemore, Arman Hasanzadeh, Fred Lewis, Yusuf Simonson, Tzu-Wei Yang, Amir Yazdanbakhsh, Deniz Altinbüken, Florin Papa, Maggie Nolan Edmonds, Aditya Patil, Don Schwarz, Satish Chan- dra, Chris Kennelly, Milad Hashemi, and Parthasarathy Ranganathan. 2025. ECO: An LLM-Driven Efficient Code Optimizer for Warehouse Scale C...

work page arXiv 2025

[16] [16]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. StarCoder 2 and The Stack v2: The Next Generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. arXiv:2511.01846 [cs.C...

work page arXiv 2025

[18] [18]

Petros Maniatis and Daniel Tarlow. 2023. Large sequence models for software development activities. https://research.google/blog/large-sequence-models- for-software-development-activities/

work page 2023

[19] [19]

Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, and Maxim Tabachnyk. 2026. Smart Paste: Auto- matically Fixing Copy/Paste for Google Developers. arXiv:2510.03843 [cs.SE] https://arxiv.org/abs/2510.03843

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Stoyan Nikolov, Daniele Codecasa, Anna Sjovall, Maxim Tabachnyk, Satish Chandra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google using AI for internal code migrations? arXiv:2501.06972 [cs.SE] https://arxiv.org/abs/2501. 06972

work page arXiv 2025

[21] [21]

Long Phan et al. 2025. Humanity’s Last Exam. arXiv:2501.14249 [cs.LG] https: //arxiv.org/abs/2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Sundar Pichai. 2026. Cloud Next ‘26: Momentum and innovation at Google scale. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google- cloud/cloud-next-2026-sundar-pichai/

work page 2026

[23] [23]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferber, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. InVLDB

work page 2013

[25] [25]

Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. http://research.google.com/archive/papers/dapper-2010-1.pdf

work page 2010

[26] [26]

Maxim Tabachnyk and Stoyan Nikolov. 2022. ML-Enhanced Code Completion Improves Developer Productivity. https://research.google/blog/ml-enhanced- code-completion-improves-developer-productivity/. Google Research Blog

work page 2022

[27] [27]

Maxim Tabachnyk, Xu Shu, Alexander Frömmgen, Pavel Sychev, Vahid Meimand, Ilia Krets, Stanislav Pyatykh, Abner Araujo, Kristóf Molnár, and Satish Chandra

work page

[28] [28]

arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

Achieving Productivity Gains with AI-based IDE features: A Journey at Google. arXiv:2601.19964 [cs.SE] https://arxiv.org/abs/2601.19964

work page arXiv

[29] [29]

Korupolu, David Oppenheimer, Eric Tune, and John Wilkes

Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. InProceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France

work page 2015

[30] [30]

2020.Software Engineering at Google

Hyrum Wright, Titus Delafayette Winters, and Tom Manshreck. 2020.Software Engineering at Google

work page 2020

[31] [31]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. arXiv:2409.02813 [cs.CL] https://arxiv.org/abs/2409. 02813

work page internal anchor Pith review Pith/arXiv arXiv 2025