Byam: Fixing Breaking Dependency Updates with Large Language Models

Benoit Baudry; Federico Bono; Frank Reyes; Martin Monperrus; May Mahmoud; Sarah Nadi

arxiv: 2505.07522 · v3 · submitted 2025-05-12 · 💻 cs.SE

Byam: Fixing Breaking Dependency Updates with Large Language Models

Frank Reyes , May Mahmoud , Federico Bono , Sarah Nadi , Benoit Baudry , Martin Monperrus This is my paper

Pith reviewed 2026-05-22 16:41 UTC · model grok-4.3

classification 💻 cs.SE

keywords large language modelsbreaking dependency updatescode repairJavaLLM promptingAPI changescompilation errorsBUMP dataset

0 comments

The pith

OpenAI's o3-mini completely fixes 27% of Java builds broken by dependency updates when given error context and API details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can automatically repair client code that breaks after third-party libraries receive updates. The authors run experiments on the BUMP dataset of real Java projects that contain breaking dependency changes. They supply the models with build logs, the exact line that fails, descriptions of API differences, and instructions to reason step by step. The best model, OpenAI's o3-mini, resolves entire broken builds in 27% of cases and individual compilation errors in 78% of cases. This result indicates that LLMs can reduce the manual work developers currently perform when they must update to newer versions of their dependencies.

Core claim

The central claim is that LLMs can automatically repair client code after breaking dependency updates. On the BUMP Java benchmark, prompts that include the buggy line, API differences, error messages, and step-by-step reasoning allow OpenAI's o3-mini to fully fix 27% of the builds and 78% of the individual compilation errors. Experiments across five models show that richer contextual information from the build process improves repair success at build, file, and error levels.

What carries the argument

Advanced prompting that supplies the LLM with build-process context including the exact buggy line, API differences, error messages, and step-by-step reasoning instructions.

If this is right

Developers can receive automated suggestions for updating code after dependency changes.
Including build error messages and API change details raises the chance that an LLM will produce a working repair.
The repair approach works at three levels: whole builds, individual files, and single compilation errors.
Among tested models, o3-mini outperforms GPT-4o-mini, Gemini-2.0 Flash, Qwen2.5-32b, and DeepSeek V3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting pattern could be adapted to other languages if their build tools provide comparable error and diff information.
An IDE plugin could run these repairs in the background whenever a developer upgrades a dependency.
The technique might extend beyond compilation errors to certain runtime or test failures caused by API changes.

Load-bearing premise

The BUMP dataset of Java projects and the chosen prompting strategies will produce similar success rates on other languages, larger codebases, or different kinds of breaking changes.

What would settle it

Applying the same prompts and models to a new collection of breaking dependency updates from Python projects and checking whether full-build fix rates stay near 27%.

read the original abstract

Application Programming Interfaces (APIs) facilitate the integration of third-party dependencies within the code of client applications. However, changes to an API, such as deprecation, modification of parameter names or types, or complete replacement with a new API, can break existing client code. These changes are called breaking dependency updates; It is often tedious for API users to identify the cause of these breaks and update their code accordingly. In this paper, we explore the use of Large Language Models (LLMs) to automate client code updates in response to breaking dependency updates. We evaluate our approach on the BUMP dataset, a benchmark for breaking dependency updates in Java projects. Our approach leverages LLMs with advanced prompts, including information from the build process and from the breaking dependency analysis. We assess effectiveness at three granularity levels: at the build level, the file level, and the individual compilation error level. We experiment with five LLMs: Google Gemini-2.0 Flash, OpenAI GPT4o-mini, OpenAI o3-mini, Alibaba Qwen2.5-32b-instruct, and DeepSeek V3. Our results show that LLMs can automatically repair breaking updates. Among the considered models, OpenAI's o3-mini is the best, able to completely fix 27% of the builds when using prompts that include contextual information such as the buggy line, API differences, error messages, and step-by-step reasoning instructions. Also, it fixes 78% of the individual compilation errors. Overall, our findings demonstrate the potential for LLMs to fix compilation errors due to breaking dependency updates, supporting developers in their efforts to stay up-to-date with changes in their dependencies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs like o3-mini fix 27% of Java breaking dependency builds on BUMP with contextual prompts, but generalization is untested.

read the letter

Hey, The main thing here is that LLMs can handle a useful number of breaking dependency updates in Java code. With good prompts that include the buggy line, API differences, error messages, and reasoning steps, OpenAI's o3-mini fully fixes 27% of the builds and 78% of the individual compilation errors on the BUMP benchmark. What the paper does is take existing prompting methods and apply them to this specific problem. They test five different LLMs and break down the results by build, file, and error level. This shows that context helps and gives a sense of current capabilities for this task. The evaluation on an external benchmark is a plus for reproducibility. The soft spots are the limited scope and missing details. Everything is on Java projects from BUMP, with no tests on other languages or more varied breaking changes. The abstract does not report baselines or how the numbers were averaged, so it's hard to know the true strength of the results. If prompt choices were optimized after seeing the data, that could affect how general the findings are. This work is aimed at people studying automated software updates or LLM applications in engineering. Readers who want concrete numbers on fixing dependency breaks will get something from it. I would send this to peer review. The empirical results address a practical issue and merit closer examination by referees.

Referee Report

3 major / 1 minor

Summary. The paper explores using LLMs to automate fixes for breaking dependency updates in Java client code. Evaluated on the BUMP benchmark, it compares five LLMs with contextual prompts (including buggy lines, API differences, error messages, and reasoning steps). OpenAI o3-mini performs best, fully repairing 27% of builds and 78% of individual compilation errors.

Significance. If the results hold under more rigorous controls, the work provides concrete evidence that LLMs can assist with a frequent maintenance task in software engineering. The multi-model comparison and multi-granularity evaluation (build, file, error) on an external benchmark are strengths. However, the lack of baselines, statistical details, and cross-domain tests limits the assessed impact on general automation claims.

major comments (3)

[Evaluation section] Evaluation section (and abstract): The headline figures of 27% full build fixes and 78% individual error fixes for o3-mini are reported without specifying the number of independent runs, variance, statistical significance tests, or averaging procedure. This directly affects the soundness of the performance claims.
[Results and Discussion sections] Results and Discussion sections: All experiments are confined to the BUMP Java benchmark. No cross-language, cross-domain, or alternative breaking-change experiments (e.g., semantic rather than signature changes) are reported, so the observed rates may not generalize beyond this specific dataset and error distribution.
[Prompting and methodology description] Prompting and methodology description: The paper relies on hand-crafted contextual prompts whose construction details, potential post-hoc tuning, and comparison to simpler baselines (e.g., rule-based or non-contextual LLM calls) are not provided, leaving open whether the gains are attributable to the LLM or the supplied context.

minor comments (1)

[Abstract] The abstract states effectiveness is assessed at three granularity levels but does not name them explicitly; a brief enumeration would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make to improve clarity, transparency, and completeness.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): The headline figures of 27% full build fixes and 78% individual error fixes for o3-mini are reported without specifying the number of independent runs, variance, statistical significance tests, or averaging procedure. This directly affects the soundness of the performance claims.

Authors: We thank the referee for highlighting this issue. Our experiments consisted of a single run per model and prompt configuration, with the temperature parameter fixed at 0 to ensure deterministic and reproducible outputs from the LLMs. Multiple independent runs were not performed due to the substantial API costs and time required to process the full BUMP dataset across five models. We will revise the Evaluation section (and update the abstract if needed) to explicitly document the single-run procedure, temperature setting, and absence of variance or statistical tests, while adding a limitations paragraph noting that future work could incorporate repeated trials for robustness. revision: yes
Referee: [Results and Discussion sections] Results and Discussion sections: All experiments are confined to the BUMP Java benchmark. No cross-language, cross-domain, or alternative breaking-change experiments (e.g., semantic rather than signature changes) are reported, so the observed rates may not generalize beyond this specific dataset and error distribution.

Authors: Our study is deliberately scoped to the Java language and the BUMP benchmark, which provides a well-defined set of breaking dependency updates based on compilation errors from signature changes. We agree that the results may not directly generalize to other languages, domains, or breaking change types such as semantic modifications. We will expand the Discussion section with a dedicated limitations subsection that acknowledges this scope and outlines future directions for cross-language and cross-domain validation, without overstating broader applicability in the current work. revision: partial
Referee: [Prompting and methodology description] Prompting and methodology description: The paper relies on hand-crafted contextual prompts whose construction details, potential post-hoc tuning, and comparison to simpler baselines (e.g., rule-based or non-contextual LLM calls) are not provided, leaving open whether the gains are attributable to the LLM or the supplied context.

Authors: The contextual prompt components (buggy lines, API differences, error messages, and step-by-step reasoning) are outlined in the methodology, and were constructed following established practices for LLM-based code repair while using a small held-out validation set for refinement to avoid test-set leakage. We will add the complete prompt templates to an appendix for full transparency and include a new results comparison against a non-contextual baseline (using only error messages) to better isolate the contribution of the provided context. This revision will clarify the role of the LLM versus the supplied information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external BUMP benchmark

full rationale

The paper is an empirical study that applies off-the-shelf LLMs to the external BUMP dataset of Java breaking dependency updates and directly measures repair success rates at build, file, and error levels. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations reduce the reported percentages (27% builds, 78% errors) to quantities defined inside the paper. The central results are independent measurements against an external benchmark and do not rely on any internal derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the BUMP benchmark is representative of real-world breaking updates and that LLM outputs can be directly applied to source files without additional validation steps.

free parameters (1)

Prompt engineering choices
The exact wording and structure of the advanced prompts are selected by the authors to improve performance.

axioms (1)

domain assumption The BUMP dataset contains a representative sample of breaking dependency updates in Java.
All quantitative results are measured against this fixed benchmark.

pith-pipeline@v0.9.0 · 5845 in / 1214 out tokens · 43270 ms · 2026-05-22T16:41:49.646493+00:00 · methodology

Byam: Fixing Breaking Dependency Updates with Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)