pith. sign in

arxiv: 2501.16044 · v2 · submitted 2025-01-27 · 💻 cs.SE

MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation

Pith reviewed 2026-05-23 05:09 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated program repairmultilingual APRcontext augmentationmulti-hunk bugscode language modelsretrieval-augmented generationbug fixing
0
0 comments X

The pith

MultiMend fixes 2,227 bugs across four languages by augmenting function context and generating multi-hunk patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing automated program repair methods often rely on language-specific tactics, use only limited function context, and struggle with bugs that span multiple code locations. MultiMend fine-tunes a pre-trained code language model, retrieves extra relevant lines through embeddings to expand the buggy function context, and builds patches that address multiple hunks at once. Evaluated on six benchmarks containing 5,501 bugs in four languages, the system fixes 2,227 bugs, matches the developer patch exactly in 1,545 cases, and resolves 121 multi-hunk bugs. Both the context augmentation and the multi-hunk mechanism improve outcomes relative to baselines that lack them. The results indicate a practical route to broader automated repair for everyday software maintenance.

Core claim

MultiMend is a multilingual learning-based APR approach that fine-tunes a pre-trained code language model, augments the usual function-based buggy context with relevant lines retrieved via embeddings, and systematically constructs patches for multi-hunk bugs. On six benchmarks with 5,501 bugs across four programming languages it fixes 2,227 bugs, of which 1,545 are identical to the developer's patch and 121 address multi-hunk bugs; both context augmentation and multi-hunk patch generation contribute positively to these results.

What carries the argument

Embedding-based retrieval augmentation of function-level context together with systematic multi-hunk patch construction inside a fine-tuned code language model.

If this is right

  • The single model handles bugs in four languages without language-dependent strategies.
  • Context augmentation measurably raises the number of correct patches over function context alone.
  • Multi-hunk generation lets the model address bugs that single-hunk approaches miss.
  • Fewer patch validations are required because multi-hunk cases are handled directly.
  • A high fraction of generated patches match developer intent exactly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval technique could be tested on other software engineering tasks such as code completion or test generation.
  • Integration into development environments might lower the manual effort spent on multi-location bugs.
  • Performance on much larger industrial codebases remains untested and could differ from the benchmark results.
  • Pairing the model with static analysis or test-generation tools might further raise the rate of correct fixes.

Load-bearing premise

Retrieval-augmented lines selected via embeddings supply relevant, non-noisy context that improves patch quality over the standard function-level context alone.

What would settle it

An ablation that removes the embedding-based retrieval step and produces no drop in the total bugs fixed or in the number of exact developer-patch matches would show the augmentation adds no value.

Figures

Figures reproduced from arXiv: 2501.16044 by Mohammad Hadi Sadreddini, Reza Gharibi, Seyed Mostafa Fakhrahmad.

Figure 1
Figure 1. Figure 1: Overview of MultiMend. using the multilingual dataset of buggy hunks, their contexts, and fixed counterparts. Throughout the fine-tuning process, we select several checkpoints or snapshots of the model at different steps for the inference stage. In the inference stage, the buggy project is analyzed, and the location of the bug is identified to construct an input for patch generation. The input consists of … view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of buggy source file parts and input encoding with context augmentation for inference. In contrast, we use the checkpoint ensemble strategy since it has shown to be a cost-effective method that also yields good results (Chen et al, 2017; Gharibi et al, 2024; Huang et al, 2017). This approach involves assembling multiple checkpoints captured at distinct intervals during the fine-tuning process… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-hunk patch validation process. This input is then tokenized, truncated if necessary, and fed to the checkpoints to generate a ranked list of candidate patches using beam search. We use beam search with a specific beam size t on each checkpoint, generating t best patches from each checkpoint based on the maximum likelihood estimation score of each output sequence. In total, we obtain k × t patches via… view at source ↗
Figure 4
Figure 4. Figure 4: Number of unique and overlapped bug fixes of MultiMend and other tools. On the Codeflaws dataset, MultiMend produces 1,864 correct fixes, 1,309 of which are identical to the developer’s fixes. This performance establishes MultiMend as the most effective tool for this large-scale dataset, significantly outperforming other tools. Similarly, on the BugAID benchmark, MultiMend successfully repairs six bugs cor… view at source ↗
Figure 5
Figure 5. Figure 5: Fix generated for the Closure 71 bug from Defects4J v1.2. if (options.hasOption(token)) { currentOption = options.getOption(token); tokens.add(token); } else if (stopAtNonOption) { eatTheRest = true; tokens.add(token); } + else + { + tokens.add(token); + } (a) MultiMend’s patch. if (options.hasOption(token)) { currentOption = options.getOption(token); - tokens.add(token); } else if (stopAtNonOption) { eatT… view at source ↗
Figure 6
Figure 6. Figure 6: Fix for Cli 19 bug from Defects4J v2.0. Figures 5 to 7 showcase some of the unique bugs fixed by MultiMend [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: demonstrates unique multi-hunk patches generated for two bugs in the Codeflaws benchmark. Figure 7a shows the multi-hunk fix generated for the 382-a-bug-6714564-6714619 bug. The fix is nearly identical to the developer’s else if(q>0) { if(q%2==0) { - for(i=0;i>(p+(q/2));i++) printf("%c",t[i]); + for(i=0;i<(p+(q/2));i++) printf("%c",t[i]); printf("%s",s); for(i=(p+(q/2));t[i]!='\0';i++) printf("%c",t[i]); }… view at source ↗
Figure 8
Figure 8. Figure 8: Patch ranking information of correctly fixed bugs. patch, except that the developer’s fix uses printf instead of puts in the second hunk. This patch highlights the MultiMend’s capability to combine different partial patches into a full patch to fix a bug. Similarly, Figure 7b shows the multi-hunk fix generated for the 544-b-bug-11371114-11371135 bug, which matches the developer’s patch. In this bug, the hu… view at source ↗
Figure 9
Figure 9. Figure 9: Number of candidate patches before correct or plausible. Number of candidate patches To further analyze the efficiency of MultiMend, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of fixes generated only with augmented context. benchmarks, the use of context augmentation yields 2,227 correct patches and 3,121 plausible patches, compared to 2,204 correct and 3,112 plausible patches without it. Other than QuixBugs (Java & Python) BugAID, and RunBugRun where the bugs fixed with context augmentation form a superset of those fixed without it, there are differences in the bugs f… view at source ↗
Figure 11
Figure 11. Figure 11: Patches generated for Chart 24 from Defects4J v1.2 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Debugging software remains a labor-intensive and time-consuming process despite advances in testing and verification. Learning-based automated program repair (APR) has shown promise in reducing the effort of manually fixing bugs. However, existing techniques face several challenges, including language-dependent strategies, limited bug context utilization, and difficulties in handling bugs that span multiple locations in the code. This paper presents MultiMend, a multilingual learning-based APR approach designed to improve repair performance through language-independent context augmentation and multi-hunk patch generation. MultiMend fine-tunes a pre-trained code language model to generate bug-fixing patches. It embeds source code lines and applies retrieval-augmented generation to augment the usual function-based buggy context with relevant lines during patch generation. The approach also systematically constructs patches for multi-hunk bugs to extend the capabilities of single-hunk models and reduce the needed patch validations. We evaluate MultiMend on six benchmarks with 5,501 bugs covering four programming languages and compare it with state-of-the-art methods. Results show that MultiMend achieves competitive effectiveness and efficiency, fixing 2,227 bugs, of which 1,545 are identical to the developer's patch, and 121 are for multi-hunk bugs. Both context augmentation and multi-hunk patch generation contribute positively to these results. Overall, MultiMend's contributions are promising and offer practical and effective techniques to enhance APR performance for real-world software maintenance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces MultiMend, a multilingual learning-based APR approach that fine-tunes a code LM, augments function-level buggy context via embedding-based retrieval-augmented generation, and systematically constructs multi-hunk patches. Evaluated on 5,501 bugs across six benchmarks in four languages, it reports fixing 2,227 bugs (1,545 exact matches to developer patches, including 121 multi-hunk fixes) and claims competitive effectiveness/efficiency with positive contributions from both proposed components.

Significance. If the results hold under rigorous validation, the work would be significant for advancing practical APR by unifying multilingual support, context augmentation, and multi-hunk handling in one framework. The scale of the evaluation (5,501 bugs, four languages) is a clear strength, as is the focus on reducing validation effort via multi-hunk construction.

major comments (2)
  1. [Abstract] Abstract: the claim that 'both context augmentation and multi-hunk patch generation contribute positively' to the 2,227 fixes (and specifically the 121 multi-hunk fixes) is load-bearing for the central contribution, yet the abstract (and available description) provides no retrieval-precision metric, no ablation replacing the embedding selector with random lines or simple heuristics, and no manual inspection of selected lines. Without these controls, gains cannot be attributed to relevance of retrieved content rather than the base model or added tokens.
  2. [Abstract] Abstract: the reported counts (2,227 fixes, 1,545 exact matches) are presented without details on the exact SOTA baselines compared, any statistical significance tests, or the validation procedure used for multi-hunk patches. This directly affects the soundness of the 'competitive effectiveness' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'both context augmentation and multi-hunk patch generation contribute positively' to the 2,227 fixes (and specifically the 121 multi-hunk fixes) is load-bearing for the central contribution, yet the abstract (and available description) provides no retrieval-precision metric, no ablation replacing the embedding selector with random lines or simple heuristics, and no manual inspection of selected lines. Without these controls, gains cannot be attributed to relevance of retrieved content rather than the base model or added tokens.

    Authors: We agree that the abstract does not currently provide retrieval-precision metrics, ablations against random selection, or manual inspection results. The manuscript reports overall performance gains from the components but does not include the specific controls mentioned. We will add an ablation study comparing embedding-based retrieval to random line selection, report retrieval precision on the benchmarks, and include a summary of manual inspection of retrieved lines (e.g., relevance rate on a sample). These additions will be reflected in a revised abstract. revision: yes

  2. Referee: [Abstract] Abstract: the reported counts (2,227 fixes, 1,545 exact matches) are presented without details on the exact SOTA baselines compared, any statistical significance tests, or the validation procedure used for multi-hunk patches. This directly affects the soundness of the 'competitive effectiveness' claim.

    Authors: The abstract provides a high-level summary of results. The full manuscript details the exact SOTA baselines in the evaluation section and tables, describes the multi-hunk patch construction and validation procedure (exact match to developer patches after systematic generation), and includes statistical tests. Due to length constraints, we will partially revise the abstract to name the primary baselines compared and note that statistical validation and multi-hunk procedures are detailed in the paper, while keeping full tables and procedures in the body. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical benchmark results with no self-referential derivations

full rationale

The paper reports empirical outcomes from applying a fine-tuned model to 5,501 bugs across benchmarks, with no equations, fitted parameters renamed as predictions, or load-bearing self-citations. Claims of positive contribution from context augmentation and multi-hunk generation are presented as direct observations from the evaluation rather than reductions to inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated. The approach relies on standard fine-tuning and retrieval techniques whose details are not provided.

pith-pipeline@v0.9.0 · 5799 in / 1092 out tokens · 33725 ms · 2026-05-23T05:09:10.544269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

    Allamanis M, Jackson-Flux H, Brockschmidt M (2021) Self-Supervised Bug Detection and Repair. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings. 22 neurips.cc/paper_files/paper/2021/hash/ea96efc03b9a050d895110db8c4af057-Abstract.html An G, Kwon M, Choi K, Yi J, Yoo S (2023) BugsC++:...

  2. [2]

    Machinery, New York, NY , USA, FSE 2014, pp 306–317, doi:10.1145/2635868.2635898 Berabi B, He J, Raychev V , Vechev M (2021) TFix: Learning to Fix Coding Errors with a Text-to-Text Trans- former. In: Proceedings of the 38th International Conference on Machine Learning, PMLR, pp 780–791, https://proceedings.mlr.press/v139/berabi21a.html Bouzenia I, Devanbu...

  3. [3]

    Machinery, New York, NY , USA, ASE ’20, pp 275–286, doi:10.1145/3324884.3416587 Drain D, Clement CB, Serrato G, Sundaresan N (2021) DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons. doi:10.48550/arXiv.2105.09352,2105.09352 Durieux T, Madeiral F, Martinez M, Abreu R (2019) Empirical review of Java program repair tools: ...

  4. [4]

    Dublin, Ireland, pp 7212–7225, doi:10.18653/v1/2022.acl-long.499 Gyimesi P, Vancsics B, Stocco A, Mazinanian D, Beszédes Á, Ferenc R, Mesbah A (2019) BugsJS: A Benchmark of JavaScript Bugs. In: 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp 90–101, doi:10.1109/ICST.2019.00019 23 Hamill M, Goseva-Popstojanova K (2017)...

  5. [5]

    Machinery, New York, NY , USA, FSE 2016, pp 144–156, doi:10.1145/2950290.2950308 Huang G, Li Y , Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ (2017) Snapshot Ensembles: Train 1, Get M for Free. In: International Conference on Learning Representations,https://openreview.net/forum?id=BJYwwY9ll Huang K, Meng X, Zhang J, Liu Y , Wang W, Li S, Zhang Y (2023a) A...

  6. [6]

    Press, Melbourne, Victoria, Australia, ICSE ’23, pp 1251–1263, doi:10.1109/ICSE48619.2023.00111 Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A (2023) InferFix: End-to-End Program Repair with LLMs. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering,...

  7. [7]

    Machinery, New York, NY , USA, ICSE ’20, pp 602–614, doi:10.1145/3377811.3380345 Li Y , Wang S, Nguyen TN (2022) DEAR: A novel deep learning-based approach for automated program repair. In: Proceedings of the 44th International Conference on Software Engineering, Association for Computing Machinery, New York, NY , USA, ICSE ’22, pp 511–523, doi:10.1145/35...

  8. [8]

    York, NY , USA, SPLASH Companion 2017, pp 55–56, doi:10.1145/3135932.3135941 Liu K, Koyuncu A, Bissyandé TF, Kim D, Klein J, Le Traon Y (2019a) You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems. In: 2019 12th IEEE Conference on Software Testing, Validation and Verification (IC...

  9. [9]

    Machinery, New York, NY , USA, ICSE ’20, pp 615–627, doi:10.1145/3377811.3380338 Loshchilov I, Hutter F (2018) Decoupled Weight Decay Regularization. In: International Conference on Learning Representations,https://openreview.net/forum?id=Bkg6RiCqY7 Lutellier T, Pham HV , Pang L, Li Y , Wei M, Tan L (2020) CoCoNuT: Combining context-aware neural translati...

  10. [10]

    Mockus, V otta (2000) Identifying reasons for software changes using historic databases. In: Proceedings 2000 International Conference on Software Maintenance, pp 120–130, doi:10.1109/ICSM.2000.883028 Nashid N, Sintaha M, Mesbah A (2023a) Embedding Context as Code Dependencies for Neural Program Repair. In: 2023 IEEE Conference on Software Testing, Verifi...

  11. [11]

    Australia, ICSE ’23, pp 2450–2462, doi:10.1109/ICSE48619.2023.00205 Nguyen HDT, Qi D, Roychoudhury A, Chandra S (2013) SemFix: Program repair via semantic analysis. In: 2013 35th International Conference on Software Engineering (ICSE), pp 772–781, doi:10.1109/ICSE.2013.6606623 Nguyen TT, Ta QT, Chin WN (2019) Automatic Program Repair Using Formal Verifica...

  12. [12]

    Publishing, Cham, Lecture Notes in Computer Science, pp 70–91, doi:10.1007/978-3-030-11245-5_4 Parasaram N, Barr ET, Mechtaev S (2023) Rete: Learning Namespace Representation for Program Re- pair. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 1264–1276, doi:10.1109/ICSE48619.2023.00112 25 Parasaram N, Yan H, Yang B, Fl...

  13. [13]

    Systems, Curran Associates, Inc., vol 32, https://proceedings.neurips.cc/paper_files/paper/2019/ hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html Prenner JA, Robbes R (2023) RunBugRun – An Executable Dataset for Automated Program Repair. doi:10.48550/arXiv.2304.01102,2304.01102 Prenner JA, Robbes R (2024) Out of Context: How important is Local Context ...

  14. [14]

    Machinery, New York, NY , USA, ICSE ’24, pp 1–13, doi:10.1145/3597503.3639086 Prenner JA, Babii H, Robbes R (2022) Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In: Proceedings of the Third International Workshop on Automated Program Repair, Association for Computing Machinery, New York, NY , USA, APR ’22, pp 69–75, doi:10.1145/3524459.3527351 Q...

  15. [15]

    Montreal, Quebec, Canada, ICSE ’19, pp 25–36, doi:10.1109/ICSE.2019.00021 Tufano M, Watson C, Bavota G, Penta MD, White M, Poshyvanyk D (2019b) An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology 28(4):19:1–19:29, doi:10.1145/3340544 Vacheret R, Pérez F, Zia...

  16. [16]

    Copiloting the copilots: Fusing large language models with completion engines for automated program repair

    USA, ESEC/FSE 2023, pp 172–184, doi:10.1145/3611643.3616271 Wes M (2010) Data Structures for Statistical Computing in Python. In: Proceedings of the 9th Python in Science

  17. [17]

    Conference, SciPy, vol 445, pp 56–61, doi:10.25080/majora-92bf1922-00a Widyasari R, Sim SQ, Lok C, Qi H, Phan J, Tay Q, Tan C, Wee F, Tan JE, Yieh Y , Goh B, Thung F, Kang HJ, Hoang T, Lo D, Ouh EL (2020) BugsInPy: A database of existing bugs in Python programs to enable controlled testing and debugging studies. In: Proceedings of the 28th ACM Joint Meeti...

  18. [18]

    USA, ESEC/FSE 2020, pp 1556–1560, doi:10.1145/3368089.3417943 Wolf T, Debut L, Sanh V , Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, von Platen P, Ma C, Jernite Y , Plu J, Xu C, Le Scao T, Gugger S, Drame M, Lhoest Q, Rush A (2020) Transformers: State-of-the-Art Natural Language Processing. In: Proceedings ...

  19. [19]

    In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 1443–1455, doi:10.1109/ICSE48619.2023.00126 29

    USA, ESEC/FSE 2021, pp 341–353, doi:10.1145/3468264.3468544 28 Zhu Q, Sun Z, Zhang W, Xiong Y , Zhang L (2023) Tare: Type-Aware Neural Program Repair. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp 1443–1455, doi:10.1109/ICSE48619.2023.00126 29