pith. sign in

arxiv: 2605.16646 · v1 · pith:DQ33I6A2new · submitted 2026-05-15 · 💻 cs.SE

LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms

Pith reviewed 2026-05-20 16:00 UTC · model grok-4.3

classification 💻 cs.SE
keywords merge conflict resolutionlarge language modelssearch-based software engineeringempirical studysoftware evolutionhybrid resolution systemsRandom Restart Hill Climbing
0
0 comments X

The pith

LLM-based tools resolve imbalanced merge conflicts better than search-based ones, but the latter generalizes more reliably across languages and datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares an LLM-based merge conflict resolver called MergeGen to a search-based one called SBCR using a random restart hill climbing algorithm. It tests both on thousands of real-world conflicts collected from open-source projects in Java, C#, JavaScript, and TypeScript. The results show that LLM methods leverage patterns to handle cases where one version has far more changes, but they can fail on non-English text or oversized inputs by producing truncated or blank outputs. Search-based methods, by contrast, perform more consistently on balanced conflicts and across different project datasets without relying on prior training data. The authors conclude that a hybrid approach combining both paradigms would likely yield more dependable resolution tools for everyday software development.

Core claim

The central claim is that the LLM paradigm excels at resolving conflicts with imbalanced content by leveraging learned patterns from training data, but it struggles with non-English content and large inputs which can lead to truncated or empty resolutions, whereas the SBSE paradigm demonstrates superior generalization across datasets and performs best on balanced conflicts, indicating that neither approach is universally superior and that context-dependent strengths exist.

What carries the argument

Direct empirical comparison of MergeGen, a state-of-the-art LLM-based tool, against SBCR, a novel SBSE approach using Random Restart Hill Climbing, on thousands of real-world merge conflicts from multiple programming languages.

Load-bearing premise

The chosen tools MergeGen and SBCR adequately represent the LLM-based and SBSE paradigms respectively, and the conflicts drawn from the selected open-source projects form a representative sample without significant bias toward particular conflict types or languages.

What would settle it

Finding that MergeGen resolves balanced conflicts as effectively as imbalanced ones or succeeds consistently with non-English and large inputs, or that SBCR shows poor performance on new datasets, would challenge the identified trade-offs.

Figures

Figures reproduced from arXiv: 2605.16646 by Heleno de Souza Campos Junior, Leonardo Gresta Paulino Murta.

Figure 1
Figure 1. Figure 1: Boxplots for the similarities between the resolution adopted by the developers and [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Boxplots for the time in seconds each approach takes to generate a candidate [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Boxplots for the similarities between the resolution adopted by the developers and [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Box plots comparing the similarity of candidates generated by SBCR and Merge [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Box plot comparing the similarity of candidates generated by SBCR and MergeGen, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Box plots comparing the similarity of candidates generated by SBCR and Merge [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Box plot comparing the similarity of candidates generated by SBCR and MergeGen, [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of DiffSBCR−MergeGen values across all datasets. Negative values (gray) indicate better performance by MergeGen, positive values (black) indicate better performance by SBCR, and zero (white bars) indicate ties. The histograms in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of the size differences between [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Context: The resolution of software merge conflicts is being reshaped by two competing paradigms: generative approaches based on Large Language Models (LLMs) and optimization approaches from Search-Based Software Engineering (SBSE). While tools from both paradigms have shown promise, their relative strengths, weaknesses, and trade-offs are not yet well understood. Objective: This paper presents the first in-depth empirical study directly comparing these paradigms to identify their capabilities and limitations in real-world scenarios. Method: We evaluated MergeGen, a state-of-the-art LLM-based tool, against SBCR, a novel SBSE approach employing a Random Restart Hill Climbing (RRHC) algorithm. The comparison used thousands of real-world conflicts from open-source projects written in Java, C#, JavaScript, and TypeScript. Results: Our findings reveal fundamental trade-offs. The LLM paradigm excels at resolving conflicts with imbalanced content by leveraging learned patterns. However, it struggles with non-English content and large inputs, which can lead to truncated or empty resolutions. Conversely, the SBSE paradigm demonstrates superior generalization across datasets and performs best on balanced conflicts, highlighting its potential as a robust, data-independent alternative. Conclusions: Neither paradigm is a silver bullet. Our findings highlight context-dependent strengths and motivate the development of hybrid systems that combine the complementary capabilities of LLM and SBSE approaches to create more robust and reliable merge conflict resolution tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first in-depth empirical comparison of LLM-based and SBSE-based paradigms for software merge conflict resolution. It evaluates the state-of-the-art LLM tool MergeGen against the novel SBSE tool SBCR (using Random Restart Hill Climbing) on thousands of real-world merge conflicts drawn from open-source Java, C#, JavaScript, and TypeScript projects. The central findings are that the LLM approach excels on imbalanced-content conflicts by leveraging learned patterns but struggles with non-English content and large inputs (leading to truncation or empty outputs), while the SBSE approach shows superior generalization across datasets and performs best on balanced conflicts. The paper concludes that neither paradigm is a silver bullet and motivates hybrid systems.

Significance. If the results hold, this study is significant for software engineering because it supplies the first direct, large-scale empirical evidence on the complementary strengths and limitations of two competing paradigms for a practical problem. The use of thousands of real-world conflicts across four languages provides a stronger foundation than prior tool-specific evaluations and could guide development of more robust resolution tools. The identification of context-dependent trade-offs (imbalanced vs. balanced conflicts, language and size sensitivity) is a concrete contribution that could be built upon.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: The central claim of paradigm-level trade-offs (LLM excels on imbalanced content; SBSE generalizes better and wins on balanced conflicts) rests on the untested assumption that MergeGen and SBCR are representative proxies for the broader LLM-based and SBSE paradigms. Without experiments using additional LLMs (different models, fine-tunes, or prompting) or other SBSE algorithms (e.g., genetic algorithms), the reported trade-offs may be tool-specific rather than paradigm-level.
  2. [Method] Method section: The paper selects thousands of conflicts from specific open-source projects but does not discuss or mitigate potential selection biases (language distribution, project size, or commit patterns) that could affect the observed performance on non-English content and cross-dataset generalization. This weakens support for the generalization claim.
minor comments (2)
  1. [Abstract] Abstract: The exact total number of conflicts, their distribution across the four languages, and the precise success metrics (e.g., exact-match rate, semantic equivalence) should be stated to allow readers to assess the scale and balance of the evaluation.
  2. [Introduction] The distinction between the SBSE paradigm and the specific RRHC implementation in SBCR could be clarified earlier to avoid conflating the two.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. These points help clarify the scope of our claims regarding paradigm-level trade-offs and the robustness of our methodological choices. We respond to each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The central claim of paradigm-level trade-offs (LLM excels on imbalanced content; SBSE generalizes better and wins on balanced conflicts) rests on the untested assumption that MergeGen and SBCR are representative proxies for the broader LLM-based and SBSE paradigms. Without experiments using additional LLMs (different models, fine-tunes, or prompting) or other SBSE algorithms (e.g., genetic algorithms), the reported trade-offs may be tool-specific rather than paradigm-level.

    Authors: MergeGen was chosen as the leading publicly available LLM-based tool for merge conflict resolution at the time of the study, while SBCR implements a standard Random Restart Hill Climbing algorithm that is representative of optimization-based SBSE approaches for this problem. We positioned the comparison as one between prominent instances of each paradigm to surface potential differences in behavior. We agree that broader sampling across models and algorithms would provide stronger evidence for paradigm-wide conclusions. In the revised manuscript we have qualified the language in the abstract and results sections to frame the findings as demonstrated for these representative tools, added explicit discussion of tool representativeness, and expanded the Threats to Validity section to note that the observed trade-offs may not generalize to all possible LLM configurations or SBSE algorithms. We have also added a forward-looking statement motivating future multi-tool studies. These textual changes address the concern directly; new experiments with additional tools lie outside the scope of the current revision. revision: partial

  2. Referee: [Method] Method section: The paper selects thousands of conflicts from specific open-source projects but does not discuss or mitigate potential selection biases (language distribution, project size, or commit patterns) that could affect the observed performance on non-English content and cross-dataset generalization. This weakens support for the generalization claim.

    Authors: The conflicts were sampled from active, publicly available open-source repositories in Java, C#, JavaScript, and TypeScript to reflect realistic usage. Selection prioritized projects with sufficient merge history and language coverage. We acknowledge that the original submission did not explicitly analyze or mitigate selection biases. In the revision we have added a detailed account of the data collection procedure in the Method section, specifying inclusion criteria such as repository activity, language prevalence, and commit volume. We have also inserted a new Threats to Validity subsection that discusses potential biases arising from language distribution, project size, and commit patterns, and how these might influence results on non-English inputs and cross-dataset performance. This addition improves transparency and allows readers to evaluate the generalizability claims more accurately. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison grounded in external data

full rationale

This is a direct empirical study that evaluates two concrete tools (MergeGen and SBCR) on thousands of real-world merge conflicts drawn from open-source repositories. The central claims consist of observed performance differences (e.g., LLM handling of imbalanced content vs. SBSE generalization) rather than any mathematical derivation, fitted parameter, or self-referential prediction. No equations, ansatzes, or uniqueness theorems are invoked; results are reported from running the tools on external data. Self-citations, if present, are not load-bearing for the reported findings. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the representativeness of the two chosen tools and the selected conflict datasets; no free parameters are fitted, no new entities are postulated, and the main supporting assumptions are domain-level choices about tool selection and data sampling.

axioms (2)
  • domain assumption MergeGen and SBCR are representative implementations of the LLM-based and SBSE paradigms respectively.
    The comparison draws paradigm-level conclusions from performance of these two specific tools.
  • domain assumption The collected real-world conflicts from open-source Java, C#, JavaScript, and TypeScript projects form a representative sample for evaluating generalization.
    Generalization claims depend on this sampling assumption.

pith-pipeline@v0.9.0 · 5787 in / 1469 out tokens · 61428 ms · 2026-05-20T16:00:25.809771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    ISBN 978-1-4503-0443-6

    Association for Computing Machinery. ISBN 978-1-4503-0443-6. doi: 10.1145/2025113.2025141. Sven Apel, Olaf Leßenich, and Christian Lengauer. Structured Merge with Auto-Tuning: BalancingPrecisionandPerformance. InProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, pages 120–129, New York, NY, USA,

  2. [2]

    ISBN 978-1-4503-1204-2

    Association for Computing Machinery. ISBN 978-1-4503-1204-2. doi: 10.1145/2351676.2351694. Heleno de S Campos Junior, Gleiph Ghiotto L de Menezes, Márcio de Oliveira Barros, André van der Hoek, and Leonardo Gresta Paulino Murta. How code composition strategies affect merge conflict resolution?Journal of Software Engineering Research and Development, 12(1):13–1,

  3. [3]

    Heleno de S Campos Junior, Gleiph Ghiotto L

    doi: 10.5753/jserd.2024.3638. Heleno de S Campos Junior, Gleiph Ghiotto L. de Menezes, Márcio de Oliveira Barros, André van der Hoek, and Leonardo Gresta Paulino Murta. Towards a feasible evaluation function for search-based merge conflict resolution.ACM Transactions on Software Engineering and Methodology,

  4. [4]

    Guilherme Cavalcanti, Paulo Borba, and Paola Accioly

    doi: 10.1145/3748256. Guilherme Cavalcanti, Paulo Borba, and Paola Accioly. Evaluating and improving semistruc- tured merge.Proceedings of the ACM on Programming Languages, 1(OOPSLA):59:1– 59:27, October

  5. [5]

    Guilherme Cavalcanti, Paulo Borba, Leonardo dos Anjos, and Jonatas Clementino

    doi: 10.1145/3133883. Guilherme Cavalcanti, Paulo Borba, Leonardo dos Anjos, and Jonatas Clementino. Semistructured merge with language-specific syntactic separators. In2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE,

  6. [6]

    Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, and Shuvendu Lahiri

    doi: 10.1145/3691620.3695483. Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, and Shuvendu Lahiri. Deepmerge: Learning to merge programs.IEEE Transactions on Soft- ware Engineering, 49(4):1599–1614,

  7. [7]

    Jinhao Dong, Qihao Zhu, Zeyu Sun, Yiling Lou, and Dan Hao

    doi: 10.1109/TSE.2022.3183955. Jinhao Dong, Qihao Zhu, Zeyu Sun, Yiling Lou, and Dan Hao. Merge conflict resolu- tion: Classification or generation? In2023 38th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE), pages 1652–1663. IEEE,

  8. [8]

    In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

    doi: 10.1109/ASE56229.2023.00155. Gleiph Ghiotto, Leonardo Murta, Márcio Barros, and André van der Hoek. On the nature of merge conflicts: A study of 2,731 open source java projects hosted by github.IEEE Trans- actions on Software Engineering, 46(8):892–915,

  9. [9]

    30 Sergio Luis Herrera Gonzalez and Piero Fraternali

    doi: 10.1109/TSE.2018.2871083. 30 Sergio Luis Herrera Gonzalez and Piero Fraternali. Almost rerere: Learning to resolve conflicts in distributed projects.IEEE Transactions on Software Engineering, 49(4):2255– 2271,

  10. [10]

    Rebecca E Grinter

    doi: 10.1109/TSE.2022.3215289. Rebecca E Grinter. Supporting articulation work using software configuration management systems.Computer Supported Cooperative Work (CSCW), 5:447–465,

  11. [11]

    Tom Mens

    doi: 10.1037/0033-2909.111.2.361. Tom Mens. A state-of-the-art survey on software merging.IEEE transactions on software engineering, 28(5):449–462,

  12. [12]

    Benedikt Schesch, Ryan Featherman, Kenneth J Yang, Ben Roberts, and Michael D Ernst

    doi: 10.1109/TSE.2002.1000449. Benedikt Schesch, Ryan Featherman, Kenneth J Yang, Ben Roberts, and Michael D Ernst. Evaluation of version control merge tools. InProceedings of the 39th IEEE/ACM In- ternational Conference on Automated Software Engineering, pages 831–83,

  13. [13]

    , title =

    doi: 10.1145/3691620.3695075. Bowen Shen and Na Meng. Conflictbench: A benchmark to evaluate software merge tools. Journal of Systems and Software, 214:112084,

  14. [14]

    Emad Shihab, Christian Bird, and Thomas Zimmermann

    doi: 10.1016/j.jss.2024.112084. Emad Shihab, Christian Bird, and Thomas Zimmermann. The effect of branching strategies on software quality. InProceedings of the ACM-IEEE international symposium on Empir- ical software engineering and measurement, pages 301–310,

  15. [15]

    doi: 10.1145/2372251. 2372305. Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K Lahiri. Program merge conflict resolution via neural transformers. InProceedings of the 30th ACM Joint Eu- ropean Software Engineering Conference and Symposium on the Foundations ...

  16. [16]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven C

    doi: 10.1145/3540250.3549163. Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic,

  17. [17]

    doi: 10.18653/v1/2021.emnlp-main.685

    Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6): 80–83,

  18. [18]

    Individual comparisons by ranking methods,

    doi: 10.2307/3001968. Xiaoqian Xing and Katsuhisa Maruyama. Automatic software merging using automated program repair. In2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF), pages 11–16. IEEE,

  19. [19]

    Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K Lahiri

    doi: 10.1109/IBF.2019.8665493. Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K Lahiri. Using pre-trained language models to resolve textual and semantic merge conflicts (experience 31 paper). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 77–88,

  20. [20]

    Fengmin Zhu and Fei He

    doi: 10.1145/3533767.3534396. Fengmin Zhu and Fei He. Conflict resolution for structured merge via version space algebra. Proceedings of the ACM on Programming Languages, 2(OOPSLA):1–25,