LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms
Pith reviewed 2026-05-20 16:00 UTC · model grok-4.3
The pith
LLM-based tools resolve imbalanced merge conflicts better than search-based ones, but the latter generalizes more reliably across languages and datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the LLM paradigm excels at resolving conflicts with imbalanced content by leveraging learned patterns from training data, but it struggles with non-English content and large inputs which can lead to truncated or empty resolutions, whereas the SBSE paradigm demonstrates superior generalization across datasets and performs best on balanced conflicts, indicating that neither approach is universally superior and that context-dependent strengths exist.
What carries the argument
Direct empirical comparison of MergeGen, a state-of-the-art LLM-based tool, against SBCR, a novel SBSE approach using Random Restart Hill Climbing, on thousands of real-world merge conflicts from multiple programming languages.
Load-bearing premise
The chosen tools MergeGen and SBCR adequately represent the LLM-based and SBSE paradigms respectively, and the conflicts drawn from the selected open-source projects form a representative sample without significant bias toward particular conflict types or languages.
What would settle it
Finding that MergeGen resolves balanced conflicts as effectively as imbalanced ones or succeeds consistently with non-English and large inputs, or that SBCR shows poor performance on new datasets, would challenge the identified trade-offs.
Figures
read the original abstract
Context: The resolution of software merge conflicts is being reshaped by two competing paradigms: generative approaches based on Large Language Models (LLMs) and optimization approaches from Search-Based Software Engineering (SBSE). While tools from both paradigms have shown promise, their relative strengths, weaknesses, and trade-offs are not yet well understood. Objective: This paper presents the first in-depth empirical study directly comparing these paradigms to identify their capabilities and limitations in real-world scenarios. Method: We evaluated MergeGen, a state-of-the-art LLM-based tool, against SBCR, a novel SBSE approach employing a Random Restart Hill Climbing (RRHC) algorithm. The comparison used thousands of real-world conflicts from open-source projects written in Java, C#, JavaScript, and TypeScript. Results: Our findings reveal fundamental trade-offs. The LLM paradigm excels at resolving conflicts with imbalanced content by leveraging learned patterns. However, it struggles with non-English content and large inputs, which can lead to truncated or empty resolutions. Conversely, the SBSE paradigm demonstrates superior generalization across datasets and performs best on balanced conflicts, highlighting its potential as a robust, data-independent alternative. Conclusions: Neither paradigm is a silver bullet. Our findings highlight context-dependent strengths and motivate the development of hybrid systems that combine the complementary capabilities of LLM and SBSE approaches to create more robust and reliable merge conflict resolution tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first in-depth empirical comparison of LLM-based and SBSE-based paradigms for software merge conflict resolution. It evaluates the state-of-the-art LLM tool MergeGen against the novel SBSE tool SBCR (using Random Restart Hill Climbing) on thousands of real-world merge conflicts drawn from open-source Java, C#, JavaScript, and TypeScript projects. The central findings are that the LLM approach excels on imbalanced-content conflicts by leveraging learned patterns but struggles with non-English content and large inputs (leading to truncation or empty outputs), while the SBSE approach shows superior generalization across datasets and performs best on balanced conflicts. The paper concludes that neither paradigm is a silver bullet and motivates hybrid systems.
Significance. If the results hold, this study is significant for software engineering because it supplies the first direct, large-scale empirical evidence on the complementary strengths and limitations of two competing paradigms for a practical problem. The use of thousands of real-world conflicts across four languages provides a stronger foundation than prior tool-specific evaluations and could guide development of more robust resolution tools. The identification of context-dependent trade-offs (imbalanced vs. balanced conflicts, language and size sensitivity) is a concrete contribution that could be built upon.
major comments (2)
- [Abstract and Results] Abstract and Results section: The central claim of paradigm-level trade-offs (LLM excels on imbalanced content; SBSE generalizes better and wins on balanced conflicts) rests on the untested assumption that MergeGen and SBCR are representative proxies for the broader LLM-based and SBSE paradigms. Without experiments using additional LLMs (different models, fine-tunes, or prompting) or other SBSE algorithms (e.g., genetic algorithms), the reported trade-offs may be tool-specific rather than paradigm-level.
- [Method] Method section: The paper selects thousands of conflicts from specific open-source projects but does not discuss or mitigate potential selection biases (language distribution, project size, or commit patterns) that could affect the observed performance on non-English content and cross-dataset generalization. This weakens support for the generalization claim.
minor comments (2)
- [Abstract] Abstract: The exact total number of conflicts, their distribution across the four languages, and the precise success metrics (e.g., exact-match rate, semantic equivalence) should be stated to allow readers to assess the scale and balance of the evaluation.
- [Introduction] The distinction between the SBSE paradigm and the specific RRHC implementation in SBCR could be clarified earlier to avoid conflating the two.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. These points help clarify the scope of our claims regarding paradigm-level trade-offs and the robustness of our methodological choices. We respond to each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: The central claim of paradigm-level trade-offs (LLM excels on imbalanced content; SBSE generalizes better and wins on balanced conflicts) rests on the untested assumption that MergeGen and SBCR are representative proxies for the broader LLM-based and SBSE paradigms. Without experiments using additional LLMs (different models, fine-tunes, or prompting) or other SBSE algorithms (e.g., genetic algorithms), the reported trade-offs may be tool-specific rather than paradigm-level.
Authors: MergeGen was chosen as the leading publicly available LLM-based tool for merge conflict resolution at the time of the study, while SBCR implements a standard Random Restart Hill Climbing algorithm that is representative of optimization-based SBSE approaches for this problem. We positioned the comparison as one between prominent instances of each paradigm to surface potential differences in behavior. We agree that broader sampling across models and algorithms would provide stronger evidence for paradigm-wide conclusions. In the revised manuscript we have qualified the language in the abstract and results sections to frame the findings as demonstrated for these representative tools, added explicit discussion of tool representativeness, and expanded the Threats to Validity section to note that the observed trade-offs may not generalize to all possible LLM configurations or SBSE algorithms. We have also added a forward-looking statement motivating future multi-tool studies. These textual changes address the concern directly; new experiments with additional tools lie outside the scope of the current revision. revision: partial
-
Referee: [Method] Method section: The paper selects thousands of conflicts from specific open-source projects but does not discuss or mitigate potential selection biases (language distribution, project size, or commit patterns) that could affect the observed performance on non-English content and cross-dataset generalization. This weakens support for the generalization claim.
Authors: The conflicts were sampled from active, publicly available open-source repositories in Java, C#, JavaScript, and TypeScript to reflect realistic usage. Selection prioritized projects with sufficient merge history and language coverage. We acknowledge that the original submission did not explicitly analyze or mitigate selection biases. In the revision we have added a detailed account of the data collection procedure in the Method section, specifying inclusion criteria such as repository activity, language prevalence, and commit volume. We have also inserted a new Threats to Validity subsection that discusses potential biases arising from language distribution, project size, and commit patterns, and how these might influence results on non-English inputs and cross-dataset performance. This addition improves transparency and allows readers to evaluate the generalizability claims more accurately. revision: yes
Circularity Check
No circularity: empirical comparison grounded in external data
full rationale
This is a direct empirical study that evaluates two concrete tools (MergeGen and SBCR) on thousands of real-world merge conflicts drawn from open-source repositories. The central claims consist of observed performance differences (e.g., LLM handling of imbalanced content vs. SBSE generalization) rather than any mathematical derivation, fitted parameter, or self-referential prediction. No equations, ansatzes, or uniqueness theorems are invoked; results are reported from running the tools on external data. Self-citations, if present, are not load-bearing for the reported findings. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption MergeGen and SBCR are representative implementations of the LLM-based and SBSE paradigms respectively.
- domain assumption The collected real-world conflicts from open-source Java, C#, JavaScript, and TypeScript projects form a representative sample for evaluating generalization.
Reference graph
Works this paper leans on
-
[1]
Association for Computing Machinery. ISBN 978-1-4503-0443-6. doi: 10.1145/2025113.2025141. Sven Apel, Olaf Leßenich, and Christian Lengauer. Structured Merge with Auto-Tuning: BalancingPrecisionandPerformance. InProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, pages 120–129, New York, NY, USA,
-
[2]
Association for Computing Machinery. ISBN 978-1-4503-1204-2. doi: 10.1145/2351676.2351694. Heleno de S Campos Junior, Gleiph Ghiotto L de Menezes, Márcio de Oliveira Barros, André van der Hoek, and Leonardo Gresta Paulino Murta. How code composition strategies affect merge conflict resolution?Journal of Software Engineering Research and Development, 12(1):13–1,
-
[3]
Heleno de S Campos Junior, Gleiph Ghiotto L
doi: 10.5753/jserd.2024.3638. Heleno de S Campos Junior, Gleiph Ghiotto L. de Menezes, Márcio de Oliveira Barros, André van der Hoek, and Leonardo Gresta Paulino Murta. Towards a feasible evaluation function for search-based merge conflict resolution.ACM Transactions on Software Engineering and Methodology,
-
[4]
Guilherme Cavalcanti, Paulo Borba, and Paola Accioly
doi: 10.1145/3748256. Guilherme Cavalcanti, Paulo Borba, and Paola Accioly. Evaluating and improving semistruc- tured merge.Proceedings of the ACM on Programming Languages, 1(OOPSLA):59:1– 59:27, October
-
[5]
Guilherme Cavalcanti, Paulo Borba, Leonardo dos Anjos, and Jonatas Clementino
doi: 10.1145/3133883. Guilherme Cavalcanti, Paulo Borba, Leonardo dos Anjos, and Jonatas Clementino. Semistructured merge with language-specific syntactic separators. In2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE,
-
[6]
doi: 10.1145/3691620.3695483. Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, and Shuvendu Lahiri. Deepmerge: Learning to merge programs.IEEE Transactions on Soft- ware Engineering, 49(4):1599–1614,
-
[7]
Jinhao Dong, Qihao Zhu, Zeyu Sun, Yiling Lou, and Dan Hao
doi: 10.1109/TSE.2022.3183955. Jinhao Dong, Qihao Zhu, Zeyu Sun, Yiling Lou, and Dan Hao. Merge conflict resolu- tion: Classification or generation? In2023 38th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE), pages 1652–1663. IEEE,
-
[8]
doi: 10.1109/ASE56229.2023.00155. Gleiph Ghiotto, Leonardo Murta, Márcio Barros, and André van der Hoek. On the nature of merge conflicts: A study of 2,731 open source java projects hosted by github.IEEE Trans- actions on Software Engineering, 46(8):892–915,
-
[9]
30 Sergio Luis Herrera Gonzalez and Piero Fraternali
doi: 10.1109/TSE.2018.2871083. 30 Sergio Luis Herrera Gonzalez and Piero Fraternali. Almost rerere: Learning to resolve conflicts in distributed projects.IEEE Transactions on Software Engineering, 49(4):2255– 2271,
-
[10]
doi: 10.1109/TSE.2022.3215289. Rebecca E Grinter. Supporting articulation work using software configuration management systems.Computer Supported Cooperative Work (CSCW), 5:447–465,
-
[11]
doi: 10.1037/0033-2909.111.2.361. Tom Mens. A state-of-the-art survey on software merging.IEEE transactions on software engineering, 28(5):449–462,
-
[12]
Benedikt Schesch, Ryan Featherman, Kenneth J Yang, Ben Roberts, and Michael D Ernst
doi: 10.1109/TSE.2002.1000449. Benedikt Schesch, Ryan Featherman, Kenneth J Yang, Ben Roberts, and Michael D Ernst. Evaluation of version control merge tools. InProceedings of the 39th IEEE/ACM In- ternational Conference on Automated Software Engineering, pages 831–83,
-
[13]
doi: 10.1145/3691620.3695075. Bowen Shen and Na Meng. Conflictbench: A benchmark to evaluate software merge tools. Journal of Systems and Software, 214:112084,
-
[14]
Emad Shihab, Christian Bird, and Thomas Zimmermann
doi: 10.1016/j.jss.2024.112084. Emad Shihab, Christian Bird, and Thomas Zimmermann. The effect of branching strategies on software quality. InProceedings of the ACM-IEEE international symposium on Empir- ical software engineering and measurement, pages 301–310,
-
[15]
doi: 10.1145/2372251. 2372305. Alexey Svyatkovskiy, Sarah Fakhoury, Negar Ghorbani, Todd Mytkowicz, Elizabeth Dinella, Christian Bird, Jinu Jang, Neel Sundaresan, and Shuvendu K Lahiri. Program merge conflict resolution via neural transformers. InProceedings of the 30th ACM Joint Eu- ropean Software Engineering Conference and Symposium on the Foundations ...
-
[16]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C
doi: 10.1145/3540250.3549163. Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic,
-
[17]
doi: 10.18653/v1/2021.emnlp-main.685
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6): 80–83,
-
[18]
Individual comparisons by ranking methods,
doi: 10.2307/3001968. Xiaoqian Xing and Katsuhisa Maruyama. Automatic software merging using automated program repair. In2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF), pages 11–16. IEEE,
-
[19]
Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K Lahiri
doi: 10.1109/IBF.2019.8665493. Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K Lahiri. Using pre-trained language models to resolve textual and semantic merge conflicts (experience 31 paper). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 77–88,
-
[20]
doi: 10.1145/3533767.3534396. Fengmin Zhu and Fei He. Conflict resolution for structured merge via version space algebra. Proceedings of the ACM on Programming Languages, 2(OOPSLA):1–25,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.