pith. sign in

arxiv: 2605.29295 · v1 · pith:5MSAVHF4new · submitted 2026-05-28 · 💻 cs.NE

EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization

Pith reviewed 2026-06-29 00:19 UTC · model grok-4.3

classification 💻 cs.NE
keywords model mergingevolutionary algorithmsgenerative modelslarge language modelscycle-consistent learningparameter optimizationLLM composition
0
0 comments X

The pith

EvoGM replaces hand-crafted operators in evolutionary LLM merging with a learnable dual-generator architecture trained on winner-loser pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoGM as a way to automate the composition of large language models by searching in parameter space. Instead of relying on fixed stochastic rules to choose how to combine model weights, it trains two generators with cycle-consistent losses on pairs of strong and weak merges collected from earlier search steps. These generators then propose new coefficient sets that feed into repeated rounds of evolution, where the best resulting models become the base experts for the next round. A reader would care if this removes the need for manual design of merging heuristics while still delivering gains on both familiar and new tasks.

Core claim

EvoGM features a dual-generator architecture with cycle-consistent learning to adaptively sample and refine promising merging candidates. By constructing winner-loser pairs from historical search trajectories, the framework captures high-performance parameter distributions and maximizes data efficiency. This generative process is integrated into a multi-round evolutionary pipeline where elite merged models iteratively serve as new expert foundations.

What carries the argument

Dual-generator architecture with cycle-consistent learning on winner-loser pairs drawn from search trajectories

If this is right

  • Merging coefficients can be proposed by learned generators rather than hand-crafted stochastic operators.
  • Historical search data alone suffices to train the system without extra labeled validation merges.
  • Elite merged models can be reused as expert bases in subsequent evolutionary rounds.
  • The approach yields higher performance than prior evolutionary merging methods on both seen and unseen benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same winner-loser training pattern could be tested on other evolutionary search problems that optimize continuous coefficients.
  • Cycle-consistent generators might reduce data requirements in other generative modeling settings where only relative rankings are available.
  • Repeated application of the pipeline could compound improvements when merging models drawn from increasingly diverse sources.

Load-bearing premise

Winner-loser pairs from past trajectories supply enough information for the generators to learn useful coefficient distributions without overfitting or separate validation sets.

What would settle it

A direct comparison on an unseen task suite where EvoGM no longer beats the strongest baseline, or an ablation removing cycle consistency that eliminates the reported gains, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.29295 by Chenhao Yi, Dongmei Jiang, Jianguo Zhang, Ran Cheng, Tao Jiang, Xinmeng Yu, Yan Li, Yiling Wu.

Figure 1
Figure 1. Figure 1: Single-task performance comparison of 10 fine-tuned Qwen2.5-1.5B models. The proposed method outperforms base￾line models on almost all targeted tasks. 1. Introduction The prevailing paradigm of large language models (LLMs) relies on large-scale pretraining followed by task-specific adaptation, yet the rapid growth in model size makes full￾parameter fine-tuning increasingly impractical under realis￾tic com… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EvoGM. We optimize merging coefficients λ ∈ R N for task-vector merging. (1) Population Initialization: initialize a diverse population of λ (average merge, one-hot, random) and evaluate on Dval to build the history set H = {(λ, f(λ))}. (2) Winner–Loser Pairing: split H into winners H+ and losers H−. (3) Dual-Generator Training: train (G−→+, G+→−) with cycle￾consistency and optimization-guided … view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on different components. (1) Single￾Generator represents the single generative model variant; (2) w/o Rounds represents the replacement of multi-round updates with five continuous iterations; and (3) w/o Cycle Loss represents the removal of the cycle-consistency constraint. and reporting the average performance across 2 tasks. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the number of merged models on performance. We evaluate how the merging quality scales when integrating dif￾ferent quantities of models. indicates that EvoGM is highly robust to the initial config￾uration and the complexity of the weight space. Even as the number of experts increases, the generative evolutionary process successfully identifies superior merging coefficients that mitigate task inte… view at source ↗
Figure 6
Figure 6. Figure 6: Fitness evolution of EvoGM and SOTA methods in multi-task scenarios. The curves represent the mean performance of the top five individuals in the population, with shaded areas indicating the confidence intervals computed from these individuals. Single w/o Rounds w/o Cycle EvoGM 0.50 0.55 0.60 0.65 0.70 Accuracy MMLU Single w/o Rounds w/o Cycle EvoGM 0.20 0.25 0.30 0.35 0.40 0.45 MMLU-Pro Single w/o Rounds … view at source ↗
Figure 7
Figure 7. Figure 7: PAblation study on different components. (1) Single-Generator represents the single generative model variant; (2) w/o Rounds represents the replacement of multi-round updates with five continuous iterations; and (3) w/o Cycle Loss represents the removal of the cycle-consistency constraint. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance Comparison across Multiple Benchmarks. The radar plots illustrate the performance of different model merging methods on Validation (top row) and Test (bottom row) sets. The benchmarks include HellaSwag, Knowledge Crosswords, MMLU Pro, and MMLU, with the rightmost column showing the overall average performance across all tasks. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Evolutionary model merging provides a powerful framework for the automated, training-free composition of LLMs through parameter-space search. However, existing methods predominantly rely on stochastic, hand-crafted operators that overlook the underlying performance landscape of the coefficient space. We propose Evolutionary Generative Merging (EvoGM), a framework that transcends manual heuristics by employing learnable generative modeling to optimize merging coefficients. Specifically, EvoGM features a dual-generator architecture with cycle-consistent learning to adaptively sample and refine promising merging candidates. By constructing winner-loser pairs from historical search trajectories, our framework effectively captures high-performance parameter distributions and maximizes data efficiency. This generative process is seamlessly integrated into a multi-round evolutionary pipeline, where elite merged models iteratively serve as new expert foundations. Extensive experiments across diverse benchmarks demonstrate that EvoGM significantly outperforms state-of-the-art baselines, exhibiting robust performance on both seen and unseen tasks. Code and data are available at https://github.com/JiangTao97/evogm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes EvoGM, a framework for evolutionary model merging of LLMs that replaces hand-crafted stochastic operators with a dual-generator architecture trained via cycle-consistent learning on winner-loser pairs extracted from historical search trajectories. These generators are embedded in a multi-round evolutionary pipeline that iteratively uses elite merged models as new foundations, with the central claim being significant outperformance over state-of-the-art baselines together with robust generalization to both seen and unseen tasks.

Significance. If the empirical claims are substantiated with proper controls, the work would offer a data-efficient, learnable alternative to manual heuristics in parameter-space model merging, potentially improving automation and performance in LLM composition without additional fine-tuning.

major comments (3)
  1. [Abstract] Abstract: the assertion of significant outperformance and robust performance on unseen tasks supplies no quantitative metrics, error bars, ablation studies, or experimental details, so the central empirical claim cannot be evaluated from the provided information.
  2. [Method] Method description (dual-generator with cycle-consistent learning): training the generators exclusively on winner-loser pairs drawn from the same evolutionary trajectories they subsequently guide creates an unaddressed self-referential dependency; without explicit regularization, held-out validation splits for the generators, or documented separation between trajectory-collection tasks and evaluation tasks, the reported generalization to unseen tasks risks being an artifact of memorization rather than distribution learning.
  3. [Experiments] Experiments section: the manuscript must demonstrate that the multi-round pipeline does not leak information from evaluation tasks into the historical trajectories used for generator training; absent such controls, the robustness claim on unseen tasks remains unverified.
minor comments (2)
  1. The GitHub link is provided, which supports reproducibility; however, the repository should include the exact scripts and seeds used for the reported runs.
  2. [Method] Notation for the cycle-consistency loss and the dual-generator sampling procedure should be defined more explicitly with equations to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and substantiation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of significant outperformance and robust performance on unseen tasks supplies no quantitative metrics, error bars, ablation studies, or experimental details, so the central empirical claim cannot be evaluated from the provided information.

    Authors: We agree that the abstract would benefit from more quantitative detail. In the revised manuscript we will update the abstract to include specific metrics such as average improvements over baselines on seen and unseen tasks, along with references to error bars from repeated runs and key ablation results. revision: yes

  2. Referee: [Method] Method description (dual-generator with cycle-consistent learning): training the generators exclusively on winner-loser pairs drawn from the same evolutionary trajectories they subsequently guide creates an unaddressed self-referential dependency; without explicit regularization, held-out validation splits for the generators, or documented separation between trajectory-collection tasks and evaluation tasks, the reported generalization to unseen tasks risks being an artifact of memorization rather than distribution learning.

    Authors: This concern is valid. Our framework collects trajectories from initial rounds on base models before generator training begins, and unseen tasks are excluded from all trajectory collection. We will add a dedicated subsection clarifying task separation, held-out validation for the generators, and regularization in the cycle-consistent objective, plus ablations confirming generalization is not due to memorization. revision: yes

  3. Referee: [Experiments] Experiments section: the manuscript must demonstrate that the multi-round pipeline does not leak information from evaluation tasks into the historical trajectories used for generator training; absent such controls, the robustness claim on unseen tasks remains unverified.

    Authors: We acknowledge the need for explicit verification. The revised Experiments section will include a new subsection documenting that historical trajectories are collected exclusively from seen tasks with no access to unseen tasks. We will add controlled experiments under strict separation and report results confirming the unseen-task robustness holds without leakage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical method

full rationale

The paper describes an iterative evolutionary pipeline that trains a dual-generator on winner-loser pairs extracted from prior search trajectories and then uses the generator to propose new merges. This is a standard data-driven enhancement to evolutionary search rather than a closed self-definition or fitted-input-renamed-as-prediction. No equations are presented that reduce a claimed result to its own inputs by construction, no uniqueness theorem is invoked via self-citation, and the central performance claims rest on external benchmark comparisons rather than internal re-labeling of the training data. The method therefore remains falsifiable against held-out tasks and does not meet the criteria for any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is abstract-only so no explicit free parameters axioms or invented entities are extractable from the text; the contribution is framed as an algorithmic framework rather than new physical postulates.

pith-pipeline@v0.9.1-grok · 5715 in / 1188 out tokens · 39912 ms · 2026-06-29T00:19:39.434532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Qwen Technical Report

    URL https://openreview.net/forum? id=H1osvc7tMP. Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., and Szpektor, I. The second PASCAL recognising textual entailment c...

  2. [2]

    Training Verifiers to Solve Math Word Problems

    URL https://openreview.net/forum? id=D7qRwx6BOS. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fe- dus, W., Li, Y ., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M.,...

  3. [3]

    CoLLaVO: Crayon large language and vision mOdel

    URL https://doi.org/10.18653/v1/ 2024.findings-acl.154. Dolan, W. B. and Brockett, C. Automatically construct- ing a corpus of sentential paraphrases. InProceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology. org/I05-5002/. DU, G., Lee, J., Li, J., Jiang, R., Guo, Y ., Yu, S., Liu, H., Goh, S. K., Tang, H...

  4. [4]

    Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N

    doi: 10.1038/s41467-024-53165-w. Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Cohn, T., He, Y ., and Liu, Y . (eds.),Findings of the Association for Com- putational Linguistics: EMNLP 2020, pp. 3356–3369, 10 EvoGM: Learning to Merge LLMs via Evolutionary G...

  5. [5]

    findings-emnlp.301/

    URL https://aclanthology.org/2020. findings-emnlp.301/. Giampiccolo, D., Magnini, B., Dagan, I., and Dolan, B. The third PASCAL recognizing textual entailment chal- lenge. In Sekine, S., Inui, K., Dagan, I., Dolan, B., Giampiccolo, D., and Magnini, B. (eds.),Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 1–9, Prague, Ju...

  6. [6]

    org/CorpusID:284648786

    URL https://api.semanticscholar. org/CorpusID:284648786. Ivison, H., Wang, Y ., Pyatkin, V ., Lambert, N., Peters, M., Dasigi, P., Jang, J., Wadden, D., Smith, N. A., Beltagy, I., et al. Camels in a changing climate: Enhancing LM adap- tation with TULU 2.arXiv preprint arXiv:2311.10702, 2023. Jiang, H., Wang, R., Liang, W., Sun, Q., Zhang, X., and Liu, Y ...

  7. [7]

    Jin, X., Ren, X., Preotiuc-Pietro, D., and Cheng, P

    URL https://openreview.net/forum? id=dj0TktJcVI. Jin, X., Ren, X., Preotiuc-Pietro, D., and Cheng, P. Data- less knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=FCnohuR6AnM. K¨opf, A., Kilcher, Y ., V on R ¨utte, D., Anagnostidis, S., Ta...

  8. [8]

    org/CorpusID:15710851

    URL https://api.semanticscholar. org/CorpusID:15710851. Li, B., Di, Z., Yang, Y ., Qian, H., Yang, P., Hao, H., Tang, K., and Zhou, A. It’s morphing time: Unleashing the po- tential of multiple LLMs via multi-objective optimization. IEEE Transactions on Evolutionary Computation, 2025a. doi: 10.1109/TEVC.2025.3613937. Li, L., Zhang, T., Bu, Z., Wang, S., H...

  9. [9]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    URL https://api.semanticscholar. org/CorpusID:230433941. Li, Y ., Lan, X., Chen, H., Lu, K., and Jiang, D. Multi- modal pear chain-of-thought reasoning for multimodal sentiment analysis.ACM Transactions on Multimedia Computing, Communications and Applications, 20(9): 1–23, 2025c. Lian, W., Goodson, B., Pentland, E., Cook, A., V ong, C., and ”Teknium”. Ope...

  10. [10]

    PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging

    URL https://openreview.net/forum? id=y1z7SAS8q8. Perin, G., Chen, X., Liu, S., Kailkhura, B., Wang, Z., and Gallagher, B. RankMean: Module-level importance score for merging fine-tuned LLM models. In Ku, L.-W., Mar- tins, A., and Srikumar, V . (eds.),Findings of the Associa- tion for Computational Linguistics: ACL 2024, pp. 1776– 1782, Bangkok, Thailand, ...

  11. [11]

    Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Lin, Q., and Jiang, D

    URL https://proceedings.mlr.press/ v162/wortsman22a.html. Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Lin, Q., and Jiang, D. WizardLM: Empower- ing large pre-trained language models to follow com- plex instructions. InThe Twelfth International Confer- ence on Learning Representations, 2024a. URL https: //openreview.net/forum?id=CfXh...

  12. [12]

    13 EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., and Tao, D

    Curran Associates, Inc., 2023. 13 EvoGM: Learning to Merge LLMs via Evolutionary Generative Optimization Yang, E., Wang, Z., Shen, L., Liu, S., Guo, G., Wang, X., and Tao, D. AdaMerging: Adaptive model merging for multi-task learning. InThe Twelfth International Confer- ence on Learning Representations, 2024. URL https: //openreview.net/forum?id=nZP6NgD3Q...

  13. [13]

    findings-emnlp.127

    URL https://aclanthology.org/2024. findings-emnlp.127. Zheng, H., Shen, L., Tang, A., Luo, Y ., Hu, H., Du, B., Wen, Y ., and Tao, D. Learning from models be- yond fine-tuning.Nature Machine Intelligence, 7(1): 6–17, 01 2025. ISSN 2522-5839. doi: 10.1038/ s42256-024-00961-0. URL https://doi.org/10. 1038/s42256-024-00961-0. Zhou, C., Liu, P., Xu, P., Iyer,...