arxiv: 2604.06465 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI

Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

Mario Iacobelli , Adrian Robert Minut , Tommaso Mencattini , Donato Crisostomi , Andrea Santilli , Iacopo Masi , Emanuele Rodol\`a This is my paper

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords model mergingreasoning modelsevolutionary optimizationlong-to-short reasoningmulti-objective optimizationinference efficiencylarge language modelsmathematical reasoning

0 comments

The pith

Evolutionary model merging finds merged reasoning models that cut output length by over half while keeping accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current ways of merging models to shorten reasoning traces rely on fixed weights that force poor compromises between accuracy and brevity. Instead, Evo-L2S treats merging as a search problem where an evolutionary algorithm explores many possible combinations of model weights to optimize both goals at once. An entropy-based sampling trick makes the search fast enough to run on models with billions of parameters. Experiments across three model sizes and six math benchmarks show the resulting models generate reasoning chains more than 50 percent shorter, often with equal or better accuracy. This matters because long chains of thought in reasoning models drive up inference cost, so better trade-offs could make them more usable.

Core claim

Evo-L2S recasts the long-to-short reasoning task as multi-objective optimization and applies evolutionary model merging to search for parameter combinations that produce a Pareto front of accuracy versus output length. Entropy-based subset sampling reduces the cost of evaluating each candidate during search. On 1.5B, 7B, and 14B models tested on six mathematical reasoning benchmarks, the method yields merged models whose generated reasoning traces are over 50 percent shorter while preserving or improving problem-solving accuracy relative to the original models.

What carries the argument

Evolutionary multi-objective merging that searches the space of model weight combinations to optimize the accuracy-length Pareto front, supported by entropy-based subset sampling to lower the cost of fitness evaluation.

If this is right

Merged models can be used directly at inference time to solve math problems with substantially fewer tokens.
The same search process scales from 1.5B to 14B parameter models without requiring retraining.
A range of models with different accuracy-length balances becomes available from a single evolutionary run.
The approach applies across multiple distinct mathematical reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Treating model merging as population-based search rather than arithmetic may apply to other efficiency goals such as energy use or latency.
The method could reduce reliance on distillation pipelines when the goal is to compress reasoning behavior.
Extending the search to include additional objectives like safety alignment would be a direct next step.

Load-bearing premise

That evolutionary search over merges of existing model weights can consistently locate combinations that improve the accuracy-length trade-off beyond what single-objective or fixed-weight merging can achieve.

What would settle it

On a new collection of reasoning problems, the evolved models would produce either longer average output lengths or lower accuracy than the source models.

Figures

Figures reproduced from arXiv: 2604.06465 by Adrian Robert Minut, Andrea Santilli, Donato Crisostomi, Emanuele Rodol\`a, Iacopo Masi, Mario Iacobelli, Tommaso Mencattini.

**Figure 2.** Figure 2: Accuracy (%) vs. length reduction (%) relative to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Entropy-based subset sampling for efficient evolutionary fitness estimation. (a) Calibration correctness matrix: each row is one of K = 10 merged checkpoints (spaced uniformly in λ), each column a problem sorted by empirical solve rate pˆi ; the colour strip below shows the corresponding Bernoulli entropy Hi . We generate this matrix simulating common patterns observed in correctness matrices, in order to … view at source ↗

**Figure 4.** Figure 4: Benchmark-wise accuracy and length reduction of Evo-L2S (best trade-off) vs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evo-L2S reframes Long-to-Short reasoning as multi-objective evolutionary merging with entropy sampling to cut trace length, but the sampling's reliability is the unaddressed weak point.

read the letter

The main thing to know is that this paper treats accuracy and output length as separate objectives in an evolutionary search over model merges, instead of picking one fixed weight for arithmetic merging. They add entropy-based subset sampling to keep fitness checks cheap enough for 7B and 14B models. If the results hold, the reported 50%+ shorter traces at equal or better accuracy on math benchmarks would be useful for anyone running reasoning models under tight compute limits.

Referee Report

2 major / 2 minor

Summary. The paper introduces Evo-L2S, a framework that formulates the Long-to-Short (L2S) reasoning problem as a multi-objective optimization task solved via evolutionary model merging. It explicitly searches for merged models that optimize the accuracy-length trade-off on reasoning traces, using an entropy-based subset sampling method to make fitness evaluation tractable for large models. Experiments at 1.5B, 7B, and 14B scales across six mathematical reasoning benchmarks are reported to yield merged models with over 50% shorter generated traces while preserving or improving accuracy relative to the original models.

Significance. If the empirical results hold under rigorous validation, the work would be significant for providing a training-free, multi-objective alternative to brittle scalarized merging methods in efficient LLM reasoning. The approach of evolving Pareto fronts for accuracy vs. length, combined with the sampling heuristic for scalability, addresses a practical inference bottleneck. Credit is due for the multi-scale evaluation (1.5B–14B) and the explicit framing as multi-objective search rather than single-objective compromise.

major comments (2)

[Experiments] Experiments section (results on 7B/14B scales): The central claim of >50% length reduction at equal or better accuracy requires full reporting of baselines (including specific scalarized merging methods), statistical tests, error bars, data splits, and controls for confounds. The abstract and summary provide no such details, making it impossible to assess whether the Pareto front improvements are robust or statistically meaningful.
[Method] Method section on entropy-based subset sampling: The sampling technique is load-bearing for tractability and reliability of the evolutionary search. Without reported correlation coefficients between subset-based fitness and full-dataset evaluation, or ablations comparing it to uniform/random sampling, the risk of bias (e.g., preferential selection of easy/short-trace examples) remains unaddressed and could invalidate the claimed accuracy-length trade-offs.

minor comments (2)

[Abstract] Abstract: The acronym 'Evo-L2S' and 'L2S' are used without an initial expansion, which reduces immediate clarity for readers unfamiliar with the subfield.
[Experiments] The manuscript would benefit from a table summarizing the six benchmarks, their sizes, and the exact metrics used for accuracy and length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive evaluation of the significance of Evo-L2S. We address each major comment below and will revise the manuscript to incorporate the requested details and validations.

read point-by-point responses

Referee: [Experiments] Experiments section (results on 7B/14B scales): The central claim of >50% length reduction at equal or better accuracy requires full reporting of baselines (including specific scalarized merging methods), statistical tests, error bars, data splits, and controls for confounds. The abstract and summary provide no such details, making it impossible to assess whether the Pareto front improvements are robust or statistically meaningful.

Authors: We agree that the Experiments section would benefit from expanded reporting to enable rigorous evaluation of the central claims. In the revised manuscript, we will augment the results for the 7B and 14B scales with: explicit baselines including specific scalarized merging methods (e.g., TIES, DARE, and linear interpolation with their tuned hyperparameters); statistical significance tests (such as paired t-tests or Wilcoxon signed-rank tests with p-values) comparing accuracy and length; error bars derived from multiple independent runs with different random seeds; explicit descriptions of the data splits and evaluation protocols; and additional controls for potential confounds including variations in prompting and decoding strategies. These additions will allow direct assessment of the robustness of the reported Pareto front improvements. revision: yes
Referee: [Method] Method section on entropy-based subset sampling: The sampling technique is load-bearing for tractability and reliability of the evolutionary search. Without reported correlation coefficients between subset-based fitness and full-dataset evaluation, or ablations comparing it to uniform/random sampling, the risk of bias (e.g., preferential selection of easy/short-trace examples) remains unaddressed and could invalidate the claimed accuracy-length trade-offs.

Authors: We concur that the entropy-based subset sampling requires explicit validation to confirm its reliability. In the revised Method section, we will report Pearson and Spearman correlation coefficients between fitness values computed on the entropy-based subsets and those on the full evaluation sets, across the different model scales. We will also add ablation experiments that replace the entropy-based sampler with uniform sampling and random sampling, presenting the resulting Pareto fronts, accuracy-length metrics, and analysis of any systematic biases in the selected examples (e.g., difficulty or trace length distributions). These results will directly address concerns about potential bias and substantiate the sampling method's suitability for the evolutionary search. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of evolutionary merging on benchmarks

full rationale

The paper advances an empirical framework (Evo-L2S) that applies evolutionary multi-objective optimization and entropy-based subset sampling to merge reasoning models, then evaluates the resulting Pareto fronts on six fixed mathematical benchmarks at multiple scales. No derivation, equation, or first-principles claim is offered that reduces a reported accuracy-length trade-off to a fitted parameter, self-citation, or input by construction; the central results are obtained from direct measurement on held-out evaluation sets rather than from any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; ledger left empty pending full text.

pith-pipeline@v0.9.0 · 5523 in / 1017 out tokens · 47609 ms · 2026-05-10T18:53:27.238621+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

URLhttps://arxiv.org/abs/2110.14168. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6(2):182–197, 2002. doi: 10.1109/4235.996017. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/4235.996017 2002
[2]

doi: 10.18653/v1/2024.emnlp-industry

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry

work page doi:10.18653/v1/2024.emnlp-industry 2024
[3]

doi: 10.18653/v1/2024.acl-long.211

URLhttps://aclanthology.org/2024.emnlp-industry.36/. Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, And...

work page doi:10.18653/v1/2024.acl-long.211 2024
[4]

From system 1 to system 2: A survey of reasoning large language models.IEEE Trans

URLhttps://openreview.net/forum?id=ayzWTxb9ZD. Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InForty-first In- ternational Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/foru...

work page doi:10.1109/tpami.2025.3637037 2024