pith. sign in

arxiv: 2604.19440 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.NE

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Pith reviewed 2026-05-10 02:16 UTC · model grok-4.3

classification 💻 cs.CL cs.NE
keywords LLMevolutionary optimizationsearch trajectorieslocal refinementsemantic driftsolution noveltyincremental improvements
0
0 comments X p. Extension

The pith

Strong LLM optimizers succeed by making incremental improvements and localizing searches in semantic space rather than drifting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates what distinguishes effective LLMs when used to guide evolutionary optimization by examining the trajectories they produce across many models and tasks. It reveals that initial problem-solving skill explains only part of the performance differences, with the pattern of the search being crucial. Effective models generate frequent small refinements and gradually concentrate the search around good solutions. Less effective models instead make larger semantic shifts, leading to occasional advances but eventual plateaus. Solution novelty by itself does not forecast success, as it helps only when it occurs within localized high-performing areas.

Core claim

Strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space.

What carries the argument

LLM-guided evolutionary search trajectories, specifically the balance between incremental refinement and semantic drift or localization.

Load-bearing premise

The differences in observed search trajectories are the main reason for varying optimization results rather than other unmeasured factors like model size or specific prompts.

What would settle it

Observing that performance differences disappear when trajectory characteristics are matched across different LLMs, or that altering a model's trajectory changes its optimization results without changing the model itself.

Figures

Figures reproduced from arXiv: 2604.19440 by Fran\c{c}ois Portet, Maxime Peyrard, Xi Chen, Xinhao Zhang.

Figure 1
Figure 1. Figure 1: Different optimization trajectories for two LLMs with similar zero-shot performance on TSP￾60. Each point represents a candidate solution, col￾ored by generation. Gemini-1.5-Pro (left) displays sus￾tained fitness improvement and progressive localization. Mistral-7B-Instruct (right) maintains high novelty but fails to exploit it into fitness gains. such LLM-guided evolutionary workflows have been shown to d… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the LLM-driven evolutionary search framework and tasks. Left: the evolutionary process across generations. Right: the within-generation loop—population initialization, LLM-guided mutation, fitness evaluation, and selection. Bottom: the four tasks and their corresponding genome representations. as fT (expr) = 1 − norm(MSE(ˆy, y)), where yˆ denotes model predictions, and norm is the min– max norm… view at source ↗
Figure 3
Figure 3. Figure 3: Scatter plot between zero-shot performance [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: OLS regression results across different tra￾jectory descriptors. (Left) Standardized coefficients. (Right) Explanatory power. ∗∗∗p < 0.001, ∗p < 0.05, ns means non-significant p-values. Novelty-based pre￾dictors are not significant, whereas breakthrough rate (BR) strongly predicts performance and improves fit beyond zero-shot capability (ZS) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A qualitative contrast of evolutionary search geometry analysis. (a) Visualization of the search space topology using MDS. Gemini-1.5-pro forms a convergent solution cluster (yellow). All points are projected using a shared MDS space learned from all task-specific candidates. (b) The Mean Best Fitness curve shows the convergence speed and quality over seeds. (c) Spatial Entropy quantifies the candidates’ o… view at source ↗
Figure 6
Figure 6. Figure 6: Generation-level mixed-effects regression of breakthrough probabilities. Standardized coeffi￾cients are shown for concurrent (left) and lagged (right) models, with predictors at generation t explaining break￾throughs at t or t+1. ∗∗∗p < 0.001, ∗∗p < 0.01, nu￾meric labels report non-significant p-values. stagnation. However, at operator-level, this sug￾gests that beyond base capability, effective LLM op￾tim… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of model mixing on optimization per￾formance and refinement rate. A fraction of offspring is generated by a weaker refiner. Solid lines denote fit￾ness; dashed lines denote refinement rate. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Optimization gain versus cost across LLMs. Each point represents a model, plotting average fitness improvement achieved through evolution against esti￾mated monetary cost. 5 Discussion and Conclusion In this work, we examined the exploration– exploitation trade-off in LLM-guided evolutionary search to understand why some models act as sub￾stantially better search operators than others. Al￾though zero-shot … view at source ↗
Figure 9
Figure 9. Figure 9: that our main finding—that local refine￾4 https://github.com/hageldave/PyGlimmerMDS ment ability is a key driver of optimization suc￾cess—is robust to substantial variations in decod￾ing temperature. Rather than being tied to a narrow hyperparameter regime, refinement behavior turns out to be a stable property of the combined system (model, prompt, and decoding configuration). Task Model Pearson r p-value … view at source ↗
Figure 10
Figure 10. Figure 10: Average local refinement rate across models [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Interaction between novelty and spatial entropy in breakthrough dynamics. Each cell reports the empirical breakthrough probability aggregated over generations falling into the corresponding bins of mean novelty and spatial entropy(z-scored). Color intensity indicates higher likelihood of breakthroughs [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Zero-shot Performance Versus Post-Optimization performance for each task [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Interaction between breakthroughs and Novelty [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Cost-efficiency plots for four task families [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Novelty and fitness coevolution line-plots aggregated over tasks (exploration–exploitation tension) [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
read the original abstract

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper conducts a large-scale empirical analysis of LLM-guided evolutionary search, collecting optimization trajectories from 15 LLMs across 8 tasks. It reports that zero-shot capability explains only part of the variance in outcomes, and that stronger-performing LLMs act as local refiners (frequent incremental improvements, progressive semantic localization) while weaker ones exhibit semantic drift, sporadic breakthroughs, and stagnation. Novelty measures do not predict final performance unless the search remains localized around high-performing regions.

Significance. If the trajectory patterns are robustly linked to outcomes, the work offers a useful shift from static capability metrics to dynamic behavioral analysis in LLM optimization systems. This could inform prompt design, model selection, and training objectives for agentic or evolutionary setups, and the scale (15 models, 8 tasks) provides a broad empirical foundation.

major comments (3)
  1. [§4, §5] §4 (Trajectory Analysis) and §5 (Results): Models are classified as strong/weak post-hoc based on final performance, after which trajectory statistics (incremental improvements, semantic drift) are compared. This observational design across heterogeneous models does not isolate whether localization/refinement causally drives better outcomes or is merely correlated with unmeasured factors such as instruction-following ability or training data. An intervention (e.g., prompt modifications that enforce local vs. exploratory proposals on the same model) would be needed to support the mechanistic claim.
  2. [§3] §3 (Experimental Setup) and associated tables/figures: The manuscript provides no quantitative details on statistical controls, error bars, multiple-comparison corrections, or exclusion criteria for trajectories. Without these, it is difficult to assess the reliability of the reported behavioral patterns (e.g., frequency of incremental improvements or semantic localization metrics) across the 15×8 design.
  3. [§5.3] §5.3 (Novelty Analysis): The claim that 'novelty is beneficial only when the search remains sufficiently localized' is supported by correlational splits but lacks a controlled test showing that increasing localization (while holding other factors fixed) improves outcomes. This is load-bearing for the actionable design insight.
minor comments (2)
  1. [§3] Clarify the exact definition and computation of 'semantic localization' and 'semantic drift' metrics (e.g., embedding model, distance function, window size) in the methods section.
  2. [§6] Add a limitations paragraph discussing the scope of the 8 tasks and whether results generalize beyond the chosen evolutionary framework.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We respond to each major comment below, providing clarifications on our observational methodology and outlining revisions to improve statistical transparency and the precision of our claims.

read point-by-point responses
  1. Referee: [§4, §5] §4 (Trajectory Analysis) and §5 (Results): Models are classified as strong/weak post-hoc based on final performance, after which trajectory statistics (incremental improvements, semantic drift) are compared. This observational design across heterogeneous models does not isolate whether localization/refinement causally drives better outcomes or is merely correlated with unmeasured factors such as instruction-following ability or training data. An intervention (e.g., prompt modifications that enforce local vs. exploratory proposals on the same model) would be needed to support the mechanistic claim.

    Authors: Our study is explicitly observational, analyzing trajectories from different LLMs to identify patterns associated with better optimization outcomes. We do not claim that the local refinement behavior causally drives success, but rather that it is a characteristic of models that achieve strong results in this setup. Post-hoc classification allows us to contrast the trajectory behaviors of high- and low-performing models. We will revise the manuscript to include stronger language emphasizing the correlational nature of the findings and the possibility of confounding variables. An interventional study on a single model would be a valuable follow-up but is outside the scope of this comparative analysis across 15 models and 8 tasks. revision: partial

  2. Referee: [§3] §3 (Experimental Setup) and associated tables/figures: The manuscript provides no quantitative details on statistical controls, error bars, multiple-comparison corrections, or exclusion criteria for trajectories. Without these, it is difficult to assess the reliability of the reported behavioral patterns (e.g., frequency of incremental improvements or semantic localization metrics) across the 15×8 design.

    Authors: We agree that additional statistical details are necessary for assessing reliability. In the revised version, we will add error bars to figures showing trajectory metrics, specify the statistical tests employed for group comparisons (including p-values and effect sizes), apply appropriate multiple-comparison corrections, and clearly state any exclusion criteria for incomplete trajectories. These additions will be incorporated into §3 and the figure captions. revision: yes

  3. Referee: [§5.3] §5.3 (Novelty Analysis): The claim that 'novelty is beneficial only when the search remains sufficiently localized' is supported by correlational splits but lacks a controlled test showing that increasing localization (while holding other factors fixed) improves outcomes. This is load-bearing for the actionable design insight.

    Authors: The novelty analysis relies on observational splits of the data by localization level, showing that the novelty-performance correlation holds only in localized searches. We will update the text in §5.3 to frame this as an observational insight rather than a causal recommendation, noting that it suggests a design principle but would benefit from future controlled experiments. This maintains the actionable aspect while accurately reflecting the evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical trajectory analysis with independent measurements

full rationale

The paper performs a large-scale empirical study: it runs LLM-guided evolutionary search on 15 models across 8 tasks, records full trajectories (proposals, fitness, semantic embeddings), classifies models post-hoc by final performance, and reports observed statistical patterns (incremental vs. drift trajectories, localization, novelty correlations). These steps use direct measurements from the runs; no claimed result is obtained by fitting a parameter to a subset and relabeling it a prediction, nor by defining a quantity in terms of itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central claims. The analysis remains self-contained against external benchmarks (the collected trajectories) and does not reduce any finding to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is observational and relies on standard assumptions about semantic embeddings for measuring localization and drift; no new free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Semantic embeddings provide a meaningful metric for measuring localization and drift of candidate solutions during search.
    Invoked to interpret trajectory differences but not justified or tested within the abstract.

pith-pipeline@v0.9.0 · 5492 in / 1311 out tokens · 46067 ms · 2026-05-10T02:16:04.592673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What Do Evolutionary Coding Agents Evolve?

    cs.NE 2026-05 unverdicted novelty 7.0

    Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · cited by 1 Pith paper

  1. [1]

    arXiv:2503.11061

    A knapsack by any other name: Presentation impacts LLM performance on NP-hard problems. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 6628–6651, Suzhou, China. Association for Computational Linguistics. Jordan S. Ellenberg, Cristofero S. Fraser-Taliente, Thomas R. Harvey, Karan Srivastava, and Andrew V . Sutherland. 202...

  2. [2]

    Flows: Building blocks of reasoning and collaborating AI.CoRR, abs/2308.01285, 2023

    Flows: Building blocks of reasoning and col- laborating ai.arXiv preprint arXiv:2308.01285. Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee, and Sung Ju Hwang. 2026. T-map: Red- teaming llm agents with trajectory-aware evolution- ary search.Preprint, arXiv:2603.22341. Joel Lehman, Jonathan Gordon, Shawn Jain, Ka- mal Ndousse, Cathy Yeh, and Ke...

  3. [3]

    Stanley , title =

    Evolution through large models.Preprint, arXiv:2206.08896. Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen

  4. [4]

    Opt-bench: Evaluating llm agent on large-scale search spaces optimization problems.arXiv preprint arXiv:2506.10764, 2025

    Opt-bench: Evaluating llm agent on large- scale search spaces optimization problems.Preprint, arXiv:2506.10764. Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binx- ing Jiao, Chen Hu, and Huacan Wang. 2025. Se- agent: Self-evolution trajectory optimization in multi- ste...

  5. [5]

    Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025

    Evolution of heuristics: Towards efficient au- tomatic algorithm design using large language model. InInternational Conference on Machine Learning (ICML). Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. 2025. Fitness land- scape of large language model-assisted automated algorithm search.Preprint, arXiv:2504.19636. Chris Lu,...

  6. [6]

    Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al

    Magistral.Preprint, arXiv:2506.10910. Melanie Mitchell and Charles E Taylor. 1999. Evolu- tionary computation: an overview.Annual Review of Ecology and Systematics, pages 593–616. Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Em- ilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian...

  7. [7]

    Peyrard, M

    Agentic ai: The era of semantic decoding. Preprint, arXiv:2403.14562. Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, and Paul Pu Liang. 2026. Coral: To- wards autonomous multi-agent evolution for o...

  8. [8]

    Trajevo: Designing trajectory prediction heuristics via llm-driven evolution

    Trajevo: Designing trajectory prediction heuristics via llm-driven evolution.Preprint, arXiv:2505.04480. Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. 2024. Hypoth- esis generation with large language models. InPro- ceedings of the 1st Workshop on NLP for Science (NLP4Science), page 117–139. Association for Com- putational ...

  9. [9]

    genome": [parent_1],

    or a=f(t, x, v) (Oscillator-2). Genomes are validated by attempting to parse and execute them; non-executable or divergent outputs receive fitness f= 1×10 6 (high loss). Fitness Evaluation:Given training data (X, ytrue), fitness is computed as: fSymReg = 1−norm (MSE (y pred, ytrue)), where normalization is per-task instance. Novelty Distance:We use functi...

  10. [10]

    code": "def equation(x, v): return x + v

    If total genomes N≤4000 : fit MDS on the Symbolic Regression Evolution Prompt System:You are a scientific equation discovery expert. Your goal is to propose a new, better mathematical expression that fits the data (lower MSE is better). User:TASK DESC: Symbolic regression for damped nonlinear oscillator. Finda=f(x, v). Here are previous candidate expressi...

  11. [11]

    precomputed

    If N >4000 : use stratified sampling (max 60 genomes per (model, generation) bucket) to obtain m≤4000 base points, then use out-of-sample (OOS) placement for remain- ing points. MDS Solver Parameters:All experiments use sklearn.manifold.MDSwith: •n_components=2 : Project to 2D for visualiza- tion. •dissimilarity="precomputed" : Input is precomputed distan...

  12. [12]

    Compute distance (using task-specific metric) from each OOS point to allmbase points

  13. [13]

    Find k= 8 nearest neighbors (smallest dis- tances)

  14. [14]

    Assign weights wi = 1/(d i + 10−8)p, where p= 2.0andd i is distance to neighbori

  15. [15]

    code": "def priority(item, bins): return -(bins - item)

    Place OOS point at weighted average of neigh- bor 2D coordinates. This approach is fast (vectorized per block of 4000 points) and preserves neighborhood structure in the high-dimensional space. Bin Packing Evolution Prompt System:You are an expert in online bin packing algorithms. Your goal is to design a new, better priority function that minimizes bins ...