What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
Pith reviewed 2026-05-10 02:16 UTC · model grok-4.3
The pith
Strong LLM optimizers succeed by making incremental improvements and localizing searches in semantic space rather than drifting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space.
What carries the argument
LLM-guided evolutionary search trajectories, specifically the balance between incremental refinement and semantic drift or localization.
Load-bearing premise
The differences in observed search trajectories are the main reason for varying optimization results rather than other unmeasured factors like model size or specific prompts.
What would settle it
Observing that performance differences disappear when trajectory characteristics are matched across different LLMs, or that altering a model's trajectory changes its optimization results without changing the model itself.
Figures
read the original abstract
Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a large-scale empirical analysis of LLM-guided evolutionary search, collecting optimization trajectories from 15 LLMs across 8 tasks. It reports that zero-shot capability explains only part of the variance in outcomes, and that stronger-performing LLMs act as local refiners (frequent incremental improvements, progressive semantic localization) while weaker ones exhibit semantic drift, sporadic breakthroughs, and stagnation. Novelty measures do not predict final performance unless the search remains localized around high-performing regions.
Significance. If the trajectory patterns are robustly linked to outcomes, the work offers a useful shift from static capability metrics to dynamic behavioral analysis in LLM optimization systems. This could inform prompt design, model selection, and training objectives for agentic or evolutionary setups, and the scale (15 models, 8 tasks) provides a broad empirical foundation.
major comments (3)
- [§4, §5] §4 (Trajectory Analysis) and §5 (Results): Models are classified as strong/weak post-hoc based on final performance, after which trajectory statistics (incremental improvements, semantic drift) are compared. This observational design across heterogeneous models does not isolate whether localization/refinement causally drives better outcomes or is merely correlated with unmeasured factors such as instruction-following ability or training data. An intervention (e.g., prompt modifications that enforce local vs. exploratory proposals on the same model) would be needed to support the mechanistic claim.
- [§3] §3 (Experimental Setup) and associated tables/figures: The manuscript provides no quantitative details on statistical controls, error bars, multiple-comparison corrections, or exclusion criteria for trajectories. Without these, it is difficult to assess the reliability of the reported behavioral patterns (e.g., frequency of incremental improvements or semantic localization metrics) across the 15×8 design.
- [§5.3] §5.3 (Novelty Analysis): The claim that 'novelty is beneficial only when the search remains sufficiently localized' is supported by correlational splits but lacks a controlled test showing that increasing localization (while holding other factors fixed) improves outcomes. This is load-bearing for the actionable design insight.
minor comments (2)
- [§3] Clarify the exact definition and computation of 'semantic localization' and 'semantic drift' metrics (e.g., embedding model, distance function, window size) in the methods section.
- [§6] Add a limitations paragraph discussing the scope of the 8 tasks and whether results generalize beyond the chosen evolutionary framework.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We respond to each major comment below, providing clarifications on our observational methodology and outlining revisions to improve statistical transparency and the precision of our claims.
read point-by-point responses
-
Referee: [§4, §5] §4 (Trajectory Analysis) and §5 (Results): Models are classified as strong/weak post-hoc based on final performance, after which trajectory statistics (incremental improvements, semantic drift) are compared. This observational design across heterogeneous models does not isolate whether localization/refinement causally drives better outcomes or is merely correlated with unmeasured factors such as instruction-following ability or training data. An intervention (e.g., prompt modifications that enforce local vs. exploratory proposals on the same model) would be needed to support the mechanistic claim.
Authors: Our study is explicitly observational, analyzing trajectories from different LLMs to identify patterns associated with better optimization outcomes. We do not claim that the local refinement behavior causally drives success, but rather that it is a characteristic of models that achieve strong results in this setup. Post-hoc classification allows us to contrast the trajectory behaviors of high- and low-performing models. We will revise the manuscript to include stronger language emphasizing the correlational nature of the findings and the possibility of confounding variables. An interventional study on a single model would be a valuable follow-up but is outside the scope of this comparative analysis across 15 models and 8 tasks. revision: partial
-
Referee: [§3] §3 (Experimental Setup) and associated tables/figures: The manuscript provides no quantitative details on statistical controls, error bars, multiple-comparison corrections, or exclusion criteria for trajectories. Without these, it is difficult to assess the reliability of the reported behavioral patterns (e.g., frequency of incremental improvements or semantic localization metrics) across the 15×8 design.
Authors: We agree that additional statistical details are necessary for assessing reliability. In the revised version, we will add error bars to figures showing trajectory metrics, specify the statistical tests employed for group comparisons (including p-values and effect sizes), apply appropriate multiple-comparison corrections, and clearly state any exclusion criteria for incomplete trajectories. These additions will be incorporated into §3 and the figure captions. revision: yes
-
Referee: [§5.3] §5.3 (Novelty Analysis): The claim that 'novelty is beneficial only when the search remains sufficiently localized' is supported by correlational splits but lacks a controlled test showing that increasing localization (while holding other factors fixed) improves outcomes. This is load-bearing for the actionable design insight.
Authors: The novelty analysis relies on observational splits of the data by localization level, showing that the novelty-performance correlation holds only in localized searches. We will update the text in §5.3 to frame this as an observational insight rather than a causal recommendation, noting that it suggests a design principle but would benefit from future controlled experiments. This maintains the actionable aspect while accurately reflecting the evidence. revision: partial
Circularity Check
No circularity: empirical trajectory analysis with independent measurements
full rationale
The paper performs a large-scale empirical study: it runs LLM-guided evolutionary search on 15 models across 8 tasks, records full trajectories (proposals, fitness, semantic embeddings), classifies models post-hoc by final performance, and reports observed statistical patterns (incremental vs. drift trajectories, localization, novelty correlations). These steps use direct measurements from the runs; no claimed result is obtained by fitting a parameter to a subset and relabeling it a prediction, nor by defining a quantity in terms of itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central claims. The analysis remains self-contained against external benchmarks (the collected trajectories) and does not reduce any finding to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic embeddings provide a meaningful metric for measuring localization and drift of candidate solutions during search.
Forward citations
Cited by 1 Pith paper
-
What Do Evolutionary Coding Agents Evolve?
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Reference graph
Works this paper leans on
-
[1]
A knapsack by any other name: Presentation impacts LLM performance on NP-hard problems. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 6628–6651, Suzhou, China. Association for Computational Linguistics. Jordan S. Ellenberg, Cristofero S. Fraser-Taliente, Thomas R. Harvey, Karan Srivastava, and Andrew V . Sutherland. 202...
-
[2]
Flows: Building blocks of reasoning and collaborating AI.CoRR, abs/2308.01285, 2023
Flows: Building blocks of reasoning and col- laborating ai.arXiv preprint arXiv:2308.01285. Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An, Seanie Lee, and Sung Ju Hwang. 2026. T-map: Red- teaming llm agents with trajectory-aware evolution- ary search.Preprint, arXiv:2603.22341. Joel Lehman, Jonathan Gordon, Shawn Jain, Ka- mal Ndousse, Cathy Yeh, and Ke...
-
[3]
Evolution through large models.Preprint, arXiv:2206.08896. Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, and Kai Chen
-
[4]
Opt-bench: Evaluating llm agent on large- scale search spaces optimization problems.Preprint, arXiv:2506.10764. Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binx- ing Jiao, Chen Hu, and Huacan Wang. 2025. Se- agent: Self-evolution trajectory optimization in multi- ste...
-
[5]
Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search, August 2025
Evolution of heuristics: Towards efficient au- tomatic algorithm design using large language model. InInternational Conference on Machine Learning (ICML). Fei Liu, Qingfu Zhang, Jialong Shi, Xialiang Tong, Kun Mao, and Mingxuan Yuan. 2025. Fitness land- scape of large language model-assisted automated algorithm search.Preprint, arXiv:2504.19636. Chris Lu,...
-
[6]
Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al
Magistral.Preprint, arXiv:2506.10910. Melanie Mitchell and Charles E Taylor. 1999. Evolu- tionary computation: an overview.Annual Review of Ecology and Systematics, pages 593–616. Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Em- ilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian...
-
[7]
Agentic ai: The era of semantic decoding. Preprint, arXiv:2403.14562. Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, and Paul Pu Liang. 2026. Coral: To- wards autonomous multi-agent evolution for o...
-
[8]
Trajevo: Designing trajectory prediction heuristics via llm-driven evolution
Trajevo: Designing trajectory prediction heuristics via llm-driven evolution.Preprint, arXiv:2505.04480. Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan. 2024. Hypoth- esis generation with large language models. InPro- ceedings of the 1st Workshop on NLP for Science (NLP4Science), page 117–139. Association for Com- putational ...
-
[9]
genome": [parent_1],
or a=f(t, x, v) (Oscillator-2). Genomes are validated by attempting to parse and execute them; non-executable or divergent outputs receive fitness f= 1×10 6 (high loss). Fitness Evaluation:Given training data (X, ytrue), fitness is computed as: fSymReg = 1−norm (MSE (y pred, ytrue)), where normalization is per-task instance. Novelty Distance:We use functi...
-
[10]
code": "def equation(x, v): return x + v
If total genomes N≤4000 : fit MDS on the Symbolic Regression Evolution Prompt System:You are a scientific equation discovery expert. Your goal is to propose a new, better mathematical expression that fits the data (lower MSE is better). User:TASK DESC: Symbolic regression for damped nonlinear oscillator. Finda=f(x, v). Here are previous candidate expressi...
-
[11]
precomputed
If N >4000 : use stratified sampling (max 60 genomes per (model, generation) bucket) to obtain m≤4000 base points, then use out-of-sample (OOS) placement for remain- ing points. MDS Solver Parameters:All experiments use sklearn.manifold.MDSwith: •n_components=2 : Project to 2D for visualiza- tion. •dissimilarity="precomputed" : Input is precomputed distan...
-
[12]
Compute distance (using task-specific metric) from each OOS point to allmbase points
-
[13]
Find k= 8 nearest neighbors (smallest dis- tances)
-
[14]
Assign weights wi = 1/(d i + 10−8)p, where p= 2.0andd i is distance to neighbori
-
[15]
code": "def priority(item, bins): return -(bins - item)
Place OOS point at weighted average of neigh- bor 2D coordinates. This approach is fast (vectorized per block of 4000 points) and preserves neighborhood structure in the high-dimensional space. Bin Packing Evolution Prompt System:You are an expert in online bin packing algorithms. Your goal is to design a new, better priority function that minimizes bins ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.