pith. sign in

arxiv: 2605.23007 · v1 · pith:GZDG3VSGnew · submitted 2026-05-21 · 💱 q-fin.TR · cs.AI· cs.LG· q-fin.PM

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

Pith reviewed 2026-05-25 05:21 UTC · model grok-4.3

classification 💱 q-fin.TR cs.AIcs.LGq-fin.PM
keywords evolutionary optimizationlarge language modelsalgorithmic tradingBitcoinquantitative financefeature evolutionbacktestingtrading strategies
0
0 comments X

The pith

An LLM-based evolutionary framework improves Bitcoin trading strategies through automated optimization of features and components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies an evolutionary optimization method driven by large language models to several tasks in quantitative finance using Bitcoin trading as the example. It reports gains from evolving feature sets used for signal generation, from tuning separate parts of a trading strategy, and from evolving the full feature pipeline along with the execution rules at once. If the results hold, this would indicate that such methods can discover improved trading algorithms automatically in a simulation environment. Readers would care because it suggests a route to scaling strategy improvement beyond manual design in quantitative finance.

Core claim

On simulation and backtesting setups for Bitcoin, the method achieves significant improvements on all tasks considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy.

What carries the argument

MadEvolve, a general-purpose algorithm optimization framework that uses large language models to iteratively evolve and improve algorithms through search.

If this is right

  • Significant improvements when evolving feature sets for signal generation.
  • Significant improvements when optimizing separate components of the trading strategy.
  • Significant improvements when jointly evolving the feature pipeline together with the execution strategy.
  • Support for the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same optimization approach could be tested on other assets or time periods to check generalization beyond the Bitcoin setup.
  • Live deployment would reveal whether backtest gains survive real-market execution frictions not captured in simulation.
  • The method might reduce the amount of manual iteration needed to refine quantitative strategies.
  • Over-reliance on a single simulation environment could produce strategies that degrade when market conditions shift.

Load-bearing premise

The simulation and backtesting setup accurately reflects real trading conditions without hidden overfitting or unrealistic assumptions that would invalidate the reported performance gains.

What would settle it

Applying the evolved strategies to live trading data with realistic transaction costs and slippage and finding no outperformance relative to baselines.

Figures

Figures reproduced from arXiv: 2605.23007 by Moritz M\"unchmeyer, Owen Colegrove, Tianyi Li, Yurii Kvasiuk.

Figure 1
Figure 1. Figure 1: Overview of the MadEvolve evolution loop. The prompt sampler retrieves parent and inspiration programs from the population database, queries the LLM ensemble, evaluates the resulting candidate against the backtester, and updates the population. See Li et al. [2026] for the full architecture. and performance score. Each cell retains only its best-performing occupant, ensuring that the population covers a ra… view at source ↗
Figure 2
Figure 2. Figure 2: Three complementary tests of the “is the gain just sizing?” question. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative impact-adjusted PnL for the baseline and best evolved strategy in Run 1 (target position [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution progress for Run 1. Highest impact-adjusted PnL achieved by any candidate program in [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative impact-adjusted PnL for the baseline and best evolved strategy in Run 2 (order [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evolution progress for Run 2. Highest impact-adjusted PnL achieved by any candidate program in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative impact-adjusted PnL for the baseline and best evolved strategy in Run 3 (joint evolution). [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evolution progress for Run 3. Highest impact-adjusted PnL achieved by any candidate program in [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cumulative impact-adjusted PnL for the baseline and best evolved pipeline in Run 5 (joint evolution [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evolution progress for Run 5. Highest impact-adjusted PnL achieved by any candidate program in [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: IS–OOS degradation for Run 5. Solid line: best in-sample (validation, 2024) impact-adjusted [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Optuna convergence diagnostics for the two hyperparameter sweeps. Left: baseline forecaster. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-model improvement rates (fraction of mutations that exceed the parent’s fitness) across the [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Null test against a baseline-shifted p-hacking procedure. Blue: validation impact-adjusted PnL at the evolution-selected best. Green: held-out test PnL of the same evo-selected strategy. Red: expected best-of-K if each iteration’s PnL were drawn from N (PnL0, σ0) on validation — the hypothetical p-hacking ceiling. Dashed lines mark the baseline p PnL0 for each split; the test σ0 is the validation σ0 resca… view at source ↗
read the original abstract

We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MadEvolve, an LLM-driven evolutionary optimization framework adapted from Alpha-Evolve, and applies it to Bitcoin trading tasks including evolving feature sets for signal generation, optimizing individual strategy components, and jointly evolving the feature pipeline with the execution strategy. It reports significant improvements on a simulation and backtesting setup, compares performance to Claude Code, and evaluates p-hacking probabilities.

Significance. If the simulation and backtesting controls are shown to be rigorous, the work would provide evidence that evolutionary LLM methods can improve quantitative trading pipelines; the explicit mention of p-hacking evaluation is a positive step toward falsifiability.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Simulation Setup): the central claim of 'significant improvements on all tasks' rests entirely on the validity of the backtesting simulator, yet no description is given of how slippage, fees, latency, partial fills, or strictly causal feature construction are implemented, nor of the out-of-sample protocol that prevents the evolutionary search from fitting the test window.
  2. [§4 and Table 2] §4 (Results) and Table 2: without reported statistical tests, confidence intervals, or explicit out-of-sample Sharpe ratios before versus after evolution, the magnitude of the claimed gains cannot be evaluated against the risk of in-sample optimization.
minor comments (2)
  1. [Abstract] The abstract states that p-hacking probabilities were evaluated but does not report the numerical values or the exact procedure used.
  2. [§2] Notation for the evolutionary operators and LLM prompt templates should be defined once in a dedicated subsection rather than inline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important aspects of reproducibility and statistical rigor in quantitative finance research. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the simulation setup and results.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Simulation Setup): the central claim of 'significant improvements on all tasks' rests entirely on the validity of the backtesting simulator, yet no description is given of how slippage, fees, latency, partial fills, or strictly causal feature construction are implemented, nor of the out-of-sample protocol that prevents the evolutionary search from fitting the test window.

    Authors: We agree that a transparent and detailed description of the backtesting simulator is necessary to substantiate the performance claims. In the revised manuscript, we will substantially expand Section 3 (Simulation Setup) with a dedicated subsection on the simulator implementation. This will explicitly cover: slippage modeled proportionally to trade size and recent volatility; transaction fees at a fixed rate per trade; latency as a configurable delay parameter; partial fills simulated via order-book depth assumptions; strictly causal feature construction enforced by restricting all computations to data available at or before the prior timestep; and the out-of-sample protocol, in which evolutionary search and hyperparameter tuning occur only within a designated training window while final performance is measured on a completely held-out test period. These additions will allow readers to evaluate the controls against overfitting risks. revision: yes

  2. Referee: [§4 and Table 2] §4 (Results) and Table 2: without reported statistical tests, confidence intervals, or explicit out-of-sample Sharpe ratios before versus after evolution, the magnitude of the claimed gains cannot be evaluated against the risk of in-sample optimization.

    Authors: We concur that statistical quantification is required to assess the reliability of the reported improvements. In the revised Section 4 and updated Table 2, we will add: bootstrap-derived 95% confidence intervals around all Sharpe ratios; paired statistical tests (e.g., t-tests or Wilcoxon tests) comparing pre- and post-evolution performance metrics; and explicit side-by-side reporting of out-of-sample Sharpe ratios before versus after MadEvolve optimization. We will also expand the existing p-hacking probability analysis to include sensitivity checks under different random seeds and data splits. These revisions will provide a clearer basis for evaluating whether the observed gains exceed what could arise from in-sample optimization alone. revision: yes

Circularity Check

0 steps flagged

No derivation chain or first-principles claim present; empirical application only

full rationale

The paper is an empirical demonstration of applying an existing evolutionary optimization framework (MadEvolve, inspired by Alpha-Evolve) to trading strategy components on a Bitcoin backtesting setup. It reports observed improvements across tasks and comparisons to other agents, without any mathematical derivation, uniqueness theorem, ansatz, or first-principles result that reduces to its own inputs. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. Standard backtesting validity concerns exist but fall under correctness rather than the enumerated circularity patterns. The work is self-contained as an application study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5689 in / 939 out tokens · 23101 ms · 2026-05-25T05:21:53.591774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Anthropic

    arXiv preprint arXiv:2511.07678. Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku.Anthropic Technical Report,

  2. [2]

    Price Impact

    arXiv preprint arXiv:0903.2428. Elliot Glazer, Ege Erdil, Tamay Besiroglu, et al. FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI,

  3. [3]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

    arXiv preprint arXiv:2411.04872. Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  4. [4]

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin

    arXiv preprint arXiv:2503.14499. Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349,

  5. [5]

    A survey of large language models in finance (FinLLMs).arXiv preprint arXiv:2402.02315,

    Jean Lee, Nicholas Stevens, Soyeon Caren Han, and Minseok Song. A survey of large language models in finance (FinLLMs).arXiv preprint arXiv:2402.02315,

  6. [6]

    Madevolve: Evolutionary optimization of cosmological algorithms with large language models.arXiv preprint arXiv:2602.15951,

    Tianyi Li, Shihui Zang, and Moritz Münchmeyer. Madevolve: Evolutionary optimization of cosmological algorithms with large language models.arXiv preprint arXiv:2602.15951,

  7. [7]

    TradingGPT: Multi-agent system with layered memory and distinct characters for enhanced financial trading performance.arXiv preprint arXiv:2309.03736,

    33 MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models Yang Li et al. TradingGPT: Multi-agent system with layered memory and distinct characters for enhanced financial trading performance.arXiv preprint arXiv:2309.03736,

  8. [8]

    Illuminating search spaces by mapping elites

    SSRN: 4412788. Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909,

  9. [9]

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  10. [10]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  11. [11]

    Humanity's Last Exam

    arXiv preprint arXiv:2501.14249. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, et al. Mathematical discoveries from program search with large language models.Nature, 625:468–475,

  12. [12]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Apache 2.0 License. Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

  13. [13]

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang

    arXiv preprint arXiv:2412.20138. Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large language models.arXiv preprint arXiv:2306.06031,

  14. [14]

    QuantEvolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569,

    Junhyeog Yun, Hyoun Jun Lee, and Insu Jeon. QuantEvolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569,

  15. [15]

    Instruct-FinGPT: Financial sentiment analysis by instruction tuning of general-purpose large language models.arXiv preprint arXiv:2306.12659,

    Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. Instruct-FinGPT: Financial sentiment analysis by instruction tuning of general-purpose large language models.arXiv preprint arXiv:2306.12659,

  16. [16]

    A Detailed Trading Simulation Setup In this appendix we explain our trading simulation setup in enough detail to make our results reproducible

    arXiv preprint arXiv:2508.11152. A Detailed Trading Simulation Setup In this appendix we explain our trading simulation setup in enough detail to make our results reproducible. Additional explanations are provided in Sec. 4.1. 34 MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models A.1 Data and Splits For all experiments, we ...

  17. [17]

    3.cancel_open_orders()removes resting orders

    Portfolio position is updated: position←position+ ∆q fill t . 3.cancel_open_orders()removes resting orders. 4.set_passive_order_data()computes new trade quantity and limit price. 5.submit_order()submits if the limit price is valid (notNaN). 6.log()records state. Only one order is active at any time; each interval replaces the previous order. 3https://mass...