MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models
Pith reviewed 2026-05-25 05:21 UTC · model grok-4.3
The pith
An LLM-based evolutionary framework improves Bitcoin trading strategies through automated optimization of features and components.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On simulation and backtesting setups for Bitcoin, the method achieves significant improvements on all tasks considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy.
What carries the argument
MadEvolve, a general-purpose algorithm optimization framework that uses large language models to iteratively evolve and improve algorithms through search.
If this is right
- Significant improvements when evolving feature sets for signal generation.
- Significant improvements when optimizing separate components of the trading strategy.
- Significant improvements when jointly evolving the feature pipeline together with the execution strategy.
- Support for the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading.
Where Pith is reading between the lines
- The same optimization approach could be tested on other assets or time periods to check generalization beyond the Bitcoin setup.
- Live deployment would reveal whether backtest gains survive real-market execution frictions not captured in simulation.
- The method might reduce the amount of manual iteration needed to refine quantitative strategies.
- Over-reliance on a single simulation environment could produce strategies that degrade when market conditions shift.
Load-bearing premise
The simulation and backtesting setup accurately reflects real trading conditions without hidden overfitting or unrealistic assumptions that would invalidate the reported performance gains.
What would settle it
Applying the evolved strategies to live trading data with realistic transaction costs and slippage and finding no outperformance relative to baselines.
Figures
read the original abstract
We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MadEvolve, an LLM-driven evolutionary optimization framework adapted from Alpha-Evolve, and applies it to Bitcoin trading tasks including evolving feature sets for signal generation, optimizing individual strategy components, and jointly evolving the feature pipeline with the execution strategy. It reports significant improvements on a simulation and backtesting setup, compares performance to Claude Code, and evaluates p-hacking probabilities.
Significance. If the simulation and backtesting controls are shown to be rigorous, the work would provide evidence that evolutionary LLM methods can improve quantitative trading pipelines; the explicit mention of p-hacking evaluation is a positive step toward falsifiability.
major comments (2)
- [Abstract and §3] Abstract and §3 (Simulation Setup): the central claim of 'significant improvements on all tasks' rests entirely on the validity of the backtesting simulator, yet no description is given of how slippage, fees, latency, partial fills, or strictly causal feature construction are implemented, nor of the out-of-sample protocol that prevents the evolutionary search from fitting the test window.
- [§4 and Table 2] §4 (Results) and Table 2: without reported statistical tests, confidence intervals, or explicit out-of-sample Sharpe ratios before versus after evolution, the magnitude of the claimed gains cannot be evaluated against the risk of in-sample optimization.
minor comments (2)
- [Abstract] The abstract states that p-hacking probabilities were evaluated but does not report the numerical values or the exact procedure used.
- [§2] Notation for the evolutionary operators and LLM prompt templates should be defined once in a dedicated subsection rather than inline.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments, which highlight important aspects of reproducibility and statistical rigor in quantitative finance research. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the simulation setup and results.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Simulation Setup): the central claim of 'significant improvements on all tasks' rests entirely on the validity of the backtesting simulator, yet no description is given of how slippage, fees, latency, partial fills, or strictly causal feature construction are implemented, nor of the out-of-sample protocol that prevents the evolutionary search from fitting the test window.
Authors: We agree that a transparent and detailed description of the backtesting simulator is necessary to substantiate the performance claims. In the revised manuscript, we will substantially expand Section 3 (Simulation Setup) with a dedicated subsection on the simulator implementation. This will explicitly cover: slippage modeled proportionally to trade size and recent volatility; transaction fees at a fixed rate per trade; latency as a configurable delay parameter; partial fills simulated via order-book depth assumptions; strictly causal feature construction enforced by restricting all computations to data available at or before the prior timestep; and the out-of-sample protocol, in which evolutionary search and hyperparameter tuning occur only within a designated training window while final performance is measured on a completely held-out test period. These additions will allow readers to evaluate the controls against overfitting risks. revision: yes
-
Referee: [§4 and Table 2] §4 (Results) and Table 2: without reported statistical tests, confidence intervals, or explicit out-of-sample Sharpe ratios before versus after evolution, the magnitude of the claimed gains cannot be evaluated against the risk of in-sample optimization.
Authors: We concur that statistical quantification is required to assess the reliability of the reported improvements. In the revised Section 4 and updated Table 2, we will add: bootstrap-derived 95% confidence intervals around all Sharpe ratios; paired statistical tests (e.g., t-tests or Wilcoxon tests) comparing pre- and post-evolution performance metrics; and explicit side-by-side reporting of out-of-sample Sharpe ratios before versus after MadEvolve optimization. We will also expand the existing p-hacking probability analysis to include sensitivity checks under different random seeds and data splits. These revisions will provide a clearer basis for evaluating whether the observed gains exceed what could arise from in-sample optimization alone. revision: yes
Circularity Check
No derivation chain or first-principles claim present; empirical application only
full rationale
The paper is an empirical demonstration of applying an existing evolutionary optimization framework (MadEvolve, inspired by Alpha-Evolve) to trading strategy components on a Bitcoin backtesting setup. It reports observed improvements across tasks and comparisons to other agents, without any mathematical derivation, uniqueness theorem, ansatz, or first-principles result that reduces to its own inputs. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. Standard backtesting validity concerns exist but fall under correctness rather than the enumerated circularity patterns. The work is self-contained as an application study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
arXiv preprint arXiv:0903.2428. Elliot Glazer, Ege Erdil, Tamay Besiroglu, et al. FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
arXiv preprint arXiv:2411.04872. Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin
arXiv preprint arXiv:2503.14499. Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349,
-
[5]
A survey of large language models in finance (FinLLMs).arXiv preprint arXiv:2402.02315,
Jean Lee, Nicholas Stevens, Soyeon Caren Han, and Minseok Song. A survey of large language models in finance (FinLLMs).arXiv preprint arXiv:2402.02315,
-
[6]
Tianyi Li, Shihui Zang, and Moritz Münchmeyer. Madevolve: Evolutionary optimization of cosmological algorithms with large language models.arXiv preprint arXiv:2602.15951,
-
[7]
33 MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models Yang Li et al. TradingGPT: Multi-agent system with layered memory and distinct characters for enhanced financial trading performance.arXiv preprint arXiv:2309.03736,
-
[8]
Illuminating search spaces by mapping elites
SSRN: 4412788. Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2501.14249. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, et al. Mathematical discoveries from program search with large language models.Nature, 625:468–475,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Apache 2.0 License. Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang
arXiv preprint arXiv:2412.20138. Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large language models.arXiv preprint arXiv:2306.06031,
-
[14]
Junhyeog Yun, Hyoun Jun Lee, and Insu Jeon. QuantEvolve: Automating quantitative strategy discovery through multi-agent evolutionary framework.arXiv preprint arXiv:2510.18569,
-
[15]
Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. Instruct-FinGPT: Financial sentiment analysis by instruction tuning of general-purpose large language models.arXiv preprint arXiv:2306.12659,
-
[16]
arXiv preprint arXiv:2508.11152. A Detailed Trading Simulation Setup In this appendix we explain our trading simulation setup in enough detail to make our results reproducible. Additional explanations are provided in Sec. 4.1. 34 MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models A.1 Data and Splits For all experiments, we ...
-
[17]
3.cancel_open_orders()removes resting orders
Portfolio position is updated: position←position+ ∆q fill t . 3.cancel_open_orders()removes resting orders. 4.set_passive_order_data()computes new trade quantity and limit price. 5.submit_order()submits if the limit price is valid (notNaN). 6.log()records state. Only one order is active at any time; each interval replaces the previous order. 3https://mass...
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.