EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs
Pith reviewed 2026-05-10 08:54 UTC · model grok-4.3
The pith
LLM-guided evolution discovers one compact Python program that performs zero-shot inference across event sequences and time series tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EVIL applies LLM-guided evolutionary search to discover pure Python/NumPy programs that carry out zero-shot, in-context inference for dynamical systems. For next-event prediction in temporal point processes, rate matrix estimation in Markov jump processes, and time series imputation, one evolved algorithm generalizes across all datasets in the domain without per-dataset training. The programs are competitive with or better than deep learning baselines, execute far faster, and consist of fully interpretable code.
What carries the argument
LLM-guided evolutionary search that proposes, mutates, and selects candidate Python programs until a single compact function emerges that generalizes zero-shot to the full set of datasets for a given inference task.
If this is right
- A single algorithm suffices for inference across multiple datasets within each dynamical systems domain.
- The need for large-scale per-dataset training or fine-tuning is eliminated for these tasks.
- Inference runs orders of magnitude faster than neural network models while remaining inspectable.
- The same evolutionary process can in principle be applied to discover algorithms for additional inference problems in sequential data.
Where Pith is reading between the lines
- If the method scales, many inference tasks currently handled by trained models might instead be solved by automatically discovered symbolic procedures.
- The structure of the evolved programs could be examined to see whether they encode previously unrecognized mathematical regularities in the underlying processes.
- The approach suggests a route to hybrid systems that combine LLM search with traditional scientific computing for interpretable dynamical modeling.
Load-bearing premise
LLM-guided evolutionary search can reliably produce algorithms that generalize zero-shot across diverse datasets in each task without any per-dataset training or adaptation.
What would settle it
A new dataset in one of the three tasks where every program found by the evolutionary process fails to match deep learning accuracy or requires per-dataset retraining to work.
Figures
read the original abstract
We introduce EVIL (\textbf{EV}olving \textbf{I}nterpretable algorithms with \textbf{L}LMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EVIL, an LLM-guided evolutionary search method to discover compact, interpretable pure Python/NumPy programs that perform zero-shot inference on dynamical systems tasks. It applies the approach to next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation, claiming that a single evolved algorithm per task generalizes across all evaluation datasets without per-dataset training or adaptation, often matching or outperforming state-of-the-art deep learning models while being orders of magnitude faster and fully interpretable.
Significance. If the empirical claims hold with proper validation, the work would be significant as the first demonstration of LLM-guided program evolution yielding amortized, zero-shot inference functions for these stochastic process and time series problems. It could provide a transparent, efficient alternative to neural amortized inference, with potential for broader algorithm discovery in dynamical systems modeling.
major comments (2)
- Abstract: the central claim that the discovered algorithms are 'often competitive with, and even outperform' state-of-the-art deep learning models lacks any quantitative metrics, error bars, dataset details, or validation procedures, leaving the performance and generalization assertions unsupported by visible evidence.
- Abstract: the zero-shot generalization claim (a single evolved program works across all evaluation datasets without per-dataset training) is load-bearing but requires explicit confirmation that evolutionary fitness evaluations used datasets strictly disjoint from (and distributionally similar to) the final test sets; otherwise the programs risk being tuned heuristics rather than true amortized inference functions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be used to guide evolutionary search toward functional, generalizable Python programs for inference tasks.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1038/s41586-023-06924-6
URLhttps://arxiv.org/abs/2506.13131. Aristeidis Panos. Decomposable transformer point processes.Advances in Neural Information Processing Systems, 37:88932–88955, 2024. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan Ellenberg, Pengming Wang, Omar Fawzi, Pushm...
-
[2]
and Berghaus et al. [2026]. Best results are bold. Dataset Method OTD RMSE e RMSE∆t sMAPE∆t TAXI HYPRO21.653±0.1631.231±0.0150.372±0.00493.803±0.454 Dual-TPP24.483±0.3831.353±0.0370.402±0.00695.211±0.187 A-NHP24.762±0.2171.276±0.0150.430±0.00397.388±0.381 NHP25.114±0.2681.297±0.0190.399±0.04096.459±0.521 IFTPP24.053±0.6091.364±0.0320.384±0.00595.719±0.779...
work page 2026
-
[3]
on” configuration, the asymmetry induces a directional bias; in the “off
rather than generating a new synthetic benchmark from scratch. Their data-generation pipeline samples MJPs on state spaces of size 2–6 by first drawing a connected adjacency structure and then sampling the allowed off-diagonal rates from a small family of Beta priors. The initial distribution is chosen either as the stationary distribution of the sampled ...
work page 1977
-
[4]
Aggregate all context inter-event gaps and all mark transitionsa→b
-
[5]
Estimate a robust global gap statistic from all observed gaps
-
[6]
For each previous marka, estimate the average next-event gap after observinga
-
[7]
Build a smoothed transition tableP(b|a)from the context sequences
-
[8]
Prediction for one target prefix
Estimate the global majority mark as a fallback. Prediction for one target prefix
-
[9]
If the prefix is empty, predict the global gap and the global majority mark
-
[10]
Otherwise, let a be the last observed mark and compute a recency-weighted average of the most recent gaps in the prefix
-
[11]
Predict the next gap by mixing this local recent-gap estimate with the context-level average gap associated with marka; clamp extreme values
-
[12]
Predict the next time as the last observed time plus the predicted gap
-
[13]
Predict the next mark by combining local counts of transitions out of a in the prefix, mark frequencies within the prefix, and the smoothed context transition row P(· |a)
-
[14]
Return the predicted next time and the highest-scoring next mark. Algorithm 1:EVILalgorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block). 3 4The function predict_next_events is called by the evaluator with batched,→ 5target histories and a context pool; it must return predicted next times and marks.,→ 6""" 7 8...
-
[15]
Aggregate all inter-event gaps from the context sequences
-
[16]
Build first-order transition countsa→bbetween consecutive marks
-
[17]
Build second-order transition counts(a, b)→cfor consecutive mark pairs
-
[18]
For each previous mark a, estimate a typical outgoing gap; for each next mark b, estimate a typical incoming gap
-
[19]
For each directed edgea→b , estimate an edge-specific typical gap whenever enough examples exist
-
[20]
Prediction for one target prefix
Estimate a global fallback gap and a global majority mark. Prediction for one target prefix
-
[21]
Predict the next mark using a reliability hierarchy: (a) if the last two marks form a previously observed pair, use the second-order transition table; (b) otherwise, if the last mark has outgoing transitions in the context data, use the first-order transition table; (c) otherwise, fall back to the global majority mark
-
[22]
Predict the next gap using an edge-specific timing estimate for the predicted transition whenever available
-
[23]
If no edge-specific estimate is available, combine the typical outgoing gap of the last mark with the typical incoming gap of the predicted mark
-
[24]
Mix this context-level timing estimate with the recent local gaps from the target prefix
-
[25]
Predict the next time as the last observed time plus the resulting gap
-
[26]
Return the predicted next time and mark. Algorithm 2:EVIL (synthetic prior)algorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block). 3 4The function predict_next_events is called by the evaluator with batched,→ 5target histories and a context pool; it must return predicted next times and marks.,→ 6""" 7 8import ...
-
[27]
Count the first observed state of each trajectory
-
[28]
Add smoothing, and optionally add a smaller contribution from second observations to reduce sensitivity to noise
-
[29]
Normalize the counts to obtain the initial-state distribution. Characteristic time scale
-
[30]
Collect valid time differences between consecutive observations
-
[31]
Use a robust typical interval, such as the median positive difference. Rate matrix estimation
-
[32]
For each observed interval spent in state i, add its duration to the exposure time of statei, while capping unusually long intervals
-
[33]
Count observed exits from each state, ignoring implausibly fast changes that are likely due to noise
-
[34]
For each valid transition i→j , accumulate a destination count, giving larger weight to transitions observed over shorter intervals
-
[35]
Estimate each state’s exit hazard as a smoothed ratio of exits to exposure time, then clip it to a reasonable range
-
[36]
Normalize destination counts from each state to obtain off-diagonal transition proba- bilities
-
[37]
"" 2Zero-shot MJP parameter estimation heuristic (evolvable block). 3
Form the rate matrix by multiplying each state’s exit hazard with its destination distribution, then set the diagonal so that each row sums to zero. Algorithm 3:EVIL (synthetic prior)algorithm for Markov Jump Processes 1""" 2Zero-shot MJP parameter estimation heuristic (evolvable block). 3""" 4 5import numpy as np 6 7def estimate_mjp_parameters( 8observat...
-
[38]
Detect contiguous blocks of missing values
-
[39]
If the series is entirely missing, fill it with a simple constant fallback
-
[40]
For each missing block: (a) If the gap is short, leave it for the interpolation fallback. (b) If the gap is long enough and sufficient observed context exists before it, extract the observed window immediately preceding the gap. (c) Search earlier in the same series for candidate windows whose preceding context is fully observed and has the same length. (...
-
[41]
After motif retrieval has been attempted for all long gaps, fill any remaining missing positions by time-aware linear interpolation. Algorithm 4:EVILalgorithm for time series imputation 1""" 2Time series imputation heuristic (evolvable block). 3 4Data contains randomly masked point-wise missing patterns. 5The function receives`observation_values`(T, D) — ...
-
[42]
91observation_values: (T, D) array, NaN at unobserved and holdout
-> np.ndarray: 89""" 90Impute values at positions where prediction_mask is True. 91observation_values: (T, D) array, NaN at unobserved and holdout. 92observation_times: (T,) timestamps for each time step. 93prediction_mask: (T, D) True at positions we must predict. 94 95Returns: 96Full array same shape (T, D), with all NaNs filled (imputed). 97""" 98out =...
work page 2025
-
[43]
You do not see every state transition
The data consists of *discrete snapshots* (recordings at specific times), NOT exact jump times. You do not see every state transition. Between `observation_grid[i]`and`observation_grid[i+1]`, zero, one, or multiple hidden jumps could have occurred. ,→ ,→ ,→
-
[44]
The data is noisy. The`observation_values`might contain measurement errors. Return NO text outside the python code block. Do NOT use torch. Use only numpy. Imputation.For imputation, the evolved function filled held-out entries in reduced time series, and the score was the negative mean MAE across datasets, score=−mean MAE. The config used the datasets BE...
-
[45]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.