pith. sign in

arxiv: 2604.15787 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs

Pith reviewed 2026-05-10 08:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords program evolutionlarge language modelszero-shot inferencetemporal point processesMarkov jump processestime series imputationinterpretable algorithmsdynamical systems
0
0 comments X

The pith

LLM-guided evolution discovers one compact Python program that performs zero-shot inference across event sequences and time series tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EVIL, a method that uses large language models to guide evolutionary search for simple, interpretable algorithms expressed as pure Python and NumPy code. These algorithms address next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation, with a single program generalizing zero-shot across all evaluation datasets in each domain without any per-dataset training or adaptation. The evolved programs compete with or exceed state-of-the-art deep learning models in performance while running orders of magnitude faster and remaining fully transparent as readable code. This establishes that automated program evolution can yield a unified inference function for multiple dynamical systems problems.

Core claim

EVIL applies LLM-guided evolutionary search to discover pure Python/NumPy programs that carry out zero-shot, in-context inference for dynamical systems. For next-event prediction in temporal point processes, rate matrix estimation in Markov jump processes, and time series imputation, one evolved algorithm generalizes across all datasets in the domain without per-dataset training. The programs are competitive with or better than deep learning baselines, execute far faster, and consist of fully interpretable code.

What carries the argument

LLM-guided evolutionary search that proposes, mutates, and selects candidate Python programs until a single compact function emerges that generalizes zero-shot to the full set of datasets for a given inference task.

If this is right

  • A single algorithm suffices for inference across multiple datasets within each dynamical systems domain.
  • The need for large-scale per-dataset training or fine-tuning is eliminated for these tasks.
  • Inference runs orders of magnitude faster than neural network models while remaining inspectable.
  • The same evolutionary process can in principle be applied to discover algorithms for additional inference problems in sequential data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method scales, many inference tasks currently handled by trained models might instead be solved by automatically discovered symbolic procedures.
  • The structure of the evolved programs could be examined to see whether they encode previously unrecognized mathematical regularities in the underlying processes.
  • The approach suggests a route to hybrid systems that combine LLM search with traditional scientific computing for interpretable dynamical modeling.

Load-bearing premise

LLM-guided evolutionary search can reliably produce algorithms that generalize zero-shot across diverse datasets in each task without any per-dataset training or adaptation.

What would settle it

A new dataset in one of the three tasks where every program found by the evolutionary process fails to match deep learning accuracy or requires per-dataset retraining to work.

Figures

Figures reproduced from arXiv: 2604.15787 by David Berghaus.

Figure 1
Figure 1. Figure 1: Overview of the EVIL approach. The same EVIL evolutionary procedure is applied separately to temporal point processes, Markov jump processes, and time-series imputation, yielding one interpretable Python/NumPy inference function per task that generalizes across datasets in a zero-shot, in-context manner. such amortized functions actually exist for several nontrivial problem classes. Across temporal point￾p… view at source ↗
Figure 2
Figure 2. Figure 2: Entropy production on the DFR. This serves as a stress test for when inference becomes [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Imputation on Motion Capture. The motif retrieval strategy enables EVIL to predict non-trivial patterns in missing windows. from TSI-Bench Du et al. [2024], which reported that simple interpolation methods often perform surprisingly well on point-wise missing patterns [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of mark alternation in TAXI and how the evolved next-event heuristic handles it. The dataset often alternates between marks (e.g., pick-up vs. drop-off); this pattern is not covered by FIM-PP’s synthetic training data, whereas EVIL (synthetic prior) generalizes to it from synthetic evolution. TAOBAO. This dataset comes from user interaction logs on the Taobao e-commerce platform Zhu et al. [20… view at source ↗
Figure 5
Figure 5. Figure 5: New best validation-set programs discovered during the 800-iteration long run. Across [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

We introduce EVIL (\textbf{EV}olving \textbf{I}nterpretable algorithms with \textbf{L}LMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces EVIL, an LLM-guided evolutionary search method to discover compact, interpretable pure Python/NumPy programs that perform zero-shot inference on dynamical systems tasks. It applies the approach to next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation, claiming that a single evolved algorithm per task generalizes across all evaluation datasets without per-dataset training or adaptation, often matching or outperforming state-of-the-art deep learning models while being orders of magnitude faster and fully interpretable.

Significance. If the empirical claims hold with proper validation, the work would be significant as the first demonstration of LLM-guided program evolution yielding amortized, zero-shot inference functions for these stochastic process and time series problems. It could provide a transparent, efficient alternative to neural amortized inference, with potential for broader algorithm discovery in dynamical systems modeling.

major comments (2)
  1. Abstract: the central claim that the discovered algorithms are 'often competitive with, and even outperform' state-of-the-art deep learning models lacks any quantitative metrics, error bars, dataset details, or validation procedures, leaving the performance and generalization assertions unsupported by visible evidence.
  2. Abstract: the zero-shot generalization claim (a single evolved program works across all evaluation datasets without per-dataset training) is load-bearing but requires explicit confirmation that evolutionary fitness evaluations used datasets strictly disjoint from (and distributionally similar to) the final test sets; otherwise the programs risk being tuned heuristics rather than true amortized inference functions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Assessment limited to abstract; the approach assumes LLMs can guide code evolution effectively and that discovered programs will generalize without training. No free parameters or invented entities are described.

axioms (1)
  • domain assumption LLMs can be used to guide evolutionary search toward functional, generalizable Python programs for inference tasks.
    Core to the method but not substantiated in the abstract.

pith-pipeline@v0.9.0 · 5475 in / 1322 out tokens · 26470 ms · 2026-05-10T08:54:10.546776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    doi: 10.1038/s41586-023-06924-6

    URLhttps://arxiv.org/abs/2506.13131. Aristeidis Panos. Decomposable transformer point processes.Advances in Neural Information Processing Systems, 37:88932–88955, 2024. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan Ellenberg, Pengming Wang, Omar Fawzi, Pushm...

  2. [2]

    and Berghaus et al. [2026]. Best results are bold. Dataset Method OTD RMSE e RMSE∆t sMAPE∆t TAXI HYPRO21.653±0.1631.231±0.0150.372±0.00493.803±0.454 Dual-TPP24.483±0.3831.353±0.0370.402±0.00695.211±0.187 A-NHP24.762±0.2171.276±0.0150.430±0.00397.388±0.381 NHP25.114±0.2681.297±0.0190.399±0.04096.459±0.521 IFTPP24.053±0.6091.364±0.0320.384±0.00595.719±0.779...

  3. [3]

    on” configuration, the asymmetry induces a directional bias; in the “off

    rather than generating a new synthetic benchmark from scratch. Their data-generation pipeline samples MJPs on state spaces of size 2–6 by first drawing a connected adjacency structure and then sampling the allowed off-diagonal rates from a small family of Beta priors. The initial distribution is chosen either as the stationary distribution of the sampled ...

  4. [4]

    Aggregate all context inter-event gaps and all mark transitionsa→b

  5. [5]

    Estimate a robust global gap statistic from all observed gaps

  6. [6]

    For each previous marka, estimate the average next-event gap after observinga

  7. [7]

    Build a smoothed transition tableP(b|a)from the context sequences

  8. [8]

    Prediction for one target prefix

    Estimate the global majority mark as a fallback. Prediction for one target prefix

  9. [9]

    If the prefix is empty, predict the global gap and the global majority mark

  10. [10]

    Otherwise, let a be the last observed mark and compute a recency-weighted average of the most recent gaps in the prefix

  11. [11]

    Predict the next gap by mixing this local recent-gap estimate with the context-level average gap associated with marka; clamp extreme values

  12. [12]

    Predict the next time as the last observed time plus the predicted gap

  13. [13]

    Predict the next mark by combining local counts of transitions out of a in the prefix, mark frequencies within the prefix, and the smoothed context transition row P(· |a)

  14. [14]

    Algorithm 1:EVILalgorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block)

    Return the predicted next time and the highest-scoring next mark. Algorithm 1:EVILalgorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block). 3 4The function predict_next_events is called by the evaluator with batched,→ 5target histories and a context pool; it must return predicted next times and marks.,→ 6""" 7 8...

  15. [15]

    Aggregate all inter-event gaps from the context sequences

  16. [16]

    Build first-order transition countsa→bbetween consecutive marks

  17. [17]

    Build second-order transition counts(a, b)→cfor consecutive mark pairs

  18. [18]

    For each previous mark a, estimate a typical outgoing gap; for each next mark b, estimate a typical incoming gap

  19. [19]

    For each directed edgea→b , estimate an edge-specific typical gap whenever enough examples exist

  20. [20]

    Prediction for one target prefix

    Estimate a global fallback gap and a global majority mark. Prediction for one target prefix

  21. [21]

    Predict the next mark using a reliability hierarchy: (a) if the last two marks form a previously observed pair, use the second-order transition table; (b) otherwise, if the last mark has outgoing transitions in the context data, use the first-order transition table; (c) otherwise, fall back to the global majority mark

  22. [22]

    Predict the next gap using an edge-specific timing estimate for the predicted transition whenever available

  23. [23]

    If no edge-specific estimate is available, combine the typical outgoing gap of the last mark with the typical incoming gap of the predicted mark

  24. [24]

    Mix this context-level timing estimate with the recent local gaps from the target prefix

  25. [25]

    Predict the next time as the last observed time plus the resulting gap

  26. [26]

    Algorithm 2:EVIL (synthetic prior)algorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block)

    Return the predicted next time and mark. Algorithm 2:EVIL (synthetic prior)algorithm for Point Processes 1""" 2Zero-shot MTPP next-event prediction heuristic (evolvable block). 3 4The function predict_next_events is called by the evaluator with batched,→ 5target histories and a context pool; it must return predicted next times and marks.,→ 6""" 7 8import ...

  27. [27]

    Count the first observed state of each trajectory

  28. [28]

    Add smoothing, and optionally add a smaller contribution from second observations to reduce sensitivity to noise

  29. [29]

    Characteristic time scale

    Normalize the counts to obtain the initial-state distribution. Characteristic time scale

  30. [30]

    Collect valid time differences between consecutive observations

  31. [31]

    Rate matrix estimation

    Use a robust typical interval, such as the median positive difference. Rate matrix estimation

  32. [32]

    For each observed interval spent in state i, add its duration to the exposure time of statei, while capping unusually long intervals

  33. [33]

    Count observed exits from each state, ignoring implausibly fast changes that are likely due to noise

  34. [34]

    For each valid transition i→j , accumulate a destination count, giving larger weight to transitions observed over shorter intervals

  35. [35]

    Estimate each state’s exit hazard as a smoothed ratio of exits to exposure time, then clip it to a reasonable range

  36. [36]

    Normalize destination counts from each state to obtain off-diagonal transition proba- bilities

  37. [37]

    "" 2Zero-shot MJP parameter estimation heuristic (evolvable block). 3

    Form the rate matrix by multiplying each state’s exit hazard with its destination distribution, then set the diagonal so that each row sums to zero. Algorithm 3:EVIL (synthetic prior)algorithm for Markov Jump Processes 1""" 2Zero-shot MJP parameter estimation heuristic (evolvable block). 3""" 4 5import numpy as np 6 7def estimate_mjp_parameters( 8observat...

  38. [38]

    Detect contiguous blocks of missing values

  39. [39]

    If the series is entirely missing, fill it with a simple constant fallback

  40. [40]

    (b) If the gap is long enough and sufficient observed context exists before it, extract the observed window immediately preceding the gap

    For each missing block: (a) If the gap is short, leave it for the interpolation fallback. (b) If the gap is long enough and sufficient observed context exists before it, extract the observed window immediately preceding the gap. (c) Search earlier in the same series for candidate windows whose preceding context is fully observed and has the same length. (...

  41. [41]

    "" 9 10import numpy as np 11from numpy.lib.stride_tricks import sliding_window_view 12 13def _impute_1d(out: np.ndarray, times: np.ndarray) -> None: 14

    After motif retrieval has been attempted for all long gaps, fill any remaining missing positions by time-aware linear interpolation. Algorithm 4:EVILalgorithm for time series imputation 1""" 2Time series imputation heuristic (evolvable block). 3 4Data contains randomly masked point-wise missing patterns. 5The function receives`observation_values`(T, D) — ...

  42. [42]

    91observation_values: (T, D) array, NaN at unobserved and holdout

    -> np.ndarray: 89""" 90Impute values at positions where prediction_mask is True. 91observation_values: (T, D) array, NaN at unobserved and holdout. 92observation_times: (T,) timestamps for each time step. 93prediction_mask: (T, D) True at positions we must predict. 94 95Returns: 96Full array same shape (T, D), with all NaNs filled (imputed). 97""" 98out =...

  43. [43]

    You do not see every state transition

    The data consists of *discrete snapshots* (recordings at specific times), NOT exact jump times. You do not see every state transition. Between `observation_grid[i]`and`observation_grid[i+1]`, zero, one, or multiple hidden jumps could have occurred. ,→ ,→ ,→

  44. [44]

    Limitations

    The data is noisy. The`observation_values`might contain measurement errors. Return NO text outside the python code block. Do NOT use torch. Use only numpy. Imputation.For imputation, the evolved function filled held-out entries in reduced time series, and the score was the negative mean MAE across datasets, score=−mean MAE. The config used the datasets BE...

  45. [45]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...