pith. sign in

arxiv: 2502.07489 · v2 · pith:PA76XDFFnew · submitted 2025-02-11 · 💻 cs.LG

Physiome-ODE: A Benchmark for Irregularly Sampled Multivariate Time Series Forecasting Based on Biological ODEs

Pith reviewed 2026-05-25 08:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords irregularly sampled time series forecastingODE-based modelsbenchmark datasetbiological ordinary differential equationsmultivariate time seriesrejection sampling
0
0 comments X

The pith

A benchmark of 50 datasets from biological ODEs lets ODE-based forecasting models show their advantages over constant baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluations of irregularly sampled multivariate time series forecasting rely on four datasets where a constant-value predictor beats ODE-based models from recent years. This paper generates Physiome-ODE by simulating trajectories from real biological ordinary differential equations and applying rejection sampling to create challenging instances. The resulting collection of fifty datasets reverses the performance pattern, allowing ODE models to perform as expected and enabling meaningful comparisons among different forecasting approaches. This setup addresses the mismatch between model family and evaluation data that has stalled progress on ODE-based methods.

Core claim

By deriving irregularly sampled multivariate time series from biological ordinary differential equations and selecting instances through rejection sampling, the Physiome-ODE benchmark demonstrates that ODE-based models can leverage their structural advantages and that different models can be distinguished in performance, in contrast to results on the prior four datasets.

What carries the argument

Rejection sampling on trajectories generated from biological ODEs to produce challenging IMTS instances for forecasting evaluation.

Load-bearing premise

The rejection sampling procedure applied to trajectories from biological ODEs produces instances that are both challenging and representative of real-world irregularly sampled multivariate time series encountered in scientific practice.

What would settle it

A direct comparison showing that constant baselines still outperform ODE models or that no differentiation occurs across models on the Physiome-ODE datasets would falsify the claim of qualitative difference.

Figures

Figures reproduced from arXiv: 2502.07489 by Christian Kl\"otergens, Lars Schmidt-Thieme, Maximilian Stubbemann, Randolf Scholz, Stefan Born, Vijaya Krishna Yalavarthi.

Figure 1
Figure 1. Figure 1: Demonstration of time series realized by 4 ODEs of d [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Test MSE of the best per￾forming model vs JGD-score across 50 datasets. Consequently, we build the benchmark on top of real-world ODEs: their system, their constants and their initial values establish a real-world con￾nection that has been created in hundreds of sci￾entific publications. To delineate our benchmark from purely synthetic data we therefore call it semi-synthetic. The fact that the datasets ar… view at source ↗
read the original abstract

State-of-the-art methods for forecasting irregularly sampled time series with missing values predominantly rely on just four datasets and a few small toy examples for evaluation. While ordinary differential equations (ODE) are the prevalent models in science and engineering, a baseline model that forecasts a constant value outperforms ODE-based models from the last five years on three of these existing datasets. This unintuitive finding hampers further research on ODE-based models, a more plausible model family. In this paper, we develop a methodology to generate irregularly sampled multivariate time series (IMTS) datasets from ordinary differential equations and to select challenging instances via rejection sampling. Using this methodology, we create Physiome-ODE, a large and sophisticated benchmark of IMTS datasets consisting of 50 individual datasets, derived from real-world ordinary differential equations from research in biology. Physiome-ODE is the first benchmark for IMTS forecasting that we are aware of and an order of magnitude larger than the current evaluation setting of four datasets. Using our benchmark Physiome-ODE, we show qualitatively completely different results than those derived from the current four datasets: on Physiome-ODE ODE-based models can play to their strength and our benchmark can differentiate in a meaningful way between different IMTS forecasting models. This way, we expect to give a new impulse to research on ODE-based time series modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Physiome-ODE, a benchmark of 50 IMTS datasets generated from real biological ODE models drawn from the physiome literature. Trajectories are produced by solving the source ODEs and then filtered via rejection sampling to retain only challenging instances; the resulting collection is an order of magnitude larger than the four datasets currently used for IMTS evaluation. The central claim is that, on Physiome-ODE, ODE-based forecasters can exploit their inductive bias and the benchmark produces qualitatively different model rankings from those observed on the existing datasets, where constant baselines often dominate.

Significance. If the rejection filter is shown to be neutral with respect to model class, the benchmark would supply the first large-scale, scientifically grounded testbed for IMTS methods whose dynamics are plausibly described by ODEs. The scale (50 datasets) and the use of independently sourced biological models are concrete strengths that could shift evaluation practice away from the current four-dataset regime.

major comments (2)
  1. [Methods (rejection sampling)] The rejection sampling step (Methods, benchmark-generation subsection) is load-bearing for the performance-reversal claim. The manuscript must specify the exact acceptance predicate, the numerical thresholds applied, the fraction of trajectories rejected, and an ablation demonstrating that model rankings remain stable under alternative sampling rules. Absent these details it cannot be ruled out that the observed advantage for ODE-based models is an artifact of the filter rather than a property of the underlying biological dynamics.
  2. [Experiments] Table or figure presenting the main experimental comparison (Experiments section): the claim of 'qualitatively completely different results' requires quantitative metrics (e.g., MAE or MSE tables) together with statistical significance tests across the 50 datasets; qualitative statements alone are insufficient to support the differentiation argument.
minor comments (2)
  1. [Preliminaries] Notation for the irregularity pattern (e.g., observation times t_i) should be defined once in a preliminary section and used consistently; current usage mixes several ad-hoc symbols.
  2. [Introduction] The abstract states that a constant baseline outperforms ODE models on three of the four existing datasets; a short table or citation to the exact prior results being referenced would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and will incorporate the requested details and analyses in the revised version.

read point-by-point responses
  1. Referee: [Methods (rejection sampling)] The rejection sampling step (Methods, benchmark-generation subsection) is load-bearing for the performance-reversal claim. The manuscript must specify the exact acceptance predicate, the numerical thresholds applied, the fraction of trajectories rejected, and an ablation demonstrating that model rankings remain stable under alternative sampling rules. Absent these details it cannot be ruled out that the observed advantage for ODE-based models is an artifact of the filter rather than a property of the underlying biological dynamics.

    Authors: We agree that full transparency on the rejection sampling procedure is essential to rule out filter-induced artifacts. In the revised manuscript we will explicitly state the acceptance predicate (trajectories are retained only if the constant baseline MAE exceeds a threshold relative to the ODE solver noise level), the numerical thresholds applied per dataset family, the exact fraction of trajectories rejected (typically 60-80% depending on the source ODE), and an ablation comparing model rankings under three alternative rejection criteria (stricter threshold, looser threshold, and no rejection). This will confirm that the observed advantage for ODE-based forecasters is stable and attributable to the biological dynamics rather than the filter. revision: yes

  2. Referee: [Experiments] Table or figure presenting the main experimental comparison (Experiments section): the claim of 'qualitatively completely different results' requires quantitative metrics (e.g., MAE or MSE tables) together with statistical significance tests across the 50 datasets; qualitative statements alone are insufficient to support the differentiation argument.

    Authors: We concur that quantitative support is required. The revised Experiments section will include a main results table reporting mean MAE and MSE (with standard deviations) for all evaluated models across the 50 Physiome-ODE datasets, plus a direct comparison table against the four existing IMTS datasets. We will also add Wilcoxon signed-rank tests (with p-values and effect sizes) across the 50 datasets to quantify the statistical significance of ranking differences, thereby replacing purely qualitative statements with rigorous evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external biological ODEs

full rationale

The paper sources its datasets from independently published biological ODE models in the literature and applies a rejection sampling filter to produce IMTS instances. The central empirical claim (performance reversal favoring ODE-based forecasters) is an evaluation result on this externally derived collection rather than a quantity defined or fitted inside the paper itself. No quoted step equates a prediction to its own input, renames a fitted parameter, or reduces the benchmark construction to a self-citation chain. The derivation chain remains self-contained against external sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution rests on existing biological ODE models from the literature and a rejection sampling procedure whose selection criteria are not specified in the abstract; no new entities or fitted constants are introduced in the provided text.

pith-pipeline@v0.9.0 · 5796 in / 1125 out tokens · 34372 ms · 2026-05-25T08:19:53.292253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TIDES: Implicit Time-Awareness in Selective State Space Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper

  1. [1]

    D. W. K. Andrews. Consistency in Nonlinear Econometric Models : A Generic Uniform Law of Large Numbers . Econometrica, 55 0 (6): 0 1465, Nov. 1987. ISSN 00129682. doi:10.2307/1913568. URL https://www.jstor.org/stable/1913568?origin=crossref

  2. [3]

    Bilo s , J

    M. Bilo s , J. Sommer, S. S. Rangapuram, T. Januschowski, and S. G \"u nnemann. Neural Flows : Efficient Alternative to Neural ODEs . In Advances in Neural Information Processing Systems , volume 34, pages 21325--21337. Curran Associates, Inc., 2021

  3. [4]

    R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural Ordinary Differential Equations . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018

  4. [5]

    Chen, C.-L

    S.-A. Chen, C.-L. Li, N. Yoder, S. O. Arik, and T. Pfister. TSMixer : An All-MLP Architecture for Time Series Forecasting , Sept. 2023

  5. [6]

    De Brouwer, J

    E. De Brouwer, J. Simm, A. Arany, and Y. Moreau. GRU-ODE-Bayes : Continuous Modeling of Sporadically-Observed Time Series . In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019

  6. [7]

    W. Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track ( Round 2) , Aug. 2021

  7. [8]

    R. W. Godahewa, C. Bergmeir, G. Webb, R. Hyndman, and P. Montero-Manso . Monash Time Series Forecasting Archive . Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 1, Dec. 2021

  8. [9]

    R. I. Jennrich. Asymptotic Properties of Non-Linear Least Squares Estimators . The Annals of Mathematical Statistics, 40 0 (2): 0 633--643, 1969. ISSN 0003-4851. doi:10.1214/aoms/1177697731

  9. [10]

    Johnson, L

    A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark. MIMIC-IV . PhysioNet, 2021. doi:10.13026/RRGF-XW32

  10. [11]

    A. E. W. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. MIMIC-III , a freely accessible critical care database. Scientific Data, 3 0 (1): 0 160035, May 2016. ISSN 2052-4463. doi:10.1038/sdata.2016.35

  11. [12]

    R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems . Journal of Basic Engineering, 82 0 (1): 0 35--45, Mar. 1960. ISSN 0021-9223. doi:10.1115/1.3662552

  12. [13]

    B. O. Koopman. Hamiltonian Systems and Transformation in Hilbert Space . Proceedings of the National Academy of Sciences, 17 0 (5): 0 315--318, May 1931. doi:10.1073/pnas.17.5.315

  13. [14]

    B. O. Koopman and J. v. Neumann. Dynamical Systems of Continuous Spectra . Proceedings of the National Academy of Sciences, 18 0 (3): 0 255--263, Mar. 1932. doi:10.1073/pnas.18.3.255

  14. [15]

    M. J. Menne, J. Williams, and R. S. Vose. Long- Term Daily and Monthly Climate Records from Stations Across the Contiguous United States ( U . S . Historical Climatology Network ). Technical Report osti:1394920; cdiac:NDP-019; doi:10.3334/CDIAC/CLI.NDP019, Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States);...

  15. [16]

    F. M. Mione, L. Kaspersetz, M. F. Luna, J. Aizpuru, R. Scholz, M. Borisyak, A. Kemmer, M. T. Schermeyer, E. C. Martinez, P. Neubauer, and M. N. Cruz Bournazou. A workflow management system for reproducible and interoperable high-throughput self-driving experiments. Computers & Chemical Engineering, 187: 0 108720, Aug. 2024. ISSN 0098-1354. doi:10.1016/j.c...

  16. [17]

    Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A Time Series is Worth 64 Words : Long-term Forecasting with Transformers . In The Eleventh International Conference on Learning Representations , ICLR 2023, Kigali , Rwanda , May 1-5, 2023 . OpenReview.net, 2023

  17. [18]

    Schirmer, M

    M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling Irregular Time Series with Continuous Recurrent Units . In Proceedings of the 39th International Conference on Machine Learning , pages 19388--19405. PMLR, June 2022

  18. [19]

    Scholz, S

    R. Scholz, S. Born, N. Duong-Trung , M. N. Cruz-Bournazou , and L. Schmidt-Thieme . Latent Linear ODEs with Neural Kalman Filtering for Irregular Time Series Forecasting . Sept. 2022

  19. [20]

    Silva, G

    I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of ICU patients: The PhysioNet / Computing in cardiology challenge 2012. In 2012 Computing in Cardiology , pages 245--248, Sept. 2012

  20. [21]

    Takamoto, T

    M. Takamoto, T. Praditia, R. Leiteritz, D. MacKinlay, F. Alesiani, D. Pfl \"u ger, and M. Niepert. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems, 35: 0 1596--1611, 2022

  21. [22]

    G. Teschl. Ordinary Differential Equations and Dynamical Systems . American Mathematical Soc., Aug. 2012. ISBN 978-0-8218-8328-0

  22. [23]

    V. K. Yalavarthi, K. Madhusudhanan, R. Scholz, N. Ahmed, J. Burchert, S. Jawed, S. Born, and L. Schmidt-Thieme . GraFITi : Graphs for Forecasting Irregularly Sampled Time Series . In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors, Thirty- Eighth AAAI Conference on Artificial Intelligence , AAAI 2024, Thirty-Sixth Conference on Innovative Applicatio...

  23. [24]

    T. Yu, C. M. Lloyd, D. P. Nickerson, M. T. Cooling, A. K. Miller, A. Garny, J. R. Terkildsen, J. Lawson, R. D. Britten, P. J. Hunter, and P. M. F. Nielsen. The Physiome Model Repository 2. Bioinformatics, 27 0 (5): 0 743--744, Mar. 2011. ISSN 1367-4803. doi:10.1093/bioinformatics/btq723

  24. [25]

    A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are Transformers Effective for Time Series Forecasting ? In AAAI , 2023

  25. [26]

    Zhang, C

    W. Zhang, C. Yin, H. Liu, X. Zhou, and H. Xiong. Irregular Multivariate Time Series Forecasting : A Transformable Patching Graph Neural Networks Approach . In Proceedings of the 41st International Conference on Machine Learning , pages 60179--60196. PMLR, July 2024

  26. [27]

    H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 11106--11115, 2021

  27. [28]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  28. [29]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  29. [30]

    No-Free-Lunch Theorem

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...