Physiome-ODE: A Benchmark for Irregularly Sampled Multivariate Time Series Forecasting Based on Biological ODEs
Pith reviewed 2026-05-25 08:19 UTC · model grok-4.3
The pith
A benchmark of 50 datasets from biological ODEs lets ODE-based forecasting models show their advantages over constant baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By deriving irregularly sampled multivariate time series from biological ordinary differential equations and selecting instances through rejection sampling, the Physiome-ODE benchmark demonstrates that ODE-based models can leverage their structural advantages and that different models can be distinguished in performance, in contrast to results on the prior four datasets.
What carries the argument
Rejection sampling on trajectories generated from biological ODEs to produce challenging IMTS instances for forecasting evaluation.
Load-bearing premise
The rejection sampling procedure applied to trajectories from biological ODEs produces instances that are both challenging and representative of real-world irregularly sampled multivariate time series encountered in scientific practice.
What would settle it
A direct comparison showing that constant baselines still outperform ODE models or that no differentiation occurs across models on the Physiome-ODE datasets would falsify the claim of qualitative difference.
Figures
read the original abstract
State-of-the-art methods for forecasting irregularly sampled time series with missing values predominantly rely on just four datasets and a few small toy examples for evaluation. While ordinary differential equations (ODE) are the prevalent models in science and engineering, a baseline model that forecasts a constant value outperforms ODE-based models from the last five years on three of these existing datasets. This unintuitive finding hampers further research on ODE-based models, a more plausible model family. In this paper, we develop a methodology to generate irregularly sampled multivariate time series (IMTS) datasets from ordinary differential equations and to select challenging instances via rejection sampling. Using this methodology, we create Physiome-ODE, a large and sophisticated benchmark of IMTS datasets consisting of 50 individual datasets, derived from real-world ordinary differential equations from research in biology. Physiome-ODE is the first benchmark for IMTS forecasting that we are aware of and an order of magnitude larger than the current evaluation setting of four datasets. Using our benchmark Physiome-ODE, we show qualitatively completely different results than those derived from the current four datasets: on Physiome-ODE ODE-based models can play to their strength and our benchmark can differentiate in a meaningful way between different IMTS forecasting models. This way, we expect to give a new impulse to research on ODE-based time series modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Physiome-ODE, a benchmark of 50 IMTS datasets generated from real biological ODE models drawn from the physiome literature. Trajectories are produced by solving the source ODEs and then filtered via rejection sampling to retain only challenging instances; the resulting collection is an order of magnitude larger than the four datasets currently used for IMTS evaluation. The central claim is that, on Physiome-ODE, ODE-based forecasters can exploit their inductive bias and the benchmark produces qualitatively different model rankings from those observed on the existing datasets, where constant baselines often dominate.
Significance. If the rejection filter is shown to be neutral with respect to model class, the benchmark would supply the first large-scale, scientifically grounded testbed for IMTS methods whose dynamics are plausibly described by ODEs. The scale (50 datasets) and the use of independently sourced biological models are concrete strengths that could shift evaluation practice away from the current four-dataset regime.
major comments (2)
- [Methods (rejection sampling)] The rejection sampling step (Methods, benchmark-generation subsection) is load-bearing for the performance-reversal claim. The manuscript must specify the exact acceptance predicate, the numerical thresholds applied, the fraction of trajectories rejected, and an ablation demonstrating that model rankings remain stable under alternative sampling rules. Absent these details it cannot be ruled out that the observed advantage for ODE-based models is an artifact of the filter rather than a property of the underlying biological dynamics.
- [Experiments] Table or figure presenting the main experimental comparison (Experiments section): the claim of 'qualitatively completely different results' requires quantitative metrics (e.g., MAE or MSE tables) together with statistical significance tests across the 50 datasets; qualitative statements alone are insufficient to support the differentiation argument.
minor comments (2)
- [Preliminaries] Notation for the irregularity pattern (e.g., observation times t_i) should be defined once in a preliminary section and used consistently; current usage mixes several ad-hoc symbols.
- [Introduction] The abstract states that a constant baseline outperforms ODE models on three of the four existing datasets; a short table or citation to the exact prior results being referenced would improve traceability.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and will incorporate the requested details and analyses in the revised version.
read point-by-point responses
-
Referee: [Methods (rejection sampling)] The rejection sampling step (Methods, benchmark-generation subsection) is load-bearing for the performance-reversal claim. The manuscript must specify the exact acceptance predicate, the numerical thresholds applied, the fraction of trajectories rejected, and an ablation demonstrating that model rankings remain stable under alternative sampling rules. Absent these details it cannot be ruled out that the observed advantage for ODE-based models is an artifact of the filter rather than a property of the underlying biological dynamics.
Authors: We agree that full transparency on the rejection sampling procedure is essential to rule out filter-induced artifacts. In the revised manuscript we will explicitly state the acceptance predicate (trajectories are retained only if the constant baseline MAE exceeds a threshold relative to the ODE solver noise level), the numerical thresholds applied per dataset family, the exact fraction of trajectories rejected (typically 60-80% depending on the source ODE), and an ablation comparing model rankings under three alternative rejection criteria (stricter threshold, looser threshold, and no rejection). This will confirm that the observed advantage for ODE-based forecasters is stable and attributable to the biological dynamics rather than the filter. revision: yes
-
Referee: [Experiments] Table or figure presenting the main experimental comparison (Experiments section): the claim of 'qualitatively completely different results' requires quantitative metrics (e.g., MAE or MSE tables) together with statistical significance tests across the 50 datasets; qualitative statements alone are insufficient to support the differentiation argument.
Authors: We concur that quantitative support is required. The revised Experiments section will include a main results table reporting mean MAE and MSE (with standard deviations) for all evaluated models across the 50 Physiome-ODE datasets, plus a direct comparison table against the four existing IMTS datasets. We will also add Wilcoxon signed-rank tests (with p-values and effect sizes) across the 50 datasets to quantify the statistical significance of ranking differences, thereby replacing purely qualitative statements with rigorous evidence. revision: yes
Circularity Check
No circularity: benchmark constructed from external biological ODEs
full rationale
The paper sources its datasets from independently published biological ODE models in the literature and applies a rejection sampling filter to produce IMTS instances. The central empirical claim (performance reversal favoring ODE-based forecasters) is an evaluation result on this externally derived collection rather than a quantity defined or fitted inside the paper itself. No quoted step equates a prediction to its own input, renames a fitted parameter, or reduces the benchmark construction to a self-citation chain. The derivation chain remains self-contained against external sources.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
Reference graph
Works this paper leans on
-
[1]
D. W. K. Andrews. Consistency in Nonlinear Econometric Models : A Generic Uniform Law of Large Numbers . Econometrica, 55 0 (6): 0 1465, Nov. 1987. ISSN 00129682. doi:10.2307/1913568. URL https://www.jstor.org/stable/1913568?origin=crossref
-
[3]
M. Bilo s , J. Sommer, S. S. Rangapuram, T. Januschowski, and S. G \"u nnemann. Neural Flows : Efficient Alternative to Neural ODEs . In Advances in Neural Information Processing Systems , volume 34, pages 21325--21337. Curran Associates, Inc., 2021
work page 2021
-
[4]
R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural Ordinary Differential Equations . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018
work page 2018
-
[5]
S.-A. Chen, C.-L. Li, N. Yoder, S. O. Arik, and T. Pfister. TSMixer : An All-MLP Architecture for Time Series Forecasting , Sept. 2023
work page 2023
-
[6]
E. De Brouwer, J. Simm, A. Arany, and Y. Moreau. GRU-ODE-Bayes : Continuous Modeling of Sporadically-Observed Time Series . In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc., 2019
work page 2019
-
[7]
W. Gilpin. Chaos as an interpretable benchmark for forecasting and data-driven modelling. In Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track ( Round 2) , Aug. 2021
work page 2021
-
[8]
R. W. Godahewa, C. Bergmeir, G. Webb, R. Hyndman, and P. Montero-Manso . Monash Time Series Forecasting Archive . Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 1, Dec. 2021
work page 2021
-
[9]
R. I. Jennrich. Asymptotic Properties of Non-Linear Least Squares Estimators . The Annals of Mathematical Statistics, 40 0 (2): 0 633--643, 1969. ISSN 0003-4851. doi:10.1214/aoms/1177697731
-
[10]
A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark. MIMIC-IV . PhysioNet, 2021. doi:10.13026/RRGF-XW32
-
[11]
A. E. W. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. MIMIC-III , a freely accessible critical care database. Scientific Data, 3 0 (1): 0 160035, May 2016. ISSN 2052-4463. doi:10.1038/sdata.2016.35
-
[12]
R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems . Journal of Basic Engineering, 82 0 (1): 0 35--45, Mar. 1960. ISSN 0021-9223. doi:10.1115/1.3662552
-
[13]
B. O. Koopman. Hamiltonian Systems and Transformation in Hilbert Space . Proceedings of the National Academy of Sciences, 17 0 (5): 0 315--318, May 1931. doi:10.1073/pnas.17.5.315
-
[14]
B. O. Koopman and J. v. Neumann. Dynamical Systems of Continuous Spectra . Proceedings of the National Academy of Sciences, 18 0 (3): 0 255--263, Mar. 1932. doi:10.1073/pnas.18.3.255
-
[15]
M. J. Menne, J. Williams, and R. S. Vose. Long- Term Daily and Monthly Climate Records from Stations Across the Contiguous United States ( U . S . Historical Climatology Network ). Technical Report osti:1394920; cdiac:NDP-019; doi:10.3334/CDIAC/CLI.NDP019, Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States);...
-
[16]
F. M. Mione, L. Kaspersetz, M. F. Luna, J. Aizpuru, R. Scholz, M. Borisyak, A. Kemmer, M. T. Schermeyer, E. C. Martinez, P. Neubauer, and M. N. Cruz Bournazou. A workflow management system for reproducible and interoperable high-throughput self-driving experiments. Computers & Chemical Engineering, 187: 0 108720, Aug. 2024. ISSN 0098-1354. doi:10.1016/j.c...
-
[17]
Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A Time Series is Worth 64 Words : Long-term Forecasting with Transformers . In The Eleventh International Conference on Learning Representations , ICLR 2023, Kigali , Rwanda , May 1-5, 2023 . OpenReview.net, 2023
work page 2023
-
[18]
M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling Irregular Time Series with Continuous Recurrent Units . In Proceedings of the 39th International Conference on Machine Learning , pages 19388--19405. PMLR, June 2022
work page 2022
- [19]
- [20]
-
[21]
M. Takamoto, T. Praditia, R. Leiteritz, D. MacKinlay, F. Alesiani, D. Pfl \"u ger, and M. Niepert. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems, 35: 0 1596--1611, 2022
work page 2022
-
[22]
G. Teschl. Ordinary Differential Equations and Dynamical Systems . American Mathematical Soc., Aug. 2012. ISBN 978-0-8218-8328-0
work page 2012
-
[23]
V. K. Yalavarthi, K. Madhusudhanan, R. Scholz, N. Ahmed, J. Burchert, S. Jawed, S. Born, and L. Schmidt-Thieme . GraFITi : Graphs for Forecasting Irregularly Sampled Time Series . In M. J. Wooldridge, J. G. Dy, and S. Natarajan, editors, Thirty- Eighth AAAI Conference on Artificial Intelligence , AAAI 2024, Thirty-Sixth Conference on Innovative Applicatio...
-
[24]
T. Yu, C. M. Lloyd, D. P. Nickerson, M. T. Cooling, A. K. Miller, A. Garny, J. R. Terkildsen, J. Lawson, R. D. Britten, P. J. Hunter, and P. M. F. Nielsen. The Physiome Model Repository 2. Bioinformatics, 27 0 (5): 0 743--744, Mar. 2011. ISSN 1367-4803. doi:10.1093/bioinformatics/btq723
-
[25]
A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are Transformers Effective for Time Series Forecasting ? In AAAI , 2023
work page 2023
- [26]
-
[27]
H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 11106--11115, 2021
work page 2021
-
[28]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[29]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[30]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.