Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling
Pith reviewed 2026-07-02 15:58 UTC · model grok-4.3
The pith
SEAHORSE unifies neural spatiotemporal point process models under a shared encode-evolve-decode interface and single benchmark protocol.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEAHORSE formalizes neural STPPs through a common encode-evolve-decode interface and trains, tunes, and evaluates every model family under a single executable benchmark protocol with raw-coordinate likelihood reporting. This enables fair comparisons but, more importantly, controlled diagnostic studies. We pair SEAHORSE with HawkesNest, a synthetic stress-test suite, and show that increasing event-pattern complexity exposes each family's inductive bias, degrading some models sharply and leaving others stable.
What carries the argument
encode-evolve-decode interface that standardizes representation of intensity models, conditional density models, latent dynamics, flow decoders, and score-based generators for uniform training and evaluation
If this is right
- All recent neural STPP families become directly comparable under identical training and evaluation conditions.
- Diagnostic experiments can isolate which model families remain stable as synthetic event patterns grow more complex.
- Raw-coordinate likelihood reporting removes hidden differences introduced by coordinate normalization choices.
- HawkesNest provides a reproducible way to measure how inductive biases manifest under controlled stress.
Where Pith is reading between the lines
- Practitioners could use the framework outputs to select models for domain-specific tasks such as disease spread forecasting.
- The shared interface may make it easier to combine components from different model families into new hybrids.
- Wider adoption of the protocol could reduce the number of incomparable results appearing in follow-on papers.
Load-bearing premise
The encode-evolve-decode interface can represent every recent neural STPP family without forcing architectural compromises that alter their original behavior.
What would settle it
A demonstration that at least one published neural STPP family cannot be expressed through the encode-evolve-decode interface without changing its likelihood computation or performance relative to its original implementation.
Figures
read the original abstract
Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models, continuous-time latent dynamics, normalizing-flow spatial decoders, and score-based generative mechanisms. Yet comparison remains fragile because implementations differ in preprocessing, coordinate normalization, splits, likelihood conventions, and evaluation protocols. We present SEAHORSE, a unified framework for reproducible STPP experimentation. SEAHORSE formalizes neural STPPs through a common encode-evolve-decode interface and trains, tunes, and evaluates every model family under a single executable benchmark protocol with raw-coordinate likelihood reporting. This enables fair comparisons but, more importantly, controlled diagnostic studies. We pair SEAHORSE with HawkesNest, a synthetic stress-test suite, and show that increasing event-pattern complexity exposes each family's inductive bias, degrading some models sharply and leaving others stable. Code: https://github.com/YahyaAalaila/seahorse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SEAHORSE, a unified benchmarking framework for neural spatiotemporal point processes that standardizes models from intensity-based, conditional-density, latent-dynamics, normalizing-flow, and score-based families through a shared encode-evolve-decode interface. All models are trained, tuned, and evaluated under one executable protocol that reports raw-coordinate likelihoods; the framework is paired with the HawkesNest synthetic stress-test suite, which increases event-pattern complexity to expose differential inductive biases across families.
Significance. If the encode-evolve-decode interface can embed the listed families while preserving their original inductive biases and likelihood semantics, SEAHORSE would provide a valuable contribution by enabling reproducible, apples-to-apples comparisons and controlled diagnostic experiments on STPP architectures. The provision of executable code and a single benchmark protocol is a concrete strength that directly addresses the reproducibility issues noted in the abstract.
major comments (2)
- [§3] §3 (encode-evolve-decode interface definition): The central claim that the interface accommodates score-based generators and normalizing-flow decoders without forcing architectural compromises or altering original likelihood semantics is asserted but not supported by any equivalence verification (e.g., no side-by-side likelihood computation or reparameterization ablation is reported for these families). This is load-bearing for the diagnostic results on HawkesNest.
- [§5] §5 (HawkesNest experiments): The reported degradation patterns are presented as exposing each family's inductive bias, yet the manuscript does not demonstrate that the synthetic data generation and coordinate handling in HawkesNest are identical to the raw-coordinate likelihood protocol used for the real benchmarks; any mismatch would make the diagnostics reflect the wrapper rather than the original model classes.
minor comments (2)
- [Abstract] Abstract: The sentence claiming the interface 'enables ... controlled diagnostic studies' would benefit from one concrete example of a diagnostic that becomes possible only under the unified protocol.
- Notation: The description of the evolve stage does not explicitly state whether the time and space components remain decoupled for all model families or whether certain families require joint evolution; a short clarifying sentence would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments identify important points where additional verification would strengthen the central claims regarding the encode-evolve-decode interface and the diagnostic validity of HawkesNest. We address each below and commit to revisions that directly incorporate the requested evidence.
read point-by-point responses
-
Referee: [§3] §3 (encode-evolve-decode interface definition): The central claim that the interface accommodates score-based generators and normalizing-flow decoders without forcing architectural compromises or altering original likelihood semantics is asserted but not supported by any equivalence verification (e.g., no side-by-side likelihood computation or reparameterization ablation is reported for these families). This is load-bearing for the diagnostic results on HawkesNest.
Authors: We agree that explicit verification is needed to substantiate the claim. The interface delegates likelihood evaluation to each model's native mechanism (score matching for score-based models; change-of-variables in raw coordinates for normalizing-flow decoders) without reparameterization of the density itself. To address the referee's concern, we will add an appendix containing side-by-side likelihood computations on a controlled synthetic dataset, comparing the wrapped implementations against the original standalone code for both families. This will confirm numerical equivalence within floating-point tolerance and will be referenced from §3. revision: yes
-
Referee: [§5] §5 (HawkesNest experiments): The reported degradation patterns are presented as exposing each family's inductive bias, yet the manuscript does not demonstrate that the synthetic data generation and coordinate handling in HawkesNest are identical to the raw-coordinate likelihood protocol used for the real benchmarks; any mismatch would make the diagnostics reflect the wrapper rather than the original model classes.
Authors: The HawkesNest generator produces events directly in the same raw spatiotemporal coordinate system used by the real-data loaders, and SEAHORSE applies an identical preprocessing and likelihood-evaluation pipeline to both synthetic and real datasets. Nevertheless, we acknowledge that this identity was stated rather than demonstrated. We will revise §5 to include an explicit subsection and accompanying code reference that verifies (i) identical coordinate ranges and units, (ii) the same raw-coordinate likelihood computation path, and (iii) the absence of any additional normalization steps unique to the synthetic suite. This will be supported by a small verification script released with the repository. revision: yes
Circularity Check
No circularity: software framework and benchmark protocol with no derivation chain
full rationale
The paper presents SEAHORSE as a unified benchmarking framework that formalizes neural STPPs via an encode-evolve-decode interface and provides a shared training/evaluation protocol. No mathematical derivations, parameter fits, predictions, or uniqueness theorems are claimed. The interface is an engineering design choice for reproducibility rather than a result derived from prior equations or self-citations. No steps reduce by construction to inputs, and the contribution is self-contained as executable code and protocol rather than any fitted or renamed result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neural STPPs from different families can be represented without loss of fidelity inside a shared encode-evolve-decode structure
Reference graph
Works this paper leans on
-
[1]
Seahorse: Unified benchmarking for spatio-temporal point processes
Yahya Aalaila, Gerrit Großmann, and Sebastian V ollmer. Seahorse: Unified benchmarking for spatio-temporal point processes. https://github.com/YahyaAalaila/seahorse, 2026. Software, Apache-2.0. Archived athttps://doi.org/10.5281/zenodo.21078077
-
[2]
Springer Science & Business Media, 2007
Daryl J Daley and David Vere-Jones.An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure. Springer Science & Business Media, 2007
2007
-
[3]
A review of self-exciting spatio-temporal point processes and their applications
Alex Reinhart. A review of self-exciting spatio-temporal point processes and their applications. Statistical Science, 33(3):299–318, 2018
2018
-
[4]
Recurrent marked temporal point processes: Embedding event history to vector
Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016
2016
-
[5]
The neural hawkes process: A neurally self-modulating multivariate point process.Advances in neural information processing systems, 30, 2017
Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process.Advances in neural information processing systems, 30, 2017. 10
2017
-
[6]
Transformer Hawkes process
Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer Hawkes process. InInternational Conference on Machine Learning, pages 11692–11702. PMLR, 2020
2020
-
[7]
Ricky T. Q. Chen, Brandon Amos, and Maximilian Nickel. Neural spatio-temporal point processes. InInternational Conference on Learning Representations, 2021
2021
-
[8]
Neural point process for learning spatiotemporal event dynamics
Zihao Zhou, Xingyi Yang, Ryan Rossi, Handong Zhao, and Rose Yu. Neural point process for learning spatiotemporal event dynamics. InLearning for Dynamics and Control Conference, pages 777–789. PMLR, 2022
2022
-
[9]
Automatic integration for spatiotemporal neural point processes
Zihao Zhou and Rose Yu. Automatic integration for spatiotemporal neural point processes. Advances in Neural Information Processing Systems, 36, 2024
2024
-
[10]
Neural spatiotemporal point processes: Trends and challenges.Transactions on Machine Learning Research, 2025
Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David Antony Selby, Yao Xie, Sebas- tian Josef V ollmer, and Gerrit Großmann. Neural spatiotemporal point processes: Trends and challenges.Transactions on Machine Learning Research, 2025. Survey Certification
2025
-
[11]
Deep spatiotemporal point processes: Advances and new directions.Annual Review of Statistics and Its Application, 13, 2025
Xiuyuan Cheng, Zheng Dong, and Yao Xie. Deep spatiotemporal point processes: Advances and new directions.Annual Review of Statistics and Its Application, 13, 2025
2025
-
[12]
Imitation learning of neural spatio- temporal point processes.IEEE Transactions on Knowledge and Data Engineering, 34(11):5391– 5402, 2021
Shixiang Zhu, Shuang Li, Zhigang Peng, and Yao Xie. Imitation learning of neural spatio- temporal point processes.IEEE Transactions on Knowledge and Data Engineering, 34(11):5391– 5402, 2021
2021
-
[13]
Integration-free training for spatio-temporal multimodal covariate deep kernel point processes.Advances in Neural Information Processing Systems, 36:25031–25049, 2023
Yixuan Zhang, Quyu Kong, and Feng Zhou. Integration-free training for spatio-temporal multimodal covariate deep kernel point processes.Advances in Neural Information Processing Systems, 36:25031–25049, 2023
2023
-
[14]
HawkesNest: A multi-axis synthetic benchmark for spatiotemporal pattern complexity, 2026
Yahya Aalaila, Sumantrak Mukherjee, Gerrit Großmann, and Sebastian V ollmer. HawkesNest: A multi-axis synthetic benchmark for spatiotemporal pattern complexity, 2026
2026
-
[15]
CRC Press, 2003
Jesper Moller and Rasmus Plenge Waagepetersen.Statistical inference and simulation for spatial point processes. CRC Press, 2003
2003
-
[16]
Lecture Notes: Temporal Point Processes and the Conditional Intensity Function
Jakob Gulddahl Rasmussen. Lecture notes: Temporal point processes and the conditional intensity function.arXiv preprint arXiv:1806.00221, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Self-attentive Hawkes process
Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive Hawkes process. InInternational Conference on Machine Learning, pages 11183–11193. PMLR, 2020
2020
-
[18]
Spatio-temporal diffusion point processes
Yuan Yuan, Jingtao Ding, Chenyang Shao, Depeng Jin, and Yong Li. Spatio-temporal diffusion point processes. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3173–3184, 2023
2023
-
[19]
Neural spectral marked point processes
Shixiang Zhu, Haoyun Wang, Xiuyuan Cheng, and Yao Xie. Neural spectral marked point processes. InInternational Conference on Learning Representations, 2022
2022
-
[20]
Beyond point prediction: Score matching-based pseudolikelihood estimation of neural marked spatio-temporal point process
Zichong Li, Qunzhi Xu, Zhenghao Xu, Yajun Mei, Tuo Zhao, and Hongyuan Zha. Beyond point prediction: Score matching-based pseudolikelihood estimation of neural marked spatio-temporal point process. InForty-first International Conference on Machine Learning, 2024
2024
-
[21]
Neural jump stochastic differential equations.Advances in Neural Information Processing Systems, 32, 2019
Junteng Jia and Austin R Benson. Neural jump stochastic differential equations.Advances in Neural Information Processing Systems, 32, 2019
2019
-
[22]
Citi bike system data
Citi Bike NYC (Lyft, Inc.). Citi bike system data. https://citibikenyc.com/ system-data. Accessed YYYY-MM-DD
-
[23]
Geological Survey, Earthquake Hazards Program
U.S. Geological Survey, Earthquake Hazards Program. Advanced national seismic system (ANSS) comprehensive catalog of earthquake events and products (ComCat). https:// earthquake.usgs.gov/data/comcat/, 2017
2017
-
[24]
Coronavirus (COVID-19) data in the united states
The New York Times. Coronavirus (COVID-19) data in the united states. https://github. com/nytimes/covid-19-data, 2021. County-level case data. Accessed YYYY-MM-DD. 11
2021
-
[25]
Zhang, Qingsong Wen, Jun Zhou, and Hongyuan Mei
Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang, Hongyan Hao, Fan Zhou, Caigao Jiang, Chen Pan, James Y . Zhang, Qingsong Wen, Jun Zhou, and Hongyuan Mei. EasyTPP: Towards open benchmarking temporal point processes. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[26]
Ivan Karpukhin, Foma Shipilov, and Andrey Savchenko. Hotpp benchmark: Are we good at the long horizon events forecasting?arXiv preprint arXiv:2406.14341, 2024
-
[27]
Uber tlc foil response
FiveThirtyEight. Uber tlc foil response. https://github.com/fivethirtyeight/ uber-tlc-foil-response , 2015. Data obtained from the New York City Taxi and Limousine Commission through a Freedom of Information Law request. Accessed: 2026-06-26
2015
-
[28]
A Countrywide Traffic Accident Dataset
Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ram- nath. A countrywide traffic accident dataset.arXiv preprint arXiv:1906.05409, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[29]
Accident risk prediction based on heterogeneous sparse data: New dataset and insights
Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. Accident risk prediction based on heterogeneous sparse data: New dataset and insights. InProceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Association for Computing Machinery, 2019
2019
-
[30]
Crimes — 2001 to present
City of Chicago. Crimes — 2001 to present. https://data.cityofchicago.org/ Public-Safety/Crimes-2001-to-Present/ijzp-q8t2 , 2026. Chicago Data Portal. Ac- cessed: 2026-06-26
2001
-
[31]
Crime data from 2020 to 2024
City of Los Angeles. Crime data from 2020 to 2024. https://data.lacity.org/ Public-Safety/Crime-Data-from-2020-to-2024/2nrs-mtv8 , 2026. Los Angeles Open Data Portal. Accessed: 2026-06-26
2020
-
[32]
Global terrorism database (gtd)
National Consortium for the Study of Terrorism and Responses to Terrorism (START). Global terrorism database (gtd). https://www.start.umd.edu/data-tools/GTD, 2022. Univer- sity of Maryland. Accessed: 2026-06-26
2022
-
[33]
Austin 311 public data
City of Austin. Austin 311 public data. https://data.austintexas.gov/ Utilities-and-City-Services/Austin-311-Public-Data/xwdj-i9he , 2026. City of Austin Open Data Portal. Accessed: 2026-06-26
2026
-
[34]
Karen C. Short. Spatial wildfire occurrence data for the united states, 1992–2015 [fpa_fod_- 20170508] (4th edition).https://doi.org/10.2737/RDS-2013-0009.4, 2017
-
[35]
Myers, and Jure Leskovec
Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. InProceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 1082–1090, New York, NY , USA, 2011. Association for Computing Machinery
2011
-
[36]
Nadine Chang, John A. Pyles, Austin Marcus, Abhinav Gupta, Michael J. Tarr, and Elissa M. Aminoff. BOLD5000, a public fMRI dataset while viewing 5000 visual images.Scientific Data, 6(1):49, 2019. A Software Interface and Reproducibility A.1 Architecture Overview Figure 6 summarizes the internal software architecture of SEAHORSE. The framework separates co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.