pith. sign in

arxiv: 2607.01022 · v1 · pith:PWBBKTQXnew · submitted 2026-07-01 · 💻 cs.LG

Seahorse: A Unified Benchmarking Framework for Spatiotemporal Event Modeling

Pith reviewed 2026-07-02 15:58 UTC · model grok-4.3

classification 💻 cs.LG
keywords spatiotemporal point processesneural STPPbenchmarking frameworkencode-evolve-decodeHawkesNestinductive biasraw-coordinate likelihood
0
0 comments X

The pith

SEAHORSE unifies neural spatiotemporal point process models under a shared encode-evolve-decode interface and single benchmark protocol.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEAHORSE as a framework that structures neural models for spatiotemporal point processes around a common encode-evolve-decode interface. This structure allows every model family to be trained, tuned, and evaluated with identical preprocessing, splits, and raw-coordinate likelihood reporting. The resulting protocol supports direct comparisons and controlled experiments that vary event pattern complexity. The authors supply a companion synthetic suite called HawkesNest to perform those experiments and demonstrate how complexity reveals differing inductive biases across model types. Consistent evaluation protocols matter for applications such as mobility tracking and epidemiology where model reliability depends on knowing which inductive assumptions hold.

Core claim

SEAHORSE formalizes neural STPPs through a common encode-evolve-decode interface and trains, tunes, and evaluates every model family under a single executable benchmark protocol with raw-coordinate likelihood reporting. This enables fair comparisons but, more importantly, controlled diagnostic studies. We pair SEAHORSE with HawkesNest, a synthetic stress-test suite, and show that increasing event-pattern complexity exposes each family's inductive bias, degrading some models sharply and leaving others stable.

What carries the argument

encode-evolve-decode interface that standardizes representation of intensity models, conditional density models, latent dynamics, flow decoders, and score-based generators for uniform training and evaluation

If this is right

  • All recent neural STPP families become directly comparable under identical training and evaluation conditions.
  • Diagnostic experiments can isolate which model families remain stable as synthetic event patterns grow more complex.
  • Raw-coordinate likelihood reporting removes hidden differences introduced by coordinate normalization choices.
  • HawkesNest provides a reproducible way to measure how inductive biases manifest under controlled stress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could use the framework outputs to select models for domain-specific tasks such as disease spread forecasting.
  • The shared interface may make it easier to combine components from different model families into new hybrids.
  • Wider adoption of the protocol could reduce the number of incomparable results appearing in follow-on papers.

Load-bearing premise

The encode-evolve-decode interface can represent every recent neural STPP family without forcing architectural compromises that alter their original behavior.

What would settle it

A demonstration that at least one published neural STPP family cannot be expressed through the encode-evolve-decode interface without changing its likelihood computation or performance relative to its original implementation.

Figures

Figures reproduced from arXiv: 2607.01022 by Gerrit Gro{\ss}mann, Sebastian Vollmer, Yahya Aalaila.

Figure 1
Figure 1. Figure 1: Overview of SEAHORSE. The framework takes fixed event datasets, model presets, and benchmark configuration as inputs, runs heterogeneous STPP models under a common contract, and returns comparable performance metrics, selected configurations, and reproducible artifacts. Despite rapid progress, evidence for neural STPPs remains difficult to compare across papers. Reported performance often depends not only … view at source ↗
Figure 2
Figure 2. Figure 2: Unified model decomposition for neural STPPs. Every method in our benchmark instantiates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning dynamics on the HawkesNest entanglement suite. Each panel reports test NLL as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Post-NLL diagnostics on the HawkesNest entanglement suite. Panel (a) reports temporal CRPS across entanglement levels. Panel (b) reports ground-truth intensity correlation for models with well-defined surface estimates. Shaded bands denote seed variability. event data. SEAHORSE fills this gap by standardizing data handling, model interfaces, training, raw-space likelihood reporting, and diagnostics for het… view at source ↗
Figure 6
Figure 6. Figure 6: SEAHORSE software architecture. The CLI resolves schema-validated configuration objects, dataset adapters expose raw event splits, preset registries construct UnifiedSTPP models, and runner/evaluation layers write structured artifacts. The architecture separates configuration, data resolution, model construction, execution, evaluation, and artifact recording. A.2 Getting Started Guide Run an Included Model… view at source ↗
Figure 7
Figure 7. Figure 7: Additional learning-budget curves on the HawkesNest entanglement suite. Each panel [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Autoregressive rollout coherence on the HawkesNest entanglement suite. Curves show [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Median successful-run wall-clock training time on real datasets and HawkesNest Suite 3. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Spatiotemporal point processes (STPPs) model event data in continuous time and space, with applications in mobility, epidemiology, and public safety. Recent neural STPPs span expressive intensity models, conditional density models, continuous-time latent dynamics, normalizing-flow spatial decoders, and score-based generative mechanisms. Yet comparison remains fragile because implementations differ in preprocessing, coordinate normalization, splits, likelihood conventions, and evaluation protocols. We present SEAHORSE, a unified framework for reproducible STPP experimentation. SEAHORSE formalizes neural STPPs through a common encode-evolve-decode interface and trains, tunes, and evaluates every model family under a single executable benchmark protocol with raw-coordinate likelihood reporting. This enables fair comparisons but, more importantly, controlled diagnostic studies. We pair SEAHORSE with HawkesNest, a synthetic stress-test suite, and show that increasing event-pattern complexity exposes each family's inductive bias, degrading some models sharply and leaving others stable. Code: https://github.com/YahyaAalaila/seahorse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEAHORSE, a unified benchmarking framework for neural spatiotemporal point processes that standardizes models from intensity-based, conditional-density, latent-dynamics, normalizing-flow, and score-based families through a shared encode-evolve-decode interface. All models are trained, tuned, and evaluated under one executable protocol that reports raw-coordinate likelihoods; the framework is paired with the HawkesNest synthetic stress-test suite, which increases event-pattern complexity to expose differential inductive biases across families.

Significance. If the encode-evolve-decode interface can embed the listed families while preserving their original inductive biases and likelihood semantics, SEAHORSE would provide a valuable contribution by enabling reproducible, apples-to-apples comparisons and controlled diagnostic experiments on STPP architectures. The provision of executable code and a single benchmark protocol is a concrete strength that directly addresses the reproducibility issues noted in the abstract.

major comments (2)
  1. [§3] §3 (encode-evolve-decode interface definition): The central claim that the interface accommodates score-based generators and normalizing-flow decoders without forcing architectural compromises or altering original likelihood semantics is asserted but not supported by any equivalence verification (e.g., no side-by-side likelihood computation or reparameterization ablation is reported for these families). This is load-bearing for the diagnostic results on HawkesNest.
  2. [§5] §5 (HawkesNest experiments): The reported degradation patterns are presented as exposing each family's inductive bias, yet the manuscript does not demonstrate that the synthetic data generation and coordinate handling in HawkesNest are identical to the raw-coordinate likelihood protocol used for the real benchmarks; any mismatch would make the diagnostics reflect the wrapper rather than the original model classes.
minor comments (2)
  1. [Abstract] Abstract: The sentence claiming the interface 'enables ... controlled diagnostic studies' would benefit from one concrete example of a diagnostic that becomes possible only under the unified protocol.
  2. Notation: The description of the evolve stage does not explicitly state whether the time and space components remain decoupled for all model families or whether certain families require joint evolution; a short clarifying sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important points where additional verification would strengthen the central claims regarding the encode-evolve-decode interface and the diagnostic validity of HawkesNest. We address each below and commit to revisions that directly incorporate the requested evidence.

read point-by-point responses
  1. Referee: [§3] §3 (encode-evolve-decode interface definition): The central claim that the interface accommodates score-based generators and normalizing-flow decoders without forcing architectural compromises or altering original likelihood semantics is asserted but not supported by any equivalence verification (e.g., no side-by-side likelihood computation or reparameterization ablation is reported for these families). This is load-bearing for the diagnostic results on HawkesNest.

    Authors: We agree that explicit verification is needed to substantiate the claim. The interface delegates likelihood evaluation to each model's native mechanism (score matching for score-based models; change-of-variables in raw coordinates for normalizing-flow decoders) without reparameterization of the density itself. To address the referee's concern, we will add an appendix containing side-by-side likelihood computations on a controlled synthetic dataset, comparing the wrapped implementations against the original standalone code for both families. This will confirm numerical equivalence within floating-point tolerance and will be referenced from §3. revision: yes

  2. Referee: [§5] §5 (HawkesNest experiments): The reported degradation patterns are presented as exposing each family's inductive bias, yet the manuscript does not demonstrate that the synthetic data generation and coordinate handling in HawkesNest are identical to the raw-coordinate likelihood protocol used for the real benchmarks; any mismatch would make the diagnostics reflect the wrapper rather than the original model classes.

    Authors: The HawkesNest generator produces events directly in the same raw spatiotemporal coordinate system used by the real-data loaders, and SEAHORSE applies an identical preprocessing and likelihood-evaluation pipeline to both synthetic and real datasets. Nevertheless, we acknowledge that this identity was stated rather than demonstrated. We will revise §5 to include an explicit subsection and accompanying code reference that verifies (i) identical coordinate ranges and units, (ii) the same raw-coordinate likelihood computation path, and (iii) the absence of any additional normalization steps unique to the synthetic suite. This will be supported by a small verification script released with the repository. revision: yes

Circularity Check

0 steps flagged

No circularity: software framework and benchmark protocol with no derivation chain

full rationale

The paper presents SEAHORSE as a unified benchmarking framework that formalizes neural STPPs via an encode-evolve-decode interface and provides a shared training/evaluation protocol. No mathematical derivations, parameter fits, predictions, or uniqueness theorems are claimed. The interface is an engineering design choice for reproducibility rather than a result derived from prior equations or self-citations. No steps reduce by construction to inputs, and the contribution is self-contained as executable code and protocol rather than any fitted or renamed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an engineering artifact rather than a derivation; it rests on the domain assumption that a single interface can cover existing neural STPP families and that synthetic stress tests can expose meaningful differences in inductive bias.

axioms (1)
  • domain assumption Neural STPPs from different families can be represented without loss of fidelity inside a shared encode-evolve-decode structure
    Invoked when the abstract states that SEAHORSE formalizes every model family through this common interface.

pith-pipeline@v0.9.1-grok · 5717 in / 1433 out tokens · 28804 ms · 2026-07-02T15:58:47.445188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Seahorse: Unified benchmarking for spatio-temporal point processes

    Yahya Aalaila, Gerrit Großmann, and Sebastian V ollmer. Seahorse: Unified benchmarking for spatio-temporal point processes. https://github.com/YahyaAalaila/seahorse, 2026. Software, Apache-2.0. Archived athttps://doi.org/10.5281/zenodo.21078077

  2. [2]

    Springer Science & Business Media, 2007

    Daryl J Daley and David Vere-Jones.An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure. Springer Science & Business Media, 2007

  3. [3]

    A review of self-exciting spatio-temporal point processes and their applications

    Alex Reinhart. A review of self-exciting spatio-temporal point processes and their applications. Statistical Science, 33(3):299–318, 2018

  4. [4]

    Recurrent marked temporal point processes: Embedding event history to vector

    Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016

  5. [5]

    The neural hawkes process: A neurally self-modulating multivariate point process.Advances in neural information processing systems, 30, 2017

    Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process.Advances in neural information processing systems, 30, 2017. 10

  6. [6]

    Transformer Hawkes process

    Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer Hawkes process. InInternational Conference on Machine Learning, pages 11692–11702. PMLR, 2020

  7. [7]

    Ricky T. Q. Chen, Brandon Amos, and Maximilian Nickel. Neural spatio-temporal point processes. InInternational Conference on Learning Representations, 2021

  8. [8]

    Neural point process for learning spatiotemporal event dynamics

    Zihao Zhou, Xingyi Yang, Ryan Rossi, Handong Zhao, and Rose Yu. Neural point process for learning spatiotemporal event dynamics. InLearning for Dynamics and Control Conference, pages 777–789. PMLR, 2022

  9. [9]

    Automatic integration for spatiotemporal neural point processes

    Zihao Zhou and Rose Yu. Automatic integration for spatiotemporal neural point processes. Advances in Neural Information Processing Systems, 36, 2024

  10. [10]

    Neural spatiotemporal point processes: Trends and challenges.Transactions on Machine Learning Research, 2025

    Sumantrak Mukherjee, Mouad Elhamdi, George Mohler, David Antony Selby, Yao Xie, Sebas- tian Josef V ollmer, and Gerrit Großmann. Neural spatiotemporal point processes: Trends and challenges.Transactions on Machine Learning Research, 2025. Survey Certification

  11. [11]

    Deep spatiotemporal point processes: Advances and new directions.Annual Review of Statistics and Its Application, 13, 2025

    Xiuyuan Cheng, Zheng Dong, and Yao Xie. Deep spatiotemporal point processes: Advances and new directions.Annual Review of Statistics and Its Application, 13, 2025

  12. [12]

    Imitation learning of neural spatio- temporal point processes.IEEE Transactions on Knowledge and Data Engineering, 34(11):5391– 5402, 2021

    Shixiang Zhu, Shuang Li, Zhigang Peng, and Yao Xie. Imitation learning of neural spatio- temporal point processes.IEEE Transactions on Knowledge and Data Engineering, 34(11):5391– 5402, 2021

  13. [13]

    Integration-free training for spatio-temporal multimodal covariate deep kernel point processes.Advances in Neural Information Processing Systems, 36:25031–25049, 2023

    Yixuan Zhang, Quyu Kong, and Feng Zhou. Integration-free training for spatio-temporal multimodal covariate deep kernel point processes.Advances in Neural Information Processing Systems, 36:25031–25049, 2023

  14. [14]

    HawkesNest: A multi-axis synthetic benchmark for spatiotemporal pattern complexity, 2026

    Yahya Aalaila, Sumantrak Mukherjee, Gerrit Großmann, and Sebastian V ollmer. HawkesNest: A multi-axis synthetic benchmark for spatiotemporal pattern complexity, 2026

  15. [15]

    CRC Press, 2003

    Jesper Moller and Rasmus Plenge Waagepetersen.Statistical inference and simulation for spatial point processes. CRC Press, 2003

  16. [16]

    Lecture Notes: Temporal Point Processes and the Conditional Intensity Function

    Jakob Gulddahl Rasmussen. Lecture notes: Temporal point processes and the conditional intensity function.arXiv preprint arXiv:1806.00221, 2018

  17. [17]

    Self-attentive Hawkes process

    Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive Hawkes process. InInternational Conference on Machine Learning, pages 11183–11193. PMLR, 2020

  18. [18]

    Spatio-temporal diffusion point processes

    Yuan Yuan, Jingtao Ding, Chenyang Shao, Depeng Jin, and Yong Li. Spatio-temporal diffusion point processes. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3173–3184, 2023

  19. [19]

    Neural spectral marked point processes

    Shixiang Zhu, Haoyun Wang, Xiuyuan Cheng, and Yao Xie. Neural spectral marked point processes. InInternational Conference on Learning Representations, 2022

  20. [20]

    Beyond point prediction: Score matching-based pseudolikelihood estimation of neural marked spatio-temporal point process

    Zichong Li, Qunzhi Xu, Zhenghao Xu, Yajun Mei, Tuo Zhao, and Hongyuan Zha. Beyond point prediction: Score matching-based pseudolikelihood estimation of neural marked spatio-temporal point process. InForty-first International Conference on Machine Learning, 2024

  21. [21]

    Neural jump stochastic differential equations.Advances in Neural Information Processing Systems, 32, 2019

    Junteng Jia and Austin R Benson. Neural jump stochastic differential equations.Advances in Neural Information Processing Systems, 32, 2019

  22. [22]

    Citi bike system data

    Citi Bike NYC (Lyft, Inc.). Citi bike system data. https://citibikenyc.com/ system-data. Accessed YYYY-MM-DD

  23. [23]

    Geological Survey, Earthquake Hazards Program

    U.S. Geological Survey, Earthquake Hazards Program. Advanced national seismic system (ANSS) comprehensive catalog of earthquake events and products (ComCat). https:// earthquake.usgs.gov/data/comcat/, 2017

  24. [24]

    Coronavirus (COVID-19) data in the united states

    The New York Times. Coronavirus (COVID-19) data in the united states. https://github. com/nytimes/covid-19-data, 2021. County-level case data. Accessed YYYY-MM-DD. 11

  25. [25]

    Zhang, Qingsong Wen, Jun Zhou, and Hongyuan Mei

    Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang, Hongyan Hao, Fan Zhou, Caigao Jiang, Chen Pan, James Y . Zhang, Qingsong Wen, Jun Zhou, and Hongyuan Mei. EasyTPP: Towards open benchmarking temporal point processes. InThe Twelfth International Conference on Learning Representations, 2024

  26. [26]

    Hotpp benchmark: Are we good at the long horizon events forecasting?arXiv preprint arXiv:2406.14341, 2024

    Ivan Karpukhin, Foma Shipilov, and Andrey Savchenko. Hotpp benchmark: Are we good at the long horizon events forecasting?arXiv preprint arXiv:2406.14341, 2024

  27. [27]

    Uber tlc foil response

    FiveThirtyEight. Uber tlc foil response. https://github.com/fivethirtyeight/ uber-tlc-foil-response , 2015. Data obtained from the New York City Taxi and Limousine Commission through a Freedom of Information Law request. Accessed: 2026-06-26

  28. [28]

    A Countrywide Traffic Accident Dataset

    Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ram- nath. A countrywide traffic accident dataset.arXiv preprint arXiv:1906.05409, 2019

  29. [29]

    Accident risk prediction based on heterogeneous sparse data: New dataset and insights

    Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. Accident risk prediction based on heterogeneous sparse data: New dataset and insights. InProceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Association for Computing Machinery, 2019

  30. [30]

    Crimes — 2001 to present

    City of Chicago. Crimes — 2001 to present. https://data.cityofchicago.org/ Public-Safety/Crimes-2001-to-Present/ijzp-q8t2 , 2026. Chicago Data Portal. Ac- cessed: 2026-06-26

  31. [31]

    Crime data from 2020 to 2024

    City of Los Angeles. Crime data from 2020 to 2024. https://data.lacity.org/ Public-Safety/Crime-Data-from-2020-to-2024/2nrs-mtv8 , 2026. Los Angeles Open Data Portal. Accessed: 2026-06-26

  32. [32]

    Global terrorism database (gtd)

    National Consortium for the Study of Terrorism and Responses to Terrorism (START). Global terrorism database (gtd). https://www.start.umd.edu/data-tools/GTD, 2022. Univer- sity of Maryland. Accessed: 2026-06-26

  33. [33]

    Austin 311 public data

    City of Austin. Austin 311 public data. https://data.austintexas.gov/ Utilities-and-City-Services/Austin-311-Public-Data/xwdj-i9he , 2026. City of Austin Open Data Portal. Accessed: 2026-06-26

  34. [34]

    Karen C. Short. Spatial wildfire occurrence data for the united states, 1992–2015 [fpa_fod_- 20170508] (4th edition).https://doi.org/10.2737/RDS-2013-0009.4, 2017

  35. [35]

    Myers, and Jure Leskovec

    Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. InProceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 1082–1090, New York, NY , USA, 2011. Association for Computing Machinery

  36. [36]

    history_state

    Nadine Chang, John A. Pyles, Austin Marcus, Abhinav Gupta, Michael J. Tarr, and Elissa M. Aminoff. BOLD5000, a public fMRI dataset while viewing 5000 visual images.Scientific Data, 6(1):49, 2019. A Software Interface and Reproducibility A.1 Architecture Overview Figure 6 summarizes the internal software architecture of SEAHORSE. The framework separates co...