arxiv: 2605.14069 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

Mohammad R. Rezaei , Tejas Balaji , Rahul G. Krishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords irregular time seriesgenerative modeltime rescaling theoremevent streamsmultivariate forecastingtemporal point processestransformer encodercross-dataset pretraining

0 comments

The pith

SurF turns irregular multivariate event sequences into i.i.d. unit-rate exponential noise through a learnable bijection based on the Time Rescaling Theorem, allowing one generative model to train across heterogeneous datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SurF as a generative approach for forecasting on irregularly sampled multivariate event streams. It treats the Time Rescaling Theorem as a trainable mapping that converts any such stream into standard exponential noise and back. Three efficient ways to represent the cumulative intensity function support long sequences, paired with a Transformer encoder that enables pretraining on multiple datasets at once. On six real-world benchmarks the method reaches the best reported time RMSE on three of them and stays competitive on the rest, while a strict leave-one-out evaluation shows it outperforming classical and neural baselines on five out of six datasets. A reader would care because the construction removes the need for dataset-specific numerical tricks and points toward shared models for asynchronous streams.

Core claim

SurF uses the Time Rescaling Theorem to define a learnable bijection between heterogeneous event sequences and i.i.d. unit-rate exponential noise. It supplies three scalable parameterizations of the cumulative intensity and a Transformer encoder for multi-dataset pretraining. This yields a single model that generates forecasts by sampling noise and inverting the map, achieving the best reported time RMSE on Earthquake, Retweet, and Taobao while remaining within trial noise of the strongest baseline on the other three; under leave-one-out the held-out checkpoint surpasses every classical and neural-autoregressive baseline on five of the six datasets.

What carries the argument

The learnable bijection from the Time Rescaling Theorem, which converts any irregular multivariate event sequence into i.i.d. unit-rate exponential noise and permits exact inversion for sampling.

If this is right

A single pretrained checkpoint can be applied directly to new event-stream datasets without full retraining.
Forecast generation reduces to drawing from standard exponential noise and applying the learned inverse map.
The approach sidesteps window-level numerical quadrature required by many neural temporal point process models.
Pretraining on diverse streams such as earthquakes, retweets, and purchases becomes feasible with one architecture.
Long sequences remain tractable because the intensity parameterizations avoid quadratic scaling with event count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bijection generalizes, the same model could support zero-shot forecasting on entirely unseen event types after pretraining.
Richer parameterizations of the cumulative intensity might extend the method to event streams with strong higher-order interactions.
The reduction of forecasting to learning an invertible rescaling suggests similar bijections could unify other irregular modalities such as point clouds or sparse sensor readings.

Load-bearing premise

The Time Rescaling Theorem can be realized as an effective learnable bijection between arbitrary event sequences and unit-rate exponential noise without large approximation errors that would force per-dataset retuning.

What would settle it

Training the model on a held-out dataset under the strict leave-one-out protocol and finding that it requires heavy dataset-specific adjustments or fails to beat simple autoregressive baselines on time RMSE would falsify the claim of a generalizable bijection.

Figures

Figures reproduced from arXiv: 2605.14069 by Mohammad R. Rezaei, Rahul G. Krishnan, Tejas Balaji.

**Figure 1.** Figure 1: The SurF noising–denoising framework in the joint intensity space. A 2D point process with two anti-correlated event types. (Left) In the original time span, type A activity suppresses type B and vice versa. (Right) After the SurF noising process Fλ (Theorem 1), the trajectory collapses to i.i.d. Exp(1) inter-arrival times along the rescaled axis z; the inverse Rλ recovers the original dynamics losslessly.… view at source ↗

**Figure 2.** Figure 2: Time-rescaling goodness-of-fit across datasets and variants. Empirical CDF of the residuals ∆zi = Λ(τi | hi−1) on the test split. A perfectly calibrated model produces residuals that are i.i.d. Exp(1) (black dotted curve); smaller KS statistic D indicates better calibration. SurF achieves near-ideal calibration ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: shows that iterated autoregressive decoding remains stable. Finetuned RMSE is nearly flat [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: SurF recovers the true oscillatory intensity (A3) and ISI distribution (C3); Hawkes fails on both (A2, C2) [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: SurF-MoE on a toy example. Learned intensity [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Gradient analysis comparing SurF and RNN-based TPPs. SurF gradient evolution at firing [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Next-event inter-arrival RMSE (top) and type accuracy (bottom) for finetuned SurF variants and three baseline models (DTPP, NHP, NJDTPP). NJDTPP was not evaluated for Amazon in their paper, and is hence omitted [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Time-rescaling goodness-of-fit across datasets and variants. Empirical CDF of the residuals ∆zi = Λ(τi | hi−1) on the test split. A perfectly calibrated model produces residuals that are i.i.d. Exp(1) (black dotted curve); smaller KS statistic D indicates better calibration. SurF achieves near-ideal calibration on Taxi and StackOverflow (D ≤ 0.1 for all variants). 34 [PITH_FULL_IMAGE:figures/full_fig_p034… view at source ↗

**Figure 9.** Figure 9: Forecast trajectories across datasets. Each subplot shows one representative sequence [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization-based approaches break down when inter-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window-level numerical quadrature. We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d.\ unit-rate exponential noise, enabling a single model to be trained across heterogeneous event-stream datasets; (ii) three efficient parameterizations of the cumulative intensity that scale to long sequences; and (iii) a Transformer-based encoder for multi-dataset pretraining. On six real-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial-level noise of the strongest specialist on the remaining three. Under a strict leave-one-out protocol, the held-out checkpoint beats every classical and neural-autoregressive baseline on 5/6 datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurF's TRT bijection for cross-dataset irregular streams is the real novelty, but the empirical edge rests on unverified parameterization accuracy.

read the letter

The paper's main move is treating the Time Rescaling Theorem as a trainable bijection that maps heterogeneous multivariate event sequences to i.i.d. unit-rate exponentials. This lets a single Transformer train across datasets instead of needing per-domain models, which is the part that actually feels new compared to standard neural temporal point processes. The three cumulative intensity parameterizations are meant to keep the mapping efficient for long sequences without heavy numerical integration, and the multi-dataset pretraining is a direct attempt at something like a foundation model for asynchronous streams. On the reported benchmarks it lands the lowest time RMSE on Earthquake, Retweet, and Taobao and stays competitive on the rest, with the leave-one-out checkpoint beating most baselines on five of six sets. That pattern is worth looking at if the mechanism holds. The soft spot is exactly where the stress-test flagged: without explicit bounds or diagnostics on how faithfully the learned bijection inverts sequences whose inter-event times span orders of magnitude, the gains could still be coming from dataset-specific fitting rather than the claimed cross-dataset transfer. The abstract gives no error bars, no ablation on the three parameterizations, and no details on baseline re-implementations, so the central empirical claim is only moderately supported right now. This is for people already working on temporal point processes or event-stream forecasting who want to see whether a unified generative route is feasible. A reader who cares about practical scaling to real-world irregular data will find the setup worth examining even if the numbers need tightening. It deserves a serious referee to check the derivations and the experimental protocol.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SurF, a generative model for irregularly sampled multivariate event streams that employs the Time Rescaling Theorem as a learnable bijection mapping heterogeneous sequences to i.i.d. unit-rate exponential noise via three efficient parameterizations of the cumulative intensity function, together with a Transformer encoder enabling multi-dataset pretraining. It reports best-in-class time RMSE on Earthquake, Retweet, and Taobao among six benchmarks, with leave-one-out superiority on five datasets.

Significance. If the TRT bijection holds with low approximation error across datasets, the approach would provide a scalable mechanism for cross-dataset pretraining on asynchronous event data, overcoming limitations of tokenization for wide-ranging inter-event times and quadrature costs in neural TPPs, and constituting a meaningful step toward foundation models for irregular time series.

major comments (2)

[Abstract] Abstract: the headline claims of best-reported time RMSE on three datasets and leave-one-out superiority on five rest on external benchmark comparisons, yet no error bars, trial counts, exact baseline re-implementations, or data-split protocols are supplied, leaving the statistical reliability of the superiority assertions difficult to assess.
[Method (TRT and cumulative-intensity parameterization)] Method (TRT and cumulative-intensity parameterization): the central mechanism asserts that the three efficient parameterizations realize a faithful, dataset-agnostic learnable bijection under the Time Rescaling Theorem; however, the manuscript supplies no quantitative bound on rescaling error, invertibility residual, or approximation accuracy for sequences whose inter-event intervals span orders of magnitude, which directly bears on whether observed gains derive from the claimed mechanism or from per-dataset fitting.

minor comments (1)

[Abstract] Abstract: the phrase 'within trial-level noise' is used without defining the trial protocol or noise metric, which should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the statistical reporting and empirical validation of the core mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claims of best-reported time RMSE on three datasets and leave-one-out superiority on five rest on external benchmark comparisons, yet no error bars, trial counts, exact baseline re-implementations, or data-split protocols are supplied, leaving the statistical reliability of the superiority assertions difficult to assess.

Authors: We agree that the absence of error bars, trial counts, and protocol details limits assessment of statistical reliability. In the revised manuscript we will report results from five independent runs with different random seeds, include standard deviations as error bars in all tables and the abstract, explicitly document the baseline re-implementations (including hyperparameter choices and code references), and detail the exact train/validation/test split ratios and preprocessing steps for each of the six datasets. revision: yes
Referee: [Method (TRT and cumulative-intensity parameterization)] Method (TRT and cumulative-intensity parameterization): the central mechanism asserts that the three efficient parameterizations realize a faithful, dataset-agnostic learnable bijection under the Time Rescaling Theorem; however, the manuscript supplies no quantitative bound on rescaling error, invertibility residual, or approximation accuracy for sequences whose inter-event intervals span orders of magnitude, which directly bears on whether observed gains derive from the claimed mechanism or from per-dataset fitting.

Authors: The Time Rescaling Theorem supplies the theoretical invertibility guarantee, and the three parameterizations are constructed to be exactly invertible by design. The current manuscript does not include explicit quantitative bounds on rescaling error. To address this concern we will add a new subsection in the experiments that reports, for each dataset, the empirical distribution of transformed inter-event times (via Kolmogorov-Smirnov statistic against unit exponential), the maximum residual in the learned cumulative intensity, and the approximation error on sequences whose inter-event intervals span at least three orders of magnitude. These diagnostics will be computed on held-out data to separate mechanism fidelity from per-dataset fitting. revision: yes

Circularity Check

0 steps flagged

No circularity: model uses standard TRT with new parameterizations; performance claims rest on external benchmarks

full rationale

The paper defines SurF via the Time Rescaling Theorem as a learnable bijection, introduces three new efficient cumulative intensity parameterizations, and employs a Transformer encoder for multi-dataset pretraining. All performance results (best RMSE on three datasets, leave-one-out superiority on five of six) are obtained by direct evaluation against external baselines on real-world data, with no equations reducing a claimed prediction to a fitted parameter by construction, no self-definitional loops, and no load-bearing self-citations. The derivation chain is self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Time Rescaling Theorem as a domain assumption from point process theory and introduces new learnable parameterizations of the cumulative intensity function whose exact functional forms and fitting procedures are not detailed in the abstract.

free parameters (1)

cumulative intensity parameterization coefficients
Three efficient parameterizations are proposed; these are learned from data and constitute free parameters that shape the bijection.

axioms (1)

domain assumption Time Rescaling Theorem provides an exact bijection between arbitrary event sequences and i.i.d. unit-rate exponential random variables
Invoked as the foundational mapping enabling the generative model.

pith-pipeline@v0.9.0 · 5495 in / 1436 out tokens · 41524 ms · 2026-05-15T05:16:54.365715+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d. unit-rate exponential noise... three efficient parameterizations of the cumulative intensity
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat embedding and orbit structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (Reverse Rescaling and Bijectivity)... Λ∗ : R+ → R+ is a C1 bijection with C1 inverse

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 4 internal anchors

[1]

A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

work page arXiv
[2]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Syndar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Lag-llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278,

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider, et al. Lag-llama: Towards foundation models for probabilistic time series forecasting.arXiv preprint arXiv:2310.08278,

work page arXiv
[4]

Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318,

Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318,

work page arXiv
[5]

Fully neural network based model for general temporal point processes

Takahiro Omi, Naonori Ueda, and Kazuyuki Aihara. Fully neural network based model for general temporal point processes. InAdvances in Neural Information Processing Systems (NeurIPS), 2019a. URLhttps://arxiv.org/abs/1905.09690. Oleksandr Shchur, Marin Biloš, and Stephan Günnemann. Intensity-free learning of temporal point processes.arXiv preprint arXiv:1909.12127,

work page arXiv 1905
[6]

The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process

URL https://arxiv.org/abs/1612.09328. Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer Hawkes process. InInternational Conference on Machine Learning, pages 11692–11702. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz

URLhttps://arxiv.org/pdf/2002.09291.pdf. Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive Hawkes process. InProceedings of the International Conference on Machine Learning (ICML),

work page arXiv 2002
[8]

Chris Whong

URLhttps://arxiv.org/abs/2210.01753. Chris Whong. FOILing NYC’s taxi trip data,

work page arXiv
[9]

Ricky TQ Chen, Brandon Amos, and Maximilian Nickel

URLhttps://arxiv.org/abs/2201.00044. Ricky TQ Chen, Brandon Amos, and Maximilian Nickel. Neural spatio-temporal point processes. In International Conference on Learning Representations,

work page arXiv
[10]

Still competitive: Revisiting recurrent models for irregular time series prediction.arXiv preprint arXiv:2510.16161,

Ankitkumar Joshi and Milos Hauskrecht. Still competitive: Revisiting recurrent models for irregular time series prediction.arXiv preprint arXiv:2510.16161,

work page arXiv
[11]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Neural temporal point processes: A review.arXiv preprint arXiv:2104.03528,

Oleksandr Shchur, Ali Caner Türkmen, Tim Januschowski, and Stephan Günnemann. Neural temporal point processes: A review.arXiv preprint arXiv:2104.03528,

work page arXiv
[15]

Easytpp: Towards open benchmarking temporal point processes.arXiv preprint arXiv:2307.08097,

Siqiao Xue, Xiaoming Shi, Zhixuan Chu, Yan Wang, Hongyan Hao, Fan Zhou, Caigao Jiang, Chen Pan, James Y Zhang, Qingsong Wen, et al. Easytpp: Towards open benchmarking temporal point processes.arXiv preprint arXiv:2307.08097,

work page arXiv
[16]

, λθ(tN | H tN ) ,(18) and the determinant remainsdetJ= QN i=1 λθ(ti | H ti)

Therefore J= diag λθ(t1 | H t1), . . . , λθ(tN | H tN ) ,(18) and the determinant remainsdetJ= QN i=1 λθ(ti | H ti). Remark2.The diagonal structure is a convenience of our architectural choice, not a requirement for correctness. Any encoder producing hi−1 with continuous dependence on all past event positions would yield a lower-triangular (but not diagon...

work page 2019
[17]

Pre” denotes the unified pretrained checkpoint evaluated zero-shot; “Fine

Variant comparison.MOE attains the best NLL on four of six datasets (Taxi, StackOverflow, Earthquake, Taobao), consistent with its exponential-basis density being well matched to the fast- decaying, near-Markovian dynamics dominant in these benchmarks. GLQ wins on the two datasets with the heaviest right tails—Amazon (Pre, −1.61) and Retweet (Fine, +2.55)...

work page 2019
[18]

error does not enter the gradient

while MOE (Pre) attains the best NLL—because RMSE measures only the conditional mean of τ whereas NLL measures the full conditional density. Reporting both is therefore informative, and we include NLL primarily as a density-quality complement to the predictive evaluation in the main text. 19 Comparison with prior work.We do not include baseline NLLs in Ta...

work page 2023
[19]

27 Figure 6: Gradient analysis comparing SurF and RNN-based TPPs

and the full results of a 100-trial random hyperparameter sweep over SurF-GLQ (Tables 7–10). 27 Figure 6: Gradient analysis comparing SurF and RNN-based TPPs. SurF gradient evolution at firing rates {4,8,16,20} events/sec shows consistent decreasing trends with gradient norms converging faster; overall statistics show SurF’s higher gradient magnitudes and...

work page arXiv
[20]

no quadrature

is well-conditioned. 29 H Additional Theoretical Material H.1 Joint Modeling of Times and Marks ForKevent types, SurF factorizes the joint conditional intensity as λ∗(t, k| H t) =λ ∗(t| H t)·p(k|t,H t), KX k=1 p(k|t,H t) = 1,(51) where λ∗(t| H t) is the marginal ground intensity (modeled byΛθ) and p(k|t,H t) is the conditional mark distribution (modeled b...

work page 2019