SOHET: Sequence Of Heterogeneous Events Transformer with Self-Supervised Pre-Training

Kees Jan de Vries; Mathijs de Jong; Mustafa Radha

arxiv: 2606.21356 · v1 · pith:PVSK4TW5new · submitted 2026-06-19 · 💻 cs.LG · cs.AI

SOHET: Sequence Of Heterogeneous Events Transformer with Self-Supervised Pre-Training

Kees Jan de Vries , Mustafa Radha , Mathijs de Jong This is my paper

Pith reviewed 2026-06-26 14:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords heterogeneous event sequencestransformer architectureself-supervised pre-trainingcausal sequence modelingfraud detectionevent type embeddingsbidirectional transformertabular encoders

0 comments

The pith

SOHET processes sequences of mixed event types with type-specific encoders, a transformer, and causal self-supervised pre-training to improve prediction accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SOHET as a model for handling streams where events come from many different types, such as user actions or transactions. It builds a hierarchy that encodes each event type with its own tabular network, adds time and type embeddings, and passes the result through either a causal or bidirectional transformer. Three new self-supervised tasks are defined to pre-train the model in the causal case. On a large real-world fraud detection dataset the full model beats earlier approaches, and pre-training adds further gains while speeding convergence; the bidirectional version also reaches or exceeds prior best scores on most tasks in a public benchmark.

Core claim

SOHET is a hierarchical architecture combining event-type-specific tabular encoders with temporal and type embeddings, processed by a causal or bidirectional transformer. Three self-supervised pre-training objectives are introduced for the causal setting. On a proprietary large-scale real-world Booking.com fraud detection task with 17 event types, SOHET outperforms FlexTPP, NAPPT, and CIPPT by 5.8%. Pre-training yields an additional 2.6% gain and 2.4% faster convergence. On the EBES benchmark, bidirectional SOHET matches or exceeds the published best on 6 out of 8 tasks.

What carries the argument

The SOHET architecture, which uses event-type-specific tabular encoders, temporal and type embeddings, and a transformer backbone, together with three causal self-supervised pre-training objectives that produce improved representations for downstream tasks.

If this is right

The same architecture can be applied to any domain that produces streams of distinct event types without changing the core design.
Pre-training reduces labeled data needs while also accelerating convergence on the target task.
Causal mode supports real-time prediction as events arrive; bidirectional mode supports offline analysis of complete sequences.
Performance on 17 event types suggests the type-specific encoders scale to other counts of heterogeneous event categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other high-volume event streams such as financial transactions or medical records to check whether the reported gains hold outside the original domain.
If the type-specific encoders prove critical, removing them and using a single shared encoder would be a direct ablation to quantify their contribution.
The pre-training objectives might transfer to related sequence tasks that also mix categorical and numerical features within each event.

Load-bearing premise

The three self-supervised pre-training objectives designed for the causal setting produce representations that meaningfully improve downstream task performance beyond what the base architecture achieves without pre-training.

What would settle it

Running the pre-trained SOHET and the non-pre-trained version on the Booking.com fraud detection task and finding that the pre-training version shows no accuracy gain or slower convergence would falsify the claimed benefit of the pre-training objectives.

Figures

Figures reproduced from arXiv: 2606.21356 by Kees Jan de Vries, Mathijs de Jong, Mustafa Radha.

**Figure 1.** Figure 1: SOHET architecture. (a) Event-specific tabular encoders transform heterogeneous events into event encodings. (b) Event encodings are processed by a position-aware sequence encoder. The resulting event representations can be used for various tasks. (c) Our novel self-supervised objectives include next event type prediction, next time delta prediction, and contrastive next-row embedding learning. (d) Per-eve… view at source ↗

read the original abstract

Many machine learning applications rely on heterogeneous event streams to make predictions, either causally as events arrive or bidirectionally over complete sequences. We propose SOHET (Sequence Of Heterogeneous Events Transformer), a hierarchical architecture combining event-type-specific tabular encoders with temporal and type embeddings, processed by a causal or bidirectional transformer. We introduce three self-supervised pre-training objectives for the causal setting. On a proprietary large-scale real-world Booking.com fraud detection task with 17 event types, SOHET outperforms FlexTPP, NAPPT, and CIPPT by 5.8%. Pre-training yields an additional 2.6% gain and 2.4% faster convergence. On the EBES benchmark, bidirectional SOHET matches or exceeds the published best on 6 out of 8 tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOHET adds type-specific tabular encoders and three causal pre-training objectives to a transformer backbone, with reported gains on fraud detection and competitive EBES results.

read the letter

The paper's main point is a hierarchical transformer for streams of mixed event types. It uses separate encoders per event type, adds temporal and type embeddings, then runs a causal or bidirectional transformer. Three new self-supervised objectives are introduced for the causal pre-training case.

The design choice to handle heterogeneity at the encoder level rather than forcing a single input format is straightforward and practical. The pre-training tasks are tailored to the causal setting and the reported numbers show they add both accuracy and faster convergence on the Booking.com task. On EBES the bidirectional version matches or beats the prior best on six of eight tasks, which is the part that can be checked by others.

The strongest evidence is the EBES comparison because it uses a public benchmark. The 5.8% lift on the proprietary fraud task is larger but harder to verify independently. The stress-test note indicates the ablations and baseline comparisons hold up internally, so there is no obvious inconsistency in how the gains are measured.

One limitation is the heavy reliance on the private dataset for the headline result. Readers cannot rerun the exact experiment or test sensitivity to data specifics. The abstract omits error bars and hyperparameter details, though the full paper apparently supplies ablations that support the claims. These are not fatal but they limit how far the numbers can be taken without further checks.

The work is aimed at applied groups that already model heterogeneous event sequences, such as fraud or user behavior in e-commerce. Anyone needing a drop-in architecture for mixed-type temporal data will find the encoder split and the pre-training objectives worth trying. It is solid enough on its own terms to merit peer review rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The paper proposes SOHET, a hierarchical transformer architecture for heterogeneous event sequences that combines event-type-specific tabular encoders with temporal and type embeddings, processed by either a causal or bidirectional transformer. It introduces three self-supervised pre-training objectives tailored to the causal setting. Empirical claims include a 5.8% outperformance over FlexTPP, NAPPT, and CIPPT on a proprietary Booking.com fraud detection task with 17 event types, plus an additional 2.6% gain and 2.4% faster convergence from pre-training; bidirectional SOHET also matches or exceeds the published best on 6 of 8 EBES benchmark tasks.

Significance. If the empirical gains prove robust, the work offers a practical advance in modeling irregular heterogeneous event streams for applications such as fraud detection. The combination of architecture and causal self-supervised objectives, together with evaluation on both a large proprietary dataset and the public EBES benchmark, provides a concrete contribution to temporal point process and event-sequence modeling.

major comments (1)

[Results] Results section (and abstract): the reported improvements of 5.8% and 2.6% are given as single point estimates without error bars, standard deviations across runs, or statistical significance tests against the baselines; this directly affects the load-bearing claim that SOHET and its pre-training objectives outperform the cited methods.

minor comments (2)

[Methods] The description of the three self-supervised pre-training objectives would benefit from explicit equations or pseudocode to clarify how they are formulated for the causal setting and how they differ from standard masked or next-event prediction.
[Experiments] Hyperparameter details, baseline re-implementation notes, and model-size comparisons are absent from the reported experiments, limiting reproducibility of the EBES and Booking.com results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the single major comment below regarding the statistical robustness of the reported results.

read point-by-point responses

Referee: [Results] Results section (and abstract): the reported improvements of 5.8% and 2.6% are given as single point estimates without error bars, standard deviations across runs, or statistical significance tests against the baselines; this directly affects the load-bearing claim that SOHET and its pre-training objectives outperform the cited methods.

Authors: We agree this is a valid concern that affects the strength of the empirical claims. In the revised version we will rerun the Booking.com experiments across multiple random seeds (minimum 5 runs) and report mean performance with standard deviations for the key metrics. We will also add statistical significance tests (e.g., paired t-tests) against the baselines FlexTPP, NAPPT, and CIPPT. The abstract and results section will be updated accordingly to reflect these more robust statistics while preserving the original point estimates for direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture (SOHET) with three self-supervised pre-training objectives and reports performance gains on a proprietary fraud task and the EBES benchmark against external baselines. No derivation chain, equations, or first-principles claims are present that reduce any reported prediction or result to quantities defined by the paper's own fitted parameters, self-citations, or ansatzes by construction. All central claims rest on direct experimental comparisons and ablations that remain falsifiable against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no concrete free parameters, axioms, or invented entities are identifiable from the provided text. The model description relies on standard transformer components whose details are not expanded.

pith-pipeline@v0.9.1-grok · 5667 in / 1286 out tokens · 40280 ms · 2026-06-26T14:14:43.117993+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references

[1]

Advances in Neural Information Processing Systems , volume =

Revisiting Deep Learning Models for Tabular Data , author =. Advances in Neural Information Processing Systems , volume =
[2]

Advances in Neural Information Processing Systems , volume =

On Embeddings for Numerical Features in Tabular Deep Learning , author =. Advances in Neural Information Processing Systems , volume =
[3]

IEEE Transactions on Neural Networks and Learning Systems , volume =

Deep Neural Networks and Tabular Data: A Survey , author =. IEEE Transactions on Neural Networks and Learning Systems , volume =
[4]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Recurrent Marked Temporal Point Processes: Embedding Event History to Vector , author =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =
[5]

Zuo, Simiao and Jiang, Haoming and Li, Zichong and Zhao, Tuo and Zha, Hongyuan , booktitle =
[6]

Advances in Neural Information Processing Systems , year =

Transformers for Mixed-type Event Sequences , author =. Advances in Neural Information Processing Systems , year =
[7]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages =

Tabular Transformers for Modeling Multivariate Time Series , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages =
[8]

Babaev, Dmitrii and Kireev, Ivan and Ovsov, Nikita and Ivanova, Mariya and Gusev, Gleb and Nazarov, Ivan and Tuzhilin, Alexander , booktitle =
[9]

International Conference on Machine Learning , pages =

Learning Transferable Visual Models from Natural Language Supervision , author =. International Conference on Machine Learning , pages =
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[11]

Gu, Albert and Dao, Tri , booktitle =
[12]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =. 2025 , address =

2025
[13]

Osin, Dmitry and Udovichenko, Igor and Shvetsov, Egor and Moskvoretskii, Viktor and Burnaev, Evgeny , booktitle =
[14]

Weller, Orion and Ricci, Kathryn and Marone, Marc and Chaffin, Antoine and Lawrie, Dawn and Van Durme, Benjamin , booktitle =
[15]

Machine Learning , volume =

One Transformer for All Time Series: Representing and Training with Time-Dependent Heterogeneous Tabular Data , author =. Machine Learning , volume =
[16]

Proceedings of the 4th ACM International Conference on AI in Finance , pages =

Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences , author =. Proceedings of the 4th ACM International Conference on AI in Finance , pages =
[17]

Li, Guilin and Zhang, Yun and Chen, Xiuyuan and Li, Chengqi and Wang, Bo and Kong, Linghe and Wang, Wenjia and Huang, Weiran and Tan, Matthias Hwai Yong , booktitle =
[18]

Behavior Sequence Transformer for E-commerce Recommendation in

Chen, Qiwei and Zhao, Huan and Li, Wei and Huang, Pipei and Ou, Wenwu , booktitle =. Behavior Sequence Transformer for E-commerce Recommendation in
[19]

Moreira, Gabriel de Souza Pereira and Rabhi, Sara and Lee, Jeong Min and Ak, Ronay and Oldridge, Even , booktitle =
[20]

McDermott, Matthew B. A. and Nestor, Bret and Argaw, Peniel and Jin, Ye and Kohane, Isaac S. , booktitle =
[21]

Zhang, Dongyu and Wang, Liang and Dai, Xin and Jain, Shubham and Wang, Junpeng and Fan, Yujie and Yeh, Chin-Chia Michael and Zheng, Yan and Zhuang, Zhongfang and Zhang, Wei , booktitle =
[22]

and Spotnitz, Matthew and Chen, RuiJun and Perotte, Adler and Natarajan, Karthik , booktitle =

Pang, Chao and Jiang, Xinzhuo and Kalluri, Krishna S. and Spotnitz, Matthew and Chen, RuiJun and Perotte, Adler and Natarajan, Karthik , booktitle =
[23]

Xia, Xue and Eksombatchai, Pong and Pancha, Nikil and Badani, Dhruvil Deven and Wang, Po-Wei and Gu, Neng and Joshi, Saurabh Vishwas and Farahpour, Nazanin and Zhang, Zhiyuan and Zhai, Andrew , booktitle =
[24]

Learning Phrase Representations using

Cho, Kyunghyun and van Merri. Learning Phrase Representations using. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =

2014
[25]

Advances in Neural Information Processing Systems , volume =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =
[26]

Moskvoretskii, Viktor and Osin, Dmitry and Shvetsov, Egor and Udovichenko, Igor and Zhelnin, Maxim and Dukhovny, Andrey and Zhimerikina, Anna and Burnaev, Evgeny , journal =
[27]

and Salehi, Mahsa , journal =

Foumani, Navid Mohammadi and Tan, Chang Wei and Webb, Geoffrey I. and Salehi, Mahsa , journal =. Improving Position Encoding of
[28]

International Conference on Learning Representations , year =

Multi-Time Attention Networks for Irregularly Sampled Time Series , author =. International Conference on Learning Representations , year =
[29]

and Shang, Jingbo , booktitle =

Chowdhury, Ranak Roy and Li, Jiacheng and Zhang, Xiyuan and Hong, Dezhi and Gupta, Rajesh K. and Shang, Jingbo , booktitle =
[30]

Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori , booktitle =
[31]

Advances in Neural Information Processing Systems , volume =

Algorithms for Hyper-Parameter Optimization , author =. Advances in Neural Information Processing Systems , volume =

[1] [1]

Advances in Neural Information Processing Systems , volume =

Revisiting Deep Learning Models for Tabular Data , author =. Advances in Neural Information Processing Systems , volume =

[2] [2]

Advances in Neural Information Processing Systems , volume =

On Embeddings for Numerical Features in Tabular Deep Learning , author =. Advances in Neural Information Processing Systems , volume =

[3] [3]

IEEE Transactions on Neural Networks and Learning Systems , volume =

Deep Neural Networks and Tabular Data: A Survey , author =. IEEE Transactions on Neural Networks and Learning Systems , volume =

[4] [4]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Recurrent Marked Temporal Point Processes: Embedding Event History to Vector , author =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

[5] [5]

Zuo, Simiao and Jiang, Haoming and Li, Zichong and Zhao, Tuo and Zha, Hongyuan , booktitle =

[6] [6]

Advances in Neural Information Processing Systems , year =

Transformers for Mixed-type Event Sequences , author =. Advances in Neural Information Processing Systems , year =

[7] [7]

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages =

Tabular Transformers for Modeling Multivariate Time Series , author =. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages =

[8] [8]

Babaev, Dmitrii and Kireev, Ivan and Ovsov, Nikita and Ivanova, Mariya and Gusev, Gleb and Nazarov, Ivan and Tuzhilin, Alexander , booktitle =

[9] [9]

International Conference on Machine Learning , pages =

Learning Transferable Visual Models from Natural Language Supervision , author =. International Conference on Machine Learning , pages =

[10] [10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[11] [11]

Gu, Albert and Dao, Tri , booktitle =

[12] [12]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , pages =. 2025 , address =

2025

[13] [13]

Osin, Dmitry and Udovichenko, Igor and Shvetsov, Egor and Moskvoretskii, Viktor and Burnaev, Evgeny , booktitle =

[14] [14]

Weller, Orion and Ricci, Kathryn and Marone, Marc and Chaffin, Antoine and Lawrie, Dawn and Van Durme, Benjamin , booktitle =

[15] [15]

Machine Learning , volume =

One Transformer for All Time Series: Representing and Training with Time-Dependent Heterogeneous Tabular Data , author =. Machine Learning , volume =

[16] [16]

Proceedings of the 4th ACM International Conference on AI in Finance , pages =

Towards a Foundation Purchasing Model: Pretrained Generative Autoregression on Transaction Sequences , author =. Proceedings of the 4th ACM International Conference on AI in Finance , pages =

[17] [17]

Li, Guilin and Zhang, Yun and Chen, Xiuyuan and Li, Chengqi and Wang, Bo and Kong, Linghe and Wang, Wenjia and Huang, Weiran and Tan, Matthias Hwai Yong , booktitle =

[18] [18]

Behavior Sequence Transformer for E-commerce Recommendation in

Chen, Qiwei and Zhao, Huan and Li, Wei and Huang, Pipei and Ou, Wenwu , booktitle =. Behavior Sequence Transformer for E-commerce Recommendation in

[19] [19]

Moreira, Gabriel de Souza Pereira and Rabhi, Sara and Lee, Jeong Min and Ak, Ronay and Oldridge, Even , booktitle =

[20] [20]

McDermott, Matthew B. A. and Nestor, Bret and Argaw, Peniel and Jin, Ye and Kohane, Isaac S. , booktitle =

[21] [21]

Zhang, Dongyu and Wang, Liang and Dai, Xin and Jain, Shubham and Wang, Junpeng and Fan, Yujie and Yeh, Chin-Chia Michael and Zheng, Yan and Zhuang, Zhongfang and Zhang, Wei , booktitle =

[22] [22]

and Spotnitz, Matthew and Chen, RuiJun and Perotte, Adler and Natarajan, Karthik , booktitle =

Pang, Chao and Jiang, Xinzhuo and Kalluri, Krishna S. and Spotnitz, Matthew and Chen, RuiJun and Perotte, Adler and Natarajan, Karthik , booktitle =

[23] [23]

Xia, Xue and Eksombatchai, Pong and Pancha, Nikil and Badani, Dhruvil Deven and Wang, Po-Wei and Gu, Neng and Joshi, Saurabh Vishwas and Farahpour, Nazanin and Zhang, Zhiyuan and Zhai, Andrew , booktitle =

[24] [24]

Learning Phrase Representations using

Cho, Kyunghyun and van Merri. Learning Phrase Representations using. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (. 2014 , address =

2014

[25] [25]

Advances in Neural Information Processing Systems , volume =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =

[26] [26]

Moskvoretskii, Viktor and Osin, Dmitry and Shvetsov, Egor and Udovichenko, Igor and Zhelnin, Maxim and Dukhovny, Andrey and Zhimerikina, Anna and Burnaev, Evgeny , journal =

[27] [27]

and Salehi, Mahsa , journal =

Foumani, Navid Mohammadi and Tan, Chang Wei and Webb, Geoffrey I. and Salehi, Mahsa , journal =. Improving Position Encoding of

[28] [28]

International Conference on Learning Representations , year =

Multi-Time Attention Networks for Irregularly Sampled Time Series , author =. International Conference on Learning Representations , year =

[29] [29]

and Shang, Jingbo , booktitle =

Chowdhury, Ranak Roy and Li, Jiacheng and Zhang, Xiyuan and Hong, Dezhi and Gupta, Rajesh K. and Shang, Jingbo , booktitle =

[30] [30]

Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori , booktitle =

[31] [31]

Advances in Neural Information Processing Systems , volume =

Algorithms for Hyper-Parameter Optimization , author =. Advances in Neural Information Processing Systems , volume =