Recognition: no theorem link
NEST: Nested Event Stream Transformer for Sequences of Multisets
Pith reviewed 2026-05-16 09:18 UTC · model grok-4.3
The pith
Preserving the original hierarchy of event streams as sequences of multisets improves both computational efficiency and representation quality in foundation models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that by retaining the nested structure of sequences of multisets in the transformer architecture, one obtains a useful inductive bias. This bias reduces the quadratic cost of attention by limiting cross-set interactions and yields higher-quality set representations without post-training aggregation. The resulting NEST model, trained with Masked Set Modeling, better captures the temporal dynamics of hierarchical event data.
What carries the argument
The nested transformer architecture in NEST that applies attention within multisets and across the sequence level separately, combined with the Masked Set Modeling objective.
If this is right
- Computational efficiency increases because attention is computed only within sets and between set representations rather than over the entire flattened sequence.
- Representation quality at the set level improves, leading to stronger performance on downstream tasks without relying on heuristic pooling methods.
- The model learns to respect the natural grouping of events, avoiding spurious correlations from artificial ordering within multisets.
- Pretraining becomes more efficient while still modeling the overall sequence dynamics.
- Real-world event stream dynamics are captured more faithfully in domains like healthcare.
Where Pith is reading between the lines
- The nested design may extend naturally to other data with similar hierarchies, such as sequences of documents or batches of transactions.
- Variable set sizes could be handled more gracefully without padding issues common in flattened approaches.
- If within-set timing information becomes available, the architecture could incorporate it without major redesign.
- This inductive bias might reduce the data requirements for effective pretraining on hierarchical streams.
Load-bearing premise
The assumption that the original multiset grouping reflects true co-occurrence without spurious internal order, and that flattening inevitably introduces misleading within-group dependencies.
What would settle it
Train both NEST and a flattened baseline on synthetic data where events within multisets are independent and randomly ordered, then compare their downstream task accuracy and training FLOPs to see if the hierarchy-preserving model still wins.
Figures
read the original abstract
Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that event stream data often consist of sequences of multisets with hierarchical structure (e.g., clinical encounters in EHRs), and that flattening this hierarchy into 1D sequences in existing foundation models causes computational inefficiency from dense attention and spurious within-set relationships, plus lower-quality set-level representations. It introduces the Nested Event Stream Transformer (NEST) architecture to preserve the hierarchy as an inductive bias, along with Masked Set Modeling (MSM) for improved set-level learning, and asserts that experiments on real-world data demonstrate gains in pretraining efficiency and downstream performance.
Significance. If the central claim holds—that hierarchy preservation supplies a load-bearing inductive bias independent of the MSM objective—this would provide a principled architectural alternative for foundation models on grouped event data, with potential efficiency advantages in attention computation and better downstream set representations in domains such as healthcare.
major comments (2)
- [Abstract] Abstract: the assertion that NEST 'improves both pretraining efficiency and downstream performance' on real-world data supplies no quantitative metrics, error bars, ablation details, or experimental protocol, preventing verification of the claimed gains or isolation of the hierarchy-preserving component.
- [Method] Method section (NEST architecture and MSM formulation): no ablation is reported that trains a flattened transformer baseline with the identical Masked Set Modeling objective while holding parameter count, masking strategy, and optimization fixed; without this control, it remains unclear whether observed benefits derive from hierarchy preservation rather than MSM itself.
minor comments (1)
- [Abstract] Abstract: the phrase 'real-world multiset sequence data' is used without naming the specific datasets or benchmarks, which should be stated explicitly for reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the experimental controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that NEST 'improves both pretraining efficiency and downstream performance' on real-world data supplies no quantitative metrics, error bars, ablation details, or experimental protocol, preventing verification of the claimed gains or isolation of the hierarchy-preserving component.
Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version we will add concise statements of the key efficiency gains (e.g., attention FLOPs or wall-clock time) and downstream performance deltas with error bars, together with a one-sentence description of the evaluation protocol. These numbers are already reported in the experiments section; we will simply surface them in the abstract. revision: yes
-
Referee: [Method] Method section (NEST architecture and MSM formulation): no ablation is reported that trains a flattened transformer baseline with the identical Masked Set Modeling objective while holding parameter count, masking strategy, and optimization fixed; without this control, it remains unclear whether observed benefits derive from hierarchy preservation rather than MSM itself.
Authors: This is a fair criticism. Our existing baselines compare NEST to standard flattened transformers, but we did not run the precise control that applies the identical MSM objective to a flattened architecture under matched parameter count, masking ratio, and optimizer settings. We will add this ablation in the revision; the new experiment will be described in the method section and reported alongside the existing results so that the contribution of the nested inductive bias can be isolated from the MSM objective. revision: yes
Circularity Check
No circularity in claimed derivation chain
full rationale
The paper advances an architectural proposal (NEST) and an associated pretraining objective (MSM) whose benefits are asserted via empirical results on real-world data rather than any closed-form derivation. No equations appear that define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the inductive-bias claim to a self-citation chain. The hierarchy-preservation argument is presented as an external modeling choice whose value is tested experimentally; it does not collapse to a tautology by construction. Consequently the derivation chain is self-contained and receives the lowest circularity score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
Neural legal judgment prediction in english
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. Neural legal judgment prediction in english. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019
work page 2019
-
[3]
Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. An exploration of hierarchical attention transformers for efficient long document classification.arXiv preprint arXiv:2210.05529, 2022
-
[4]
Diffcse: Difference-based contrastive learning for sentence embeddings
Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljaˇci´c, Shang- Wen Li, Scott Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...
work page 2022
-
[5]
Bert: Pre-training of deep bidi- rectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[6]
Simcse: Simple contrastive learning of sentence embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page 6894. Association for Computational Linguistics, 2021
work page 2021
-
[7]
Hdt: Hierarchical document transformer.arXiv preprint arXiv:2407.08330, 2024
Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, and Andreas Geiger. Hdt: Hierarchical document transformer.arXiv preprint arXiv:2407.08330, 2024
-
[8]
Sets2sets: Learning from sequential sets with neural networks
Haoji Hu and Xiangnan He. Sets2sets: Learning from sequential sets with neural networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1491–1499, 2019
work page 2019
-
[9]
Modeling personalized item frequency informa- tion for next-basket recommendation
Haoji Hu, Xiangnan He, Jinyang Gao, and Zhi-Li Zhang. Modeling personalized item frequency informa- tion for next-basket recommendation. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 1071–1080, 2020
work page 2020
-
[10]
Bert4eth: A pre- trained transformer for ethereum fraud detection
Sihao Hu, Zhen Zhang, Bingqiao Luo, Shengliang Lu, Bingsheng He, and Ling Liu. Bert4eth: A pre- trained transformer for ethereum fraud detection. InProceedings of the ACM Web Conference 2023, pages 2189–2197, 2023
work page 2023
-
[11]
Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer
Tingyi Huang, Shreya Saini, Aditya Nagarajan, Young-Rock Chung, Shayok Bhattacharyya, and Tengfei Ma. Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer. Journal of Biomedical Informatics, 152:104623, 2024
work page 2024
-
[12]
Event-based contrastive learning for medical time series.arXiv preprint arXiv:2312.10308, 2023
Hyewon Jeong, Nassim Oufattole, Matthew Mcdermott, Aparna Balagopalan, Bryan Jangeesingh, Marzyeh Ghassemi, and Collin Stultz. Event-based contrastive learning for medical time series.arXiv preprint arXiv:2312.10308, 2023
-
[13]
Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023
Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023
work page 2023
-
[14]
Self-attentive sequential recommendation
Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE, 2018
work page 2018
-
[15]
Time2Vec: Learning a Vector Representation of Time
Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus Brubaker. Time2vec: Learning a vector representation of time.arXiv preprint arXiv:1907.05321, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[16]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017
work page 2017
-
[17]
Soft contrastive learning for time series
Seunghan Lee, Taeyoung Park, and Kibok Lee. Soft contrastive learning for time series. 2024. Publisher Copyright: © 2024 12th International Conference on Learning Representations, ICLR 2024. All rights reserved.; 12th International Conference on Learning Representations, ICLR 2024 ; Conference date: 07-05-2024 Through 11-05-2024. 12
work page 2024
-
[18]
Masked and swapped sequence modeling for next novel basket recommendation in grocery shopping
Ming Li, Mozhdeh Ariannezhad, Andrew Yates, and Maarten De Rijke. Masked and swapped sequence modeling for next novel basket recommendation in grocery shopping. InProceedings of the 17th ACM Conference on Recommender Systems, pages 35–46, 2023
work page 2023
-
[19]
A next basket recommendation reality check.ACM Transactions on Information Systems, 41(4):1–29, 2023
Ming Li, Sami Jullien, Mozhdeh Ariannezhad, and Maarten De Rijke. A next basket recommendation reality check.ACM Transactions on Information Systems, 41(4):1–29, 2023
work page 2023
-
[20]
Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020
Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020
work page 2020
-
[21]
Yikuan Li, Mohammad Mamouei, Gholamreza Salimi-Khorshidi, Shishir Rao, Abdelaali Hassaine, Dexter Canoy, Thomas Lukasiewicz, and Kazem Rahimi. Hi-behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records.IEEE journal of biomedical and health informatics, 27(2):1106–1117, 2022
work page 2022
-
[22]
Matthew McDermott, Bret Nestor, Peniel Argaw, and Isaac S Kohane. Event stream gpt: a data pre- processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events.Advances in Neural Information Processing Systems, 36:24322–24334, 2023
work page 2023
-
[23]
Mikkel Odgaard, Kiril Vadimovic Klein, Sanne Møller Thysen, Espen Jimenez-Solem, Martin Sillesen, and Mads Nielsen. Core-behrt: A carefully optimized and rigorously evaluated behrt.arXiv preprint arXiv:2404.15201, 2024
-
[24]
Graph transformers on EHRs: Better representation improves downstream performance
Raphael Poulain and Rahmatollah Beheshti. Graph transformers on EHRs: Better representation improves downstream performance. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pe0Vdv7rsL
work page 2024
-
[25]
Using the output embedding to improve language models
Ofir Press and Lior Wolf. Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, 2017
work page 2017
-
[26]
arXiv preprint arXiv:2112.05682 , year=
Markus N. Rabe and Charles Staats. Self-attention does not need O(n2) memory.arXiv preprint arXiv:2112.05682, 2021
-
[27]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[28]
Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.NPJ digital medicine, 4(1):86, 2021
work page 2021
-
[29]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019
work page 2019
-
[30]
Zero shot health trajectory prediction using transformer.NPJ digital medicine, 7(1):256, 2024
Pawel Renc, Yugang Jia, Anthony E Samir, Jaroslaw Was, Quanzheng Li, David W Bates, and Arkadiusz Sitek. Zero shot health trajectory prediction using transformer.NPJ digital medicine, 7(1):256, 2024
work page 2024
-
[31]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[32]
Ethan Steinberg, Jason Fries, Yizhe Xu, and Nigam Shah. Motor: A time-to-event foundation model for structured medical records.arXiv preprint arXiv:2301.03150, 2023
-
[33]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[34]
Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer
Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019
work page 2019
-
[35]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 13
work page 2017
-
[36]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...
work page 2025
-
[37]
Loss functions for multiset prediction.Advances in Neural Information Processing Systems, 31, 2018
Sean Welleck, Zixin Yao, Yu Gai, Jialin Mao, Zheng Zhang, and Kyunghyun Cho. Loss functions for multiset prediction.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[38]
Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason Fries, and Nigam Shah. Ehrshot: An ehr benchmark for few-shot evaluation of foundation models.Advances in Neural Information Processing Systems, 36: 67125–67137, 2023
work page 2023
-
[39]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020
work page 2020
-
[40]
Hierarchical attention networks for document classification
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016
work page 2016
-
[41]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020
work page 2020
-
[42]
equal up to an additive constant
Yan Zhang, Jonathon Hare, and Adam Prugel-Bennett. Deep set prediction networks.Advances in Neural Information Processing Systems, 32, 2019. 14 A MSM A.1 Loss derivation Let l(θ;X i) be the multinomial log-likelihood of multiset Xi. Let ≜ denote “equal up to an additive constant”. −l(θ;X i) =−log n! n0!n1!· · ·n |V|−1! πθ(0)n0 πθ(1)n1 · · ·π θ(|V| −1) n|V...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.