arxiv: 2602.00520 · v3 · submitted 2026-01-31 · 💻 cs.LG

Recognition: no theorem link

NEST: Nested Event Stream Transformer for Sequences of Multisets

Minghui Sun , Haoyu Gong , Xingyu You , Jillian Hurst , Benjamin Goldstein , Matthew Engelhard

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords event streamsmultisetstransformer architecturemasked set modelingfoundation modelsinductive biashierarchical dataelectronic health records

0 comments

The pith

Preserving the original hierarchy of event streams as sequences of multisets improves both computational efficiency and representation quality in foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Event stream data such as electronic health records often appears as sequences of multisets where events within each group lack reliable ordering. Standard foundation models flatten this structure into a single sequence, which forces dense attention across all events and risks learning meaningless relationships within each multiset. The Nested Event Stream Transformer keeps the hierarchy intact by processing sets and sequences in nested layers. This design enables Masked Set Modeling, a pretraining method that directly targets set-level understanding. Experiments confirm gains in pretraining speed and accuracy on downstream tasks involving real multiset sequences.

Core claim

The paper claims that by retaining the nested structure of sequences of multisets in the transformer architecture, one obtains a useful inductive bias. This bias reduces the quadratic cost of attention by limiting cross-set interactions and yields higher-quality set representations without post-training aggregation. The resulting NEST model, trained with Masked Set Modeling, better captures the temporal dynamics of hierarchical event data.

What carries the argument

The nested transformer architecture in NEST that applies attention within multisets and across the sequence level separately, combined with the Masked Set Modeling objective.

If this is right

Computational efficiency increases because attention is computed only within sets and between set representations rather than over the entire flattened sequence.
Representation quality at the set level improves, leading to stronger performance on downstream tasks without relying on heuristic pooling methods.
The model learns to respect the natural grouping of events, avoiding spurious correlations from artificial ordering within multisets.
Pretraining becomes more efficient while still modeling the overall sequence dynamics.
Real-world event stream dynamics are captured more faithfully in domains like healthcare.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The nested design may extend naturally to other data with similar hierarchies, such as sequences of documents or batches of transactions.
Variable set sizes could be handled more gracefully without padding issues common in flattened approaches.
If within-set timing information becomes available, the architecture could incorporate it without major redesign.
This inductive bias might reduce the data requirements for effective pretraining on hierarchical streams.

Load-bearing premise

The assumption that the original multiset grouping reflects true co-occurrence without spurious internal order, and that flattening inevitably introduces misleading within-group dependencies.

What would settle it

Train both NEST and a flattened baseline on synthetic data where events within multisets are independent and randomly ordered, then compare their downstream task accuracy and training FLOPs to see if the hierarchy-preserving model still wins.

Figures

Figures reproduced from arXiv: 2602.00520 by Benjamin Goldstein, Haoyu Gong, Jillian Hurst, Matthew Engelhard, Minghui Sun, Xingyu You.

**Figure 2.** Figure 2: Dense attention is applied both within and across sets, yet their composition yields a sparse [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: NEST is a simple stack of hierarchical Transformer layers. Each layer is a composite [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: NEST validation NBR performance during the training. Set prediction is challenging in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NEST keeps multiset hierarchy in event streams via nested attention and adds masked set modeling, but the experiments do not yet isolate whether the nesting itself drives gains beyond the new objective.

read the letter

The main point is that this paper takes sequences of multisets—common in EHRs where events cluster into encounters with unreliable internal order—and builds a transformer that respects that nesting instead of flattening it all into one sequence. They pair the architecture with Masked Set Modeling, which masks entire sets during pretraining to push better set-level representations without relying on later pooling tricks. That combination is the actual new piece: a direct architectural response to the inefficiency of dense attention across flattened data and the spurious within-set correlations it creates. The inductive bias argument is straightforward and matches the data structure they describe. On the execution side, the proposal is clean and the motivation from real-world grouped logs is solid. The math is standard nested attention with no obvious errors or circular definitions. What is missing is evidence that the nesting component is load-bearing. The abstract claims efficiency and downstream wins on real data, but without reported ablations that hold the MSM objective fixed and compare nested versus flat transformers at matched parameter count, it is hard to tell how much the hierarchy preservation adds. If a flat model with the same masking strategy performs comparably, the central claim weakens. The paper is aimed at people building foundation models for healthcare, finance, or log data where events arrive in natural groups. Readers working on similar structured sequences would find the architecture and MSM formulation useful to try. It is coherent enough and addresses a practical gap, so it deserves a serious referee who can press for those isolating experiments and full metrics with error bars.

Referee Report

2 major / 1 minor

Summary. The paper claims that event stream data often consist of sequences of multisets with hierarchical structure (e.g., clinical encounters in EHRs), and that flattening this hierarchy into 1D sequences in existing foundation models causes computational inefficiency from dense attention and spurious within-set relationships, plus lower-quality set-level representations. It introduces the Nested Event Stream Transformer (NEST) architecture to preserve the hierarchy as an inductive bias, along with Masked Set Modeling (MSM) for improved set-level learning, and asserts that experiments on real-world data demonstrate gains in pretraining efficiency and downstream performance.

Significance. If the central claim holds—that hierarchy preservation supplies a load-bearing inductive bias independent of the MSM objective—this would provide a principled architectural alternative for foundation models on grouped event data, with potential efficiency advantages in attention computation and better downstream set representations in domains such as healthcare.

major comments (2)

[Abstract] Abstract: the assertion that NEST 'improves both pretraining efficiency and downstream performance' on real-world data supplies no quantitative metrics, error bars, ablation details, or experimental protocol, preventing verification of the claimed gains or isolation of the hierarchy-preserving component.
[Method] Method section (NEST architecture and MSM formulation): no ablation is reported that trains a flattened transformer baseline with the identical Masked Set Modeling objective while holding parameter count, masking strategy, and optimization fixed; without this control, it remains unclear whether observed benefits derive from hierarchy preservation rather than MSM itself.

minor comments (1)

[Abstract] Abstract: the phrase 'real-world multiset sequence data' is used without naming the specific datasets or benchmarks, which should be stated explicitly for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the experimental controls.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that NEST 'improves both pretraining efficiency and downstream performance' on real-world data supplies no quantitative metrics, error bars, ablation details, or experimental protocol, preventing verification of the claimed gains or isolation of the hierarchy-preserving component.

Authors: We agree that the abstract would be strengthened by including quantitative support. In the revised version we will add concise statements of the key efficiency gains (e.g., attention FLOPs or wall-clock time) and downstream performance deltas with error bars, together with a one-sentence description of the evaluation protocol. These numbers are already reported in the experiments section; we will simply surface them in the abstract. revision: yes
Referee: [Method] Method section (NEST architecture and MSM formulation): no ablation is reported that trains a flattened transformer baseline with the identical Masked Set Modeling objective while holding parameter count, masking strategy, and optimization fixed; without this control, it remains unclear whether observed benefits derive from hierarchy preservation rather than MSM itself.

Authors: This is a fair criticism. Our existing baselines compare NEST to standard flattened transformers, but we did not run the precise control that applies the identical MSM objective to a flattened architecture under matched parameter count, masking ratio, and optimizer settings. We will add this ablation in the revision; the new experiment will be described in the method section and reported alongside the existing results so that the contribution of the nested inductive bias can be isolated from the MSM objective. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper advances an architectural proposal (NEST) and an associated pretraining objective (MSM) whose benefits are asserted via empirical results on real-world data rather than any closed-form derivation. No equations appear that define a quantity in terms of itself, rename a fitted parameter as a prediction, or reduce the inductive-bias claim to a self-citation chain. The hierarchy-preservation argument is presented as an external modeling choice whose value is tested experimentally; it does not collapse to a tautology by construction. Consequently the derivation chain is self-contained and receives the lowest circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that multiset hierarchy is both meaningful and computationally advantageous.

pith-pipeline@v0.9.0 · 5515 in / 991 out tokens · 26194 ms · 2026-05-16T09:18:03.572484+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Neural legal judgment prediction in english

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. Neural legal judgment prediction in english. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019

work page 2019
[3]

An exploration of hierarchical attention transformers for efficient long document classification.arXiv preprint arXiv:2210.05529, 2022

Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. An exploration of hierarchical attention transformers for efficient long document classification.arXiv preprint arXiv:2210.05529, 2022

work page arXiv 2022
[4]

Diffcse: Difference-based contrastive learning for sentence embeddings

Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljaˇci´c, Shang- Wen Li, Scott Yih, Yoon Kim, and James Glass. Diffcse: Difference-based contrastive learning for sentence embeddings. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page 2022
[5]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[6]

Simcse: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page 6894. Association for Computational Linguistics, 2021

work page 2021
[7]

Hdt: Hierarchical document transformer.arXiv preprint arXiv:2407.08330, 2024

Haoyu He, Markus Flicke, Jan Buchmann, Iryna Gurevych, and Andreas Geiger. Hdt: Hierarchical document transformer.arXiv preprint arXiv:2407.08330, 2024

work page arXiv 2024
[8]

Sets2sets: Learning from sequential sets with neural networks

Haoji Hu and Xiangnan He. Sets2sets: Learning from sequential sets with neural networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1491–1499, 2019

work page 2019
[9]

Modeling personalized item frequency informa- tion for next-basket recommendation

Haoji Hu, Xiangnan He, Jinyang Gao, and Zhi-Li Zhang. Modeling personalized item frequency informa- tion for next-basket recommendation. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 1071–1080, 2020

work page 2020
[10]

Bert4eth: A pre- trained transformer for ethereum fraud detection

Sihao Hu, Zhen Zhang, Bingqiao Luo, Shengliang Lu, Bingsheng He, and Ling Liu. Bert4eth: A pre- trained transformer for ethereum fraud detection. InProceedings of the ACM Web Conference 2023, pages 2189–2197, 2023

work page 2023
[11]

Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer

Tingyi Huang, Shreya Saini, Aditya Nagarajan, Young-Rock Chung, Shayok Bhattacharyya, and Tengfei Ma. Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer. Journal of Biomedical Informatics, 152:104623, 2024

work page 2024
[12]

Event-based contrastive learning for medical time series.arXiv preprint arXiv:2312.10308, 2023

Hyewon Jeong, Nassim Oufattole, Matthew Mcdermott, Aparna Balagopalan, Bryan Jangeesingh, Marzyeh Ghassemi, and Collin Stultz. Event-based contrastive learning for medical time series.arXiv preprint arXiv:2312.10308, 2023

work page arXiv 2023
[13]

Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset.Scientific data, 10(1):1, 2023

work page 2023
[14]

Self-attentive sequential recommendation

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE, 2018

work page 2018
[15]

Time2Vec: Learning a Vector Representation of Time

Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus Brubaker. Time2vec: Learning a vector representation of time.arXiv preprint arXiv:1907.05321, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[16]

Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

work page 2017
[17]

Soft contrastive learning for time series

Seunghan Lee, Taeyoung Park, and Kibok Lee. Soft contrastive learning for time series. 2024. Publisher Copyright: © 2024 12th International Conference on Learning Representations, ICLR 2024. All rights reserved.; 12th International Conference on Learning Representations, ICLR 2024 ; Conference date: 07-05-2024 Through 11-05-2024. 12

work page 2024
[18]

Masked and swapped sequence modeling for next novel basket recommendation in grocery shopping

Ming Li, Mozhdeh Ariannezhad, Andrew Yates, and Maarten De Rijke. Masked and swapped sequence modeling for next novel basket recommendation in grocery shopping. InProceedings of the 17th ACM Conference on Recommender Systems, pages 35–46, 2023

work page 2023
[19]

A next basket recommendation reality check.ACM Transactions on Information Systems, 41(4):1–29, 2023

Ming Li, Sami Jullien, Mozhdeh Ariannezhad, and Maarten De Rijke. A next basket recommendation reality check.ACM Transactions on Information Systems, 41(4):1–29, 2023

work page 2023
[20]

Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

Yikuan Li, Shishir Rao, José Roberto Ayala Solares, Abdelaali Hassaine, Rema Ramakrishnan, Dexter Canoy, Yajie Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Behrt: transformer for electronic health records.Scientific reports, 10(1):7155, 2020

work page 2020
[21]

Yikuan Li, Mohammad Mamouei, Gholamreza Salimi-Khorshidi, Shishir Rao, Abdelaali Hassaine, Dexter Canoy, Thomas Lukasiewicz, and Kazem Rahimi. Hi-behrt: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records.IEEE journal of biomedical and health informatics, 27(2):1106–1117, 2022

work page 2022
[22]

Matthew McDermott, Bret Nestor, Peniel Argaw, and Isaac S Kohane. Event stream gpt: a data pre- processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events.Advances in Neural Information Processing Systems, 36:24322–24334, 2023

work page 2023
[23]

Core-behrt: A carefully optimized and rigorously evaluated behrt.arXiv preprint arXiv:2404.15201, 2024

Mikkel Odgaard, Kiril Vadimovic Klein, Sanne Møller Thysen, Espen Jimenez-Solem, Martin Sillesen, and Mads Nielsen. Core-behrt: A carefully optimized and rigorously evaluated behrt.arXiv preprint arXiv:2404.15201, 2024

work page arXiv 2024
[24]

Graph transformers on EHRs: Better representation improves downstream performance

Raphael Poulain and Rahmatollah Beheshti. Graph transformers on EHRs: Better representation improves downstream performance. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=pe0Vdv7rsL

work page 2024
[25]

Using the output embedding to improve language models

Ofir Press and Lior Wolf. Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, 2017

work page 2017
[26]

arXiv preprint arXiv:2112.05682 , year=

Markus N. Rabe and Charles Staats. Self-attention does not need O(n2) memory.arXiv preprint arXiv:2112.05682, 2021

work page arXiv 2021
[27]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[28]

Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.NPJ digital medicine, 4(1):86, 2021

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.NPJ digital medicine, 4(1):86, 2021

work page 2021
[29]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019

work page 2019
[30]

Zero shot health trajectory prediction using transformer.NPJ digital medicine, 7(1):256, 2024

Pawel Renc, Yugang Jia, Anthony E Samir, Jaroslaw Was, Quanzheng Li, David W Bates, and Arkadiusz Sitek. Zero shot health trajectory prediction using transformer.NPJ digital medicine, 7(1):256, 2024

work page 2024
[31]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[32]

Motor: A time-to-event foundation model for structured medical records.arXiv preprint arXiv:2301.03150, 2023

Ethan Steinberg, Jason Fries, Yizhe Xu, and Nigam Shah. Motor: A time-to-event foundation model for structured medical records.arXiv preprint arXiv:2301.03150, 2023

work page arXiv 2023
[33]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[34]

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. InProceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019

work page 2019
[35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 13

work page 2017
[36]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Associati...

work page 2025
[37]

Loss functions for multiset prediction.Advances in Neural Information Processing Systems, 31, 2018

Sean Welleck, Zixin Yao, Yu Gai, Jialin Mao, Zheng Zhang, and Kyunghyun Cho. Loss functions for multiset prediction.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[38]

Ehrshot: An ehr benchmark for few-shot evaluation of foundation models.Advances in Neural Information Processing Systems, 36: 67125–67137, 2023

Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason Fries, and Nigam Shah. Ehrshot: An ehr benchmark for few-shot evaluation of foundation models.Advances in Neural Information Processing Systems, 36: 67125–67137, 2023

work page 2023
[39]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational conference on machine learning, pages 10524–10533. PMLR, 2020

work page 2020
[40]

Hierarchical attention networks for document classification

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480–1489, 2016

work page 2016
[41]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020

work page 2020
[42]

equal up to an additive constant

Yan Zhang, Jonathon Hare, and Adam Prugel-Bennett. Deep set prediction networks.Advances in Neural Information Processing Systems, 32, 2019. 14 A MSM A.1 Loss derivation Let l(θ;X i) be the multinomial log-likelihood of multiset Xi. Let ≜ denote “equal up to an additive constant”. −l(θ;X i) =−log n! n0!n1!· · ·n |V|−1! πθ(0)n0 πθ(1)n1 · · ·π θ(|V| −1) n|V...

work page 2019