Rotary Masked Autoencoders are Versatile Learners

arxiv: 2505.20535 · v3 · submitted 2025-05-26 · 💻 cs.LG

Rotary Masked Autoencoders are Versatile Learners

Uros Zivanovic , Serafina Di Gioia , Andre Scaffidi , Mart\'in de los Rios , Gabriella Contardo , Roberto Trotta This is my paper

Pith reviewed 2026-05-19 12:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords Rotary Positional EmbeddingsMasked AutoencodersTime SeriesRepresentation LearningIrregular SamplingMultimodal Learning

0 comments p. Extension

The pith

RoMAE applies rotary positional embeddings to masked autoencoders for continuous positions across modalities without time-series customizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoMAE as an extension of the standard Masked Autoencoder that replaces usual positional encodings with rotary embeddings adapted to continuous multidimensional inputs. This setup targets irregular time-series and similar data where positions are not on a fixed grid. A sympathetic reader would care because the method claims to avoid the extra architectural changes and computational costs that usually come with handling uneven sampling or multivariate continuous signals. The work shows RoMAE matching or exceeding specialized models on challenging benchmarks while preserving baseline MAE behavior on images and audio.

Core claim

RoMAE utilizes Rotary Positional Embedding for continuous positions in the MAE framework, enabling interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. It surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. Including learned embeddings in the input sequence breaks RoPE's relative position property.

What carries the argument

Rotary Positional Embedding (RoPE) adapted for continuous positions, inserted directly into the Masked Autoencoder pipeline to encode relative positional information that supports interpolation without grid assumptions.

If this is right

RoMAE performs representation learning on irregular and multivariate time-series without added complexity.
The same model maintains standard MAE reconstruction quality on images and audio.
Learned embeddings in the sequence interfere with RoPE's relative position encoding.
RoMAE can interpolate across continuous positional dimensions in the input sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend naturally to other continuous-valued domains such as spatial sensor grids or scientific measurement series.
Direct substitution of rotary embeddings could reduce the need for modality-specific positional modules in future transformer variants.

Load-bearing premise

Rotary positional embeddings can be directly substituted into the masked autoencoder framework for continuous positions without introducing hidden limitations or needing further adjustments.

What would settle it

An experiment where RoMAE is tested on irregular time-series data with known gaps and shows clear degradation compared to specialized architectures, or fails to reconstruct positions while preserving relative properties.

Figures

Figures reproduced from arXiv: 2505.20535 by Andre Scaffidi, Gabriella Contardo, Mart\'in de los Rios, Roberto Trotta, Serafina Di Gioia, Uros Zivanovic.

**Figure 1.** Figure 1: Overview of the RoMAE pipeline. Left: Visualisation of data embedding via multidimensional (ND) patchification for illustrative data realisations in 1, 2 and 3D. Centre: Full depiction of RoMAE architecture. The optional [CLS] token is omitted from the input sequence for simplicity. Right: The RoMAE encoder/decoder with ROPE operations denoted by rotational arrows. Masked Autoencoders: Architectures such … view at source ↗

**Figure 2.** Figure 2: RoMAE position reconstruction MSE across two positional ranges. [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Average MSE obtained from the interpolation task using RoMAE-tiny for time-series with [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Same as above but now for time-series with two frequency modes present in the signal. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Illustrative realisation from the evaluation of RoMAE on a the bi-frequency time series. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: A training example from the ELAsTiCC dataset. The flux difference of each band has [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Two sample realisations of differing chirality from the test set of spirals. The green line is [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Samples from interpolation tests using n = 3, 10 and 20 observations. D.9 PhysioNet We adopt the pre–processed release of the PHYSIONET/CinC 2012 Challenge [50], comprising multivariate clinical time–series collected during the 48h window following intensive–care–unit (ICU) admission. Static covariates (Age, Gender, Height, ICU type) occupy feature indices 0–3 and are always observed, whereas the remaining… view at source ↗

read the original abstract

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoMAE applies RoPE to MAE for continuous positions without time-series tweaks and claims cross-modality wins, but the abstract gives no numbers or implementation details to check if the substitution is truly clean.

read the letter

RoMAE tries to handle irregular time series and other continuous-position data inside a standard masked autoencoder by using rotary embeddings instead of learned ones. The abstract says this keeps the model general across modalities and beats specialized time-series methods on the ELAsTiCC challenge while matching normal MAE results on images and audio. They also check that learned embeddings break RoPE's relative-position property when reconstructing positions, which is a sensible sanity test.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Rotary Masked Autoencoder (RoMAE), an extension of the standard Masked Autoencoder that incorporates Rotary Positional Embeddings (RoPE) for multidimensional continuous positions. It claims this enables interpolation and representation learning on irregular/multivariate time-series (and other modalities) without any time-series-specific architectural changes to masking, encoder, decoder, or loss, while surpassing specialized time-series models on the DESC ELAsTiCC Challenge and preserving MAE performance on images and audio; an additional investigation shows that learned embeddings break RoPE's relative-position property.

Significance. If the empirical results hold after full verification, the work would be significant for providing a parameter-light, specialization-free route to continuous positional modeling in transformers, potentially reducing complexity for irregular time-series and related modalities while retaining MAE's reconstruction-based pretraining benefits.

major comments (2)

[Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.
[Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.

minor comments (1)

[Abstract] Abstract: the final sentence on reconstructing embedded continuous positions would benefit from a brief statement of the observed outcome rather than only the negative finding about learned embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the support for our claims while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.

Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately evaluable without requiring the reader to consult the full experiments section. The manuscript contains detailed results on the DESC ELAsTiCC Challenge, including performance metrics, comparisons against specialized time-series baselines, error bars from multiple random seeds, and dataset statistics. We have revised the abstract to incorporate concise quantitative highlights from those experiments. revision: yes
Referee: [Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.

Authors: The body of the manuscript (Section 3) provides the equations for adapting RoPE to continuous multidimensional positions, along with a description confirming that the substitution uses only standard MAE masking, encoder, decoder, and loss components with no auxiliary mechanisms or modality-specific changes. To improve accessibility from the abstract, we have added a brief clarifying sentence and a reference to the relevant section in the revised abstract, and we have included a short pseudocode block in the main text for the integration step. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims are empirical

full rationale

The provided document consists solely of the abstract, which introduces RoMAE as an extension of the standard Masked Autoencoder that applies Rotary Positional Embeddings to continuous multidimensional positions. No equations, derivations, fitted parameters, or self-citations are shown that could reduce any result to its inputs by construction. Performance claims on datasets such as DESC ELAsTiCC are presented as empirical outcomes rather than first-principles predictions, and the investigation of position reconstruction is described without technical steps that invite circularity analysis. Because no load-bearing derivation chain exists in the available text, the paper is self-contained against external benchmarks with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard components of MAE and RoPE.

pith-pipeline@v0.9.0 · 5671 in / 998 out tokens · 26055 ms · 2026-05-19T12:28:08.608944+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We utilize Axial RoPE... split into D subspaces... apply p-RoPE to each subspace... up to D=3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 5 internal anchors

[1]

Swin Transformer: Hierarchical vision transformer using shifted windows

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid. Vivit: A video vision transformer. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6816–6826. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00676. URLhttps://doi.org/10.1109/ICCV48922.2021.00676

work page doi:10.1109/iccv48922.2021.00676 2021
[2]

L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. URLhttp://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

A. J. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. J. Keogh. The UEA multivariate time series classification archive, 2018.CoRR, abs/1811.00075, 2018. URLhttp://arxiv.org/abs/1811.00075

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Barbero, A

F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Velickovic. Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

work page 2025
[5]

URLhttps://openreview.net/forum?id=GtvuNrk58a

work page
[6]

Becker, H

P. Becker, H. Pandya, G. H. W. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann. Re- current kalman networks: Factorized inference in high-dimensional deep feature spaces. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of...

work page 2019
[7]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 2020
[8]

M., Bayo, A., Catelan, M., Estévez, P

Cabrera-Vives, G., Moreno-Cartagena, D., Astorga, N., Reyes-Jainaga, I., Förster, F., Huijse, P., Arredondo, J., Muñoz Arancibia, A. M., Bayo, A., Catelan, M., Estévez, P. A., Sánchez-Sáez, P., Álvarez, A., Castellanos, P., Gallardo, P., Moya, A., and Rodriguez-Mancini, D. Atat: Astronomical transformer for time series and tabular data.A&A, 689:A289, 2024...

work page doi:10.1051/0004-6361/202449475 2024
[10]

R. T. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6571– 6583, 2018

work page 2018
[11]

T. Q. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal...

work page 2018
[12]

Y . Chen, K. Ren, Y . Wang, Y . Fang, W. Sun, and D. Li. Contiformer: Continuous- time transformer for irregular time series modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orl...

work page 2023
[13]

Y . Chen, Q. Wang, and Y . e. Fu. Continuous-time Transformer for Irregular Time-series Predictions. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[14]

E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le. Randaugment: Practical automated data aug- mentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M. Bal- can, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, v...

work page 2020
[15]

Proceedings of the 2019 Conference of the North

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...

work page doi:10.18653/v1/n19-1423 2019
[16]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview....

work page 2021
[17]

Available: https://doi-org.ornl.idm.oclc.org/10.1016/j

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018. doi: 10.1016/J. NEUNET.2017.12.012. URLhttps://doi.org/10.1016/j.neunet.2017.12.012

work page doi:10.1016/j 2018
[18]

Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao. EV A-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2024. doi: 10.1016/J.IMA VIS.2024.105171. URLhttps://doi.org/10.1016/j.imavis.2024.105171

work page doi:10.1016/j.ima 2024
[19]

Feichtenhofer, H

C. Feichtenhofer, H. Fan, Y . Li, and K. He. Masked autoencoders as spatiotemporal learn- ers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural In- formation Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, ...

work page 2022
[20]

S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik. Units: A unified multi-task time series model.Advances in Neural Information Processing Systems, 37:140589– 140631, 2024

work page 2024
[21]

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

work page 2017
[22]

Y . Gong, C. Lai, Y . Chung, and J. R. Glass. SSAST: self-supervised audio spectrogram trans- former. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Sym- posium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual...

work page doi:10.1609/aaai.v36i10.21315 2022
[24]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. B. Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.01553. URLhttps://doi.org/10.1109/CVPR52688.2022.01553

work page doi:10.1109/cvpr52688.2022.01553 2022
[25]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

B. Heo, S. Park, D. Han, and S. Yun. Rotary position embedding for vision transformer. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part X, volume 15068 ofLecture Notes in Computer Science, pages 289–305...

work page doi:10.1007/978-3-031-72684-2 2024
[27]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.CoRR, abs/1207.0580, 2012. URLhttp://arxiv.org/abs/1207.0580

work page internal anchor Pith review Pith/arXiv arXiv 2012
[28]

Huang, Y

G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016...

work page doi:10.1007/978-3-319-46493-0_ 2016
[29]

Huang, H

P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Nov...

work page 2022
[30]

Jeevan and A

P. Jeevan and A. Sethi. Resource-efficient hybrid x-formers for vision. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3555–3563. IEEE, 2022. doi: 10.1109/WACV51458.2022.00361. URL https://doi.org/10.1109/WACV51458.2022.00361

work page doi:10.1109/wacv51458.2022.00361 2022
[31]

Kidger, J

P. Kidger, J. Morrill, J. Foster, and T. J. Lyons. Neural controlled differential equations for irregular time series. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual,...

work page 2020
[32]

Le and X

Y . Le and X. S. Yang. Tiny imagenet visual recognition challenge. 2015. URL http: //vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf

work page 2015
[33]

Z. Li, S. Li, and X. Yan. Time series as images: Vision transformer for irregularly sampled time series. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,...

work page 2023
[34]

Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long. Timer: Generative pre-trained transformers are large time series models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=bYRYb7DMNo

work page 2024
[35]

Z. Lu, Z. Wang, D. Huang, C. Wu, X. Liu, W. Ouyang, and L. Bai. Fit: Flexible vision transformer for diffusion model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https:// openreview.net/forum?id=jZVen2JguY. 14

work page 2024
[36]

M. Moor, B. Rieck, M. Horn, C. R. Jutzeler, and K. Borgwardt. Early Recognition of Sepsis with Heteroscedastic Temporal Variational Autoencoders. InInternational Conference on Machine Learning (ICML), pages 7781–7792, 2021

work page 2021
[37]

Panayotov, G

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[38]

Patel, R

H. Patel, R. Qiu, A. Irwin, S. Sadiq, and S. Wang. EMIT - event-based masked auto encoding for irregular time series. In E. Baralis, K. Zhang, E. Damiani, M. Debbah, P. Kalnis, and X. Wu, editors,IEEE International Conference on Data Mining, ICDM 2024, Abu Dhabi, United Arab Emirates, December 9-12, 2024, pages 370–379. IEEE, 2024. doi: 10.1109/ICDM59182....

work page doi:10.1109/icdm59182.2024 2024
[39]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/forum?id=wHBfxhZu1u

work page 2024
[40]

K. J. Piczak. Esc: Dataset for environmental sound classification. InProceedings of the 23rd ACM International Conference on Multimedia, MM ’15, page 1015–1018, New York, NY , USA,

work page
[41]

ISBN 9781450334594

Association for Computing Machinery. ISBN 9781450334594. doi: 10.1145/2733373. 2806390. URLhttps://doi.org/10.1145/2733373.2806390

work page doi:10.1145/2733373
[42]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language under- standing by generative pre-training.OpenAI blog, 2018. URL https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf

work page 2018
[43]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Lan- guage models are unsupervised multitask learners.OpenAI blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019
[44]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y . Bengio, and A. Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

work page 2019
[46]

Rubanova, T

Y . Rubanova, T. Q. Chen, and D. Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In H. M. Wallach, H. Larochelle, A. Beygelz- imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Process- ing Systems 2019, NeurIPS 2019,...

work page 2019
[47]

Schirmer, M

M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling irregular time series with continuous recurrent units. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, ...

work page 2022
[48]

S. N. Shukla and B. M. Marlin. Multi-time attention networks for irregularly sampled time series. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum? id=4c0J6lwQ4_. 15

work page 2021
[50]

S. N. Shukla and B. M. Marlin. Heteroscedastic temporal variational autoencoder for irregularly sampled time series. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview. net/forum?id=Az7opqbQE-3

work page 2022
[51]

Silva, B

I. Silva, B. Moody, D. Scott, L. Celi, R. Mark, and G. Clifford. The PhysioNet/Computing in Cardiology Challenge 2012: Predicting In-Hospital Mortality from ICU Data. InComputing in Cardiology, pages 245–248, 2012

work page 2012
[52]

Silva, G

I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 computing in cardiology, pages 245–248. IEEE, 2012

work page 2012
[53]

J. T. H. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=Ai8Hw3AXqks

work page 2023
[54]

Z. Song, Q. Lu, H. Zhu, D. Buckeridge, and Y . Li. Trajgpt: Irregular time-series representation learning for health trajectory analysis.CoRR, abs/2410.02133, 2024. doi: 10.48550/ARXIV . 2410.02133. URLhttps://doi.org/10.48550/arXiv.2410.02133

work page internal anchor Pith review doi:10.48550/arxiv 2024
[55]

N. Stroh. Trackgpt–a generative pre-trained transformer for cross-domain entity trajectory forecasting.arXiv preprint arXiv:2402.00066, 2024

work page arXiv 2024
[56]

J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM. 2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom 2024
[57]

K. Su, Q. Wu, P. Cai, X. Zhu, X. Lu, Z. Wang, and K. Hu. RI-MAE: rotation-invariant masked autoencoders for self-supervised point cloud representation learning. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7015–70...

work page doi:10.1609/aaai.v39i7.32753 2025
[58]

Szegedy, V

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec- ture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society,

work page 2016
[59]

URLhttps://doi.org/10.1109/CVPR.2016.308

doi: 10.1109/CVPR.2016.308. URLhttps://doi.org/10.1109/CVPR.2016.308

work page doi:10.1109/cvpr.2016.308 2016
[60]

Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New...

work page 2022
[61]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

work page 2017
[62]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae V2: scaling video masked autoencoders with dual masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 14549–14560. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01398. URL https://doi.org/10.1109...

work page doi:10.1109/cvpr52729.2023.01398 2023
[63]

Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 6778–6786. ijcai.org, 2023. doi: 10.24963/IJCAI.2023/759. URL https://doi.org/10.24963/ijcai. 2023/759

work page doi:10.24963/ijcai.2023/759 2023
[64]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceed- ings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

work page 2020
[65]

Xiong, J

W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma. Effective long-context scaling of foundation models. In K. Duh, H. Gómez-Adorno, and S. Bethard, editors,Proceedings of the 2024 Co...

work page 2024
[66]

Effective Long-Context Scaling of Foundation Models

doi: 10.18653/V1/2024.NAACL-LONG.260. URL https://doi.org/10.18653/v1/ 2024.naacl-long.260

work page doi:10.18653/v1/2024.naacl-long.260 2024
[67]

M. Xu, X. Men, B. Wang, Q. Zhang, H. Lin, X. Han, and W. Chen. Base of rope bounds context length. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

work page 2024
[68]

Z.-Q. J. Xu, Y . Zhang, and T. Luo. Overview frequency principle/spectral bias in deep learning. Communications on Applied Mathematics and Computation, 7(3):827–864, 2025

work page 2025
[69]

Zerveas, S

G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff. A transformer-based framework for multivariate time series representation learning. In F. Zhu, B. C. Ooi, and C. Miao, editors,KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2114–2124. ACM, 2021. doi: 10...

work page doi:10.1145/3447548.3467401 2021
[70]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–1237...

work page 2019

[1] [1]

Swin Transformer: Hierarchical vision transformer using shifted windows

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid. Vivit: A video vision transformer. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6816–6826. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00676. URLhttps://doi.org/10.1109/ICCV48922.2021.00676

work page doi:10.1109/iccv48922.2021.00676 2021

[2] [2]

L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. URLhttp://arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

A. J. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. J. Keogh. The UEA multivariate time series classification archive, 2018.CoRR, abs/1811.00075, 2018. URLhttp://arxiv.org/abs/1811.00075

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Barbero, A

F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Velickovic. Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

work page 2025

[5] [5]

URLhttps://openreview.net/forum?id=GtvuNrk58a

work page

[6] [6]

Becker, H

P. Becker, H. Pandya, G. H. W. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann. Re- current kalman networks: Factorized inference in high-dimensional deep feature spaces. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of...

work page 2019

[7] [7]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 2020

[8] [8]

M., Bayo, A., Catelan, M., Estévez, P

Cabrera-Vives, G., Moreno-Cartagena, D., Astorga, N., Reyes-Jainaga, I., Förster, F., Huijse, P., Arredondo, J., Muñoz Arancibia, A. M., Bayo, A., Catelan, M., Estévez, P. A., Sánchez-Sáez, P., Álvarez, A., Castellanos, P., Gallardo, P., Moya, A., and Rodriguez-Mancini, D. Atat: Astronomical transformer for time series and tabular data.A&A, 689:A289, 2024...

work page doi:10.1051/0004-6361/202449475 2024

[9] [10]

R. T. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6571– 6583, 2018

work page 2018

[10] [11]

T. Q. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal...

work page 2018

[11] [12]

Y . Chen, K. Ren, Y . Wang, Y . Fang, W. Sun, and D. Li. Contiformer: Continuous- time transformer for irregular time series modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orl...

work page 2023

[12] [13]

Y . Chen, Q. Wang, and Y . e. Fu. Continuous-time Transformer for Irregular Time-series Predictions. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[13] [14]

E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le. Randaugment: Practical automated data aug- mentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M. Bal- can, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, v...

work page 2020

[14] [15]

Proceedings of the 2019 Conference of the North

J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...

work page doi:10.18653/v1/n19-1423 2019

[15] [16]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview....

work page 2021

[16] [17]

Available: https://doi-org.ornl.idm.oclc.org/10.1016/j

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018. doi: 10.1016/J. NEUNET.2017.12.012. URLhttps://doi.org/10.1016/j.neunet.2017.12.012

work page doi:10.1016/j 2018

[17] [18]

Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao. EV A-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2024. doi: 10.1016/J.IMA VIS.2024.105171. URLhttps://doi.org/10.1016/j.imavis.2024.105171

work page doi:10.1016/j.ima 2024

[18] [19]

Feichtenhofer, H

C. Feichtenhofer, H. Fan, Y . Li, and K. He. Masked autoencoders as spatiotemporal learn- ers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural In- formation Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, ...

work page 2022

[19] [20]

S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik. Units: A unified multi-task time series model.Advances in Neural Information Processing Systems, 37:140589– 140631, 2024

work page 2024

[20] [21]

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

work page 2017

[21] [22]

Y . Gong, C. Lai, Y . Chung, and J. R. Glass. SSAST: self-supervised audio spectrogram trans- former. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Sym- posium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual...

work page doi:10.1609/aaai.v36i10.21315 2022

[22] [24]

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. B. Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.01553. URLhttps://doi.org/10.1109/CVPR52688.2022.01553

work page doi:10.1109/cvpr52688.2022.01553 2022

[23] [25]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [26]

B. Heo, S. Park, D. Han, and S. Yun. Rotary position embedding for vision transformer. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part X, volume 15068 ofLecture Notes in Computer Science, pages 289–305...

work page doi:10.1007/978-3-031-72684-2 2024

[25] [27]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.CoRR, abs/1207.0580, 2012. URLhttp://arxiv.org/abs/1207.0580

work page internal anchor Pith review Pith/arXiv arXiv 2012

[26] [28]

Huang, Y

G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016...

work page doi:10.1007/978-3-319-46493-0_ 2016

[27] [29]

Huang, H

P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Nov...

work page 2022

[28] [30]

Jeevan and A

P. Jeevan and A. Sethi. Resource-efficient hybrid x-formers for vision. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3555–3563. IEEE, 2022. doi: 10.1109/WACV51458.2022.00361. URL https://doi.org/10.1109/WACV51458.2022.00361

work page doi:10.1109/wacv51458.2022.00361 2022

[29] [31]

Kidger, J

P. Kidger, J. Morrill, J. Foster, and T. J. Lyons. Neural controlled differential equations for irregular time series. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual,...

work page 2020

[30] [32]

Le and X

Y . Le and X. S. Yang. Tiny imagenet visual recognition challenge. 2015. URL http: //vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf

work page 2015

[31] [33]

Z. Li, S. Li, and X. Yan. Time series as images: Vision transformer for irregularly sampled time series. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,...

work page 2023

[32] [34]

Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long. Timer: Generative pre-trained transformers are large time series models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=bYRYb7DMNo

work page 2024

[33] [35]

Z. Lu, Z. Wang, D. Huang, C. Wu, X. Liu, W. Ouyang, and L. Bai. Fit: Flexible vision transformer for diffusion model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https:// openreview.net/forum?id=jZVen2JguY. 14

work page 2024

[34] [36]

M. Moor, B. Rieck, M. Horn, C. R. Jutzeler, and K. Borgwardt. Early Recognition of Sepsis with Heteroscedastic Temporal Variational Autoencoders. InInternational Conference on Machine Learning (ICML), pages 7781–7792, 2021

work page 2021

[35] [37]

Panayotov, G

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015

[36] [38]

Patel, R

H. Patel, R. Qiu, A. Irwin, S. Sadiq, and S. Wang. EMIT - event-based masked auto encoding for irregular time series. In E. Baralis, K. Zhang, E. Damiani, M. Debbah, P. Kalnis, and X. Wu, editors,IEEE International Conference on Data Mining, ICDM 2024, Abu Dhabi, United Arab Emirates, December 9-12, 2024, pages 370–379. IEEE, 2024. doi: 10.1109/ICDM59182....

work page doi:10.1109/icdm59182.2024 2024

[37] [39]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/forum?id=wHBfxhZu1u

work page 2024

[38] [40]

K. J. Piczak. Esc: Dataset for environmental sound classification. InProceedings of the 23rd ACM International Conference on Multimedia, MM ’15, page 1015–1018, New York, NY , USA,

work page

[39] [41]

ISBN 9781450334594

Association for Computing Machinery. ISBN 9781450334594. doi: 10.1145/2733373. 2806390. URLhttps://doi.org/10.1145/2733373.2806390

work page doi:10.1145/2733373

[40] [42]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language under- standing by generative pre-training.OpenAI blog, 2018. URL https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf

work page 2018

[41] [43]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Lan- guage models are unsupervised multitask learners.OpenAI blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019

[42] [44]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y . Bengio, and A. Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

work page 2019

[43] [46]

Rubanova, T

Y . Rubanova, T. Q. Chen, and D. Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In H. M. Wallach, H. Larochelle, A. Beygelz- imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Process- ing Systems 2019, NeurIPS 2019,...

work page 2019

[44] [47]

Schirmer, M

M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling irregular time series with continuous recurrent units. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, ...

work page 2022

[45] [48]

S. N. Shukla and B. M. Marlin. Multi-time attention networks for irregularly sampled time series. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum? id=4c0J6lwQ4_. 15

work page 2021

[46] [50]

S. N. Shukla and B. M. Marlin. Heteroscedastic temporal variational autoencoder for irregularly sampled time series. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview. net/forum?id=Az7opqbQE-3

work page 2022

[47] [51]

Silva, B

I. Silva, B. Moody, D. Scott, L. Celi, R. Mark, and G. Clifford. The PhysioNet/Computing in Cardiology Challenge 2012: Predicting In-Hospital Mortality from ICU Data. InComputing in Cardiology, pages 245–248, 2012

work page 2012

[48] [52]

Silva, G

I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 computing in cardiology, pages 245–248. IEEE, 2012

work page 2012

[49] [53]

J. T. H. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=Ai8Hw3AXqks

work page 2023

[50] [54]

Z. Song, Q. Lu, H. Zhu, D. Buckeridge, and Y . Li. Trajgpt: Irregular time-series representation learning for health trajectory analysis.CoRR, abs/2410.02133, 2024. doi: 10.48550/ARXIV . 2410.02133. URLhttps://doi.org/10.48550/arXiv.2410.02133

work page internal anchor Pith review doi:10.48550/arxiv 2024

[51] [55]

N. Stroh. Trackgpt–a generative pre-trained transformer for cross-domain entity trajectory forecasting.arXiv preprint arXiv:2402.00066, 2024

work page arXiv 2024

[52] [56]

J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM. 2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom 2024

[53] [57]

K. Su, Q. Wu, P. Cai, X. Zhu, X. Lu, Z. Wang, and K. Hu. RI-MAE: rotation-invariant masked autoencoders for self-supervised point cloud representation learning. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7015–70...

work page doi:10.1609/aaai.v39i7.32753 2025

[54] [58]

Szegedy, V

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec- ture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society,

work page 2016

[55] [59]

URLhttps://doi.org/10.1109/CVPR.2016.308

doi: 10.1109/CVPR.2016.308. URLhttps://doi.org/10.1109/CVPR.2016.308

work page doi:10.1109/cvpr.2016.308 2016

[56] [60]

Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New...

work page 2022

[57] [61]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

work page 2017

[58] [62]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae V2: scaling video masked autoencoders with dual masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 14549–14560. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01398. URL https://doi.org/10.1109...

work page doi:10.1109/cvpr52729.2023.01398 2023

[59] [63]

Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 6778–6786. ijcai.org, 2023. doi: 10.24963/IJCAI.2023/759. URL https://doi.org/10.24963/ijcai. 2023/759

work page doi:10.24963/ijcai.2023/759 2023

[60] [64]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceed- ings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

work page 2020

[61] [65]

Xiong, J

W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma. Effective long-context scaling of foundation models. In K. Duh, H. Gómez-Adorno, and S. Bethard, editors,Proceedings of the 2024 Co...

work page 2024

[62] [66]

Effective Long-Context Scaling of Foundation Models

doi: 10.18653/V1/2024.NAACL-LONG.260. URL https://doi.org/10.18653/v1/ 2024.naacl-long.260

work page doi:10.18653/v1/2024.naacl-long.260 2024

[63] [67]

M. Xu, X. Men, B. Wang, Q. Zhang, H. Lin, X. Han, and W. Chen. Base of rope bounds context length. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

work page 2024

[64] [68]

Z.-Q. J. Xu, Y . Zhang, and T. Luo. Overview frequency principle/spectral bias in deep learning. Communications on Applied Mathematics and Computation, 7(3):827–864, 2025

work page 2025

[65] [69]

Zerveas, S

G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff. A transformer-based framework for multivariate time series representation learning. In F. Zhu, B. C. Ooi, and C. Miao, editors,KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2114–2124. ACM, 2021. doi: 10...

work page doi:10.1145/3447548.3467401 2021

[66] [70]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–1237...

work page 2019