pith. sign in

arxiv: 2505.20535 · v3 · submitted 2025-05-26 · 💻 cs.LG

Rotary Masked Autoencoders are Versatile Learners

Pith reviewed 2026-05-19 12:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords Rotary Positional EmbeddingsMasked AutoencodersTime SeriesRepresentation LearningIrregular SamplingMultimodal Learning
0
0 comments X p. Extension

The pith

RoMAE applies rotary positional embeddings to masked autoencoders for continuous positions across modalities without time-series customizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoMAE as an extension of the standard Masked Autoencoder that replaces usual positional encodings with rotary embeddings adapted to continuous multidimensional inputs. This setup targets irregular time-series and similar data where positions are not on a fixed grid. A sympathetic reader would care because the method claims to avoid the extra architectural changes and computational costs that usually come with handling uneven sampling or multivariate continuous signals. The work shows RoMAE matching or exceeding specialized models on challenging benchmarks while preserving baseline MAE behavior on images and audio.

Core claim

RoMAE utilizes Rotary Positional Embedding for continuous positions in the MAE framework, enabling interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. It surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. Including learned embeddings in the input sequence breaks RoPE's relative position property.

What carries the argument

Rotary Positional Embedding (RoPE) adapted for continuous positions, inserted directly into the Masked Autoencoder pipeline to encode relative positional information that supports interpolation without grid assumptions.

If this is right

  • RoMAE performs representation learning on irregular and multivariate time-series without added complexity.
  • The same model maintains standard MAE reconstruction quality on images and audio.
  • Learned embeddings in the sequence interfere with RoPE's relative position encoding.
  • RoMAE can interpolate across continuous positional dimensions in the input sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend naturally to other continuous-valued domains such as spatial sensor grids or scientific measurement series.
  • Direct substitution of rotary embeddings could reduce the need for modality-specific positional modules in future transformer variants.

Load-bearing premise

Rotary positional embeddings can be directly substituted into the masked autoencoder framework for continuous positions without introducing hidden limitations or needing further adjustments.

What would settle it

An experiment where RoMAE is tested on irregular time-series data with known gaps and shows clear degradation compared to specialized architectures, or fails to reconstruct positions while preserving relative properties.

Figures

Figures reproduced from arXiv: 2505.20535 by Andre Scaffidi, Gabriella Contardo, Mart\'in de los Rios, Roberto Trotta, Serafina Di Gioia, Uros Zivanovic.

Figure 1
Figure 1. Figure 1: Overview of the RoMAE pipeline. Left: Visualisation of data embedding via multi￾dimensional (ND) patchification for illustrative data realisations in 1, 2 and 3D. Centre: Full depiction of RoMAE architecture. The optional [CLS] token is omitted from the input sequence for simplicity. Right: The RoMAE encoder/decoder with ROPE operations denoted by rotational arrows. Masked Autoencoders: Architectures such … view at source ↗
Figure 2
Figure 2. Figure 2: RoMAE position reconstruction MSE across two positional ranges. [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average MSE obtained from the interpolation task using RoMAE-tiny for time-series with [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Same as above but now for time-series with two frequency modes present in the signal. [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrative realisation from the evaluation of RoMAE on a the bi-frequency time series. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A training example from the ELAsTiCC dataset. The flux difference of each band has [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Two sample realisations of differing chirality from the test set of spirals. The green line is [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Samples from interpolation tests using n = 3, 10 and 20 observations. D.9 PhysioNet We adopt the pre–processed release of the PHYSIONET/CinC 2012 Challenge [50], comprising multivariate clinical time–series collected during the 48h window following intensive–care–unit (ICU) admission. Static covariates (Age, Gender, Height, ICU type) occupy feature indices 0–3 and are always observed, whereas the remaining… view at source ↗
read the original abstract

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Rotary Masked Autoencoder (RoMAE), an extension of the standard Masked Autoencoder that incorporates Rotary Positional Embeddings (RoPE) for multidimensional continuous positions. It claims this enables interpolation and representation learning on irregular/multivariate time-series (and other modalities) without any time-series-specific architectural changes to masking, encoder, decoder, or loss, while surpassing specialized time-series models on the DESC ELAsTiCC Challenge and preserving MAE performance on images and audio; an additional investigation shows that learned embeddings break RoPE's relative-position property.

Significance. If the empirical results hold after full verification, the work would be significant for providing a parameter-light, specialization-free route to continuous positional modeling in transformers, potentially reducing complexity for irregular time-series and related modalities while retaining MAE's reconstruction-based pretraining benefits.

major comments (2)
  1. [Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.
  2. [Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.
minor comments (1)
  1. [Abstract] Abstract: the final sentence on reconstructing embedded continuous positions would benefit from a brief statement of the observed outcome rather than only the negative finding about learned embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the support for our claims while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately evaluable without requiring the reader to consult the full experiments section. The manuscript contains detailed results on the DESC ELAsTiCC Challenge, including performance metrics, comparisons against specialized time-series baselines, error bars from multiple random seeds, and dataset statistics. We have revised the abstract to incorporate concise quantitative highlights from those experiments. revision: yes

  2. Referee: [Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.

    Authors: The body of the manuscript (Section 3) provides the equations for adapting RoPE to continuous multidimensional positions, along with a description confirming that the substitution uses only standard MAE masking, encoder, decoder, and loss components with no auxiliary mechanisms or modality-specific changes. To improve accessibility from the abstract, we have added a brief clarifying sentence and a reference to the relevant section in the revised abstract, and we have included a short pseudocode block in the main text for the integration step. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; claims are empirical

full rationale

The provided document consists solely of the abstract, which introduces RoMAE as an extension of the standard Masked Autoencoder that applies Rotary Positional Embeddings to continuous multidimensional positions. No equations, derivations, fitted parameters, or self-citations are shown that could reduce any result to its inputs by construction. Performance claims on datasets such as DESC ELAsTiCC are presented as empirical outcomes rather than first-principles predictions, and the investigation of position reconstruction is described without technical steps that invite circularity analysis. Because no load-bearing derivation chain exists in the available text, the paper is self-contained against external benchmarks with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard components of MAE and RoPE.

pith-pipeline@v0.9.0 · 5671 in / 998 out tokens · 26055 ms · 2026-05-19T12:28:08.608944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 5 internal anchors

  1. [1]

    Swin Transformer: Hierarchical vision transformer using shifted windows

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid. Vivit: A video vision transformer. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6816–6826. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00676. URLhttps://doi.org/10.1109/ICCV48922.2021.00676

  2. [2]

    L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. URLhttp://arxiv.org/abs/1607.06450

  3. [3]

    A. J. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. J. Keogh. The UEA multivariate time series classification archive, 2018.CoRR, abs/1811.00075, 2018. URLhttp://arxiv.org/abs/1811.00075

  4. [4]

    Barbero, A

    F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Velickovic. Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

  5. [5]

    URLhttps://openreview.net/forum?id=GtvuNrk58a

  6. [6]

    Becker, H

    P. Becker, H. Pandya, G. H. W. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann. Re- current kalman networks: Factorized inference in high-dimensional deep feature spaces. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of...

  7. [7]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

  8. [8]

    M., Bayo, A., Catelan, M., Estévez, P

    Cabrera-Vives, G., Moreno-Cartagena, D., Astorga, N., Reyes-Jainaga, I., Förster, F., Huijse, P., Arredondo, J., Muñoz Arancibia, A. M., Bayo, A., Catelan, M., Estévez, P. A., Sánchez-Sáez, P., Álvarez, A., Castellanos, P., Gallardo, P., Moya, A., and Rodriguez-Mancini, D. Atat: Astronomical transformer for time series and tabular data.A&A, 689:A289, 2024...

  9. [10]

    R. T. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6571– 6583, 2018

  10. [11]

    T. Q. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal...

  11. [12]

    Y . Chen, K. Ren, Y . Wang, Y . Fang, W. Sun, and D. Li. Contiformer: Continuous- time transformer for irregular time series modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orl...

  12. [13]

    Y . Chen, Q. Wang, and Y . e. Fu. Continuous-time Transformer for Irregular Time-series Predictions. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  13. [14]

    E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le. Randaugment: Practical automated data aug- mentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M. Bal- can, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, v...

  14. [15]

    Proceedings of the 2019 Conference of the North

    J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...

  15. [16]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview....

  16. [17]

    Available: https://doi-org.ornl.idm.oclc.org/10.1016/j

    S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018. doi: 10.1016/J. NEUNET.2017.12.012. URLhttps://doi.org/10.1016/j.neunet.2017.12.012

  17. [18]

    Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao. EV A-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2024. doi: 10.1016/J.IMA VIS.2024.105171. URLhttps://doi.org/10.1016/j.imavis.2024.105171

  18. [19]

    Feichtenhofer, H

    C. Feichtenhofer, H. Fan, Y . Li, and K. He. Masked autoencoders as spatiotemporal learn- ers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural In- formation Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, ...

  19. [20]

    S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik. Units: A unified multi-task time series model.Advances in Neural Information Processing Systems, 37:140589– 140631, 2024

  20. [21]

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

  21. [22]

    Y . Gong, C. Lai, Y . Chung, and J. R. Glass. SSAST: self-supervised audio spectrogram trans- former. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Sym- posium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual...

  22. [24]

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. B. Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.01553. URLhttps://doi.org/10.1109/CVPR52688.2022.01553

  23. [25]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415

  24. [26]

    B. Heo, S. Park, D. Han, and S. Yun. Rotary position embedding for vision transformer. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part X, volume 15068 ofLecture Notes in Computer Science, pages 289–305...

  25. [27]

    G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.CoRR, abs/1207.0580, 2012. URLhttp://arxiv.org/abs/1207.0580

  26. [28]

    Huang, Y

    G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016...

  27. [29]

    Huang, H

    P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Nov...

  28. [30]

    Jeevan and A

    P. Jeevan and A. Sethi. Resource-efficient hybrid x-formers for vision. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3555–3563. IEEE, 2022. doi: 10.1109/WACV51458.2022.00361. URL https://doi.org/10.1109/WACV51458.2022.00361

  29. [31]

    Kidger, J

    P. Kidger, J. Morrill, J. Foster, and T. J. Lyons. Neural controlled differential equations for irregular time series. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual,...

  30. [32]

    Le and X

    Y . Le and X. S. Yang. Tiny imagenet visual recognition challenge. 2015. URL http: //vision.stanford.edu/teaching/cs231n/reports/2015/pdfs/yle_project.pdf

  31. [33]

    Z. Li, S. Li, and X. Yan. Time series as images: Vision transformer for irregularly sampled time series. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,...

  32. [34]

    Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long. Timer: Generative pre-trained transformers are large time series models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=bYRYb7DMNo

  33. [35]

    Z. Lu, Z. Wang, D. Huang, C. Wu, X. Liu, W. Ouyang, and L. Bai. Fit: Flexible vision transformer for diffusion model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https:// openreview.net/forum?id=jZVen2JguY. 14

  34. [36]

    M. Moor, B. Rieck, M. Horn, C. R. Jutzeler, and K. Borgwardt. Early Recognition of Sepsis with Heteroscedastic Temporal Variational Autoencoders. InInternational Conference on Machine Learning (ICML), pages 7781–7792, 2021

  35. [37]

    Panayotov, G

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

  36. [38]

    Patel, R

    H. Patel, R. Qiu, A. Irwin, S. Sadiq, and S. Wang. EMIT - event-based masked auto encoding for irregular time series. In E. Baralis, K. Zhang, E. Damiani, M. Debbah, P. Kalnis, and X. Wu, editors,IEEE International Conference on Data Mining, ICDM 2024, Abu Dhabi, United Arab Emirates, December 9-12, 2024, pages 370–379. IEEE, 2024. doi: 10.1109/ICDM59182....

  37. [39]

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/forum?id=wHBfxhZu1u

  38. [40]

    K. J. Piczak. Esc: Dataset for environmental sound classification. InProceedings of the 23rd ACM International Conference on Multimedia, MM ’15, page 1015–1018, New York, NY , USA,

  39. [41]

    ISBN 9781450334594

    Association for Computing Machinery. ISBN 9781450334594. doi: 10.1145/2733373. 2806390. URLhttps://doi.org/10.1145/2733373.2806390

  40. [42]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language under- standing by generative pre-training.OpenAI blog, 2018. URL https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf

  41. [43]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Lan- guage models are unsupervised multitask learners.OpenAI blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

  42. [44]

    Rahaman, A

    N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y . Bengio, and A. Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

  43. [46]

    Rubanova, T

    Y . Rubanova, T. Q. Chen, and D. Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In H. M. Wallach, H. Larochelle, A. Beygelz- imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Process- ing Systems 2019, NeurIPS 2019,...

  44. [47]

    Schirmer, M

    M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling irregular time series with continuous recurrent units. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, ...

  45. [48]

    S. N. Shukla and B. M. Marlin. Multi-time attention networks for irregularly sampled time series. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum? id=4c0J6lwQ4_. 15

  46. [50]

    S. N. Shukla and B. M. Marlin. Heteroscedastic temporal variational autoencoder for irregularly sampled time series. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview. net/forum?id=Az7opqbQE-3

  47. [51]

    Silva, B

    I. Silva, B. Moody, D. Scott, L. Celi, R. Mark, and G. Clifford. The PhysioNet/Computing in Cardiology Challenge 2012: Predicting In-Hospital Mortality from ICU Data. InComputing in Cardiology, pages 245–248, 2012

  48. [52]

    Silva, G

    I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 computing in cardiology, pages 245–248. IEEE, 2012

  49. [53]

    J. T. H. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=Ai8Hw3AXqks

  50. [54]

    Z. Song, Q. Lu, H. Zhu, D. Buckeridge, and Y . Li. Trajgpt: Irregular time-series representation learning for health trajectory analysis.CoRR, abs/2410.02133, 2024. doi: 10.48550/ARXIV . 2410.02133. URLhttps://doi.org/10.48550/arXiv.2410.02133

  51. [55]

    N. Stroh. Trackgpt–a generative pre-trained transformer for cross-domain entity trajectory forecasting.arXiv preprint arXiv:2402.00066, 2024

  52. [56]

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM. 2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063

  53. [57]

    K. Su, Q. Wu, P. Cai, X. Zhu, X. Lu, Z. Wang, and K. Hu. RI-MAE: rotation-invariant masked autoencoders for self-supervised point cloud representation learning. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7015–70...

  54. [58]

    Szegedy, V

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec- ture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society,

  55. [59]

    URLhttps://doi.org/10.1109/CVPR.2016.308

    doi: 10.1109/CVPR.2016.308. URLhttps://doi.org/10.1109/CVPR.2016.308

  56. [60]

    Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New...

  57. [61]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

  58. [62]

    L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae V2: scaling video masked autoencoders with dual masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 14549–14560. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01398. URL https://doi.org/10.1109...

  59. [63]

    Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 6778–6786. ijcai.org, 2023. doi: 10.24963/IJCAI.2023/759. URL https://doi.org/10.24963/ijcai. 2023/759

  60. [64]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceed- ings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020

  61. [65]

    Xiong, J

    W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma. Effective long-context scaling of foundation models. In K. Duh, H. Gómez-Adorno, and S. Bethard, editors,Proceedings of the 2024 Co...

  62. [66]

    Effective Long-Context Scaling of Foundation Models

    doi: 10.18653/V1/2024.NAACL-LONG.260. URL https://doi.org/10.18653/v1/ 2024.naacl-long.260

  63. [67]

    M. Xu, X. Men, B. Wang, Q. Zhang, H. Lin, X. Han, and W. Chen. Base of rope bounds context length. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

  64. [68]

    Z.-Q. J. Xu, Y . Zhang, and T. Luo. Overview frequency principle/spectral bias in deep learning. Communications on Applied Mathematics and Computation, 7(3):827–864, 2025

  65. [69]

    Zerveas, S

    G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff. A transformer-based framework for multivariate time series representation learning. In F. Zhu, B. C. Ooi, and C. Miao, editors,KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2114–2124. ACM, 2021. doi: 10...

  66. [70]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–1237...