Rotary Masked Autoencoders are Versatile Learners
Pith reviewed 2026-05-19 12:28 UTC · model grok-4.3
The pith
RoMAE applies rotary positional embeddings to masked autoencoders for continuous positions across modalities without time-series customizations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoMAE utilizes Rotary Positional Embedding for continuous positions in the MAE framework, enabling interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. It surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. Including learned embeddings in the input sequence breaks RoPE's relative position property.
What carries the argument
Rotary Positional Embedding (RoPE) adapted for continuous positions, inserted directly into the Masked Autoencoder pipeline to encode relative positional information that supports interpolation without grid assumptions.
If this is right
- RoMAE performs representation learning on irregular and multivariate time-series without added complexity.
- The same model maintains standard MAE reconstruction quality on images and audio.
- Learned embeddings in the sequence interfere with RoPE's relative position encoding.
- RoMAE can interpolate across continuous positional dimensions in the input sequence.
Where Pith is reading between the lines
- The approach may extend naturally to other continuous-valued domains such as spatial sensor grids or scientific measurement series.
- Direct substitution of rotary embeddings could reduce the need for modality-specific positional modules in future transformer variants.
Load-bearing premise
Rotary positional embeddings can be directly substituted into the masked autoencoder framework for continuous positions without introducing hidden limitations or needing further adjustments.
What would settle it
An experiment where RoMAE is tested on irregular time-series data with known gaps and shows clear degradation compared to specialized architectures, or fails to reconstruct positions while preserving relative properties.
Figures
read the original abstract
Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Rotary Masked Autoencoder (RoMAE), an extension of the standard Masked Autoencoder that incorporates Rotary Positional Embeddings (RoPE) for multidimensional continuous positions. It claims this enables interpolation and representation learning on irregular/multivariate time-series (and other modalities) without any time-series-specific architectural changes to masking, encoder, decoder, or loss, while surpassing specialized time-series models on the DESC ELAsTiCC Challenge and preserving MAE performance on images and audio; an additional investigation shows that learned embeddings break RoPE's relative-position property.
Significance. If the empirical results hold after full verification, the work would be significant for providing a parameter-light, specialization-free route to continuous positional modeling in transformers, potentially reducing complexity for irregular time-series and related modalities while retaining MAE's reconstruction-based pretraining benefits.
major comments (2)
- [Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.
- [Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.
minor comments (1)
- [Abstract] Abstract: the final sentence on reconstructing embedded continuous positions would benefit from a brief statement of the observed outcome rather than only the negative finding about learned embeddings.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the support for our claims while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that RoMAE 'surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge' is unsupported by any quantitative metrics, error bars, ablation tables, or dataset statistics, leaving the central performance claim unevaluable.
Authors: We agree that the abstract would be strengthened by including key quantitative results to make the central claim immediately evaluable without requiring the reader to consult the full experiments section. The manuscript contains detailed results on the DESC ELAsTiCC Challenge, including performance metrics, comparisons against specialized time-series baselines, error bars from multiple random seeds, and dataset statistics. We have revised the abstract to incorporate concise quantitative highlights from those experiments. revision: yes
-
Referee: [Abstract] Abstract: no equations, pseudocode, or implementation details are supplied for the direct substitution of RoPE into the MAE pipeline for continuous multidimensional positions, so it is impossible to confirm that no auxiliary mechanisms (e.g., custom normalization or adjusted masking) were introduced that would contradict the 'avoiding any time-series-specific architectural specializations' claim.
Authors: The body of the manuscript (Section 3) provides the equations for adapting RoPE to continuous multidimensional positions, along with a description confirming that the substitution uses only standard MAE masking, encoder, decoder, and loss components with no auxiliary mechanisms or modality-specific changes. To improve accessibility from the abstract, we have added a brief clarifying sentence and a reference to the relevant section in the revised abstract, and we have included a short pseudocode block in the main text for the integration step. revision: yes
Circularity Check
No derivation chain or equations present; claims are empirical
full rationale
The provided document consists solely of the abstract, which introduces RoMAE as an extension of the standard Masked Autoencoder that applies Rotary Positional Embeddings to continuous multidimensional positions. No equations, derivations, fitted parameters, or self-citations are shown that could reduce any result to its inputs by construction. Performance claims on datasets such as DESC ELAsTiCC are presented as empirical outcomes rather than first-principles predictions, and the investigation of position reconstruction is described without technical steps that invite circularity analysis. Because no load-bearing derivation chain exists in the available text, the paper is self-contained against external benchmarks with no detectable circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We utilize Axial RoPE... split into D subspaces... apply p-RoPE to each subspace... up to D=3
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid. Vivit: A video vision transformer. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6816–6826. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00676. URLhttps://doi.org/10.1109/ICCV48922.2021.00676
-
[2]
L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. URLhttp://arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
A. J. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. J. Keogh. The UEA multivariate time series classification archive, 2018.CoRR, abs/1811.00075, 2018. URLhttp://arxiv.org/abs/1811.00075
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Velickovic. Round and round we go! what makes rotary positional encodings useful? InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,
work page 2025
-
[5]
URLhttps://openreview.net/forum?id=GtvuNrk58a
-
[6]
P. Becker, H. Pandya, G. H. W. Gebhardt, C. Zhao, C. J. Taylor, and G. Neumann. Re- current kalman networks: Factorized inference in high-dimensional deep feature spaces. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Confer- ence on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of...
work page 2019
-
[7]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page 2020
-
[8]
M., Bayo, A., Catelan, M., Estévez, P
Cabrera-Vives, G., Moreno-Cartagena, D., Astorga, N., Reyes-Jainaga, I., Förster, F., Huijse, P., Arredondo, J., Muñoz Arancibia, A. M., Bayo, A., Catelan, M., Estévez, P. A., Sánchez-Sáez, P., Álvarez, A., Castellanos, P., Gallardo, P., Moya, A., and Rodriguez-Mancini, D. Atat: Astronomical transformer for time series and tabular data.A&A, 689:A289, 2024...
-
[10]
R. T. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. InAdvances in Neural Information Processing Systems (NeurIPS), pages 6571– 6583, 2018
work page 2018
-
[11]
T. Q. Chen, Y . Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal...
work page 2018
-
[12]
Y . Chen, K. Ren, Y . Wang, Y . Fang, W. Sun, and D. Li. Contiformer: Continuous- time transformer for irregular time series modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orl...
work page 2023
-
[13]
Y . Chen, Q. Wang, and Y . e. Fu. Continuous-time Transformer for Irregular Time-series Predictions. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[14]
E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le. Randaugment: Practical automated data aug- mentation with a reduced search space. In H. Larochelle, M. Ranzato, R. Hadsell, M. Bal- can, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, v...
work page 2020
-
[15]
BERT: Pre-training of deep bidirec- tional transformers for language understanding
J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, ...
-
[16]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview....
work page 2021
-
[17]
S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural Networks, 107:3–11, 2018. doi: 10.1016/J. NEUNET.2017.12.012. URLhttps://doi.org/10.1016/j.neunet.2017.12.012
work page doi:10.1016/j 2018
-
[18]
Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao. EV A-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2024. doi: 10.1016/J.IMA VIS.2024.105171. URLhttps://doi.org/10.1016/j.imavis.2024.105171
-
[19]
C. Feichtenhofer, H. Fan, Y . Li, and K. He. Masked autoencoders as spatiotemporal learn- ers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural In- formation Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, ...
work page 2022
-
[20]
S. Gao, T. Koker, O. Queen, T. Hartvigsen, T. Tsiligkaridis, and M. Zitnik. Units: A unified multi-task time series model.Advances in Neural Information Processing Systems, 37:140589– 140631, 2024
work page 2024
-
[21]
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017
work page 2017
-
[22]
Y . Gong, C. Lai, Y . Chung, and J. R. Glass. SSAST: self-supervised audio spectrogram trans- former. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Sym- posium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual...
-
[24]
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. B. Girshick. Masked autoencoders are scalable vision learners. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15979–15988. IEEE, 2022. doi: 10.1109/ CVPR52688.2022.01553. URLhttps://doi.org/10.1109/CVPR52688.2022.01553
-
[25]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units.CoRR, abs/1606.08415, 2016. URL http://arxiv.org/abs/1606.08415
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
B. Heo, S. Park, D. Han, and S. Yun. Rotary position embedding for vision transformer. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part X, volume 15068 ofLecture Notes in Computer Science, pages 289–305...
-
[27]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.CoRR, abs/1207.0580, 2012. URLhttp://arxiv.org/abs/1207.0580
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[28]
G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors,Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016...
-
[29]
P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer. Masked autoencoders that listen. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, Nov...
work page 2022
-
[30]
P. Jeevan and A. Sethi. Resource-efficient hybrid x-formers for vision. InIEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022, pages 3555–3563. IEEE, 2022. doi: 10.1109/WACV51458.2022.00361. URL https://doi.org/10.1109/WACV51458.2022.00361
-
[31]
P. Kidger, J. Morrill, J. Foster, and T. J. Lyons. Neural controlled differential equations for irregular time series. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6- 12, 2020, virtual,...
work page 2020
- [32]
-
[33]
Z. Li, S. Li, and X. Yan. Time series as images: Vision transformer for irregularly sampled time series. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16,...
work page 2023
-
[34]
Y . Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long. Timer: Generative pre-trained transformers are large time series models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=bYRYb7DMNo
work page 2024
-
[35]
Z. Lu, Z. Wang, D. Huang, C. Wu, X. Liu, W. Ouyang, and L. Bai. Fit: Flexible vision transformer for diffusion model. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https:// openreview.net/forum?id=jZVen2JguY. 14
work page 2024
-
[36]
M. Moor, B. Rieck, M. Horn, C. R. Jutzeler, and K. Borgwardt. Early Recognition of Sepsis with Heteroscedastic Temporal Variational Autoencoders. InInternational Conference on Machine Learning (ICML), pages 7781–7792, 2021
work page 2021
-
[37]
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964
-
[38]
H. Patel, R. Qiu, A. Irwin, S. Sadiq, and S. Wang. EMIT - event-based masked auto encoding for irregular time series. In E. Baralis, K. Zhang, E. Damiani, M. Debbah, P. Kalnis, and X. Wu, editors,IEEE International Conference on Data Mining, ICDM 2024, Abu Dhabi, United Arab Emirates, December 9-12, 2024, pages 370–379. IEEE, 2024. doi: 10.1109/ICDM59182....
-
[39]
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/forum?id=wHBfxhZu1u
work page 2024
-
[40]
K. J. Piczak. Esc: Dataset for environmental sound classification. InProceedings of the 23rd ACM International Conference on Multimedia, MM ’15, page 1015–1018, New York, NY , USA,
-
[41]
Association for Computing Machinery. ISBN 9781450334594. doi: 10.1145/2733373. 2806390. URLhttps://doi.org/10.1145/2733373.2806390
-
[42]
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language under- standing by generative pre-training.OpenAI blog, 2018. URL https://cdn.openai.com/ research-covers/language-unsupervised/language_understanding_paper.pdf
work page 2018
-
[43]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Lan- guage models are unsupervised multitask learners.OpenAI blog, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf
work page 2019
-
[44]
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y . Bengio, and A. Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019
work page 2019
-
[46]
Y . Rubanova, T. Q. Chen, and D. Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In H. M. Wallach, H. Larochelle, A. Beygelz- imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural In- formation Processing Systems 32: Annual Conference on Neural Information Process- ing Systems 2019, NeurIPS 2019,...
work page 2019
-
[47]
M. Schirmer, M. Eltayeb, S. Lessmann, and M. Rudolph. Modeling irregular time series with continuous recurrent units. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, ...
work page 2022
-
[48]
S. N. Shukla and B. M. Marlin. Multi-time attention networks for irregularly sampled time series. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum? id=4c0J6lwQ4_. 15
work page 2021
-
[50]
S. N. Shukla and B. M. Marlin. Heteroscedastic temporal variational autoencoder for irregularly sampled time series. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview. net/forum?id=Az7opqbQE-3
work page 2022
- [51]
- [52]
-
[53]
J. T. H. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=Ai8Hw3AXqks
work page 2023
-
[54]
Z. Song, Q. Lu, H. Zhu, D. Buckeridge, and Y . Li. Trajgpt: Irregular time-series representation learning for health trajectory analysis.CoRR, abs/2410.02133, 2024. doi: 10.48550/ARXIV . 2410.02133. URLhttps://doi.org/10.48550/arXiv.2410.02133
work page internal anchor Pith review doi:10.48550/arxiv 2024
- [55]
-
[56]
J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. doi: 10.1016/J.NEUCOM. 2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063
-
[57]
K. Su, Q. Wu, P. Cai, X. Zhu, X. Lu, Z. Wang, and K. Hu. RI-MAE: rotation-invariant masked autoencoders for self-supervised point cloud representation learning. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 7015–70...
-
[58]
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec- ture for computer vision. In2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society,
work page 2016
-
[59]
URLhttps://doi.org/10.1109/CVPR.2016.308
doi: 10.1109/CVPR.2016.308. URLhttps://doi.org/10.1109/CVPR.2016.308
-
[60]
Z. Tong, Y . Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New...
work page 2022
-
[61]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...
work page 2017
-
[62]
L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, and Y . Qiao. Videomae V2: scaling video masked autoencoders with dual masking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17- 24, 2023, pages 14549–14560. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01398. URL https://doi.org/10.1109...
-
[63]
Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun. Transformers in time series: A survey. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 6778–6786. ijcai.org, 2023. doi: 10.24963/IJCAI.2023/759. URL https://doi.org/10.24963/ijcai. 2023/759
-
[64]
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Transformers: State-of-the-art natural language processing. InProceed- ings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020
work page 2020
-
[65]
W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma. Effective long-context scaling of foundation models. In K. Duh, H. Gómez-Adorno, and S. Bethard, editors,Proceedings of the 2024 Co...
work page 2024
-
[66]
Effective Long-Context Scaling of Foundation Models
doi: 10.18653/V1/2024.NAACL-LONG.260. URL https://doi.org/10.18653/v1/ 2024.naacl-long.260
-
[67]
M. Xu, X. Men, B. Wang, Q. Zhang, H. Lin, X. Han, and W. Chen. Base of rope bounds context length. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...
work page 2024
-
[68]
Z.-Q. J. Xu, Y . Zhang, and T. Luo. Overview frequency principle/spectral bias in deep learning. Communications on Applied Mathematics and Computation, 7(3):827–864, 2025
work page 2025
-
[69]
G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff. A transformer-based framework for multivariate time series representation learning. In F. Zhu, B. C. Ooi, and C. Miao, editors,KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2114–2124. ACM, 2021. doi: 10...
-
[70]
B. Zhang and R. Sennrich. Root mean square layer normalization. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–1237...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.