pith. machine review for the scientific record. sign in

arxiv: 2605.00292 · v2 · submitted 2026-04-30 · 💻 cs.LG · cs.AI

Recognition: unknown

Caracal: Causal Architecture via Spectral Mixing

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Caracalspectral mixingcausal architectureFourier transformlong-sequence modelingefficient sequence models
0
0 comments X

The pith

Caracal replaces attention with a Fourier module to model long sequences efficiently while supporting causal generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Caracal, which substitutes the attention mechanism in transformers with a Multi-Head Fourier module that mixes sequences using the Fast Fourier Transform. This approach achieves linearithmic complexity in sequence length and avoids the need for positional encodings by operating in the frequency domain. A key innovation is the causal masking applied in the frequency domain using asymmetric padding and truncation to ensure the model can be used for autoregressive generation. The design relies on standard library operations for broad compatibility. Experiments show it matches the performance of standard transformers and state-space models on relevant tasks.

Core claim

Caracal introduces a causal architecture that performs sequence mixing via spectral methods with the Fast Fourier Transform, enforcing causality through frequency-domain masking to achieve efficient and portable long-context modeling.

What carries the argument

Multi-Head Fourier (MHF) module with frequency-domain causal masking via asymmetric padding and truncation.

If this is right

  • Sequence modeling becomes feasible at linearithmic cost in sequence length.
  • Models can be implemented and deployed using ubiquitous FFT libraries without custom kernels.
  • Generative capabilities are preserved for tasks like language modeling.
  • Scalability to longer sequences is improved compared to quadratic attention methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other sequence tasks could benefit from similar spectral causal designs.
  • The simplicity of standard operators may accelerate adoption in production systems.
  • Potential for hybrid models combining Fourier mixing with other efficient techniques.

Load-bearing premise

The frequency-domain causal masking technique via asymmetric padding and truncation successfully enforces autoregressive capabilities for generative modeling without loss of expressiveness or performance.

What would settle it

Demonstrating that the causal masking either leaks future information or leads to significantly worse performance than baselines on long sequences would falsify the approach.

Figures

Figures reproduced from arXiv: 2605.00292 by Bingzheng Gan, Jing Huang, Tao Yu, Tianyi Zhang, Wei Shi, Yangkai Ding, Yusu Li.

Figure 1
Figure 1. Figure 1: Comparison of a standard Transformer model (left) and our proposed Caracal model (right). The core modification is the replacement of Masked Multi-Head Attention with our Multi-Head Fourier (MHF) module and the removal of positional encodings. double length (N = 2L) and transform them via the FFT: Vfft = F(xv), Gfft = F(xg) (4) In this expanded frequency space, we perform element-wise multiplication: Xfft … view at source ↗
Figure 2
Figure 2. Figure 2: illustrates the training time and throughput across varying context lengths L (Detailed numbers see Appendix D). While the attention-based Llama architecture exhibits the expected quadratic complexity O(L 2 ), resulting in a 256 512 1024 2048 4096 8192 Context Length (Tokens, log2 scale) 30000 40000 50000 60000 70000 80000 90000 100000 Training Time (s) LLAMA Time LLAMA Thr. MAMBA Time MAMBA Thr. MAMBA2 Ti… view at source ↗
read the original abstract

The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, O(L log(L)) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Caracal, a novel neural architecture for long-sequence modeling that substitutes the attention mechanism with a Multi-Head Fourier (MHF) module based on the Fast Fourier Transform (FFT). This achieves O(L log L) complexity for sequence mixing. The key innovation is a causal masking technique applied in the frequency domain using asymmetric padding and truncation to support autoregressive generation. The architecture is designed to be portable by relying on standard library operators rather than custom hardware-specific implementations. The paper claims that Caracal achieves competitive performance compared to Transformer and State-Space Model (SSM) baselines.

Significance. If the causal masking technique is proven to maintain strict causality without spectral artifacts and the performance claims are substantiated with rigorous experiments, this work could represent a significant advancement in efficient sequence modeling by providing a simple, FFT-based alternative that avoids the quadratic cost of attention and the implementation complexities of SSMs. The emphasis on portability using standard operators is a notable strength for practical deployment.

major comments (2)
  1. Abstract (contribution 2): The frequency-domain causal masking technique via asymmetric padding and truncation is presented as enforcing autoregressive capabilities. However, because the FFT has global support, it is not immediately clear that this operation results in a strictly causal (lower-triangular) effective transformation in the time domain. A mathematical derivation showing that the resulting time-domain kernel has no dependence on future inputs, or an empirical test (such as checking for zero leakage on future tokens), is necessary to support the claim that this overcomes the barrier for Fourier-based generative models.
  2. Abstract: The claim that 'Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines' lacks supporting details such as specific metrics, datasets used, baseline models, ablation studies, or statistical significance. This omission makes it challenging to evaluate the strength of the empirical results. The full manuscript should include these in the experiments section with clear tables or figures.
minor comments (2)
  1. Abstract: Grammatical error: 'we uses standard library operators' should be 'we use standard library operators'.
  2. Abstract: The abstract mentions 'Code is available in Appendix' but does not provide a link or repository information, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested clarifications and additions.

read point-by-point responses
  1. Referee: Abstract (contribution 2): The frequency-domain causal masking technique via asymmetric padding and truncation is presented as enforcing autoregressive capabilities. However, because the FFT has global support, it is not immediately clear that this operation results in a strictly causal (lower-triangular) effective transformation in the time domain. A mathematical derivation showing that the resulting time-domain kernel has no dependence on future inputs, or an empirical test (such as checking for zero leakage on future tokens), is necessary to support the claim that this overcomes the barrier for Fourier-based generative models.

    Authors: We agree that explicit verification of strict causality is important given the global nature of the FFT. In the revised manuscript, we will add a dedicated subsection with a mathematical derivation showing that the asymmetric padding followed by truncation in the frequency domain produces an effective time-domain kernel that is strictly lower-triangular (i.e., no dependence on future inputs). We will also include an empirical check by constructing the equivalent time-domain transformation matrix and verifying that entries corresponding to future tokens are numerically zero (within floating-point precision). revision: yes

  2. Referee: Abstract: The claim that 'Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines' lacks supporting details such as specific metrics, datasets used, baseline models, ablation studies, or statistical significance. This omission makes it challenging to evaluate the strength of the empirical results. The full manuscript should include these in the experiments section with clear tables or figures.

    Authors: The Experiments section of the full manuscript already reports results on long-sequence benchmarks (including perplexity and accuracy metrics) against Transformer and SSM baselines such as Mamba, with ablations on the Multi-Head Fourier module and multiple datasets. To directly address the concern, we will expand the abstract to briefly reference the key quantitative findings and add a summary table in the main text that highlights statistical significance (e.g., via standard deviations over multiple runs). Additional ablation figures will also be included if space permits. revision: partial

Circularity Check

0 steps flagged

No circularity; novel architecture validated empirically

full rationale

The paper introduces Caracal as a new O(L log L) architecture based on Multi-Head Fourier mixing with an asymmetric padding/truncation masking technique for causality. No derivation chain, fitted parameters, or equations are shown that reduce by construction to prior inputs or self-citations. Central claims rest on competitive empirical evaluations against Transformer and SSM baselines using standard library operators, with no load-bearing self-referential steps or uniqueness theorems imported from the authors' prior work. The method is presented as self-contained and portable without mathematical self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the MHF module and frequency masking are presented as novel but without derivation details or independent evidence.

pith-pipeline@v0.9.0 · 5465 in / 1053 out tokens · 24943 ms · 2026-05-09T19:39:05.186096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Language models enable simple systems for generating structured views of heterogeneous data lakes

    Arora, S., Yang, B., Eyuboglu, S., Narayan, A., Hojel, A., Trummer, I., and R \' e , C. Language models enable simple systems for generating structured views of heterogeneous data lakes. Proc. VLDB Endow. , 17 0 (2): 0 92--105, 2023

  2. [4]

    PIQA : Reasoning about physical commonsense in natural language

    Bisk, Y., Zellers, R., Le Bras, R., Gao, J., and Choi, Y. PIQA : Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pp.\ 7432--7439, 2020

  3. [5]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 1877--1901, 2020

  4. [6]

    BoolQ : Exploring the surprising difficulty of natural yes/no questions

    Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. BoolQ : Exploring the surprising difficulty of natural yes/no questions. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), volume 1, pp.\ 2924--2936, 2019

  5. [8]

    and Gu, A

    Dao, T. and Gu, A. Transformers are SSMs : Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning (ICML), 2024

  6. [11]

    Monarch mixer: A simple sub-quadratic GEMM -based architecture

    Fu, D., Arora, S., Grogan, J., Johnson, I., Eyuboglu, S., Thomas, A., Spector, B., Poli, M., Rudra, A., and R \' e , C. Monarch mixer: A simple sub-quadratic GEMM -based architecture. In Advances in Neural Information Processing Systems (NeurIPS), 2023 a

  7. [12]

    Hungry Hungry Hippos : Towards language modeling with state space models

    Fu, D., Dao, T., Saab, K., Thomas, A., Rudra, A., and R \' e , C. Hungry Hungry Hippos : Towards language modeling with state space models. In International Conference on Learning Representations (ICLR), 2023 b

  8. [13]

    Simple hardware-efficient long convolutions for sequence modeling

    Fu, D., Epstein, E., Nguyen, E., Thomas, A., Zhang, M., Dao, T., Rudra, A., and R \' e , C. Simple hardware-efficient long convolutions for sequence modeling. In International Conference on Machine Learning (ICML), volume 202, pp.\ 10373--10391, 2023 c

  9. [15]

    On the parameterization and initialization of diagonal state space models

    Gu, A., Goel, K., Gupta, A., and R \'e , C. On the parameterization and initialization of diagonal state space models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.\ 35971--35983, 2022 a

  10. [16]

    Efficiently modeling long sequences with structured state spaces

    Gu, A., Goel, K., and R \'e , C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (ICLR), 2022 b

  11. [18]

    Fourier Transformer : Fast long range modeling by removing sequence redundancy with FFT operator

    He, Z., Yang, M., Feng, M., Yin, J., Wang, X., Leng, J., and Lin, Z. Fourier Transformer : Fast long range modeling by removing sequence redundancy with FFT operator. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 8954--8966, 2023

  12. [20]

    Reformer : The efficient Transformer

    Kitaev, N., Kaiser, L., and Levskaya, A. Reformer : The efficient Transformer . In International Conference on Learning Representations (ICLR), 2020

  13. [21]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.\ 1207--1216, Stanford, CA, 2000. Morgan Kaufmann

  14. [22]

    FNet : Mixing tokens with Fourier transforms

    Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. FNet : Mixing tokens with Fourier transforms. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp.\ 4296--4313, 2022

  15. [23]

    Jamba : Hybrid Transformer-Mamba language models

    Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., and et al. Jamba : Hybrid Transformer-Mamba language models. In International Conference on Learning Representations (ICLR), 2025

  16. [24]

    Fourier neural operator for parametric partial differential equations

    Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations (ICLR), 2021

  17. [25]

    Lockard, C., Shiralkar, P., and Dong, X. L. OpenCeres : When open information extraction meets the semi-structured web. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), volume 1, pp.\ 3047--3056, 2019

  18. [27]

    N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \' a ndez, R

    Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \' a ndez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pp.\ 1525--1534, 2016

  19. [28]

    A., Von Werra, L., and Wolf, T

    Penedo, G., Kydl \'i c ek, H., Lozhkov, A., Mitchell, M., Raffel, C. A., Von Werra, L., and Wolf, T. The FineWeb datasets: Decanting the web for the finest text data at scale. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, pp.\ 30811--30849, 2024

  20. [29]

    YaRN : Efficient context window extension of large language models

    Peng, B., Quesnelle, J., Fan, H., and Shippole, E. YaRN : Efficient context window extension of large language models. In International Conference on Learning Representations (ICLR), 2024

  21. [30]

    Hyena Hierarchy : Towards larger convolutional language models

    Poli, M., Massaroli, S., Nguyen, E., Fu, D., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and R \' e , C. Hyena Hierarchy : Towards larger convolutional language models. In International Conference on Machine Learning (ICML), volume 202, pp.\ 28043--28078, 2023

  22. [31]

    A., and Lewis, M

    Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations (ICLR), 2022

  23. [32]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

  24. [33]

    Global filter networks for image classification

    Rao, Y., Zhao, W., Zhu, Z., Lu, J., and Zhou, J. Global filter networks for image classification. In Advances in Neural Information Processing Systems (NeurIPS), pp.\ 980--993, 2021

  25. [34]

    Efficient content-based sparse attention with routing Transformers

    Roy, A., Saffar, M., Vaswani, A., and Grangier, D. Efficient content-based sparse attention with routing Transformers . In Transactions of the Association for Computational Linguistics (TACL), volume 9, pp.\ 53--68, 2021

  26. [35]

    WinoGrande : An adversarial Winograd schema challenge at scale

    Sakaguchi, K., Le Bras, R., Bhagavatula, C., and Choi, Y. WinoGrande : An adversarial Winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pp.\ 8732--8740, 2020

  27. [36]

    SocialIQA : Commonsense reasoning about social interactions

    Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. SocialIQA : Commonsense reasoning about social interactions. In Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, (EMNLP-IJCNLP), pp.\ 4462--4472, 2019

  28. [37]

    DCT -former: Efficient self-attention with discrete cosine transform

    Scribano, C., Franchini, G., Prato, M., and Bertogna, M. DCT -former: Efficient self-attention with discrete cosine transform. Journal of Scientific Computing, 94 0 (3): 0 67, 2023

  29. [39]

    Roformer : Enhanced Transformer with rotary position embedding

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer : Enhanced Transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  30. [41]

    Learning to (learn at test time): Rnns with expressive hidden states

    Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., and Guestrin, C. Learning to (learn at test time): Rnns with expressive hidden states. In International Conference on Machine Learning (ICML), 2025

  31. [43]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, pp.\ 5998--6008, 2017

  32. [44]

    Gated linear attention transformers with hardware-efficient training

    Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. Gated linear attention transformers with hardware-efficient training. In International Conference on Machine Learning (ICML), pp.\ 56501--56523, 2024 a

  33. [45]

    Parallelizing linear Transformers with the delta rule over sequence length

    Yang, S., Wang, B., Zhang, Y., Shen, Y., and Kim, Y. Parallelizing linear Transformers with the delta rule over sequence length. In Advances in Neural Information Processing Systems (NeurIPS), 2024 b

  34. [46]

    Gated Delta Networks : Improving Mamba2 with delta rule

    Yang, S., Kautz, J., and Hatamizadeh, A. Gated Delta Networks : Improving Mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 2025

  35. [47]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., et al

    Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., et al. Big Bird : Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.\ 17283--17297, 2020

  36. [48]

    HellaSwag : Can a machine really finish your sentence? In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pp.\ 4791--4800, 2019

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag : Can a machine really finish your sentence? In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pp.\ 4791--4800, 2019

  37. [50]

    Long-range sequence modeling with predictable sparse attention

    Zhuang, Y., Zhang, J., and Tu, M. Long-range sequence modeling with predictable sparse attention. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pp.\ 234--243, 2022

  38. [51]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  39. [52]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  40. [53]

    M. J. Kearns , title =

  41. [54]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  42. [55]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  43. [56]

    Suppressed for Anonymity , author=

  44. [57]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  45. [58]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  46. [59]

    and Kaiser,

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is All You Need , booktitle =

  47. [60]

    GLU Variants Improve Transformer

    Shazeer, Noam , title =. arXiv preprint arXiv:2002.05202 , year =

  48. [61]

    Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , title =

  49. [62]

    and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , title =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D. and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  50. [63]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie. arXiv preprint arXiv:2302.13971 , year =

  51. [64]

    , title =

    Lenz, Barak and Lieber, Opher and Arazi, Alan and Bergman, Amir and Manevich, Avshalom and Peleg, Barak and Aviram, Ben and Almagor, Chen and Fridman, Clara and Padnos, Dan and et al. , title =. International Conference on Learning Representations (ICLR) , year =

  52. [65]

    Longformer: The Long-Document Transformer

    Beltagy, Iz and Peters, Matthew E. and Cohan, Arman , title =. arXiv preprint arXiv:2004.05150 , year =

  53. [66]

    Advances in Neural Information Processing Systems (NeurIPS) , volume =

    Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

  54. [67]

    International Conference on Learning Representations (ICLR) , year =

    Kitaev, Nikita and Kaiser, Lukasz and Levskaya, Anselm , title =. International Conference on Learning Representations (ICLR) , year =

  55. [68]

    Transactions of the Association for Computational Linguistics (TACL) , volume =

    Roy, Aurko and Saffar, Mohammad and Vaswani, Ashish and Grangier, David , title =. Transactions of the Association for Computational Linguistics (TACL) , volume =

  56. [69]

    Neurocomputing , volume =

    Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. Neurocomputing , volume =

  57. [70]

    and Lewis, Mike , title =

    Press, Ofir and Smith, Noah A. and Lewis, Mike , title =. International Conference on Learning Representations (ICLR) , year =

  58. [71]

    International Conference on Learning Representations (ICLR) , year =

    Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , title =. International Conference on Learning Representations (ICLR) , year =

  59. [72]

    Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =

    Gu, Albert and Goel, Karan and R. Efficiently Modeling Long Sequences with Structured State Spaces , booktitle =

  60. [73]

    On the parameterization and initialization of diagonal state space models , booktitle =

    Gu, Albert and Goel, Karan and Gupta, Ankit and R. On the parameterization and initialization of diagonal state space models , booktitle =

  61. [74]

    International Conference on Learning Representations (ICLR) , year =

    Fu, Daniel and Dao, Tri and Saab, Khaled and Thomas, Armin and Rudra, Atri and R. International Conference on Learning Representations (ICLR) , year =

  62. [75]

    International Conference on Machine Learning (ICML) , volume =

    Poli, Michael and Massaroli, Stefano and Nguyen, Eric and Fu, Daniel and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R. International Conference on Machine Learning (ICML) , volume =

  63. [76]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, Albert and Dao, Tri , title =. arXiv preprint arXiv:2312.00752 , year =

  64. [77]

    International Conference on Machine Learning (ICML) , year =

    Dao, Tri and Gu, Albert , title =. International Conference on Machine Learning (ICML) , year =

  65. [78]

    Retentive Network: A Successor to Transformer for Large Language Models

    Sun, Yutao and Dong, Li and Huang, Shaohan and Ma, Shuming and Xia, Yuqing and Xue, Jilong and Wang, Jianyong and Wei, Furu , title =. arXiv preprint arXiv:2307.08621 , year =

  66. [79]

    International Conference on Machine Learning (ICML) , pages =

    Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon , title =. International Conference on Machine Learning (ICML) , pages =

  67. [80]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  68. [81]

    International Conference on Learning Representations (ICLR) , year =

    Yang, Songlin and Kautz, Jan and Hatamizadeh, Ali , title =. International Conference on Learning Representations (ICLR) , year =

  69. [82]

    International Conference on Machine Learning (ICML) , year =

    Sun, Yu and Li, Xinhao and Dalal, Karan and Xu, Jiarui and Vikram, Arjun and Zhang, Genghan and Dubois, Yann and Chen, Xinlei and Wang, Xiaolong and Koyejo, Sanmi and Hashimoto, Tatsunori and Guestrin, Carlos , title =. International Conference on Machine Learning (ICML) , year =

  70. [83]

    Lee-Thorp, James and Ainslie, Joshua and Eckstein, Ilya and Ontanon, Santiago , booktitle =

  71. [84]

    International Conference on Learning Representations (ICLR) , year =

    Li, Zongyi and Kovachki, Nikola and Azizzadenesheli, Kamyar and Liu, Burigede and Bhattacharya, Kaushik and Stuart, Andrew and Anandkumar, Anima , title =. International Conference on Learning Representations (ICLR) , year =

  72. [85]

    Advances in Neural Information Processing Systems (NeurIPS) , pages =

    Rao, Yongming and Zhao, Wenliang and Zhu, Zheng and Lu, Jiwen and Zhou, Jie , title =. Advances in Neural Information Processing Systems (NeurIPS) , pages =

  73. [86]

    arXiv preprint arXiv:2111.13587 , year=

    Guibas, John and Mardani, Morteza and Li, Zongyi and Tao, Andrew and Anandkumar, Anima and Catanzaro, Bryan , title =. arXiv preprint arXiv:2111.13587 , year =

  74. [87]

    arXiv preprint arXiv:2502.18394 , year=

    Fein-Ashley, Jacob and Gupta, Neelesh and Kannan, Rajgopal and Prasanna, Viktor , title =. arXiv preprint arXiv:2502.18394 , year =

  75. [88]

    Journal of Scientific Computing , volume =

    Scribano, Carmelo and Franchini, Giorgia and Prato, Marco and Bertogna, Marko , title =. Journal of Scientific Computing , volume =

  76. [89]

    Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    He, Ziwei and Yang, Meng and Feng, Minwei and Yin, Jingcheng and Wang, Xinbing and Leng, Jingwen and Lin, Zhouhan , title =. Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

  77. [90]

    arXiv preprint arXiv:2503.07630 , year =

    Kiruluta, Andrew and Lundy, Eric and Lemos, Andreas , title =. arXiv preprint arXiv:2503.07630 , year =

  78. [91]

    Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) , volume =

    Zhuang, Yimeng and Zhang, Jing and Tu, Mei , title =. Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) , volume =

  79. [92]

    Fan: Fourier analysis networks.ArXiv, abs/2410.02675, 2024

    Dong, Yihong and Li, Ge and Tao, Yongding and Jiang, Xue and Zhang, Kechi and Li, Jia and Su, Jing and Zhang, Jun and Xu, Jingjing , title =. arXiv preprint arXiv:2410.02675 , year =

  80. [93]

    arXiv preprint arXiv:2502.18094 , year =

    Mian, Shengtian and Wang, Ya and Gu, Nannan and Wang, Yuping and Li, Xiaoqing , title =. arXiv preprint arXiv:2502.18094 , year =

Showing first 80 references.