pith. machine review for the scientific record. sign in

arxiv: 2604.22808 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· eess.IV

Recognition: unknown

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AIeess.IV
keywords frequency domain attentionvideo diffusion transformersefficient self-attentionspectral routinglong sequence modelingheterogeneous attentiondiffusion modelstoken efficiency
0
0 comments X

The pith

FreqFormer splits video token features into frequency bands and assigns different attention operators to each band, reducing quadratic costs in long-sequence diffusion transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-sequence video diffusion transformers incur quadratic self-attention costs that dominate runtime and memory once token counts reach tens or hundreds of thousands. The paper proposes splitting features by spectral content so that low frequencies receive dense global attention, mid frequencies receive structured block-sparse attention, and high frequencies receive sliding-window local attention. A lightweight routing network uses layer statistics and the diffusion timestep to decide how many heads operate in each band, shifting emphasis from global layout early in denoising to fine detail later. Cross-band summary tokens enable cheap information exchange between the branches. Simulations from 64K to 1M tokens show substantial drops in estimated attention FLOPs and KV memory traffic relative to dense attention while preserving a hardware-friendly execution pattern.

Core claim

Video features in diffusion processes are spectrally structured, with low frequencies carrying global layout and coarse motion while high frequencies carry texture and fine detail. FreqFormer exploits this structure through a heterogeneous attention framework that applies dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A spectral routing network allocates heads across bands using layer statistics and the current denoising timestep. Cross-band summary tokens provide residual exchange. The resulting system is paired with a fused GPU execution plan and is shown in simulation

What carries the argument

Frequency-aware heterogeneous attention framework that partitions tokens into spectral bands and routes them to distinct operators (dense, block-sparse, local) under the control of a timestep-aware routing network plus cross-band summary tokens.

If this is right

  • Longer video sequences become feasible because attention FLOPs and KV memory traffic scale more favorably than dense quadratic attention.
  • Compute can be reallocated automatically as denoising progresses, prioritizing global structure early and local detail later.
  • A single fused GPU schedule for the three branches reduces kernel launches and memory traffic relative to separate kernels.
  • The same spectral decomposition supplies both an orthonormal view of the approximation and a consistent complexity model.
  • The approach supports hardware-friendly patterns that remain practical on current GPUs up to at least one million tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same band-splitting idea could be tested in other iterative generative settings such as audio waveform models or 3D scene synthesis where frequency structure is also present.
  • If the routing network proves stable, it might be combined with existing sparse-attention libraries to obtain real speedups beyond the reported simulations.
  • Energy use per generated frame should drop noticeably for very long videos, which would matter for large-scale training runs.
  • The method leaves open whether the routing decisions themselves could be learned jointly with the diffusion weights rather than using hand-designed statistics.

Load-bearing premise

Video features remain sufficiently organized by frequency across denoising timesteps so the heterogeneous operators and routing network preserve generative quality without post-hoc retraining.

What would settle it

Run FreqFormer on 1M-token video sequences and measure perceptual quality or FID against a dense-attention baseline; clear degradation at later denoising steps would show the spectral-structure assumption does not hold.

Figures

Figures reproduced from arXiv: 2604.22808 by Haopeng Jin.

Figure 1
Figure 1. Figure 1: FreqFormer overview: frequency decomposition, adaptive routing, and heterogeneous attention. High-level diagram of the FreqFormer layer. Input video tokens are transformed into a spectral basis, partitioned into low-, mid-, and high-frequency bands, and routed to dense compressed global attention, block-sparse attention, and local window attention respectively. A timestep-conditioned router allocates heads… view at source ↗
Figure 2
Figure 2. Figure 2: Learnable spectral decomposition and band partitioning. Illustration of the spectral decomposition layer with alternative transform instantiations: fixed DCT, fixed wavelets, and learned orthonormal mixing initialized from DCT. The figure shows how transformed coefficients are partitioned into low, mid, and high bands over spatial-temporal token axes, and how low-frequency coefficients are further compress… view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical frequency-adaptive attention and cross-band residual exchange. Detailed pipeline view of band-specific attention operators. The low band applies dense global attention on compressed tokens, the mid band uses strided block-sparse attention over a predefined pattern, and the high band uses local sliding-window attention. Summary-token cross-band residual exchange is depicted to show how global a… view at source ↗
Figure 4
Figure 4. Figure 4: Spectral routing over denoising timesteps. Schematic of the routing network taking pooled token statistics and timestep embeddings to produce band probabilities and head allocations. The figure includes an example trajectory where early denoising allocates more heads to low-frequency global modeling and later steps shift capacity toward high-frequency detail synthesis. A practical upper bound can be writte… view at source ↗
Figure 5
Figure 5. Figure 5: Fused multi-band attention kernel with warp specialization. Execution diagram of the fused GPU kernel. Warp groups are assigned to dense compressed attention, block-sparse attention, and local attention within a single launch, with shared-memory staging and a unified output epilogue. The figure contrasts fused execution with separate kernel launches to highlight reduced launch overhead and improved occupan… view at source ↗
Figure 6
Figure 6. Figure 6: Sim Convergence Curves. Simulation result from sim_efficiency_kernel_system.py 7 Experimental Evaluation Because this paper is a method-and-systems study, the experiments target both algorithmic behavior and hardware implications. 7.1 Baselines We compare against: • Full dense attention • FlashAttention-3-style exact tiled attention [Dao, 2024] 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sim Scaling Law Curve. Simulation result from sim_efficiency_kernel_system.py 8.2 Number of bands Compare L = 2, 3, 4. Expected outcome. Two bands are likely too coarse to distinguish medium-range structure from local detail; four bands may improve flexibility but complicate routing and scheduling. 8.3 Low-band compression ratio Compare Nlow/Nlow ∈ {1/2, 1/4, 1/8}. Expected outcome. More aggressive compres… view at source ↗
Figure 8
Figure 8. Figure 8: Sim Throughput Nvidia H20 96Gb. Simulation result from sim_efficiency_kernel_system.py [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sim Throughput Nvidia H100 Sxm 80Gb. Simulation result from sim_efficiency_kernel_system.py 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sim Roofline Nvidia H20 96Gb Fp8 Where Supported. Simulation result from sim_efficiency_kernel_system.py [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sim Roofline Nvidia H100 Sxm 80Gb Fp8 Where Supported. Simulation result from sim_efficiency_kernel_system.py 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sim Roofline Nvidia H20 96Gb Bf16. Simulation result from sim_efficiency_kernel_system.py [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sim Roofline Nvidia H100 Sxm 80Gb Bf16. Simulation result from sim_efficiency_kernel_system.py 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are spectrally structured: low frequencies carry global layout and coarse motion; high frequencies carry texture and fine detail. We present FreqFormer, a frequency-aware heterogeneous attention framework. Token features are split into spectral bands with different operators: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange. FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. We give a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers (throughput, arithmetic intensity, memory traffic, duration scaling). In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern, supporting spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces FreqFormer, a frequency-aware heterogeneous attention framework for long-sequence video diffusion transformers. Token features are split into spectral bands processed by different operators (dense global attention on low-frequency content, block-sparse attention on mid frequencies, sliding-window local attention on high frequencies), with a lightweight spectral routing network that allocates heads using layer statistics and diffusion timestep. Cross-band summary tokens enable residual exchange, and the method is paired with a fused GPU execution plan. The authors provide a consistent complexity model, an orthonormal-decomposition view of the approximation, and simulation-based results showing reduced attention FLOPs and KV memory traffic versus dense attention for sequences from 64K to 1M tokens.

Significance. If the spectral structure of video features holds across denoising timesteps and the routing preserves generative quality, FreqFormer could meaningfully advance scalable long-video diffusion by replacing uniform approximations with band-specific operators that align with natural frequency content while maintaining hardware-friendly patterns. The provided complexity model and simulation numbers offer a clear, consistent basis for the efficiency claims.

major comments (2)
  1. [Abstract] Abstract and simulation results: No end-to-end training results, FID/FVD scores, perceptual metrics, or ablations comparing generated video quality against dense attention are reported. This is load-bearing for the central claim, as the efficiency gains are conditional on the untested assumption that the heterogeneous operators plus timestep-aware routing preserve quality without degradation.
  2. [Complexity model] Complexity model and simulations: The reported FLOP and memory-traffic reductions for 64K–1M tokens rest entirely on estimated complexity modeling and simulations rather than measured hardware performance or an implemented model; no table or figure provides per-band breakdown, routing overhead, or actual throughput numbers.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive and detailed review. We address the major comments point by point below, clarifying the simulation-focused scope of the work while outlining revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and simulation results: No end-to-end training results, FID/FVD scores, perceptual metrics, or ablations comparing generated video quality against dense attention are reported. This is load-bearing for the central claim, as the efficiency gains are conditional on the untested assumption that the heterogeneous operators plus timestep-aware routing preserve quality without degradation.

    Authors: We acknowledge that the manuscript does not report end-to-end training results, FID/FVD scores, or perceptual metrics. The work is a simulation-based study of the frequency-aware heterogeneous attention mechanism, supported by an analytical complexity model and an orthonormal-decomposition analysis of the approximation error. The assumption of quality preservation follows from the alignment of band-specific operators with video spectral structure and the use of cross-band summary tokens. We will revise the abstract and add a limitations section to explicitly state this scope and note the requirement for future full-model validation. revision: partial

  2. Referee: [Complexity model] Complexity model and simulations: The reported FLOP and memory-traffic reductions for 64K–1M tokens rest entirely on estimated complexity modeling and simulations rather than measured hardware performance or an implemented model; no table or figure provides per-band breakdown, routing overhead, or actual throughput numbers.

    Authors: The efficiency numbers are derived from the consistent analytical complexity model and simulations described in the manuscript. We will add a table providing per-band FLOP and memory-traffic breakdowns, explicit calculations of the spectral routing overhead, and additional figures reporting arithmetic intensity and estimated throughput from the simulation framework. As the study does not include a full GPU kernel implementation, hardware measurements are not available. revision: yes

standing simulated objections not resolved
  • End-to-end generative quality evaluation including FID/FVD scores and direct ablations against dense attention, which would require training a complete video diffusion model beyond the current simulation study scope.
  • Measured hardware performance, throughput, and kernel execution times from a deployed implementation, as all results rely on analytical modeling and simulations.

Circularity Check

0 steps flagged

No circularity: complexity model and simulations derived directly from operator definitions

full rationale

The paper's central results consist of a complexity model and simulation estimates for FLOP/memory reductions that are computed from the explicit definitions of the heterogeneous operators (dense low-frequency attention, block-sparse mid-frequency, sliding-window high-frequency) plus the routing network and cross-band tokens. These quantities follow arithmetically from the proposed architecture without any fitted parameters being renamed as predictions, without self-citations serving as load-bearing justifications for uniqueness or ansatz choices, and without any derivation step that reduces to its own inputs by construction. The orthonormal-decomposition view is presented as an interpretive lens on the same operator set rather than an independent claim. The analysis is therefore self-contained against the paper's own stated components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that video diffusion features admit a stable spectral decomposition whose bands can be processed independently with different attention operators without destroying the generative signal; this premise is stated but not derived in the abstract.

axioms (1)
  • domain assumption Video token features are spectrally structured such that low frequencies carry global layout and high frequencies carry texture.
    Invoked in the first paragraph of the abstract to justify the band split.

pith-pipeline@v0.9.0 · 5534 in / 1351 out tokens · 24857 ms · 2026-05-10T15:51:20.090533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., and Schmid, C. (2021). ViViT: A Video Vision Transformer.Proceedings of the IEEE/CVF International Conference on Computer Vision

  2. [2]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The Long-Document Transformer.arXiv preprint arXiv:2004.05150

  3. [3]

    Bertasius, G., Wang, H., and Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding?Proceedings of the International Conference on Machine Learning

  4. [4]

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, O., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R. (2023). Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.arXiv preprint arXiv:2311.15127

  5. [5]

    Brooks, T., Peebles, W., Holynski, A., and Efros, A. A. (2024). Video generation models as world simulators. arXiv preprint arXiv:2405.02363

  6. [6]

    and Mallat, S

    Bruna, J. and Mallat, S. (2013). Invariant scattering convolution networks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 18721886

  7. [7]

    Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating Long Sequences with Sparse Trans- formers.arXiv preprint arXiv:1904.10509. 19 Figure 8:Sim Throughput Nvidia H20 96Gb. Simulation result from sim_efficiency_kernel_system.py Figure 9:Sim Throughput Nvidia H100 Sxm 80Gb. Simulation result from sim_efficiency_kernel_system.py 20 Figure 1...

  8. [8]

    Q., Mohiuddin, A., Kaiser, ., Belanger, D., Colwell, L., and Weller, A

    Choromanski, K., Likhosherstov, V ., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, ., Belanger, D., Colwell, L., and Weller, A. (2021). Rethinking Attention with Performers.International Conference on Learning Representations

  9. [9]

    Dao, T. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.arXiv preprint arXiv:2407.08608

  10. [10]

    Y ., Ermon, S., Rudra, A., and Re, C

    Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and Re, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.Advances in Neural Information Processing Systems

  11. [11]

    Fan, H., Xiong, B., Mangalam, K., Li, Y ., Yan, Z., Malik, J., and Feichtenhofer, C. (2021). Multiscale Vision Transformers.Proceedings of the IEEE/CVF International Conference on Computer Vision

  12. [12]

    Feichtenhofer, C., Fan, H., Li, Y ., and He, K. (2022). Masked Autoencoders As Spatiotemporal Learners. Advances in Neural Information Processing Systems

  13. [13]

    Gu, A., Goel, K., and Re, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations

  14. [14]

    J., Jung, M., Kim, J., Oh, Y

    Ham, T. J., Jung, M., Kim, J., Oh, Y . H., Park, Y ., Song, Y ., Park, J.-H., Lee, S., Park, K., and Kwon, S. (2024). FlatAttention: Fast and Accurate Attention via Flat Dataflow on GPUs.Proceedings of the ACM/IEEE International Symposium on Computer Architecture

  15. [15]

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.Advances in Neural Information Processing Systems

  16. [16]

    Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models.Advances in Neural Information Processing Systems

  17. [17]

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Fleet, D., Norouzi, M., and Salimans, T. (2022). Imagen Video: High Definition Video Generation with Diffusion Models.arXiv preprint arXiv:2210.02303

  18. [18]

    Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.Proceedings of the International Conference on Machine Learning

  19. [19]

    Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the Design Space of Diffusion-Based Generative Models.Advances in Neural Information Processing Systems

  20. [20]

    Kwon, W., Kim, J., Choi, S., and Lee, J. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention.Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles

  21. [21]

    Lee-Thorp, J., Ainslie, J., Eckstein, I., and Ontanon, S. (2022). FNet: Mixing Tokens with Fourier Transforms. Proceedings of the North American Chapter of the Association for Computational Linguistics

  22. [22]

    Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., and Anandkumar, A. (2021). Fourier Neural Operator for Parametric Partial Differential Equations.International Conference on Learning Representations

  23. [23]

    Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., and Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.Proceedings of the IEEE/CVF International Conference on Computer Vision

  24. [24]

    Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 674693

  25. [25]

    Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V ., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. (2021). Efficient Large- Scale Language Model Training on GPU Clusters Using Megatron-LM.Proceedings of Machine Learning and Systems

  26. [26]

    A., and Kong, L

    Peng, H., Pappas, N., Yogatama, D., Schwarz, J., Smith, N. A., and Kong, L. (2021). Random Feature Attention.International Conference on Learning Representations. 23

  27. [27]

    and Xie, S

    Peebles, W. and Xie, S. (2023). Scalable Diffusion Models with Transformers.Proceedings of the IEEE/CVF International Conference on Computer Vision

  28. [28]

    Y ., Dao, T., Baccus, S., Bengio, Y ., Ermon, S., and Re, C

    Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y ., Dao, T., Baccus, S., Bengio, Y ., Ermon, S., and Re, C. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models.Proceedings of the International Conference on Machine Learning

  29. [29]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision.Proceedings of the International Conference on Machine Learning

  30. [30]

    Simoncelli, E. P. and Olshausen, B. A. (2001). Natural Image Statistics and Neural Representation.Annual Review of Neuroscience, 24, 11931216

  31. [31]

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., and Taigman, Y . (2023). Make-A-Video: Text-to-Video Generation without Text-Video Data. International Conference on Learning Representations

  32. [32]

    Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., Lucic, M., and Dosovitskiy, A. (2021). MLP-Mixer: An all-MLP Architecture for Vision. Advances in Neural Information Processing Systems

  33. [33]

    Tong, Z., Song, Y ., Wang, J., and Wang, L. (2022). VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.Advances in Neural Information Processing Systems

  34. [34]

    and Oliva, A

    Torralba, A. and Oliva, A. (2003). Statistics of natural image categories.Network: Computation in Neural Systems, 14(3), 391412

  35. [35]

    Unterthiner, T., Nessler, B., Heigold, G., van den Oord, A., and Hochreiter, S. (2018). Towards Accurate Generative Models of Video: A New Metric and Challenges.arXiv preprint arXiv:1812.01717

  36. [36]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention Is All You Need.Advances in Neural Information Processing Systems

  37. [37]

    Wang, H., Li, Y ., and Feichtenhofer, C. (2023a). InternVideo: General Video Foundation Models via Generative and Discriminative Learning.arXiv preprint arXiv:2212.03191

  38. [38]

    Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local Neural Networks.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  39. [39]

    Xiong, Y ., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y ., and Singh, V . (2021). Nystromformer: A Nystrom-Based Algorithm for Approximating Self-Attention.Proceedings of the AAAI Conference on Artificial Intelligence

  40. [40]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A

    Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., and Ahmed, A. (2020). Big Bird: Transformers for Longer Sequences.Advances in Neural Information Processing Systems

  41. [41]

    Zheng, C., Zhang, H., and Xu, J. (2024). Survey of Efficient Attention for Long-Context Transformers.ACM Computing Surveys

  42. [42]

    Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Jiang, Z., Hou, Q., and Feng, J. (2022). DeepViT: Towards Deeper Vision Transformer.Proceedings of the AAAI Conference on Artificial Intelligence

  43. [43]

    Zhu, Y ., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. Proceedings of the IEEE International Conference on Computer Vision. 24