pith. machine review for the scientific record. sign in

arxiv: 2603.23284 · v2 · submitted 2026-03-24 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatiotemporal predictionwavelet transformdual-domain gatingvideo frame predictionMoving MNISTefficient forecasting
0
0 comments X

The pith

WaveSFNet uses a wavelet codec and dual-domain gating to predict future frames with competitive accuracy and low complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Spatiotemporal predictive learning forecasts future frames from historical observations without supervision. The main difficulty is capturing long-range dynamics while keeping high-frequency details for clear predictions over many steps. WaveSFNet addresses this by pairing a wavelet-based codec that maintains high-frequency subbands during downsampling with a translator that adds frame differences for motion and fuses spatial and frequency domain features using gates. This design targets better balance between local and global modeling than pure spatial operators or strided sampling. Readers would care because it offers an efficient way to do video prediction for uses like traffic or weather forecasting.

Core claim

WaveSFNet unifies a wavelet-based codec with a spatial-frequency dual-domain gated spatiotemporal translator. The codec preserves high-frequency subband cues during downsampling and reconstruction. The translator injects adjacent-frame differences to enhance dynamic information and performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange.

What carries the argument

Wavelet-based codec combined with spatial-frequency dual-domain gated spatiotemporal translator that uses gated fusion for local-global balance and frame difference injection for dynamics.

If this is right

  • Delivers competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench.
  • Keeps low computational complexity.
  • Supports sharp multi-step predictions by retaining high-frequency details.
  • Provides an efficient recurrent-free framework for spatiotemporal forecasting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might reduce the need for recurrent models in real-time prediction systems.
  • Similar dual-domain gating could improve efficiency in other video processing tasks like denoising or super-resolution.
  • Testing on additional datasets with complex dynamics would validate broader applicability.

Load-bearing premise

The wavelet codec reliably preserves high-frequency subband cues during downsampling and reconstruction while the dual-domain gated fusion balances local interactions with global propagation without artifacts over multiple steps.

What would settle it

Compare multi-step prediction errors and visual sharpness on the TaxiBJ or WeatherBench dataset against baseline methods; if errors are higher or details are lost, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2603.23284 by Hu Chen, Runming Xie, Xinyong Cai, Yuankai Wu.

Figure 1
Figure 1. Figure 1: Performance comparison on the TaxiBJ dataset. Bubble size denotes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture and core modules of WaveSFNet. WaveSFNet follows an encoder–translator–decoder design. A wavelet-based multi-scale encoder extracts latent features from input frames. A TDI Block injects adjacent-frame differences, and Nt stacked ST Blocks apply spatial–frequency dual-domain gating after packing time into channels. A wavelet-symmetric decoder reconstructs predictions. Subsequently, Ns … view at source ↗
Figure 4
Figure 4. Figure 4: Frequency spectrum analysis on the WeatherBench T2M dataset. The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visualizations of WaveSFNet on TaxiBJ. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial--frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes WaveSFNet, an efficient spatiotemporal prediction framework that integrates a wavelet-based codec to preserve high-frequency details during downsampling and reconstruction with a spatial-frequency dual-domain gated translator that incorporates adjacent-frame differences and performs gated fusion between spatial local modeling and frequency-domain global modulation. It reports competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench benchmarks while maintaining low computational complexity.

Significance. If the central claims hold, this work could advance efficient recurrent-free models for video prediction by addressing the loss of high-frequency cues in standard sampling methods and balancing local-global interactions via dual-domain gating. The availability of code on GitHub supports reproducibility, and the use of public datasets allows for direct comparison with prior art.

major comments (2)
  1. Abstract and Method section: The assertion that the wavelet-based codec reliably preserves high-frequency subband cues during downsampling and reconstruction lacks direct quantitative validation, such as reconstruction PSNR/SSIM metrics on high-pass subbands or an ablation replacing the wavelet codec with standard strided convolutions while holding the dual-domain translator fixed.
  2. §4 Experiments: The multi-step prediction results on TaxiBJ and WeatherBench lack reported error accumulation analysis over rollout steps and full ablation details isolating the wavelet component from adjacent-frame difference injection and gated fusion, leaving open whether accuracy gains are driven by the proposed codec.
minor comments (1)
  1. Benchmark tables: Error bars or standard deviations are not reported alongside accuracy metrics, which would strengthen assessment of the competitive performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below. Where the manuscript requires additional validation or analysis, we will revise accordingly in the next version.

read point-by-point responses
  1. Referee: Abstract and Method section: The assertion that the wavelet-based codec reliably preserves high-frequency subband cues during downsampling and reconstruction lacks direct quantitative validation, such as reconstruction PSNR/SSIM metrics on high-pass subbands or an ablation replacing the wavelet codec with standard strided convolutions while holding the dual-domain translator fixed.

    Authors: We agree that direct quantitative validation would strengthen the claims. In the revised manuscript we will add reconstruction PSNR and SSIM metrics computed specifically on the high-pass subbands across the evaluation datasets. We will also include a controlled ablation that replaces the wavelet codec with standard strided convolutions while keeping the dual-domain translator fixed, thereby isolating the codec's contribution. revision: yes

  2. Referee: §4 Experiments: The multi-step prediction results on TaxiBJ and WeatherBench lack reported error accumulation analysis over rollout steps and full ablation details isolating the wavelet component from adjacent-frame difference injection and gated fusion, leaving open whether accuracy gains are driven by the proposed codec.

    Authors: We acknowledge that error accumulation analysis and finer-grained ablations would clarify the source of gains. We will add tables or plots showing prediction error accumulation over successive rollout steps for both TaxiBJ and WeatherBench. We will also expand the ablation section to report separate variants that disable the wavelet codec, the adjacent-frame difference injection, and the gated fusion modules individually while holding other components constant. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces WaveSFNet as a new architectural proposal that combines a wavelet-based codec for downsampling/reconstruction with a spatial-frequency dual-domain gated translator. All central claims rest on empirical validation against public benchmarks (Moving MNIST, TaxiBJ, WeatherBench) rather than any derivation that reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations. No equations or steps equate a prediction to its own inputs; the method is presented as an independent design choice whose performance is measured externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard deep-learning training assumptions plus the domain-specific premise that wavelet subbands preserve usable high-frequency cues for prediction; no new physical entities are postulated.

free parameters (1)
  • network hyperparameters (kernel sizes, channel counts, gating thresholds)
    Typical learned or hand-chosen values in deep networks that are fitted during training on the target datasets.
axioms (1)
  • domain assumption Wavelet transform preserves high-frequency subband cues during downsampling and reconstruction
    Invoked in the description of the wavelet-based codec as the mechanism that avoids texture loss.
invented entities (1)
  • spatial-frequency dual-domain gated spatiotemporal translator no independent evidence
    purpose: To perform gated fusion of large-kernel spatial local modeling and frequency-domain global modulation together with channel interaction
    New architectural component introduced to address the balance between local and global dynamics.

pith-pipeline@v0.9.0 · 5517 in / 1342 out tokens · 39325 ms · 2026-05-15T00:14:53.040272+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    Convolutional lstm network: A machine learning approach for precipitation nowcasting,

    X. Shi, Z. Chen, H. Wang, D.-Y . Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,”Advances in Neural Information Processing Systems, vol. 28, 2015

  2. [2]

    Deep learning and process understanding for data-driven earth system science,

    M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and F. Prabhat, “Deep learning and process understanding for data-driven earth system science,”Nature, vol. 566, no. 7743, pp. 195–204, 2019

  3. [3]

    Deep spatio-temporal residual networks for citywide crowd flows prediction,

    J. Zhang, Y . Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017, pp. 1655– 1661

  4. [4]

    Gstnet: Global spatial-temporal network for traffic flow prediction

    S. Fang, Q. Zhang, G. Meng, S. Xiang, and C. Pan, “Gstnet: Global spatial-temporal network for traffic flow prediction.” inProceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 2286–2293

  5. [5]

    Rethinking urban mobility prediction: A multivariate time series forecasting approach,

    J. Cheng, K. Li, Y . Liang, L. Sun, J. Yan, and Y . Wu, “Rethinking urban mobility prediction: A multivariate time series forecasting approach,” IEEE Transactions on Intelligent Transportation Systems, 2024

  6. [6]

    Long-term on-board pre- diction of people in traffic scenes under uncertainty,

    A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board pre- diction of people in traffic scenes under uncertainty,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4194–4202

  7. [7]

    Predicting future frames using retro- spective cycle gan,

    Y .-H. Kwon and M.-G. Park, “Predicting future frames using retro- spective cycle gan,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1811–1820

  8. [8]

    Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,

    Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” Advances in Neural Information Processing Systems, vol. 30, 2017

  9. [9]

    Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,

    Y . Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 5123–5132

  10. [10]

    Predrnn: A recurrent neural network for spatiotemporal predictive learning,

    Y . Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, P. S. Yu, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 2, pp. 2208–2225, 2022

  11. [11]

    Simvp: Simpler yet better video prediction,

    Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180

  12. [12]

    Temporal attention unit: Towards efficient spatiotemporal predictive learning,

    C. Tan, Z. Gao, L. Wu, Y . Xu, J. Xia, S. Li, and S. Z. Li, “Temporal attention unit: Towards efficient spatiotemporal predictive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 770–18 782

  13. [13]

    Non-local neural net- works,

    X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803

  14. [14]

    An image is worth 16x16 words: Trans- formers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

  15. [15]

    Global filter networks for image classification,

    Y . Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,”Advances in Neural Information Processing Systems, vol. 34, pp. 980–993, 2021

  16. [16]

    Fnet: Mixing tokens with fourier transforms,

    J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “Fnet: Mixing tokens with fourier transforms,” inProceedings of the 2022 Conference of the north American chapter of the Association for Computational Linguistics: human language technologies, 2022, pp. 4296–4313

  17. [17]

    Multi-level wavelet- cnn for image restoration,

    P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet- cnn for image restoration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 773– 782

  18. [18]

    Wavelet integrated cnns for noise-robust image classification,

    Q. Li, L. Shen, S. Guo, and Z. Lai, “Wavelet integrated cnns for noise-robust image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7245–7254

  19. [19]

    Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,

    Y . Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P. S. Yu, “Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9154–9162

  20. [20]

    Disentangling physical dynamics from unknown factors for unsupervised video prediction,

    V . L. Guen and N. Thome, “Disentangling physical dynamics from unknown factors for unsupervised video prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 474–11 484

  21. [21]

    Swinlstm: Improving spa- tiotemporal prediction accuracy using swin transformer and lstm,

    S. Tang, C. Li, P. Zhang, and R. Tang, “Swinlstm: Improving spa- tiotemporal prediction accuracy using swin transformer and lstm,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 470–13 479

  22. [22]

    Vmrnn: Integrat- ing vision mamba and lstm for efficient and accurate spatiotemporal forecasting,

    Y . Tang, P. Dong, Z. Tang, X. Chu, and J. Liang, “Vmrnn: Integrat- ing vision mamba and lstm for efficient and accurate spatiotemporal forecasting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 5663–5673

  23. [23]

    Metaformer is actually what you need for vision,

    W. Yu, M. Luo, P. Zhou, C. Si, Y . Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 819–10 829

  24. [24]

    Openstl: A comprehensive benchmark of spatio-temporal predictive learning,

    C. Tan, S. Li, Z. Gao, W. Guan, Z. Wang, Z. Liu, L. Wu, and S. Z. Li, “Openstl: A comprehensive benchmark of spatio-temporal predictive learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 819–69 831, 2023

  25. [25]

    Earthformer: Exploring space-time transformers for earth system fore- casting,

    Z. Gao, X. Shi, H. Wang, Y . Zhu, Y . B. Wang, M. Li, and D.-Y . Yeung, “Earthformer: Exploring space-time transformers for earth system fore- casting,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 390–25 403, 2022

  26. [26]

    Mmvp: Motion- matrix-based video prediction,

    Y . Zhong, L. Liang, I. Zharkov, and U. Neumann, “Mmvp: Motion- matrix-based video prediction,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023, pp. 4273–4283

  27. [27]

    Triplet attention transformer for spatiotemporal predictive learning,

    X. Nie, X. Chen, H. Jin, Z. Zhu, Y . Yan, and D. Qi, “Triplet attention transformer for spatiotemporal predictive learning,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 7036–7045

  28. [28]

    FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

    J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheliet al., “Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators,”arXiv preprint arXiv:2202.11214, 2022

  29. [29]

    Pastnet: Introducing physical inductive biases for spatio-temporal video predic- tion,

    H. Wu, F. Xu, C. Chen, X.-S. Hua, X. Luo, and H. Wang, “Pastnet: Introducing physical inductive biases for spatio-temporal video predic- tion,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2917–2926

  30. [30]

    Spectformer: Frequency and attention is what you need in a vision transformer,

    B. N. Patro, V . P. Namboodiri, and V . S. Agneeswaran, “Spectformer: Frequency and attention is what you need in a vision transformer,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 9525–9536

  31. [31]

    Wave-vit: Unifying wavelet and transformers for visual representation learning,

    T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inPro- ceedings of the European Conference on Computer Vision, 2022, pp. 328–345

  32. [32]

    Wavelet-driven spatiotemporal predictive learning: Bridging frequency and time variations,

    X. Nie, Y . Yan, S. Li, C. Tan, X. Chen, H. Jin, Z. Zhu, S. Z. Li, and D. Qi, “Wavelet-driven spatiotemporal predictive learning: Bridging frequency and time variations,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4334–4342

  33. [33]

    Convnext v2: Co-designing and scaling convnets with masked autoen- coders,

    S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoen- coders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 133–16 142

  34. [34]

    Go- ing deeper with image transformers,

    H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J ´egou, “Go- ing deeper with image transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42

  35. [35]

    Unsupervised learn- ing of video representations using lstms,

    N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learn- ing of video representations using lstms,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 843–852

  36. [36]

    Weatherbench: a benchmark data set for data-driven weather forecasting,

    S. Rasp, P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, “Weatherbench: a benchmark data set for data-driven weather forecasting,”Journal of Advances in Modeling Earth Systems, vol. 12, no. 11, p. e2020MS002203, 2020

  37. [37]

    Folded recurrent neural networks for future video prediction,

    M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” inProceedings of the European Conference on Computer Vision, 2018, pp. 716–731

  38. [38]

    Mau: A motion-aware unit for video prediction and beyond,

    Z. Chang, X. Zhang, S. Wang, S. Ma, Y . Ye, X. Xinguang, and W. Gao, “Mau: A motion-aware unit for video prediction and beyond,”Advances in Neural Information Processing Systems, vol. 34, pp. 26 950–26 962, 2021

  39. [39]

    Efficient and information- preserving future frame prediction and beyond,

    W. Yu, Y . Lu, S. Easterbrook, and S. Fidler, “Efficient and information- preserving future frame prediction and beyond,” inInternational Con- ference on Learning Representations, 2020

  40. [40]

    Eidetic 3d lstm: A model for video prediction and beyond,

    Y . Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei- Fei, “Eidetic 3d lstm: A model for video prediction and beyond,” in International Conference on Learning Representations, 2018

  41. [41]

    Simvpv2: Towards simple yet powerful spatiotemporal predictive learning,

    C. Tan, Z. Gao, S. Li, and S. Z. Li, “Simvpv2: Towards simple yet powerful spatiotemporal predictive learning,”IEEE Transactions on Multimedia, 2025

  42. [42]

    Moganet: Multi-order gated aggregation network,

    S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li, “Moganet: Multi-order gated aggregation network,” in International Conference on Learning Representations, 2024

  43. [43]

    Hornet: Effi- cient high-order spatial interactions with recursive gated convolutions,

    Y . Rao, W. Zhao, Y . Tang, J. Zhou, S.-L. Lim, and J. Lu, “Hornet: Effi- cient high-order spatial interactions with recursive gated convolutions,” Advances in Neural Information Processing Systems, 2022

  44. [44]

    Patches are all you need?

    A. Trockman and J. Z. Kolter, “Patches are all you need?”Transactions on Machine Learning Research, 2023, featured Certification