arxiv: 2603.23284 · v2 · submitted 2026-03-24 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

WaveSFNet: A Wavelet-Based Codec and Spatial--Frequency Dual-Domain Gating Network for Spatiotemporal Prediction

Xinyong Cai , Runming Xie , Hu Chen , Yuankai Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatiotemporal predictionwavelet transformdual-domain gatingvideo frame predictionMoving MNISTefficient forecasting

0 comments

The pith

WaveSFNet uses a wavelet codec and dual-domain gating to predict future frames with competitive accuracy and low complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Spatiotemporal predictive learning forecasts future frames from historical observations without supervision. The main difficulty is capturing long-range dynamics while keeping high-frequency details for clear predictions over many steps. WaveSFNet addresses this by pairing a wavelet-based codec that maintains high-frequency subbands during downsampling with a translator that adds frame differences for motion and fuses spatial and frequency domain features using gates. This design targets better balance between local and global modeling than pure spatial operators or strided sampling. Readers would care because it offers an efficient way to do video prediction for uses like traffic or weather forecasting.

Core claim

WaveSFNet unifies a wavelet-based codec with a spatial-frequency dual-domain gated spatiotemporal translator. The codec preserves high-frequency subband cues during downsampling and reconstruction. The translator injects adjacent-frame differences to enhance dynamic information and performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange.

What carries the argument

Wavelet-based codec combined with spatial-frequency dual-domain gated spatiotemporal translator that uses gated fusion for local-global balance and frame difference injection for dynamics.

If this is right

Delivers competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench.
Keeps low computational complexity.
Supports sharp multi-step predictions by retaining high-frequency details.
Provides an efficient recurrent-free framework for spatiotemporal forecasting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might reduce the need for recurrent models in real-time prediction systems.
Similar dual-domain gating could improve efficiency in other video processing tasks like denoising or super-resolution.
Testing on additional datasets with complex dynamics would validate broader applicability.

Load-bearing premise

The wavelet codec reliably preserves high-frequency subband cues during downsampling and reconstruction while the dual-domain gated fusion balances local interactions with global propagation without artifacts over multiple steps.

What would settle it

Compare multi-step prediction errors and visual sharpness on the TaxiBJ or WeatherBench dataset against baseline methods; if errors are higher or details are lost, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2603.23284 by Hu Chen, Runming Xie, Xinyong Cai, Yuankai Wu.

**Figure 2.** Figure 2: Overall architecture and core modules of WaveSFNet. WaveSFNet follows an encoder–translator–decoder design. A wavelet-based multi-scale encoder extracts latent features from input frames. A TDI Block injects adjacent-frame differences, and Nt stacked ST Blocks apply spatial–frequency dual-domain gating after packing time into channels. A wavelet-symmetric decoder reconstructs predictions. Subsequently, Ns … view at source ↗

**Figure 4.** Figure 4: Frequency spectrum analysis on the WeatherBench T2M dataset. The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 3.** Figure 3: Qualitative visualizations of WaveSFNet on TaxiBJ. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Spatiotemporal predictive learning aims to forecast future frames from historical observations in an unsupervised manner, and is critical to a wide range of applications. The key challenge is to model long-range dynamics while preserving high-frequency details for sharp multi-step predictions. Existing efficient recurrent-free frameworks typically rely on strided convolutions or pooling for sampling, which tends to discard textures and boundaries, while purely spatial operators often struggle to balance local interactions with global propagation. To address these issues, we propose WaveSFNet, an efficient framework that unifies a wavelet-based codec with a spatial--frequency dual-domain gated spatiotemporal translator. The wavelet-based codec preserves high-frequency subband cues during downsampling and reconstruction. Meanwhile, the translator first injects adjacent-frame differences to explicitly enhance dynamic information, and then performs dual-domain gated fusion between large-kernel spatial local modeling and frequency-domain global modulation, together with gated channel interaction for cross-channel feature exchange. Extensive experiments demonstrate that WaveSFNet achieves competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench, while maintaining low computational complexity. Our code is available at https://github.com/fhjdqaq/WaveSFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WaveSFNet combines a wavelet codec with spatial-frequency dual-domain gating and frame-difference injection for spatiotemporal forecasting, delivering competitive benchmark numbers at low cost, though the wavelet component's contribution still needs isolated evidence.

read the letter

The main point is that this paper puts forward WaveSFNet as an efficient architecture that replaces standard strided convolutions with a wavelet-based codec to keep high-frequency details during downsampling, then routes the features through a translator that explicitly adds adjacent-frame differences and performs gated fusion across spatial large-kernel modeling and frequency-domain modulation. That specific combination, plus the gated channel interaction, is the new piece beyond prior wavelet or gating work in this area. The experiments place it at competitive accuracy on Moving MNIST, TaxiBJ, and WeatherBench while holding computational complexity down, and the code release is a practical plus for anyone who wants to test the setup directly. Those elements give the work a clear, usable contribution in the subfield of recurrent-free spatiotemporal prediction. The central claim rests on the wavelet codec reliably preserving high-frequency subband cues, yet the reported results do not include direct reconstruction metrics on the high-pass bands or an ablation that swaps the codec for plain strided operations while freezing the rest of the translator. Without those checks it remains possible that the frame-difference injection and dual-domain gating account for most of the observed gains. Multi-step rollouts on the weather and traffic datasets could still accumulate artifacts if the frequency modulation is not perfectly invertible, and the absence of error bars or fuller ablation tables leaves the strength of the accuracy claims moderate rather than definitive. The paper is aimed at researchers building practical forecasting models for video or weather domains who already know the standard benchmarks and are looking for lighter alternatives to heavy recurrent baselines. A reader focused on architectural efficiency with some frequency awareness will find concrete design choices to consider. I would send it to peer review because the architecture is coherent, the benchmarks are standard, and the missing ablations are the sort of thing referees can request and the authors can supply.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes WaveSFNet, an efficient spatiotemporal prediction framework that integrates a wavelet-based codec to preserve high-frequency details during downsampling and reconstruction with a spatial-frequency dual-domain gated translator that incorporates adjacent-frame differences and performs gated fusion between spatial local modeling and frequency-domain global modulation. It reports competitive prediction accuracy on Moving MNIST, TaxiBJ, and WeatherBench benchmarks while maintaining low computational complexity.

Significance. If the central claims hold, this work could advance efficient recurrent-free models for video prediction by addressing the loss of high-frequency cues in standard sampling methods and balancing local-global interactions via dual-domain gating. The availability of code on GitHub supports reproducibility, and the use of public datasets allows for direct comparison with prior art.

major comments (2)

Abstract and Method section: The assertion that the wavelet-based codec reliably preserves high-frequency subband cues during downsampling and reconstruction lacks direct quantitative validation, such as reconstruction PSNR/SSIM metrics on high-pass subbands or an ablation replacing the wavelet codec with standard strided convolutions while holding the dual-domain translator fixed.
§4 Experiments: The multi-step prediction results on TaxiBJ and WeatherBench lack reported error accumulation analysis over rollout steps and full ablation details isolating the wavelet component from adjacent-frame difference injection and gated fusion, leaving open whether accuracy gains are driven by the proposed codec.

minor comments (1)

Benchmark tables: Error bars or standard deviations are not reported alongside accuracy metrics, which would strengthen assessment of the competitive performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below. Where the manuscript requires additional validation or analysis, we will revise accordingly in the next version.

read point-by-point responses

Referee: Abstract and Method section: The assertion that the wavelet-based codec reliably preserves high-frequency subband cues during downsampling and reconstruction lacks direct quantitative validation, such as reconstruction PSNR/SSIM metrics on high-pass subbands or an ablation replacing the wavelet codec with standard strided convolutions while holding the dual-domain translator fixed.

Authors: We agree that direct quantitative validation would strengthen the claims. In the revised manuscript we will add reconstruction PSNR and SSIM metrics computed specifically on the high-pass subbands across the evaluation datasets. We will also include a controlled ablation that replaces the wavelet codec with standard strided convolutions while keeping the dual-domain translator fixed, thereby isolating the codec's contribution. revision: yes
Referee: §4 Experiments: The multi-step prediction results on TaxiBJ and WeatherBench lack reported error accumulation analysis over rollout steps and full ablation details isolating the wavelet component from adjacent-frame difference injection and gated fusion, leaving open whether accuracy gains are driven by the proposed codec.

Authors: We acknowledge that error accumulation analysis and finer-grained ablations would clarify the source of gains. We will add tables or plots showing prediction error accumulation over successive rollout steps for both TaxiBJ and WeatherBench. We will also expand the ablation section to report separate variants that disable the wavelet codec, the adjacent-frame difference injection, and the gated fusion modules individually while holding other components constant. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces WaveSFNet as a new architectural proposal that combines a wavelet-based codec for downsampling/reconstruction with a spatial-frequency dual-domain gated translator. All central claims rest on empirical validation against public benchmarks (Moving MNIST, TaxiBJ, WeatherBench) rather than any derivation that reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations. No equations or steps equate a prediction to its own inputs; the method is presented as an independent design choice whose performance is measured externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard deep-learning training assumptions plus the domain-specific premise that wavelet subbands preserve usable high-frequency cues for prediction; no new physical entities are postulated.

free parameters (1)

network hyperparameters (kernel sizes, channel counts, gating thresholds)
Typical learned or hand-chosen values in deep networks that are fitted during training on the target datasets.

axioms (1)

domain assumption Wavelet transform preserves high-frequency subband cues during downsampling and reconstruction
Invoked in the description of the wavelet-based codec as the mechanism that avoids texture loss.

invented entities (1)

spatial-frequency dual-domain gated spatiotemporal translator no independent evidence
purpose: To perform gated fusion of large-kernel spatial local modeling and frequency-domain global modulation together with channel interaction
New architectural component introduced to address the balance between local and global dynamics.

pith-pipeline@v0.9.0 · 5517 in / 1342 out tokens · 39325 ms · 2026-05-15T00:14:53.040272+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

Convolutional lstm network: A machine learning approach for precipitation nowcasting,

X. Shi, Z. Chen, H. Wang, D.-Y . Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,”Advances in Neural Information Processing Systems, vol. 28, 2015

work page 2015
[2]

Deep learning and process understanding for data-driven earth system science,

M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and F. Prabhat, “Deep learning and process understanding for data-driven earth system science,”Nature, vol. 566, no. 7743, pp. 195–204, 2019

work page 2019
[3]

Deep spatio-temporal residual networks for citywide crowd flows prediction,

J. Zhang, Y . Zheng, and D. Qi, “Deep spatio-temporal residual networks for citywide crowd flows prediction,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017, pp. 1655– 1661

work page 2017
[4]

Gstnet: Global spatial-temporal network for traffic flow prediction

S. Fang, Q. Zhang, G. Meng, S. Xiang, and C. Pan, “Gstnet: Global spatial-temporal network for traffic flow prediction.” inProceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 2286–2293

work page 2019
[5]

Rethinking urban mobility prediction: A multivariate time series forecasting approach,

J. Cheng, K. Li, Y . Liang, L. Sun, J. Yan, and Y . Wu, “Rethinking urban mobility prediction: A multivariate time series forecasting approach,” IEEE Transactions on Intelligent Transportation Systems, 2024

work page 2024
[6]

Long-term on-board pre- diction of people in traffic scenes under uncertainty,

A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-board pre- diction of people in traffic scenes under uncertainty,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4194–4202

work page 2018
[7]

Predicting future frames using retro- spective cycle gan,

Y .-H. Kwon and M.-G. Park, “Predicting future frames using retro- spective cycle gan,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1811–1820

work page 2019
[8]

Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,

Y . Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[9]

Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,

Y . Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 5123–5132

work page 2018
[10]

Predrnn: A recurrent neural network for spatiotemporal predictive learning,

Y . Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, P. S. Yu, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 2, pp. 2208–2225, 2022

work page 2022
[11]

Simvp: Simpler yet better video prediction,

Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180

work page 2022
[12]

Temporal attention unit: Towards efficient spatiotemporal predictive learning,

C. Tan, Z. Gao, L. Wu, Y . Xu, J. Xia, S. Li, and S. Z. Li, “Temporal attention unit: Towards efficient spatiotemporal predictive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 770–18 782

work page 2023
[13]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803

work page 2018
[14]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

work page 2021
[15]

Global filter networks for image classification,

Y . Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,”Advances in Neural Information Processing Systems, vol. 34, pp. 980–993, 2021

work page 2021
[16]

Fnet: Mixing tokens with fourier transforms,

J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “Fnet: Mixing tokens with fourier transforms,” inProceedings of the 2022 Conference of the north American chapter of the Association for Computational Linguistics: human language technologies, 2022, pp. 4296–4313

work page 2022
[17]

Multi-level wavelet- cnn for image restoration,

P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet- cnn for image restoration,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 773– 782

work page 2018
[18]

Wavelet integrated cnns for noise-robust image classification,

Q. Li, L. Shen, S. Guo, and Z. Lai, “Wavelet integrated cnns for noise-robust image classification,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7245–7254

work page 2020
[19]

Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,

Y . Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P. S. Yu, “Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9154–9162

work page 2019
[20]

Disentangling physical dynamics from unknown factors for unsupervised video prediction,

V . L. Guen and N. Thome, “Disentangling physical dynamics from unknown factors for unsupervised video prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 474–11 484

work page 2020
[21]

Swinlstm: Improving spa- tiotemporal prediction accuracy using swin transformer and lstm,

S. Tang, C. Li, P. Zhang, and R. Tang, “Swinlstm: Improving spa- tiotemporal prediction accuracy using swin transformer and lstm,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 470–13 479

work page 2023
[22]

Vmrnn: Integrat- ing vision mamba and lstm for efficient and accurate spatiotemporal forecasting,

Y . Tang, P. Dong, Z. Tang, X. Chu, and J. Liang, “Vmrnn: Integrat- ing vision mamba and lstm for efficient and accurate spatiotemporal forecasting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 5663–5673

work page 2024
[23]

Metaformer is actually what you need for vision,

W. Yu, M. Luo, P. Zhou, C. Si, Y . Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 819–10 829

work page 2022
[24]

Openstl: A comprehensive benchmark of spatio-temporal predictive learning,

C. Tan, S. Li, Z. Gao, W. Guan, Z. Wang, Z. Liu, L. Wu, and S. Z. Li, “Openstl: A comprehensive benchmark of spatio-temporal predictive learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 819–69 831, 2023

work page 2023
[25]

Earthformer: Exploring space-time transformers for earth system fore- casting,

Z. Gao, X. Shi, H. Wang, Y . Zhu, Y . B. Wang, M. Li, and D.-Y . Yeung, “Earthformer: Exploring space-time transformers for earth system fore- casting,”Advances in Neural Information Processing Systems, vol. 35, pp. 25 390–25 403, 2022

work page 2022
[26]

Mmvp: Motion- matrix-based video prediction,

Y . Zhong, L. Liang, I. Zharkov, and U. Neumann, “Mmvp: Motion- matrix-based video prediction,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, 2023, pp. 4273–4283

work page 2023
[27]

Triplet attention transformer for spatiotemporal predictive learning,

X. Nie, X. Chen, H. Jin, Z. Zhu, Y . Yan, and D. Qi, “Triplet attention transformer for spatiotemporal predictive learning,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 7036–7045

work page 2024
[28]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheliet al., “Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators,”arXiv preprint arXiv:2202.11214, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Pastnet: Introducing physical inductive biases for spatio-temporal video predic- tion,

H. Wu, F. Xu, C. Chen, X.-S. Hua, X. Luo, and H. Wang, “Pastnet: Introducing physical inductive biases for spatio-temporal video predic- tion,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2917–2926

work page 2024
[30]

Spectformer: Frequency and attention is what you need in a vision transformer,

B. N. Patro, V . P. Namboodiri, and V . S. Agneeswaran, “Spectformer: Frequency and attention is what you need in a vision transformer,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 9525–9536

work page 2025
[31]

Wave-vit: Unifying wavelet and transformers for visual representation learning,

T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inPro- ceedings of the European Conference on Computer Vision, 2022, pp. 328–345

work page 2022
[32]

Wavelet-driven spatiotemporal predictive learning: Bridging frequency and time variations,

X. Nie, Y . Yan, S. Li, C. Tan, X. Chen, H. Jin, Z. Zhu, S. Z. Li, and D. Qi, “Wavelet-driven spatiotemporal predictive learning: Bridging frequency and time variations,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4334–4342

work page 2024
[33]

Convnext v2: Co-designing and scaling convnets with masked autoen- coders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoen- coders,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 133–16 142

work page 2023
[34]

Go- ing deeper with image transformers,

H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J ´egou, “Go- ing deeper with image transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 32–42

work page 2021
[35]

Unsupervised learn- ing of video representations using lstms,

N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learn- ing of video representations using lstms,” inInternational Conference on Machine Learning. PMLR, 2015, pp. 843–852

work page 2015
[36]

Weatherbench: a benchmark data set for data-driven weather forecasting,

S. Rasp, P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, “Weatherbench: a benchmark data set for data-driven weather forecasting,”Journal of Advances in Modeling Earth Systems, vol. 12, no. 11, p. e2020MS002203, 2020

work page 2020
[37]

Folded recurrent neural networks for future video prediction,

M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” inProceedings of the European Conference on Computer Vision, 2018, pp. 716–731

work page 2018
[38]

Mau: A motion-aware unit for video prediction and beyond,

Z. Chang, X. Zhang, S. Wang, S. Ma, Y . Ye, X. Xinguang, and W. Gao, “Mau: A motion-aware unit for video prediction and beyond,”Advances in Neural Information Processing Systems, vol. 34, pp. 26 950–26 962, 2021

work page 2021
[39]

Efficient and information- preserving future frame prediction and beyond,

W. Yu, Y . Lu, S. Easterbrook, and S. Fidler, “Efficient and information- preserving future frame prediction and beyond,” inInternational Con- ference on Learning Representations, 2020

work page 2020
[40]

Eidetic 3d lstm: A model for video prediction and beyond,

Y . Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei- Fei, “Eidetic 3d lstm: A model for video prediction and beyond,” in International Conference on Learning Representations, 2018

work page 2018
[41]

Simvpv2: Towards simple yet powerful spatiotemporal predictive learning,

C. Tan, Z. Gao, S. Li, and S. Z. Li, “Simvpv2: Towards simple yet powerful spatiotemporal predictive learning,”IEEE Transactions on Multimedia, 2025

work page 2025
[42]

Moganet: Multi-order gated aggregation network,

S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li, “Moganet: Multi-order gated aggregation network,” in International Conference on Learning Representations, 2024

work page 2024
[43]

Hornet: Effi- cient high-order spatial interactions with recursive gated convolutions,

Y . Rao, W. Zhao, Y . Tang, J. Zhou, S.-L. Lim, and J. Lu, “Hornet: Effi- cient high-order spatial interactions with recursive gated convolutions,” Advances in Neural Information Processing Systems, 2022

work page 2022
[44]

Patches are all you need?

A. Trockman and J. Z. Kolter, “Patches are all you need?”Transactions on Machine Learning Research, 2023, featured Certification

work page 2023