pith. sign in

arxiv: 2606.03160 · v1 · pith:QIB2FDBBnew · submitted 2026-06-02 · 💻 cs.CV

SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

Pith reviewed 2026-06-28 11:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords point cloud action recognitionspectral decompositionwavelet analysisfrequency modelingSRENetNTU-RGBDMSR-Action3Dtemporal dynamics
0
0 comments X

The pith

SRENet captures motion in point cloud sequences by decomposing features into frequency components with wavelet blocks and a re-entry mechanism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that irregular point cloud sequences make it hard to learn both broad motion context and precise timing details for action recognition. SRENet addresses this by applying wavelet analysis along time and space to split features into low-frequency global patterns and high-frequency local changes, then using attention to handle each separately. A second re-entry decomposition step fixes distortions from earlier fusion steps and recovers missed dynamics. A training schedule that moves focus from coarse low frequencies to fine high frequencies, paired with contrastive loss, is said to sharpen discrimination. Experiments on three standard datasets show top results, which the authors take as evidence that frequency modeling helps point-cloud action understanding.

Core claim

SRENet shows that performing wavelet-based spectral decomposition along temporal and spatial axes, followed by a secondary re-entry decomposition, lets the model separate and recover both global motion context and fine-grained temporal dynamics that standard approaches miss in irregular point cloud data, producing state-of-the-art accuracy on MSR-Action3D, NTU-RGBD and NTU-RGBD120.

What carries the argument

Spectral Decomposition Block (SDeBlock) that splits features via wavelets into low- and high-frequency parts with frequency-specific attention, plus Spectral Re-entry Block (SReBlock) that applies secondary temporal decomposition to restore residual dynamics and re-align frequency structure.

If this is right

  • Frequency-specific attention improves separation of coarse global context from detailed local motion patterns.
  • The curriculum schedule that shifts emphasis from low- to high-frequency subspaces aligns training with the progression from broad to fine motion cues.
  • Spectral-aware contrastive loss increases discriminability inside each frequency subspace.
  • Secondary re-entry decomposition corrects temporal frequency misalignment introduced during semantic fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frequency split might help other tasks that process sparse 3D sequences, such as gesture spotting from LiDAR.
  • If the re-entry step proves essential, future designs could make the secondary decomposition lighter or conditional on data irregularity.
  • Testing whether the low-to-high curriculum still helps when point clouds come from moving sensors rather than fixed setups would check generalization.
  • The approach suggests that explicit frequency handling could reduce the need for very dense frame sampling in point-cloud pipelines.

Load-bearing premise

Wavelet decomposition along time and space can separate motion information in irregular point clouds without losing essential details or creating artifacts that hurt recognition.

What would settle it

An ablation that replaces both the SDeBlock and SReBlock with ordinary spatio-temporal convolutions and measures no drop (or a gain) in accuracy on the NTU-RGBD120 dataset would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.03160 by Jiarui Lan, Kun Hu, Qiuxia Wu, Wenxiong Kang, Zhiyong Wang.

Figure 1
Figure 1. Figure 1: (a) Previous methods extract spectral features based on a single [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Time-frequency amplitude map of the action ‘j [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SRENet, which integrates spectral decomposition and re-entry blocks for point cloud action recognition. SDeBlock extracts spatial and [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the unified Decomposition-Modeling-Fusion pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Recognition accuracy (%) comparison in hard set. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of class activation maps on NTU-RGBD with class-wise accuracies. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Class Activation Map (CAM). (a) CAMs of low- (left) and high [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: The t-SNE visualization on NTU-RGBD. Take off shoes Wear shoes Wear shoes Take off shoes Writing Typing Clapping Rub hands ✓ × ✓ ✓ ✓ × × × [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Some failure cases with CAM visualizations. These examples are [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
read the original abstract

Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SRENet, a spectral-aware framework for point cloud action recognition. It introduces SDeBlock for wavelet-based decomposition along temporal and spatial axes to separate low- and high-frequency components with frequency-specific attention, SReBlock for secondary temporal decomposition to recover residual dynamics, and a spectral-aware contrastive loss with curriculum schedule. Experiments claim state-of-the-art results on MSR-Action3D, NTU-RGBD, and NTU-RGBD120, validating frequency modeling for motion dynamics in irregular point cloud sequences.

Significance. If the wavelet-based blocks can be shown to reliably operate on unordered point clouds without critical distortion, the work would offer a concrete frequency-domain approach to disentangling global context from fine-grained temporal dynamics, a direction with potential value for 3D perception tasks where standard spatio-temporal convolutions struggle with irregularity.

major comments (2)
  1. [Abstract / §3] Abstract (and §3, SDeBlock description): the claim that wavelet-based analysis 'disentangles features into low- and high-frequency components' on point cloud sequences is load-bearing for the SOTA validation, yet the manuscript provides no explicit mechanism (graph wavelets, voxelization, or interpolation) for applying DWT to irregular, non-grid data; without this, frequency separation cannot be guaranteed and the reported gains on MSR-Action3D, NTU-RGBD, and NTU-RGBD120 may not be attributable to spectral modeling.
  2. [Abstract / §4] Abstract (and §4, experimental setup): the SOTA claim rests on the effectiveness of SReBlock's 'secondary temporal decomposition' to re-align structures 'distorted during semantic fusion,' but no ablation isolates the contribution of the re-entry step versus standard temporal modeling, nor reports error bars or dataset statistics, making it impossible to assess whether the frequency re-entry is the driver of the reported improvements.
minor comments (1)
  1. [§3] Notation for frequency-specific attention inside SDeBlock is introduced without an equation reference, making the precise formulation of the attention weights unclear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SRENet. The two major comments highlight areas where additional clarity and experiments would strengthen the manuscript. We address each point below and commit to revisions that directly respond to the concerns without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract (and §3, SDeBlock description): the claim that wavelet-based analysis 'disentangles features into low- and high-frequency components' on point cloud sequences is load-bearing for the SOTA validation, yet the manuscript provides no explicit mechanism (graph wavelets, voxelization, or interpolation) for applying DWT to irregular, non-grid data; without this, frequency separation cannot be guaranteed and the reported gains on MSR-Action3D, NTU-RGBD, and NTU-RGBD120 may not be attributable to spectral modeling.

    Authors: We agree that the manuscript must explicitly specify the mechanism used to apply wavelet decomposition to irregular point cloud sequences. The current description in the abstract and §3 states that SDeBlock performs wavelet-based analysis along temporal and spatial axes but does not detail the concrete implementation (e.g., voxelization, graph wavelets, or interpolation). We will revise §3 to include this mechanism and any supporting justification so that the frequency separation can be verified and the attribution of gains to spectral modeling is unambiguous. revision: yes

  2. Referee: [Abstract / §4] Abstract (and §4, experimental setup): the SOTA claim rests on the effectiveness of SReBlock's 'secondary temporal decomposition' to re-align structures 'distorted during semantic fusion,' but no ablation isolates the contribution of the re-entry step versus standard temporal modeling, nor reports error bars or dataset statistics, making it impossible to assess whether the frequency re-entry is the driver of the reported improvements.

    Authors: We acknowledge that the experimental section does not contain an ablation isolating the SReBlock re-entry step from standard temporal modeling, nor does it report error bars or per-dataset statistics. These omissions make it difficult to isolate the contribution of the re-entry mechanism. We will add the requested ablation study, error bars across multiple runs, and relevant dataset statistics in the revised §4 to allow readers to evaluate whether the frequency re-entry drives the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on external benchmarks

full rationale

The paper proposes SRENet with SDeBlock and SReBlock for wavelet-based frequency decomposition on point cloud sequences, plus a spectral-aware contrastive loss. No equations, derivations, or first-principles claims appear that reduce any result to fitted parameters or self-referential definitions. Performance is reported as SOTA on MSR-Action3D, NTU-RGBD, and NTU-RGBD120 via standard empirical evaluation. The central claim rests on experimental outcomes rather than any tautological reduction. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that wavelet analysis and re-entry blocks improve discriminability on irregular point clouds; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Wavelet transforms can effectively separate low- and high-frequency motion components in point cloud feature sequences.
    Invoked by the description of SDeBlock and SReBlock.

pith-pipeline@v0.9.1-grok · 5763 in / 1131 out tokens · 46328 ms · 2026-06-28T11:05:53.892904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Point 4D transformer networks for spatio-temporal modeling in point cloud videos,

    H. Fan, Y . Yang, and M. Kankanhalli, “Point 4D transformer networks for spatio-temporal modeling in point cloud videos,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 14 204–14 213

  2. [2]

    Point spatio-temporal transformer networks for point cloud video modeling,

    H. Fan, Y . Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2022

  3. [3]

    Point primitive transformer for long-term 4D point cloud video understanding,

    H. Wen, Y . Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4D point cloud video understanding,” inProc. Eur. Conf. Comput. Vis.Springer, 2022, pp. 19–35. 12

  4. [4]

    Spatial-temporal transformer for 3D point cloud sequences,

    Y . Wei, H. Liu, T. Xie, Q. Ke, and Y . Guo, “Spatial-temporal transformer for 3D point cloud sequences,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2022, pp. 1171–1180

  5. [5]

    How do vision transformers work?

    N. Park and S. Kim, “How do vision transformers work?” inProc. Int. Conf. Learn. Represent., 2022, pp. 4287–4312

  6. [6]

    Fast vision transformers with hilo attention,

    Z. Pan, J. Cai, and B. Zhuang, “Fast vision transformers with hilo attention,” inProc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 14 541–14 554

  7. [7]

    Inception transformer,

    C. Si, W. Yu, P. Zhou, Y . Zhou, X. Wang, and S. Yan, “Inception transformer,” inProc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 23 495–23 509

  8. [8]

    Frequency guidance matters: Skeletal action recognition by frequency-aware mixed transformer,

    W. Wu, C. Zheng, Z. Yang, C. Chen, S. Das, and A. Lu, “Frequency guidance matters: Skeletal action recognition by frequency-aware mixed transformer,” inProc. 32nd ACM Int. Conf. Multimedia, 2024, pp. 4660– 4669

  9. [9]

    Decompose more and aggregate better: Two closer looks at frequency representation learning for human motion prediction,

    X. Gao, S. Du, Y . Wu, and Y . Yang, “Decompose more and aggregate better: Two closer looks at frequency representation learning for human motion prediction,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 6451–6460

  10. [10]

    Action recognition based on a bag of 3D points,

    W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” inProc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops. IEEE, 2010, pp. 9–14

  11. [11]

    Ntu RGB+D: A large scale dataset for 3D human activity analysis,

    A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu RGB+D: A large scale dataset for 3D human activity analysis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 1010–1019

  12. [12]

    Ntu RGB+D 120: A large-scale benchmark for 3D human activity understanding,

    J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y . Duan, and A. C. Kot, “Ntu RGB+D 120: A large-scale benchmark for 3D human activity understanding,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, 2019

  13. [13]

    PointNet: Deep learning on point sets for 3D classification and segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 652–660

  14. [14]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” inProc. Adv. Neural Inf. Process. Syst., vol. 30, 2017

  15. [15]

    Dynamic graph cnn for learning on point clouds,

    Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,”ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019

  16. [16]

    PCT: Point cloud transformer,

    M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “PCT: Point cloud transformer,”Comput. Vis. Media, vol. 7, pp. 187–199, 2021

  17. [17]

    Point transformer v3: Simpler faster stronger,

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 4840–4851

  18. [18]

    Rethinking network design and local geometry in point cloud: A simple residual mlp framework,

    X. Ma, C. Qin, H. You, H. Ran, and Y . Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” inProc. Int. Conf. Learn. Represent., 2022, pp. 5019–5033

  19. [19]

    RI-MAE: Rotation-invariant masked autoencoders for self-supervised point cloud representation learning,

    K. Su, Q. Wu, P. Cai, X. Zhu, X. Lu, Z. Wang, and K. Hu, “RI-MAE: Rotation-invariant masked autoencoders for self-supervised point cloud representation learning,” inProc. AAAI Conf. on Artif. Intell., vol. 39, no. 7, 2025, pp. 7015–7023

  20. [20]

    Safdnet: A simple and effective network for fully sparse 3d object detection,

    G. Zhang, J. Chen, G. Gao, J. Li, S. Liu, and X. Hu, “Safdnet: A simple and effective network for fully sparse 3d object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 14 477–14 486

  21. [21]

    V oxel mamba: Group-free state space models for point cloud based 3d object detection,

    G. Zhang, L. Fan, C. He, Z. Lei, Z.-X. ZHANG, and L. Zhang, “V oxel mamba: Group-free state space models for point cloud based 3d object detection,” inProc. Adv. Neural Inf. Process. Syst., vol. 37, 2024, pp. 81 489–81 509

  22. [22]

    No time to train: Empowering non-parametric networks for few-shot 3d scene segmentation,

    X. Zhu, R. Zhang, B. He, Z. Guo, J. Liu, H. Xiao, C. Fu, H. Dong, and P. Gao, “No time to train: Empowering non-parametric networks for few-shot 3d scene segmentation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 3838–3847

  23. [23]

    Unimix: Towards domain adaptive and generalizable lidar semantic segmentation in ad- verse weather,

    H. Zhao, J. Zhang, Z. Chen, S. Zhao, and D. Tao, “Unimix: Towards domain adaptive and generalizable lidar semantic segmentation in ad- verse weather,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 14 781–14 791

  24. [24]

    DC-PCN: Point cloud completion network with dual-codebook guided quantization,

    Q. Wu, H. Huang, K. Su, Z. Wang, and K. Hu, “DC-PCN: Point cloud completion network with dual-codebook guided quantization,” inProc. AAAI Conf. on Artif. Intell., vol. 39, no. 8, 2025, pp. 8441–8449

  25. [25]

    Distinguishing and matching- aware unsupervised point cloud completion,

    H. Xiao, Y . Li, W. Kang, and Q. Wu, “Distinguishing and matching- aware unsupervised point cloud completion,”IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 5160–5173, 2023

  26. [26]

    3DV: 3D dynamic voxel for action recognition in depth video,

    Y . Wang, Y . Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3DV: 3D dynamic voxel for action recognition in depth video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 511–520

  27. [27]

    Semantic complete scene forecasting from a 4d dynamic point cloud sequence,

    Z. Wang, Z. Ye, H. Wu, J. Chen, and L. Yi, “Semantic complete scene forecasting from a 4d dynamic point cloud sequence,” inProc. AAAI Conf. on Artif. Intell., vol. 38, no. 6, 2024, pp. 5867–5875

  28. [28]

    You will never walk alone: One-shot 3d action recognition with point cloud sequence,

    X. Tong, Y . Xiao, B. Tan, J. Yang, Z. Cao, J. T. Zhou, and J. Yuan, “You will never walk alone: One-shot 3d action recognition with point cloud sequence,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 11, pp. 11 464–11 477, 2024

  29. [29]

    PSTnet: Point spatio-temporal convolution on point cloud sequences,

    H. Fan, X. Yu, Y . Ding, Y . Yang, and M. Kankanhalli, “PSTnet: Point spatio-temporal convolution on point cloud sequences,” inProc. Int. Conf. Learn. Represent., 2021, pp. 18 296–18 318

  30. [30]

    Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,

    H. Fan, X. Yu, Y . Yang, and M. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9918–9930, 2021

  31. [31]

    Real-Time 3-D human action recognition based on hyperpoint sequence,

    X. Li, Q. Huang, Z. Wang, T. Yang, Z. Hou, and Z. Miao, “Real-Time 3-D human action recognition based on hyperpoint sequence,”IEEE Trans. on Ind. Informat., vol. 19, no. 8, pp. 8933–8942, 2022

  32. [32]

    GeometryMotion-Net: A strong two-stream baseline for 3d action recognition,

    J. Liu and D. Xu, “GeometryMotion-Net: A strong two-stream baseline for 3d action recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 12, pp. 4711–4721, 2021

  33. [33]

    MAPLE: Masked pseudo-labeling autoencoder for semi-supervised point cloud action recognition,

    X. Chen, W. Liu, X. Liu, Y . Zhang, J. Han, and T. Mei, “MAPLE: Masked pseudo-labeling autoencoder for semi-supervised point cloud action recognition,” inProc. ACM Int. Conf. on Multimedia, 2022, pp. 708–718

  34. [34]

    Point con- trastive prediction with semantic clustering for self-supervised learning on point cloud videos,

    X. Sheng, Z. Shen, G. Xiao, L. Wang, Y . Guo, and H. Fan, “Point con- trastive prediction with semantic clustering for self-supervised learning on point cloud videos,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 16 515–16 524

  35. [35]

    Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,

    Z. Shen, X. Sheng, L. Wang, Y . Guo, Q. Liu, and X. Zhou, “Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 1212–1222

  36. [36]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008

  37. [37]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv:2312.00752, 2023

  38. [38]

    Mamba4D: Efficient 4D point cloud video understanding with disentangled spatial-temporal state space models,

    J. Liu, J. Han, L. Liu, A. I. Aviles-Rivero, C. Jiang, Z. Liu, and H. Wang, “Mamba4D: Efficient 4D point cloud video understanding with disentangled spatial-temporal state space models,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 17 626–17 636

  39. [39]

    Fdnet: Frequency decomposition network for learned image compression,

    J. Wang and Q. Ling, “Fdnet: Frequency decomposition network for learned image compression,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 11, pp. 11 241–11 255, 2024

  40. [40]

    Temporal diversified self-contrastive learning for generalized face forgery detection,

    R. Zhang, P. He, H. Li, S. Wang, and Y . Cao, “Temporal diversified self-contrastive learning for generalized face forgery detection,”IEEE Trans. Circuits Syst. Video Technol., 2024

  41. [41]

    Interactive spectral- spatial transformer for hyperspectral image classification,

    L. Song, Z. Feng, S. Yang, X. Zhang, and L. Jiao, “Interactive spectral- spatial transformer for hyperspectral image classification,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 9, pp. 8589–8601, 2024

  42. [42]

    DNA: Uncovering universal latent forgery knowledge,

    J. Dou, C. Shi, Y . Wang, S. Guo, A. Yi, W. Wu, L. Zhang, F. Shen, and T.-S. Chua, “DNA: Uncovering universal latent forgery knowledge,” arXiv:2601.22515, 2026

  43. [43]

    Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

    J. Dou, C. Shi, J. Wang, F. Shen, Z. Wang, and T.-S. Chua, “Beyond surface artifacts: Capturing shared latent forgery knowledge across modalities,”arXiv:2604.07763, 2026

  44. [44]

    Delving into the frequency: Temporally consistent human motion transfer in the fourier space,

    G. Yang, W. Liu, X. Liu, X. Gu, J. Cao, and J. Li, “Delving into the frequency: Temporally consistent human motion transfer in the fourier space,” inProc. ACM Int. Conf. on Multimedia, 2022, pp. 1156–1166

  45. [45]

    Wavelet-decoupling contrastive enhancement network for fine-grained skeleton-based action recognition,

    H. Chang, J. Chen, Y . Li, J. Chen, and X. Zhang, “Wavelet-decoupling contrastive enhancement network for fine-grained skeleton-based action recognition,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP). IEEE, 2024, pp. 4060–4064

  46. [46]

    Strengthening skeletal action recognizers via leveraging temporal patterns,

    Z. Qin, P. Ji, D. Kim, Y . Liu, S. Anwar, and T. Gedeon, “Strengthening skeletal action recognizers via leveraging temporal patterns,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 577–593

  47. [47]

    The emerging field of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains,

    D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Van- dergheynst, “The emerging field of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, 2013

  48. [48]

    Wavelets on graphs via spectral graph theory,

    D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,”Appl. Comput. Harmon. Anal., vol. 30, no. 2, pp. 129–150, 2011

  49. [49]

    Local spectral graph convolution for point set feature learning,

    C. Wang, B. Samari, and K. Siddiqi, “Local spectral graph convolution for point set feature learning,” inProc. Eur. Conf. Comput. Vis., 2018, pp. 52–66. 13

  50. [50]

    Adaptive wavelet transformer network for 3D shape representation learning,

    H. Huang and Y . Fang, “Adaptive wavelet transformer network for 3D shape representation learning,” inProc. Int. Conf. Learn. Represent., 2021, pp. 21 471–21 486

  51. [51]

    Prototypical contrastive learning of unsupervised representations,

    J. Li, P. Zhou, C. Xiong, and S. C. Hoi, “Prototypical contrastive learning of unsupervised representations,”arXiv:2005.04966, 2020

  52. [52]

    Meteornet: Deep learning on dynamic 3D point cloud sequences,

    X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3D point cloud sequences,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 9246–9255

  53. [53]

    No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,

    J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 8510–8520

  54. [54]

    Leaf: learning frames for 4d point cloud sequence understanding,

    Y . Liu, J. Chen, Z. Zhang, J. Huang, and L. Yi, “Leaf: learning frames for 4d point cloud sequence understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 604–613

  55. [55]

    X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,

    L. Jing, Y . Xue, X. Yan, C. Zheng, D. Wang, R. Zhang, Z. Wang, H. Fang, B. Zhao, and Z. Li, “X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer,” inProc. AAAI Conf. on Artif. Intell., vol. 38, no. 3, 2024, pp. 2670–2678

  56. [56]

    3DinAction: Understanding human actions in 3D point clouds,

    Y . Ben-Shabat, O. Shrout, and S. Gould, “3DinAction: Understanding human actions in 3D point clouds,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 19 978–19 987

  57. [57]

    PREnet: A plane-fit redundancy encoding point cloud sequence network for real-time 3D action recognition,

    S. He, X. Qu, J. Wan, G. Li, C. Xie, and J. Wang, “PREnet: A plane-fit redundancy encoding point cloud sequence network for real-time 3D action recognition,”arXiv:2405.06929, 2024

  58. [58]

    Kan- HyperpointNet for point cloud sequence-based 3D human action recog- nition,

    Z. Chen, X. Li, Q. Huang, Q. Geng, T. Yang, and S. Han, “Kan- HyperpointNet for point cloud sequence-based 3D human action recog- nition,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP). IEEE, 2025, pp. 1–5

  59. [59]

    PRG-Net: Point relationship-guided network for 3d human action recognition,

    Y . Du, Z. Hou, E. Lin, X. Li, J. Liang, and X. Zhou, “PRG-Net: Point relationship-guided network for 3d human action recognition,” Neurocomputing, vol. 635, p. 130015, 2025

  60. [60]

    Zur theorie der orthogonalen funktionensysteme,

    A. Haar, “Zur theorie der orthogonalen funktionensysteme,”Mathema- tische Annalen, vol. 69, pp. 331–371, 1910

  61. [61]

    Daubechies,Ten Lectures on Wavelets, ser

    I. Daubechies,Ten Lectures on Wavelets, ser. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1992, vol. 61

  62. [62]

    Orthonormal bases of compactly supported wavelets,

    Ingrid Daubechies, “Orthonormal bases of compactly supported wavelets,”Commun. Pure Appl. Math., vol. 41, no. 7, pp. 909–996, 1988

  63. [63]

    Spatio-temporal decoupled knowledge compensator for few-shot action recognition,

    H. Qu, X. Shu, R. Yan, H. Gao, W. Wang, and J. Tang, “Spatio-temporal decoupled knowledge compensator for few-shot action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2026

  64. [64]

    MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition,

    H. Qu, R. Yan, X. Shu, H. Gao, P. Huang, and G.-S. Xie, “MVP-Shot: Multi-velocity progressive-alignment framework for few-shot action recognition,”IEEE Trans. Multimedia, 2025

  65. [65]

    Grad-CAM: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 618–626

  66. [66]

    Gait recognition in the wild with dense 3d representations and a benchmark,

    J. Zheng, X. Liu, W. Liu, L. He, C. Yan, and T. Mei, “Gait recognition in the wild with dense 3d representations and a benchmark,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 20 228–20 237

  67. [67]

    HiGCIN: Hierarchical graph-based cross inference network for group activity recognition,

    R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “HiGCIN: Hierarchical graph-based cross inference network for group activity recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 6955–6968, 2020

  68. [68]

    Progressive instance- aware feature learning for compositional action recognition,

    R. Yan, L. Xie, X. Shu, L. Zhang, and J. Tang, “Progressive instance- aware feature learning for compositional action recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 10 317–10 330, 2023

  69. [69]

    AV-SSAN: Audio- visual selective doa estimation through explicit multi-band semantic- spatial alignment,

    Y . Chen, H. Zhu, J. Wang, K. Chen, and X. Qian, “AV-SSAN: Audio- visual selective doa estimation through explicit multi-band semantic- spatial alignment,” inProc. AAAI Conf. on Artif. Intell., vol. 40, no. 25, 2026, pp. 20 409–20 417

  70. [70]

    Locality-aware cross- modal correspondence learning for dense audio-visual events detection,

    L. Xing, H. Qu, R. Yan, X. Shu, and J. Tang, “Locality-aware cross- modal correspondence learning for dense audio-visual events detection,” IEEE Trans. Circuits Syst. Video Technol., 2025