pith. machine review for the scientific record. sign in

arxiv: 2604.06783 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords video action recognitiontransformerhuman visual cognitionglance and gazespatiotemporal attentiondual-path networkkinetics-400motion modeling
0
0 comments X

The pith

A dual-path Transformer using human-like glance for coarse video context and gaze for local details outperforms uniform attention in action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video Transformers succeed by mimicking human visual cognition: one path takes an overall glance at coarse spatiotemporal information while the other refines with local details, rather than treating time and space uniformly. This matters because factorized and window-based attention split correlations between regions of interest, limiting motion and long-range dependency capture. If correct, the design yields higher accuracy on standard video benchmarks while maintaining efficiency. A sympathetic reader cares because it grounds architecture choices in observable human behavior instead of purely computational heuristics.

Core claim

The Overall Glance and Refined Gaze (OG-ReG) Transformer is a dual-path network in which the Glance path extracts coarse-grained overall spatiotemporal information and the Gaze path supplements it with local details. This structure follows from the observation that temporal and spatial importance varies across time scales and that attention is allocated sparsely via glance and gaze behavior, unlike factorized or window-based self-attention that breaks correlations. The resulting model reaches state-of-the-art accuracy on Kinetics-400, Something-Something v2, and Diving-48.

What carries the argument

The dual-path OG-ReG network that separates an overall glance path for coarse spatiotemporal features from a refined gaze path for local details, thereby preserving motion correlations.

Load-bearing premise

Human visual attention varies in importance across time scales and is allocated sparsely through glance and gaze, making this separation necessary for better video performance than uniform attention.

What would settle it

A controlled experiment in which a standard factorized self-attention Transformer, trained on identical data and hyperparameters, reaches equal or higher top-1 accuracy on Something-Something v2 would falsify the claimed advantage of the glance-gaze split.

Figures

Figures reproduced from arXiv: 2604.06783 by Bohao Xing, Deng Li, Heikki K\"alvi\"ainen, Rong Gao, Xin Liu.

Figure 1
Figure 1. Figure 1: The differences between window-based attention and glance-like attention. (a) Frame sequences (pushing notebook so that it falls off the table) from SSv2 [25] dataset. (b) In window-based self-attention (e.g., Video-Swin [49]), a square represents a window within which self-attention performs its calculations. When an object (marked with red) zooms, moves, and rotates across many different windows in frame… view at source ↗
Figure 2
Figure 2. Figure 2: By visualizing the similarity matrices and their corresponding three video sequences ((a) a fast tempo, (b) a slow tempo, (c) a slower tempo) from the same action category, we demonstrate that matrix 𝑨 can effectively capture the tempo characteristics of the actions. 3) Compared to the state-of-the-art models, the proposed OG-ReG demonstrates remarkable performance on multiple benchmarks while also reducin… view at source ↗
Figure 3
Figure 3. Figure 3: The framework of OG-ReG Transformer. (a) An overview of the proposed Overall Glance and Refined Gaze (OG-ReG) Transformer. (b) The fundamental building block of OG-ReG, namely the OG-ReG block. cascaded network with self-attention and convolution, con￾sidering the latency and accuracy trade-off. However, most of the aforementioned methods have primarily focused on image tasks. To our knowledge, the integra… view at source ↗
Figure 4
Figure 4. Figure 4: Details of the OG-ReG block (neglecting Layer-Norm). 𝑸, 𝑲, 𝑽 = 𝒁𝑙−1𝑾𝑄, 𝒁𝑙−1𝑾𝐾, 𝒁𝑙−1𝑾𝑉 , 𝒁̂ 𝑙 = SoftMax(𝑸𝑲𝑇 ∕ √ 𝑑)𝑽 , (2) where 𝑾𝑄, 𝑾𝐾, 𝑾𝑉 are the weights of linear projection, and 𝑸, 𝑲, 𝑽 represent the 𝑞𝑢𝑒𝑟𝑦, 𝑘𝑒𝑦 and 𝑣𝑎𝑙𝑢𝑒. Our network architecture fundamentally adheres to the design principles of a hierarchical vision transformer. The essential distinction lies in our utilization of SoDA in conjunction wi… view at source ↗
Figure 5
Figure 5. Figure 5: Details of the MDConv. Here, 𝑿 ∈ ℝ𝑁×𝐶 represents an input, where 𝑠 represents the downsample/upsample ratio, typically is set to [8, 4, 2, 1] at different stages. The operations of reducing/restoring the spatial resolution of the input only. ReS(𝑿) reshapes the input sequence of 𝑿 ∈ ℝ𝑁×𝐶 to ℝ𝑇×𝐻×𝑊 ×𝐶 or back. Down(𝒙, 𝑠) and Up(𝑿, 𝑠) downsample the feature 𝑿 ∈ ℝ𝑇×𝐻×𝑊 ×𝐶 to ℝ 𝑇× 𝐻 𝑠 × 𝑊 𝑠 ×𝐶 , or upsample ba… view at source ↗
Figure 6
Figure 6. Figure 6: Fourier spectrum of features in OG-ReG-T and Video-Swin-B. heatmap changes with the object’s collapse, indicating the ability of our method to perceive the object’s motion. In contrast, the heatmap of Video-Swin [49] shows a fixed pattern, focusing on the center of the image, and it does not change with the movement of objects and the camera in time [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Grad-Cam and Visual tempo Visualization. Bohao Xing et al.: Preprint submitted to Elsevier Page 13 of 12 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Overall Glance and Refined Gaze (OG-ReG) Transformer, a dual-path video architecture motivated by human visual cognition. The Glance path extracts coarse overall spatiotemporal information while the Gaze path supplies local refined details; the design is argued to better preserve motion and long-range dependencies than factorized or window-based self-attention. The paper reports state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48.

Significance. If the performance gains can be attributed to the dual-path design rather than capacity or training differences, the work supplies a cognitively motivated alternative to standard attention factorizations and could improve efficiency in modeling varying spatiotemporal importance across time scales. Public code release is a clear strength for reproducibility.

major comments (2)
  1. [Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).
  2. [Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).
minor comments (2)
  1. [Abstract] The abstract states SOTA results but does not report the exact top-1/top-5 margins or the competing methods being surpassed.
  2. [Section 3.2] Notation for the two paths (e.g., how features from Glance and Gaze are fused) is introduced without an explicit equation or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's careful reading and the opportunity to address the concerns raised. We respond to each major comment below and commit to revisions that will strengthen the empirical support and mechanistic analysis without altering the core claims.

read point-by-point responses
  1. Referee: [Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).

    Authors: We agree that isolating the contribution of the dual-path design requires additional controlled experiments. In the revised manuscript we will add: (1) a single-path baseline obtained by ablating the Gaze path while keeping the Glance path and overall architecture otherwise identical; (2) direct comparisons against a standard divided space-time attention model and a factorized attention baseline; and (3) explicit reporting of FLOPs and parameter counts for every model so that all comparisons occur at matched computational budgets. These additions will clarify whether the observed gains are attributable to the proposed Glance+Gaze mechanism rather than capacity or training differences. revision: yes

  2. Referee: [Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).

    Authors: The motivation is drawn from established findings in visual cognition on glance and gaze behavior, where global context and local refinement are allocated sparsely and with varying spatiotemporal emphasis. While the performance results are consistent with this design choice, we acknowledge that direct model-level evidence would strengthen the argument. In the revision we will include: attention-map visualizations from both the Glance and Gaze paths, per-layer breakdowns of attention weights, and quantitative metrics that compare the relative emphasis on temporal versus spatial dimensions across different time scales and layers. These analyses will be added to Section 3 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal evaluated on external benchmarks

full rationale

The paper proposes the OG-ReG Transformer architecture motivated by human visual cognition (glance/gaze paths for spatiotemporal attention), then reports end-to-end SOTA results on Kinetics-400, SSv2 and Diving-48. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on empirical performance rather than any reduction to inputs by construction. This is a standard self-contained empirical contribution with independent external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no explicit free parameters, axioms, or invented entities beyond standard transformer components and the high-level dual-path split; the architecture is presented as a novel empirical combination.

pith-pipeline@v0.9.0 · 5505 in / 1045 out tokens · 59035 ms · 2026-05-10T18:03:36.685232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836– 6846

  2. [2]

    Some informational aspects of visual perception

    Attneave, F., 1954. Some informational aspects of visual perception. Psychological review 61, 183

  3. [3]

    A cortical mechanism for triggering top-down facili- tation in visual object recognition

    Bar, M., 2003. A cortical mechanism for triggering top-down facili- tation in visual object recognition. Journal of cognitive neuroscience 15, 600–609

  4. [4]

    Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p

    Bertasius, G., Wang, H., Torresani, L., 2021. Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p. 4

  5. [5]

    Buch,S.,Eyzaguirre,C.,Gaidon,A.,Wu,J.,Fei-Fei,L.,Niebles,J.C.,

  6. [6]

    2917–2927

    Revisiting the" video" in video-language understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927

  7. [7]

    Space-time mixing attention for video transformer

    Bulat,A.,PerezRua,J.M.,Sudhakaran,S.,Martinez,B.,Tzimiropou- los, G., 2021. Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems 34, 19594– 19607

  8. [8]

    Integrated model of visual processing

    Bullier, J., 2001. Integrated model of visual processing. Brain research reviews 36, 96–107

  9. [9]

    Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi,W.,2017. Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4778– 4787

  10. [10]

    End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229

  11. [11]

    Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299– 6308

  12. [12]

    Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J., 2022. Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5249–5259

  13. [13]

    11030–11039

    Chen,Y.,Dai,X.,Liu,M.,Chen,D.,Yuan,L.,Liu,Z.,2020.Dynamic convolution: Attention over convolution kernels, in: Proceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 11030–11039

  14. [14]

    Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp

    Chen,Y.,Fan,H.,Xu,B.,Yan,Z.,Kalantidis,Y.,Rohrbach,M.,Yan, S., Feng, J., 2019. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp. 3435–3444

  15. [15]

    Twins: Revisiting the design of spatial attention in vision transformers

    Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C., 2021. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34, 9355–9366

  16. [16]

    Saccade target selection and object recognition: Evidence for a common attentional mechanism

    Deubel, H., Schneider, W.X., 1996. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision research 36, 1827–1837

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., etal.,2020. Animageisworth16x16words:Transformersforimage recognition at scale. arXiv preprint arXiv:2010.11929

  18. [18]

    Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feicht- enhofer, C., 2021a. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835

  19. [19]

    More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32

    Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D., 2019. More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32

  20. [20]

    An image classifier can suffice for video understanding

    Fan, Q., Panda, R., et al., 2021b. An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.14104

  21. [21]

    X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Feichtenhofer, C., 2020. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213

  22. [22]

    Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp

    Feichtenhofer,C.,Fan,H.,Malik,J.,He,K.,2019. Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 6202–6211

  23. [23]

    Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B.W., Kälviäinen, H.,

  24. [24]

    IEEE Transactions on Affective Computing

    Identity-freeartificialemotionalintelligenceviamicro-gesture understanding. IEEE Transactions on Affective Computing

  25. [25]

    Canet: Comprehensive attention network for video-based action recognition

    Gao, X., Chang, Z., Ran, X., Lu, Y., 2024. Canet: Comprehensive attention network for video-based action recognition. Knowledge- Based Systems 296, 111852. URL:https://www.sciencedirect. com/science/article/pii/S0950705124004866,doi:https://doi.org/10. 1016/j.knosys.2024.111852

  26. [26]

    Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Girdhar,R.,Singh,M.,Ravi,N.,vanderMaaten,L.,Joulin,A.,Misra, I., 2022. Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112

  27. [27]

    something something

    Goyal,R.,EbrahimiKahou,S.,Michalski,V.,Materzynska,J.,West- phal,S.,Kim,H.,Haenel,V.,Fruend,I.,Yianilos,P.,Mueller-Freitag, M., et al., 2017. The" something something" video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850

  28. [28]

    Gslta- cdfsar: Global sequences and local tuples alignment for cross- domain few-shot action recognition

    Guo, F., Qi, H., Zhang, X., Zhu, L., Sun, J., 2025. Gslta- cdfsar: Global sequences and local tuples alignment for cross- domain few-shot action recognition. Knowledge-Based Systems 311,113041. URL:https://www.sciencedirect.com/science/article/ pii/S0950705125000887, doi:https://doi.org/10.1016/j.knosys.2025. 113041

  29. [30]

    16000–16009

    He,K.,Chen,X.,Xie,S.,Li,Y.,Dollár,P.,Girshick,R.,2022.Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16000–16009

  30. [31]

    Shuffle transformer: Rethinking spatial shuffle for vision transformer

    Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B., 2021. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650

  31. [32]

    The neural bases of spatial frequency processing during scene perception

    Kauffmann, L., Ramanoël, S., Peyrin, C., 2014. The neural bases of spatial frequency processing during scene perception. Frontiers in integrative neuroscience 8, 37

  32. [33]

    6232– 6242

    Korbar,B.,Tran,D.,Torresani,L.,2019.Scsampler:Samplingsalient clipsfromvideoforefficientactionrecognition,in:Proceedingsofthe IEEE/CVF International Conference on Computer Vision, pp. 6232– 6242

  33. [34]

    Kwon, H., Kim, M., Kwak, S., Cho, M., 2020. Motionsqueeze: Neuralmotionfeaturelearningforvideounderstanding,in:European Bohao Xing et al.:Preprint submitted to ElsevierPage 10 of 12 OG-ReG Conference on Computer Vision, Springer. pp. 345–362

  34. [35]

    The dynamics of attending: How people track time-varying events

    Large, E.W., Jones, M.R., 1999. The dynamics of attending: How people track time-varying events. Psychological review 106, 119

  35. [36]

    Applied Acoustics 187, 108499

    Li,A.,Zheng,C.,Zhang,L.,Li,X.,2022a.Glanceandgaze:Acollab- orative learning framework for single-channel speech enhancement. Applied Acoustics 187, 108499

  36. [37]

    Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

    Li,C.,Zhong,Q.,Xie,D.,Pu,S.,2019. Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 7872–7881

  37. [38]

    Li, D., Shao, J., Xing, B., Gao, R., Wen, B., Kälviäinen, H., Liu, X.,

  38. [39]

    IEEE Transactions on Multimedia

    Msf-mamba: Motion-aware state fusion mamba for efficient micro-gesture recognition. IEEE Transactions on Multimedia

  39. [40]

    Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning

    Li, D., Xing, B., Liu, X., 2024. Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning. IEEE Signal Processing Letters 31, 1309–1313

  40. [41]

    Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp

    Li, D., Xing, B., Liu, X., Xia, B., Wen, B., Kälviäinen, H., 2025. Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp. 5707–5716

  41. [42]

    Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

    Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X., 2022b. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501

  42. [43]

    Uniformer: Unified transformer for efficient spatiotemporal representation learning

    Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y., 2022c. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676

  43. [44]

    Uniformer: Uni- fying convolution and self-attention for visual recognition

    Li,K.,Wang,Y.,Zhang,J.,Gao,P.,Song,G.,Liu,Y.,Li,H.,Qiao,Y., 2022d. Uniformer:Unifyingconvolutionandself-attentionforvisual recognition. arXiv preprint arXiv:2201.09450

  44. [45]

    Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Li, S., Bak, S., Carr, P., Wang, X., 2018a. Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 369–378

  45. [46]

    Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L., 2020. Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918

  46. [47]

    Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp

    Li, Y., Li, Y., Vasconcelos, N., 2018b. Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp. 513–528

  47. [48]

    Improved multiscale vision transformers for classi- fication and detection

    Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C., 2021. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526

  48. [49]

    Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Lin, J., Gan, C., Han, S., 2019. Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093

  49. [50]

    Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,

  50. [51]

    10012–10022

    Swintransformer:Hierarchicalvisiontransformerusingshifted windows,in:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision, pp. 10012–10022

  51. [52]

    Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

    Liu,Z.,Luo,D.,Wang,Y.,Wang,L.,Tai,Y.,Wang,C.,Li,J.,Huang, F., Lu, T., 2020. Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11669–11676

  52. [53]

    Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp

    Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022. Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 3202–3211

  53. [54]

    Lu, X., Zhao, S., Cheng, L., Zheng, Y., Fan, X., Song, M.,

  54. [55]

    Knowledge-Based Systems 294,111686

    Mixed resolution network with hierarchical motion mod- eling for efficient action recognition. Knowledge-Based Systems 294,111686. URL:https://www.sciencedirect.com/science/article/ pii/S0950705124003216, doi:https://doi.org/10.1016/j.knosys.2024. 111686

  55. [56]

    Peripheral vision trans- former

    Min, J., Zhao, Y., Luo, C., Cho, M., 2022. Peripheral vision trans- former. arXiv preprint arXiv:2206.06801

  56. [57]

    Contextual visual and motion salient fusion frame- work for action recognition in dark environments

    Munsif, M., Khan, S.U., Khan, N., Hussain, A., Kim, M.J., Baik, S.W., 2024. Contextual visual and motion salient fusion frame- work for action recognition in dark environments. Knowledge- Based Systems 304, 112480. URL:https://www.sciencedirect. com/science/article/pii/S0950705124011146,doi:https://doi.org/10. 1016/j.knosys.2024.112480

  57. [58]

    Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Neimark, D., Bar, O., Zohar, M., Asselmann, D., 2021. Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172

  58. [59]

    Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer

    Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H., 2022. Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer. pp. 1–18

  59. [60]

    Observing the transformation of experience into memory

    Paller, K.A., Wagner, A.D., 2002. Observing the transformation of experience into memory. Trends in cognitive sciences 6, 93–102

  60. [61]

    arXiv preprint arXiv:2202.06709 , year=

    Park, N., Kim, S., 2022. How do vision transformers work? arXiv preprint arXiv:2202.06709

  61. [62]

    Keepingyoureyeonthe ball: Trajectory attention in video transformers

    Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichten- hofer,C.,Vedaldi,A.,Henriques,J.F.,2021. Keepingyoureyeonthe ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems 34, 12493–12506

  62. [63]

    Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Piergiovanni, A., Kuo, W., Angelova, A., 2023. Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224

  63. [64]

    Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Qiu, Z., Yao, T., Mei, T., 2017. Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5533– 5541

  64. [65]

    Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128

    Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128

  65. [66]

    Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp

    Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S., 2023. Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 6545–6554

  66. [67]

    Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE

    Rav-Acha, A., Pritch, Y., Peleg, S., 2006. Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE. pp. 435–441

  67. [68]

    S., Berg, A

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei- Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y

  68. [69]

    Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision

  69. [70]

    Inception transformer

    Si,C.,Yu,W.,Zhou,P.,Zhou,Y.,Wang,X.,Yan,S.,2022. Inception transformer. arXiv preprint arXiv:2205.12956

  70. [71]

    Two-stream convolutional networks for action recognition in videos

    Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27

  71. [72]

    Formation of a motor memory by action observation

    Stefan, K., Cohen, L.G., Duque, J., Mazzocchio, R., Celnik, P., Sawaki, L., Ungerleider, L., Classen, J., 2005. Formation of a motor memory by action observation. Journal of Neuroscience 25, 9339– 9346

  72. [73]

    Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Strudel, R., Garcia, R., Laptev, I., Schmid, C., 2021. Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262– 7272

  73. [74]

    Spatiotemporal visual considerations for video coding

    Tang, C.W., 2007. Spatiotemporal visual considerations for video coding. IEEE Transactions on Multimedia 9, 231–238

  74. [75]

    Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W

    Tong, Z., Song, Y., Wang, J., Wang, L., 2022. Videomae: Masked autoencodersaredata-efficientlearnersforself-supervisedvideopre- training. arXiv preprint arXiv:2203.12602 . Bohao Xing et al.:Preprint submitted to ElsevierPage 11 of 12 OG-ReG

  75. [76]

    Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR

    Touvron,H.,Cord,M.,Douze,M.,Massa,F.,Sablayrolles,A.,Jégou, H., 2021. Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR. pp. 10347–10357

  76. [77]

    Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp

    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp. 4489–4497

  77. [78]

    A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Tran,D.,Wang,H.,Torresani,L.,Ray,J.,LeCun,Y.,Paluri,M.,2018. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459

  78. [79]

    Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Wang, J., Torresani, L., 2022. Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14053–14062

  79. [80]

    Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

    Wang, L., Tong, Z., Ji, B., Wu, G., 2021a. Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 1895–1904

  80. [81]

    Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer

    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V., 2016. Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer. pp. 20–36

Showing first 80 references.