arxiv: 2604.06783 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Bohao Xing , Deng Li , Rong Gao , Xin Liu , Heikki K\"alvi\"ainen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords video action recognitiontransformerhuman visual cognitionglance and gazespatiotemporal attentiondual-path networkkinetics-400motion modeling

0 comments

The pith

A dual-path Transformer using human-like glance for coarse video context and gaze for local details outperforms uniform attention in action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video Transformers succeed by mimicking human visual cognition: one path takes an overall glance at coarse spatiotemporal information while the other refines with local details, rather than treating time and space uniformly. This matters because factorized and window-based attention split correlations between regions of interest, limiting motion and long-range dependency capture. If correct, the design yields higher accuracy on standard video benchmarks while maintaining efficiency. A sympathetic reader cares because it grounds architecture choices in observable human behavior instead of purely computational heuristics.

Core claim

The Overall Glance and Refined Gaze (OG-ReG) Transformer is a dual-path network in which the Glance path extracts coarse-grained overall spatiotemporal information and the Gaze path supplements it with local details. This structure follows from the observation that temporal and spatial importance varies across time scales and that attention is allocated sparsely via glance and gaze behavior, unlike factorized or window-based self-attention that breaks correlations. The resulting model reaches state-of-the-art accuracy on Kinetics-400, Something-Something v2, and Diving-48.

What carries the argument

The dual-path OG-ReG network that separates an overall glance path for coarse spatiotemporal features from a refined gaze path for local details, thereby preserving motion correlations.

Load-bearing premise

Human visual attention varies in importance across time scales and is allocated sparsely through glance and gaze, making this separation necessary for better video performance than uniform attention.

What would settle it

A controlled experiment in which a standard factorized self-attention Transformer, trained on identical data and hyperparameters, reaches equal or higher top-1 accuracy on Something-Something v2 would falsify the claimed advantage of the glance-gaze split.

Figures

Figures reproduced from arXiv: 2604.06783 by Bohao Xing, Deng Li, Heikki K\"alvi\"ainen, Rong Gao, Xin Liu.

**Figure 1.** Figure 1: The differences between window-based attention and glance-like attention. (a) Frame sequences (pushing notebook so that it falls off the table) from SSv2 [25] dataset. (b) In window-based self-attention (e.g., Video-Swin [49]), a square represents a window within which self-attention performs its calculations. When an object (marked with red) zooms, moves, and rotates across many different windows in frame… view at source ↗

**Figure 2.** Figure 2: By visualizing the similarity matrices and their corresponding three video sequences ((a) a fast tempo, (b) a slow tempo, (c) a slower tempo) from the same action category, we demonstrate that matrix 𝑨 can effectively capture the tempo characteristics of the actions. 3) Compared to the state-of-the-art models, the proposed OG-ReG demonstrates remarkable performance on multiple benchmarks while also reducin… view at source ↗

**Figure 3.** Figure 3: The framework of OG-ReG Transformer. (a) An overview of the proposed Overall Glance and Refined Gaze (OG-ReG) Transformer. (b) The fundamental building block of OG-ReG, namely the OG-ReG block. cascaded network with self-attention and convolution, considering the latency and accuracy trade-off. However, most of the aforementioned methods have primarily focused on image tasks. To our knowledge, the integra… view at source ↗

**Figure 4.** Figure 4: Details of the OG-ReG block (neglecting Layer-Norm). 𝑸, 𝑲, 𝑽 = 𝒁𝑙−1𝑾𝑄, 𝒁𝑙−1𝑾𝐾, 𝒁𝑙−1𝑾𝑉 , 𝒁̂ 𝑙 = SoftMax(𝑸𝑲𝑇 ∕ √ 𝑑)𝑽 , (2) where 𝑾𝑄, 𝑾𝐾, 𝑾𝑉 are the weights of linear projection, and 𝑸, 𝑲, 𝑽 represent the 𝑞𝑢𝑒𝑟𝑦, 𝑘𝑒𝑦 and 𝑣𝑎𝑙𝑢𝑒. Our network architecture fundamentally adheres to the design principles of a hierarchical vision transformer. The essential distinction lies in our utilization of SoDA in conjunction wi… view at source ↗

**Figure 5.** Figure 5: Details of the MDConv. Here, 𝑿 ∈ ℝ𝑁×𝐶 represents an input, where 𝑠 represents the downsample/upsample ratio, typically is set to [8, 4, 2, 1] at different stages. The operations of reducing/restoring the spatial resolution of the input only. ReS(𝑿) reshapes the input sequence of 𝑿 ∈ ℝ𝑁×𝐶 to ℝ𝑇×𝐻×𝑊 ×𝐶 or back. Down(𝒙, 𝑠) and Up(𝑿, 𝑠) downsample the feature 𝑿 ∈ ℝ𝑇×𝐻×𝑊 ×𝐶 to ℝ 𝑇× 𝐻 𝑠 × 𝑊 𝑠 ×𝐶 , or upsample ba… view at source ↗

**Figure 6.** Figure 6: Fourier spectrum of features in OG-ReG-T and Video-Swin-B. heatmap changes with the object’s collapse, indicating the ability of our method to perceive the object’s motion. In contrast, the heatmap of Video-Swin [49] shows a fixed pattern, focusing on the center of the image, and it does not change with the movement of objects and the camera in time [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Grad-Cam and Visual tempo Visualization. Bohao Xing et al.: Preprint submitted to Elsevier Page 13 of 12 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The OG-ReG dual-path transformer is a straightforward extension of factorized attention ideas using human glance-gaze inspiration, but its SOTA claims on three benchmarks rest on unshown ablations and matched-compute controls.

read the letter

The main point for you is that this paper introduces a dual-path video transformer called OG-ReG. One path does a coarse overall glance across the whole spatiotemporal volume, and the other adds refined local gaze details. The authors argue this respects how temporal and spatial importance shifts across time scales, unlike factorized or window-based attention that splits correlations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Overall Glance and Refined Gaze (OG-ReG) Transformer, a dual-path video architecture motivated by human visual cognition. The Glance path extracts coarse overall spatiotemporal information while the Gaze path supplies local refined details; the design is argued to better preserve motion and long-range dependencies than factorized or window-based self-attention. The paper reports state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48.

Significance. If the performance gains can be attributed to the dual-path design rather than capacity or training differences, the work supplies a cognitively motivated alternative to standard attention factorizations and could improve efficiency in modeling varying spatiotemporal importance across time scales. Public code release is a clear strength for reproducibility.

major comments (2)

[Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).
[Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).

minor comments (2)

[Abstract] The abstract states SOTA results but does not report the exact top-1/top-5 margins or the competing methods being surpassed.
[Section 3.2] Notation for the two paths (e.g., how features from Glance and Gaze are fused) is introduced without an explicit equation or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's careful reading and the opportunity to address the concerns raised. We respond to each major comment below and commit to revisions that will strengthen the empirical support and mechanistic analysis without altering the core claims.

read point-by-point responses

Referee: [Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).

Authors: We agree that isolating the contribution of the dual-path design requires additional controlled experiments. In the revised manuscript we will add: (1) a single-path baseline obtained by ablating the Gaze path while keeping the Glance path and overall architecture otherwise identical; (2) direct comparisons against a standard divided space-time attention model and a factorized attention baseline; and (3) explicit reporting of FLOPs and parameter counts for every model so that all comparisons occur at matched computational budgets. These additions will clarify whether the observed gains are attributable to the proposed Glance+Gaze mechanism rather than capacity or training differences. revision: yes
Referee: [Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).

Authors: The motivation is drawn from established findings in visual cognition on glance and gaze behavior, where global context and local refinement are allocated sparsely and with varying spatiotemporal emphasis. While the performance results are consistent with this design choice, we acknowledge that direct model-level evidence would strengthen the argument. In the revision we will include: attention-map visualizations from both the Glance and Gaze paths, per-layer breakdowns of attention weights, and quantitative metrics that compare the relative emphasis on temporal versus spatial dimensions across different time scales and layers. These analyses will be added to Section 3 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal evaluated on external benchmarks

full rationale

The paper proposes the OG-ReG Transformer architecture motivated by human visual cognition (glance/gaze paths for spatiotemporal attention), then reports end-to-end SOTA results on Kinetics-400, SSv2 and Diving-48. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on empirical performance rather than any reduction to inputs by construction. This is a standard self-contained empirical contribution with independent external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no explicit free parameters, axioms, or invented entities beyond standard transformer components and the high-level dual-path split; the architecture is presented as a novel empirical combination.

pith-pipeline@v0.9.0 · 5505 in / 1045 out tokens · 59035 ms · 2026-05-10T18:03:36.685232+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

temporal information is crucial for agnostic actions... frame-level observation... focus more on the local spatial information of ROI

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

102 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836– 6846

2021
[2]

Some informational aspects of visual perception

Attneave, F., 1954. Some informational aspects of visual perception. Psychological review 61, 183

1954
[3]

A cortical mechanism for triggering top-down facili- tation in visual object recognition

Bar, M., 2003. A cortical mechanism for triggering top-down facili- tation in visual object recognition. Journal of cognitive neuroscience 15, 600–609

2003
[4]

Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p

Bertasius, G., Wang, H., Torresani, L., 2021. Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p. 4

2021
[5]

Buch,S.,Eyzaguirre,C.,Gaidon,A.,Wu,J.,Fei-Fei,L.,Niebles,J.C.,
[6]

2917–2927

Revisiting the" video" in video-language understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927
[7]

Space-time mixing attention for video transformer

Bulat,A.,PerezRua,J.M.,Sudhakaran,S.,Martinez,B.,Tzimiropou- los, G., 2021. Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems 34, 19594– 19607

2021
[8]

Integrated model of visual processing

Bullier, J., 2001. Integrated model of visual processing. Brain research reviews 36, 96–107

2001
[9]

Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi,W.,2017. Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4778– 4787

2017
[10]

End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229

2020
[11]

Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299– 6308

2017
[12]

Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J., 2022. Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5249–5259

2022
[13]

11030–11039

Chen,Y.,Dai,X.,Liu,M.,Chen,D.,Yuan,L.,Liu,Z.,2020.Dynamic convolution: Attention over convolution kernels, in: Proceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 11030–11039

2020
[14]

Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp

Chen,Y.,Fan,H.,Xu,B.,Yan,Z.,Kalantidis,Y.,Rohrbach,M.,Yan, S., Feng, J., 2019. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp. 3435–3444

2019
[15]

Twins: Revisiting the design of spatial attention in vision transformers

Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C., 2021. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34, 9355–9366

2021
[16]

Saccade target selection and object recognition: Evidence for a common attentional mechanism

Deubel, H., Schneider, W.X., 1996. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision research 36, 1827–1837

1996
[17]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., etal.,2020. Animageisworth16x16words:Transformersforimage recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feicht- enhofer, C., 2021a. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835
[19]

More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32

Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D., 2019. More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32

2019
[20]

An image classifier can suffice for video understanding

Fan, Q., Panda, R., et al., 2021b. An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.14104

work page arXiv
[21]

X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Feichtenhofer, C., 2020. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213

2020
[22]

Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp

Feichtenhofer,C.,Fan,H.,Malik,J.,He,K.,2019. Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 6202–6211

2019
[23]

Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B.W., Kälviäinen, H.,
[24]

IEEE Transactions on Affective Computing

Identity-freeartificialemotionalintelligenceviamicro-gesture understanding. IEEE Transactions on Affective Computing
[25]

Canet: Comprehensive attention network for video-based action recognition

Gao, X., Chang, Z., Ran, X., Lu, Y., 2024. Canet: Comprehensive attention network for video-based action recognition. Knowledge- Based Systems 296, 111852. URL:https://www.sciencedirect. com/science/article/pii/S0950705124004866,doi:https://doi.org/10. 1016/j.knosys.2024.111852

work page arXiv 2024
[26]

Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Girdhar,R.,Singh,M.,Ravi,N.,vanderMaaten,L.,Joulin,A.,Misra, I., 2022. Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112

2022
[27]

something something

Goyal,R.,EbrahimiKahou,S.,Michalski,V.,Materzynska,J.,West- phal,S.,Kim,H.,Haenel,V.,Fruend,I.,Yianilos,P.,Mueller-Freitag, M., et al., 2017. The" something something" video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850

2017
[28]

Gslta- cdfsar: Global sequences and local tuples alignment for cross- domain few-shot action recognition

Guo, F., Qi, H., Zhang, X., Zhu, L., Sun, J., 2025. Gslta- cdfsar: Global sequences and local tuples alignment for cross- domain few-shot action recognition. Knowledge-Based Systems 311,113041. URL:https://www.sciencedirect.com/science/article/ pii/S0950705125000887, doi:https://doi.org/10.1016/j.knosys.2025. 113041

work page doi:10.1016/j.knosys.2025 2025
[30]

16000–16009

He,K.,Chen,X.,Xie,S.,Li,Y.,Dollár,P.,Girshick,R.,2022.Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16000–16009

2022
[31]

Shuffle transformer: Rethinking spatial shuffle for vision transformer

Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B., 2021. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650

work page arXiv 2021
[32]

The neural bases of spatial frequency processing during scene perception

Kauffmann, L., Ramanoël, S., Peyrin, C., 2014. The neural bases of spatial frequency processing during scene perception. Frontiers in integrative neuroscience 8, 37

2014
[33]

6232– 6242

Korbar,B.,Tran,D.,Torresani,L.,2019.Scsampler:Samplingsalient clipsfromvideoforefficientactionrecognition,in:Proceedingsofthe IEEE/CVF International Conference on Computer Vision, pp. 6232– 6242

2019
[34]

Kwon, H., Kim, M., Kwak, S., Cho, M., 2020. Motionsqueeze: Neuralmotionfeaturelearningforvideounderstanding,in:European Bohao Xing et al.:Preprint submitted to ElsevierPage 10 of 12 OG-ReG Conference on Computer Vision, Springer. pp. 345–362

2020
[35]

The dynamics of attending: How people track time-varying events

Large, E.W., Jones, M.R., 1999. The dynamics of attending: How people track time-varying events. Psychological review 106, 119

1999
[36]

Applied Acoustics 187, 108499

Li,A.,Zheng,C.,Zhang,L.,Li,X.,2022a.Glanceandgaze:Acollab- orative learning framework for single-channel speech enhancement. Applied Acoustics 187, 108499
[37]

Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

Li,C.,Zhong,Q.,Xie,D.,Pu,S.,2019. Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 7872–7881

2019
[38]

Li, D., Shao, J., Xing, B., Gao, R., Wen, B., Kälviäinen, H., Liu, X.,
[39]

IEEE Transactions on Multimedia

Msf-mamba: Motion-aware state fusion mamba for efficient micro-gesture recognition. IEEE Transactions on Multimedia
[40]

Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning

Li, D., Xing, B., Liu, X., 2024. Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning. IEEE Signal Processing Letters 31, 1309–1313

2024
[41]

Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp

Li, D., Xing, B., Liu, X., Xia, B., Wen, B., Kälviäinen, H., 2025. Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp. 5707–5716

2025
[42]

Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios

Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X., 2022b. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501

work page arXiv
[43]

Uniformer: Unified transformer for efficient spatiotemporal representation learning

Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y., 2022c. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676

work page arXiv
[44]

Uniformer: Uni- fying convolution and self-attention for visual recognition

Li,K.,Wang,Y.,Zhang,J.,Gao,P.,Song,G.,Liu,Y.,Li,H.,Qiao,Y., 2022d. Uniformer:Unifyingconvolutionandself-attentionforvisual recognition. arXiv preprint arXiv:2201.09450

work page arXiv
[45]

Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Li, S., Bak, S., Carr, P., Wang, X., 2018a. Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 369–378
[46]

Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L., 2020. Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918

2020
[47]

Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp

Li, Y., Li, Y., Vasconcelos, N., 2018b. Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp. 513–528
[48]

Improved multiscale vision transformers for classi- fication and detection

Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C., 2021. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526

work page arXiv 2021
[49]

Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Lin, J., Gan, C., Han, S., 2019. Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093

2019
[50]

Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,
[51]

10012–10022

Swintransformer:Hierarchicalvisiontransformerusingshifted windows,in:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision, pp. 10012–10022
[52]

Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp

Liu,Z.,Luo,D.,Wang,Y.,Wang,L.,Tai,Y.,Wang,C.,Li,J.,Huang, F., Lu, T., 2020. Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11669–11676

2020
[53]

Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022. Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 3202–3211

2022
[54]

Lu, X., Zhao, S., Cheng, L., Zheng, Y., Fan, X., Song, M.,
[55]

Knowledge-Based Systems 294,111686

Mixed resolution network with hierarchical motion mod- eling for efficient action recognition. Knowledge-Based Systems 294,111686. URL:https://www.sciencedirect.com/science/article/ pii/S0950705124003216, doi:https://doi.org/10.1016/j.knosys.2024. 111686

work page doi:10.1016/j.knosys.2024 2024
[56]

Peripheral vision trans- former

Min, J., Zhao, Y., Luo, C., Cho, M., 2022. Peripheral vision trans- former. arXiv preprint arXiv:2206.06801

work page arXiv 2022
[57]

Contextual visual and motion salient fusion frame- work for action recognition in dark environments

Munsif, M., Khan, S.U., Khan, N., Hussain, A., Kim, M.J., Baik, S.W., 2024. Contextual visual and motion salient fusion frame- work for action recognition in dark environments. Knowledge- Based Systems 304, 112480. URL:https://www.sciencedirect. com/science/article/pii/S0950705124011146,doi:https://doi.org/10. 1016/j.knosys.2024.112480

work page arXiv 2024
[58]

Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Neimark, D., Bar, O., Zohar, M., Asselmann, D., 2021. Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172

2021
[59]

Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer

Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H., 2022. Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer. pp. 1–18

2022
[60]

Observing the transformation of experience into memory

Paller, K.A., Wagner, A.D., 2002. Observing the transformation of experience into memory. Trends in cognitive sciences 6, 93–102

2002
[61]

arXiv preprint arXiv:2202.06709 , year=

Park, N., Kim, S., 2022. How do vision transformers work? arXiv preprint arXiv:2202.06709

work page arXiv 2022
[62]

Keepingyoureyeonthe ball: Trajectory attention in video transformers

Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichten- hofer,C.,Vedaldi,A.,Henriques,J.F.,2021. Keepingyoureyeonthe ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems 34, 12493–12506

2021
[63]

Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Piergiovanni, A., Kuo, W., Angelova, A., 2023. Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224

2023
[64]

Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Qiu, Z., Yao, T., Mei, T., 2017. Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5533– 5541

2017
[65]

Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128

2021
[66]

Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp

Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S., 2023. Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 6545–6554

2023
[67]

Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE

Rav-Acha, A., Pritch, Y., Peleg, S., 2006. Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE. pp. 435–441

2006
[68]

S., Berg, A

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei- Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[69]

Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision

2017
[70]

Inception transformer

Si,C.,Yu,W.,Zhou,P.,Zhou,Y.,Wang,X.,Yan,S.,2022. Inception transformer. arXiv preprint arXiv:2205.12956

work page arXiv 2022
[71]

Two-stream convolutional networks for action recognition in videos

Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27

2014
[72]

Formation of a motor memory by action observation

Stefan, K., Cohen, L.G., Duque, J., Mazzocchio, R., Celnik, P., Sawaki, L., Ungerleider, L., Classen, J., 2005. Formation of a motor memory by action observation. Journal of Neuroscience 25, 9339– 9346

2005
[73]

Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Strudel, R., Garcia, R., Laptev, I., Schmid, C., 2021. Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262– 7272

2021
[74]

Spatiotemporal visual considerations for video coding

Tang, C.W., 2007. Spatiotemporal visual considerations for video coding. IEEE Transactions on Multimedia 9, 231–238

2007
[75]

Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W

Tong, Z., Song, Y., Wang, J., Wang, L., 2022. Videomae: Masked autoencodersaredata-efficientlearnersforself-supervisedvideopre- training. arXiv preprint arXiv:2203.12602 . Bohao Xing et al.:Preprint submitted to ElsevierPage 11 of 12 OG-ReG

work page arXiv 2022
[76]

Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR

Touvron,H.,Cord,M.,Douze,M.,Massa,F.,Sablayrolles,A.,Jégou, H., 2021. Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR. pp. 10347–10357

2021
[77]

Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp. 4489–4497

2015
[78]

A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Tran,D.,Wang,H.,Torresani,L.,Ray,J.,LeCun,Y.,Paluri,M.,2018. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459

2018
[79]

Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Wang, J., Torresani, L., 2022. Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14053–14062

2022
[80]

Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp

Wang, L., Tong, Z., Ji, B., Wu, G., 2021a. Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 1895–1904

1904
[81]

Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V., 2016. Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer. pp. 20–36

2016

Showing first 80 references.