Recognition: 2 theorem links
· Lean TheoremInsights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
Pith reviewed 2026-05-10 18:03 UTC · model grok-4.3
The pith
A dual-path Transformer using human-like glance for coarse video context and gaze for local details outperforms uniform attention in action recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Overall Glance and Refined Gaze (OG-ReG) Transformer is a dual-path network in which the Glance path extracts coarse-grained overall spatiotemporal information and the Gaze path supplements it with local details. This structure follows from the observation that temporal and spatial importance varies across time scales and that attention is allocated sparsely via glance and gaze behavior, unlike factorized or window-based self-attention that breaks correlations. The resulting model reaches state-of-the-art accuracy on Kinetics-400, Something-Something v2, and Diving-48.
What carries the argument
The dual-path OG-ReG network that separates an overall glance path for coarse spatiotemporal features from a refined gaze path for local details, thereby preserving motion correlations.
Load-bearing premise
Human visual attention varies in importance across time scales and is allocated sparsely through glance and gaze, making this separation necessary for better video performance than uniform attention.
What would settle it
A controlled experiment in which a standard factorized self-attention Transformer, trained on identical data and hyperparameters, reaches equal or higher top-1 accuracy on Something-Something v2 would falsify the claimed advantage of the glance-gaze split.
Figures
read the original abstract
Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Overall Glance and Refined Gaze (OG-ReG) Transformer, a dual-path video architecture motivated by human visual cognition. The Glance path extracts coarse overall spatiotemporal information while the Gaze path supplies local refined details; the design is argued to better preserve motion and long-range dependencies than factorized or window-based self-attention. The paper reports state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48.
Significance. If the performance gains can be attributed to the dual-path design rather than capacity or training differences, the work supplies a cognitively motivated alternative to standard attention factorizations and could improve efficiency in modeling varying spatiotemporal importance across time scales. Public code release is a clear strength for reproducibility.
major comments (2)
- [Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).
- [Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).
minor comments (2)
- [Abstract] The abstract states SOTA results but does not report the exact top-1/top-5 margins or the competing methods being surpassed.
- [Section 3.2] Notation for the two paths (e.g., how features from Glance and Gaze are fused) is introduced without an explicit equation or pseudocode block.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the referee's careful reading and the opportunity to address the concerns raised. We respond to each major comment below and commit to revisions that will strengthen the empirical support and mechanistic analysis without altering the core claims.
read point-by-point responses
-
Referee: [Experiments section] The central empirical claim (SOTA on Kinetics-400, SSv2, Diving-48) is presented without controlled ablations that isolate the contribution of the dual Glance+Gaze paths. No comparison is shown against a single-path baseline, a standard divided space-time attention model, or factorized attention at matched FLOPs (see Experiments section and associated tables).
Authors: We agree that isolating the contribution of the dual-path design requires additional controlled experiments. In the revised manuscript we will add: (1) a single-path baseline obtained by ablating the Gaze path while keeping the Glance path and overall architecture otherwise identical; (2) direct comparisons against a standard divided space-time attention model and a factorized attention baseline; and (3) explicit reporting of FLOPs and parameter counts for every model so that all comparisons occur at matched computational budgets. These additions will clarify whether the observed gains are attributable to the proposed Glance+Gaze mechanism rather than capacity or training differences. revision: yes
-
Referee: [Section 3] The motivation that 'equal consideration of time and space is not crucial' and that sparse glance/gaze allocation is superior rests on the assumption that performance differences arise from this mechanism; however, no attention-map visualizations, per-layer importance analysis, or quantitative comparison of temporal vs. spatial weighting across time scales are provided to support this (see Section 3 and Figure 1).
Authors: The motivation is drawn from established findings in visual cognition on glance and gaze behavior, where global context and local refinement are allocated sparsely and with varying spatiotemporal emphasis. While the performance results are consistent with this design choice, we acknowledge that direct model-level evidence would strengthen the argument. In the revision we will include: attention-map visualizations from both the Glance and Gaze paths, per-layer breakdowns of attention weights, and quantitative metrics that compare the relative emphasis on temporal versus spatial dimensions across different time scales and layers. These analyses will be added to Section 3 and the supplementary material. revision: yes
Circularity Check
No circularity: empirical architecture proposal evaluated on external benchmarks
full rationale
The paper proposes the OG-ReG Transformer architecture motivated by human visual cognition (glance/gaze paths for spatiotemporal attention), then reports end-to-end SOTA results on Kinetics-400, SSv2 and Diving-48. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim rests on empirical performance rather than any reduction to inputs by construction. This is a standard self-contained empirical contribution with independent external validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
temporal information is crucial for agnostic actions... frame-level observation... focus more on the local spatial information of ROI
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C., 2021. Vivit: A video vision transformer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836– 6846
2021
-
[2]
Some informational aspects of visual perception
Attneave, F., 1954. Some informational aspects of visual perception. Psychological review 61, 183
1954
-
[3]
A cortical mechanism for triggering top-down facili- tation in visual object recognition
Bar, M., 2003. A cortical mechanism for triggering top-down facili- tation in visual object recognition. Journal of cognitive neuroscience 15, 600–609
2003
-
[4]
Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p
Bertasius, G., Wang, H., Torresani, L., 2021. Is space-time attention all you need for video understanding?, in: International Conference on Machine Learning, p. 4
2021
-
[5]
Buch,S.,Eyzaguirre,C.,Gaidon,A.,Wu,J.,Fei-Fei,L.,Niebles,J.C.,
-
[6]
2917–2927
Revisiting the" video" in video-language understanding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927
-
[7]
Space-time mixing attention for video transformer
Bulat,A.,PerezRua,J.M.,Sudhakaran,S.,Martinez,B.,Tzimiropou- los, G., 2021. Space-time mixing attention for video transformer. Advances in Neural Information Processing Systems 34, 19594– 19607
2021
-
[8]
Integrated model of visual processing
Bullier, J., 2001. Integrated model of visual processing. Brain research reviews 36, 96–107
2001
-
[9]
Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi,W.,2017. Real-timevideosuper-resolutionwithspatio-temporal networksandmotioncompensation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4778– 4787
2017
-
[10]
End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers, in:EuropeanConferenceonComputerVision,Springer.pp.213–229
2020
-
[11]
Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Carreira, J., Zisserman, A., 2017. Quo vadis, action recognition? a newmodelandthekineticsdataset,in:proceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6299– 6308
2017
-
[12]
Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J., 2022. Mixformer: Mixing features across windows and dimensions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5249–5259
2022
-
[13]
11030–11039
Chen,Y.,Dai,X.,Liu,M.,Chen,D.,Yuan,L.,Liu,Z.,2020.Dynamic convolution: Attention over convolution kernels, in: Proceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 11030–11039
2020
-
[14]
Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp
Chen,Y.,Fan,H.,Xu,B.,Yan,Z.,Kalantidis,Y.,Rohrbach,M.,Yan, S., Feng, J., 2019. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution, in: Proceed- ingsoftheIEEE/CVFInternationalConferenceonComputerVision, pp. 3435–3444
2019
-
[15]
Twins: Revisiting the design of spatial attention in vision transformers
Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C., 2021. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems 34, 9355–9366
2021
-
[16]
Saccade target selection and object recognition: Evidence for a common attentional mechanism
Deubel, H., Schneider, W.X., 1996. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision research 36, 1827–1837
1996
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., etal.,2020. Animageisworth16x16words:Transformersforimage recognition at scale. arXiv preprint arXiv:2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[18]
Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feicht- enhofer, C., 2021a. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835
-
[19]
More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32
Fan, Q., Chen, C.F.R., Kuehne, H., Pistoia, M., Cox, D., 2019. More is less: Learning efficient video representations by big-little network anddepthwisetemporalaggregation.AdvancesinNeuralInformation Processing Systems 32
2019
-
[20]
An image classifier can suffice for video understanding
Fan, Q., Panda, R., et al., 2021b. An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.14104
-
[21]
X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Feichtenhofer, C., 2020. X3d: Expanding architectures for efficient video recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213
2020
-
[22]
Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp
Feichtenhofer,C.,Fan,H.,Malik,J.,He,K.,2019. Slowfastnetworks forvideorecognition,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 6202–6211
2019
-
[23]
Gao, R., Liu, X., Xing, B., Yu, Z., Schuller, B.W., Kälviäinen, H.,
-
[24]
IEEE Transactions on Affective Computing
Identity-freeartificialemotionalintelligenceviamicro-gesture understanding. IEEE Transactions on Affective Computing
-
[25]
Canet: Comprehensive attention network for video-based action recognition
Gao, X., Chang, Z., Ran, X., Lu, Y., 2024. Canet: Comprehensive attention network for video-based action recognition. Knowledge- Based Systems 296, 111852. URL:https://www.sciencedirect. com/science/article/pii/S0950705124004866,doi:https://doi.org/10. 1016/j.knosys.2024.111852
-
[26]
Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Girdhar,R.,Singh,M.,Ravi,N.,vanderMaaten,L.,Joulin,A.,Misra, I., 2022. Omnivore: A single model for many visual modalities, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16102–16112
2022
-
[27]
something something
Goyal,R.,EbrahimiKahou,S.,Michalski,V.,Materzynska,J.,West- phal,S.,Kim,H.,Haenel,V.,Fruend,I.,Yianilos,P.,Mueller-Freitag, M., et al., 2017. The" something something" video database for learning and evaluating visual common sense, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850
2017
-
[28]
Guo, F., Qi, H., Zhang, X., Zhu, L., Sun, J., 2025. Gslta- cdfsar: Global sequences and local tuples alignment for cross- domain few-shot action recognition. Knowledge-Based Systems 311,113041. URL:https://www.sciencedirect.com/science/article/ pii/S0950705125000887, doi:https://doi.org/10.1016/j.knosys.2025. 113041
-
[30]
16000–16009
He,K.,Chen,X.,Xie,S.,Li,Y.,Dollár,P.,Girshick,R.,2022.Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 16000–16009
2022
-
[31]
Shuffle transformer: Rethinking spatial shuffle for vision transformer
Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B., 2021. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650
-
[32]
The neural bases of spatial frequency processing during scene perception
Kauffmann, L., Ramanoël, S., Peyrin, C., 2014. The neural bases of spatial frequency processing during scene perception. Frontiers in integrative neuroscience 8, 37
2014
-
[33]
6232– 6242
Korbar,B.,Tran,D.,Torresani,L.,2019.Scsampler:Samplingsalient clipsfromvideoforefficientactionrecognition,in:Proceedingsofthe IEEE/CVF International Conference on Computer Vision, pp. 6232– 6242
2019
-
[34]
Kwon, H., Kim, M., Kwak, S., Cho, M., 2020. Motionsqueeze: Neuralmotionfeaturelearningforvideounderstanding,in:European Bohao Xing et al.:Preprint submitted to ElsevierPage 10 of 12 OG-ReG Conference on Computer Vision, Springer. pp. 345–362
2020
-
[35]
The dynamics of attending: How people track time-varying events
Large, E.W., Jones, M.R., 1999. The dynamics of attending: How people track time-varying events. Psychological review 106, 119
1999
-
[36]
Applied Acoustics 187, 108499
Li,A.,Zheng,C.,Zhang,L.,Li,X.,2022a.Glanceandgaze:Acollab- orative learning framework for single-channel speech enhancement. Applied Acoustics 187, 108499
-
[37]
Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp
Li,C.,Zhong,Q.,Xie,D.,Pu,S.,2019. Collaborativespatiotemporal feature learning for video action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 7872–7881
2019
-
[38]
Li, D., Shao, J., Xing, B., Gao, R., Wen, B., Kälviäinen, H., Liu, X.,
-
[39]
IEEE Transactions on Multimedia
Msf-mamba: Motion-aware state fusion mamba for efficient micro-gesture recognition. IEEE Transactions on Multimedia
-
[40]
Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning
Li, D., Xing, B., Liu, X., 2024. Enhancing micro gesture recognition for emotion understanding via context-aware visual-text contrastive learning. IEEE Signal Processing Letters 31, 1309–1313
2024
-
[41]
Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp
Li, D., Xing, B., Liu, X., Xia, B., Wen, B., Kälviäinen, H., 2025. Deemo: De-identity multimodal emotion recognition and reasoning, in: Proceedings of the 33rd ACM International Conference on Multi- media, pp. 5707–5716
2025
-
[42]
Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., Pan, X., 2022b. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv preprint arXiv:2207.05501
-
[43]
Uniformer: Unified transformer for efficient spatiotemporal representation learning
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y., 2022c. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
-
[44]
Uniformer: Uni- fying convolution and self-attention for visual recognition
Li,K.,Wang,Y.,Zhang,J.,Gao,P.,Song,G.,Liu,Y.,Li,H.,Qiao,Y., 2022d. Uniformer:Unifyingconvolutionandself-attentionforvisual recognition. arXiv preprint arXiv:2201.09450
-
[45]
Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Li, S., Bak, S., Carr, P., Wang, X., 2018a. Diversity regularized spatiotemporal attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 369–378
-
[46]
Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L., 2020. Tea: Temporal excitation and aggregation for action recognition, in: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918
2020
-
[47]
Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp
Li, Y., Li, Y., Vasconcelos, N., 2018b. Resound: Towards action recognition without representation bias, in: Proceedings of the Eu- ropean Conference on Computer Vision (ECCV), pp. 513–528
-
[48]
Improved multiscale vision transformers for classi- fication and detection
Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C., 2021. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526
-
[49]
Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Lin, J., Gan, C., Han, S., 2019. Tsm: Temporal shift module for efficient video understanding, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093
2019
-
[50]
Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,
-
[51]
10012–10022
Swintransformer:Hierarchicalvisiontransformerusingshifted windows,in:ProceedingsoftheIEEE/CVFInternationalConference on Computer Vision, pp. 10012–10022
-
[52]
Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp
Liu,Z.,Luo,D.,Wang,Y.,Wang,L.,Tai,Y.,Wang,C.,Li,J.,Huang, F., Lu, T., 2020. Teinet: Towards an efficient architecture for video recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11669–11676
2020
-
[53]
Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H., 2022. Videoswintransformer,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 3202–3211
2022
-
[54]
Lu, X., Zhao, S., Cheng, L., Zheng, Y., Fan, X., Song, M.,
-
[55]
Knowledge-Based Systems 294,111686
Mixed resolution network with hierarchical motion mod- eling for efficient action recognition. Knowledge-Based Systems 294,111686. URL:https://www.sciencedirect.com/science/article/ pii/S0950705124003216, doi:https://doi.org/10.1016/j.knosys.2024. 111686
-
[56]
Peripheral vision trans- former
Min, J., Zhao, Y., Luo, C., Cho, M., 2022. Peripheral vision trans- former. arXiv preprint arXiv:2206.06801
-
[57]
Contextual visual and motion salient fusion frame- work for action recognition in dark environments
Munsif, M., Khan, S.U., Khan, N., Hussain, A., Kim, M.J., Baik, S.W., 2024. Contextual visual and motion salient fusion frame- work for action recognition in dark environments. Knowledge- Based Systems 304, 112480. URL:https://www.sciencedirect. com/science/article/pii/S0950705124011146,doi:https://doi.org/10. 1016/j.knosys.2024.112480
-
[58]
Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Neimark, D., Bar, O., Zohar, M., Asselmann, D., 2021. Video transformer network, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172
2021
-
[59]
Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer
Ni, B., Peng, H., Chen, M., Zhang, S., Meng, G., Fu, J., Xiang, S., Ling, H., 2022. Expanding language-image pretrained models for general video recognition, in: European Conference on Computer Vision, Springer. pp. 1–18
2022
-
[60]
Observing the transformation of experience into memory
Paller, K.A., Wagner, A.D., 2002. Observing the transformation of experience into memory. Trends in cognitive sciences 6, 93–102
2002
-
[61]
arXiv preprint arXiv:2202.06709 , year=
Park, N., Kim, S., 2022. How do vision transformers work? arXiv preprint arXiv:2202.06709
-
[62]
Keepingyoureyeonthe ball: Trajectory attention in video transformers
Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichten- hofer,C.,Vedaldi,A.,Henriques,J.F.,2021. Keepingyoureyeonthe ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems 34, 12493–12506
2021
-
[63]
Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Piergiovanni, A., Kuo, W., Angelova, A., 2023. Rethinking video vits: Sparse video tubes for joint image and video learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2214–2224
2023
-
[64]
Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Qiu, Z., Yao, T., Mei, T., 2017. Learning spatio-temporal repre- sentation with pseudo-3d residual networks, in: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5533– 5541
2017
-
[65]
Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A., 2021. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34, 12116–12128
2021
-
[66]
Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp
Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S., 2023. Fine-tunedclipmodelsareefficientvideolearners,in:Proceedingsof theIEEE/CVFConferenceonComputerVisionandPatternRecogni- tion, pp. 6545–6554
2023
-
[67]
Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE
Rav-Acha, A., Pritch, Y., Peleg, S., 2006. Making a long video short: Dynamic video synopsis, in: 2006 IEEE Computer Society ConferenceonComputerVisionandPatternRecognition(CVPR’06), IEEE. pp. 435–441
2006
-
[68]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei- Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y
-
[69]
Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision
2017
-
[70]
Si,C.,Yu,W.,Zhou,P.,Zhou,Y.,Wang,X.,Yan,S.,2022. Inception transformer. arXiv preprint arXiv:2205.12956
-
[71]
Two-stream convolutional networks for action recognition in videos
Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27
2014
-
[72]
Formation of a motor memory by action observation
Stefan, K., Cohen, L.G., Duque, J., Mazzocchio, R., Celnik, P., Sawaki, L., Ungerleider, L., Classen, J., 2005. Formation of a motor memory by action observation. Journal of Neuroscience 25, 9339– 9346
2005
-
[73]
Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Strudel, R., Garcia, R., Laptev, I., Schmid, C., 2021. Segmenter: Transformer for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262– 7272
2021
-
[74]
Spatiotemporal visual considerations for video coding
Tang, C.W., 2007. Spatiotemporal visual considerations for video coding. IEEE Transactions on Multimedia 9, 231–238
2007
-
[75]
Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W
Tong, Z., Song, Y., Wang, J., Wang, L., 2022. Videomae: Masked autoencodersaredata-efficientlearnersforself-supervisedvideopre- training. arXiv preprint arXiv:2203.12602 . Bohao Xing et al.:Preprint submitted to ElsevierPage 11 of 12 OG-ReG
-
[76]
Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR
Touvron,H.,Cord,M.,Douze,M.,Massa,F.,Sablayrolles,A.,Jégou, H., 2021. Training data-efficient image transformers & distillation throughattention,in:InternationalConferenceonMachineLearning, PMLR. pp. 10347–10357
2021
-
[77]
Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learningspatiotemporalfeatureswith3dconvolutionalnetworks,in: ProceedingsoftheIEEE/CVFInternationalConferenceonComputer Vision, pp. 4489–4497
2015
-
[78]
A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Tran,D.,Wang,H.,Torresani,L.,Ray,J.,LeCun,Y.,Paluri,M.,2018. A closer look at spatiotemporal convolutions for action recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459
2018
-
[79]
Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Wang, J., Torresani, L., 2022. Deformable video transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14053–14062
2022
-
[80]
Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp
Wang, L., Tong, Z., Ji, B., Wu, G., 2021a. Tdn: Temporal differ- ence networks for efficient action recognition, in: Proceedings of the IEEE/CVFConferenceonComputerVisionandPatternRecognition, pp. 1895–1904
1904
-
[81]
Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V., 2016. Temporal segment networks: Towards good practices for deep action recognition, in: European Conference on Computer Vision, Springer. pp. 20–36
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.