Exploring High-Order Self-Similarity for Video Understanding

Heeseung Kwon; Karteek Alahari; Manjin Kim; Minsu Cho

arxiv: 2604.20760 · v1 · submitted 2026-04-22 · 💻 cs.CV

Exploring High-Order Self-Similarity for Video Understanding

Manjin Kim , Heeseung Kwon , Karteek Alahari , Minsu Cho This is my paper

Pith reviewed 2026-05-10 00:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords space-time self-similaritymulti-order self-similarityvideo action recognitiontemporal dynamicsmotion modelinglightweight neural modulevideo visual question answering

0 comments

The pith

Integrating multi-order space-time self-similarities via a lightweight module improves motion modeling across video tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores how space-time self-similarity at different orders captures distinct aspects of temporal dynamics in videos. It introduces the Multi-Order Self-Similarity module to learn and combine these features in a neural network. This design targets better motion representation while keeping added computation and memory low. A reader would care because improved temporal modeling could make video systems more accurate for recognition, question answering, and control without heavy resource demands. Experiments across action recognition, video VQA, and robotic tasks show consistent gains.

Core claim

Space-time self-similarity at higher orders reveals distinct aspects of temporal dynamics. The Multi-Order Self-Similarity module is a lightweight neural component that learns and integrates multi-order STSS features to enhance motion modeling capabilities with only marginal computational cost and memory usage. Applied to diverse video tasks, it produces substantial improvements on action recognition, motion-centric video VQA, and real-world robotic tasks.

What carries the argument

The Multi-Order Self-Similarity (MOSS) module, a neural module that learns and integrates multi-order space-time self-similarity features for temporal dynamics.

Load-bearing premise

Higher-order space-time self-similarities supply distinct and complementary information on temporal dynamics that a lightweight integration module can combine effectively without meaningful overhead or loss of accuracy.

What would settle it

Inserting the MOSS module into standard video models and measuring no accuracy gains on action recognition or VQA benchmarks together with increased runtime or memory usage would show the approach does not deliver substantial improvements at marginal cost.

Figures

Figures reproduced from arXiv: 2604.20760 by Heeseung Kwon, Karteek Alahari, Manjin Kim, Minsu Cho.

**Figure 2.** Figure 2: High-order STSS transformation & Multi-Order Self-Similarity [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: STSS map visualizations on a toy video clip. From top to bottom, we visualize RGB frames and 1st- , 2nd-, and 3rd-order STSS maps of the brown query by setting STSS encoding function g as vectorization over (L, U, V ) dimensions. The STSS maps progressively capture different temporal dynamics: motion flow, motion segments, and overall motion layouts. Unlike the 2nd-order STSS that identifies individual… view at source ↗

**Figure 5.** Figure 5: STSS visualization. RGB frames at the top where two queries and their spatiotemporal matching regions are marked in red and green respectively. The subsequent rows show STSS maps for the two queries and L2-norm of feature maps across 1st-, 2nd-, and 3rd-order. Best viewed in pdf. Comparison to Other STSS Learning Methods. In Tab. 3d, we delve into the effectiveness of high-order STSS by comparing differen… view at source ↗

**Figure 6.** Figure 6: VideoLLaMA3 with MOSS. MOSS is integrated with the vision encoder and provides early motion cues for advanced temporal reasoning in LLM. fine-grained motion-level reasoning in Video MLLMs, comprising 1,776 videos and 8,184 multiple-choice QA pairs across 6 motion-related tasks. We use 15K samples from the publicly released training set, FAVOR-Train, for fine-tuning. MotionBench [24] is another recent be… view at source ↗

**Figure 7.** Figure 7: Proposed real-world robotic tasks (MoveSense & PongPredict) and [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of STSS tensors. (a) Input RGB frames, where two different queries and their spatio-temporal matching regions. (b) 1st- to 3rd-order STSS maps of the brown query. (c) 1st- to 3rd-order STSS maps of the yellow query. A Illustration of High-Order STSS We present a toy example with a simplified video clip to clarify the characteristics of high-order STSS in modeling temporal dynamics, as describ… view at source ↗

**Figure 9.** Figure 9: Real-robot platform. We specify the robot specifications and other environment settings. B Implementation Details In Tabs. 6 and 7, we provide detailed model configurations and training hyperparameters across different model scales and datasets. All models are trained using 8 NVIDIA RTX 6000 Ada GPUs. C Experimental Setup for Real-World Robotic Tasks In [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Effects of 2nd-order STSS on Something-Something V1. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1stto 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1stto 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1stto 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1stto 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: STSS visualization. RGB frames at the top show query locations and their spatio-temporal matching regions marked in red and green, respectively. The subsequent rows show STSS maps for the two queries and the L2-norm of feature maps from 1stto 3rd-order STSSs. Best viewed in PDF [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗

**Figure 16.** Figure 16: Example rollouts of real-world robot tasks [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

read the original abstract

Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a lightweight MOSS module that integrates higher-order space-time self-similarities for video tasks and reports consistent but likely modest gains across action recognition, VQA, and robotics.

read the letter

The main takeaway is that this work extends space-time self-similarity to multiple orders and packages the idea into a plug-in neural module called MOSS. The module is meant to capture complementary temporal cues without much extra compute or memory, and the authors test it on action recognition, motion-centric video QA, and real-world robotic tasks with reported improvements and plans to release code and checkpoints.

Referee Report

0 major / 3 minor

Summary. The paper explores higher-order space-time self-similarity (STSS) for representing temporal dynamics in videos, arguing that STSS at different orders capture distinct aspects of motion. It introduces the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module to learn and integrate these multi-order features. The module is presented as a general-purpose component that can be inserted into video architectures to enhance motion modeling at marginal computational and memory cost. Claims are supported by experiments on action recognition, motion-centric video VQA, and real-world robotic tasks showing consistent improvements, with code and checkpoints to be released.

Significance. If the empirical results hold, MOSS offers a practical, efficient temporal modeling primitive with broad applicability across video tasks. Its lightweight design and plug-and-play nature could see adoption in existing pipelines, particularly if gains are reproducible across datasets and architectures. The planned public release of code and checkpoints strengthens the contribution by enabling verification and extension.

minor comments (3)

[Abstract] Abstract: the phrasing 'higher-order STSS' and 'multi-order STSS features' is used interchangeably without an explicit definition of the orders considered (e.g., first-order vs. second-order correspondences); a short clarifying sentence would aid readers.
[§4 or §5] The manuscript states that MOSS consumes 'only marginal computational cost and memory usage'; providing a table or paragraph with exact FLOPs and parameter overhead relative to the backbone (e.g., in §4 or §5) would make this claim more precise and verifiable.
[Experiments] Experiments section: while tables are referenced, ensuring that every reported improvement includes the corresponding baseline value, metric (e.g., top-1 accuracy, mAP), and dataset split would allow direct assessment of effect sizes.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the recognition that MOSS provides a practical, efficient temporal modeling primitive with broad applicability, and we value the note on the planned public release of code and checkpoints.

Circularity Check

0 steps flagged

No significant circularity; MOSS module is an independent architectural contribution

full rationale

The paper presents MOSS as a new lightweight neural module for learning and integrating multi-order space-time self-similarity features, with claims supported directly by its definition, integration details, and empirical results across video tasks. No derivation chain, equations, or predictions are shown that reduce by construction to fitted inputs or prior self-citations. The abstract and context describe an empirical validation approach without self-definitional loops, uniqueness theorems, or ansatz smuggling. This is a standard case of a self-contained neural architecture paper whose central claims rest on experimental tables rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities detailed beyond the MOSS module itself as a new component.

invented entities (1)

Multi-Order Self-Similarity (MOSS) module no independent evidence
purpose: Learn and integrate multi-order STSS features for video temporal modeling
New neural module introduced to combine higher-order self-similarity features

pith-pipeline@v0.9.0 · 5432 in / 1126 out tokens · 46914 ms · 2026-05-10T00:55:43.219326+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Vivit: A video vision transformer,

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)

work page arXiv 2021
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review arXiv 2025
[3]

arXiv preprint arXiv:2312.00826 (2023)

Bae, K., Ahn, G., Kim, Y., Choi, J.: Devias: Learning disentangled video repre- sentations of action and scene for holistic video understanding. arXiv preprint arXiv:2312.00826 (2023)

work page arXiv 2023
[4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

work page internal anchor Pith review arXiv 2024
[5]

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)

work page arXiv 2021
[6]

In: CVPR

Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multi- scale contrastive random walks. In: CVPR. pp. 6508–6519 (2022)

work page 2022
[7]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review arXiv 2025
[8]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review arXiv 2024
[9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

work page 2021
[10]

In: CVPR (2017)

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

work page 2017
[11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review arXiv 2024
[12]

In: European Conference on Computer Vision

Cheng, F., Bertasius, G.: Tallformer: Temporal action localization with a long- memory transformer. In: European Conference on Computer Vision. pp. 503–521. Springer (2022)

work page 2022
[13]

NeurIPS32(2019)

Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. NeurIPS32(2019)

work page 2019
[14]

NeurIPS35, 39020–39033 (2022)

Chung, J., Wu, Y., Russakovsky, O.: Enabling detailed action recognition evaluation through video dataset augmentation. NeurIPS35, 39020–39033 (2022)

work page 2022
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

In: ICCV (2015)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu- tional networks. In: ICCV (2015)

work page 2015
[17]

Multiscale

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021) 16 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

work page arXiv 2021
[18]

In: CVPR (2020)

Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)

work page 2020
[19]

In: ICCV (2019)

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)

work page 2019
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

work page 2025
[21]

something something

Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: ICCV (2017)

work page 2017
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022
[23]

In: CVPR

Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., Darrell, T., Globerson, A.: Object-region video transformers. In: CVPR. pp. 3148–3159 (2022)

work page 2022
[24]

In: CVPR

Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: CVPR. pp. 8450–8460 (2025)

work page 2025
[25]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review arXiv 2024
[26]

in the wild

Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding155, 1–23 (2017)

work page 2017
[27]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

arXiv:2510.04246 [cs]

Jang, H., Yu, S., Kwon, H., Jeon, H., Seo, Y., Shin, J.: Contextvla: Vision-language- action model with amortized multi-frame context. arXiv preprint arXiv:2510.04246 (2025)

work page arXiv 2025
[29]

IEEE TPAMI (2010)

Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. IEEE TPAMI (2010)

work page 2010
[30]

In: ECCV (2008)

Junejo, I.N., Dexter, E., Laptev, I., PÚrez, P.: Cross-view action recognition from temporal self-similarities. In: ECCV (2008)

work page 2008
[31]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review arXiv 2017
[32]

NeurIPS34, 8046–8059 (2021)

Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational self-attention: What’s missing in attention for video understanding. NeurIPS34, 8046–8059 (2021)

work page 2021
[33]

In: CVPR

Kim, M., Seo, P.H., Schmid, C., Cho, M.: Learning correlation structures for vision transformers. In: CVPR. pp. 18941–18951 (2024)

work page 2024
[34]

arXiv preprint arXiv:2007.09933 (2020)

Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: Neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933 (2020)

work page arXiv 2007
[35]

arXiv preprint arXiv:2102.07092 (2021)

Kwon, H., Kim, M., Kwak, S., Cho, M.: Learning self-similarity in space and time as generalized motion for action recognition. arXiv preprint arXiv:2102.07092 (2021)

work page arXiv 2021
[36]

arXiv:2208.01897 (2022)

Leong, M.C., Zhang, H., Tan, H.L., Li, L., Lim, J.H.: Combined cnn trans- former encoder for enhanced fine-grained human action recognition. arXiv preprint arXiv:2208.01897 (2022) Exploring High-Order Self-Similarity for Video Understanding 17

work page arXiv 2022
[37]

arXiv preprint arXiv:2206.02985 (2022)

Li, C., Wang, X., Hong, D., Wang, Y., Zhang, L., Luo, T., Wen, L.: Struc- tured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)

work page arXiv 2022
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

work page 2024
[39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., Qiao, Y.: Uniformerv2: Unlocking the potential of image vits for video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1632–1643 (2023)

work page 2023
[40]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

work page 2023
[41]

Evaluating Real-World Robot Manipulation Policies in Simulation

Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., et al.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024)

work page internal anchor Pith review arXiv 2024
[42]

In: CVPR

Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C.: Mvitv2: Improved multiscale vision transformers for classification and detection. In: CVPR. pp. 4804–4814 (2022)

work page 2022
[43]

In: ECCV (2018)

Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without representation bias. In: ECCV (2018)

work page 2018
[44]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)

work page 2024
[45]

In: ICCV (2019)

Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: ICCV (2019)

work page 2019
[46]

In: ECCV

Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: ECCV. pp. 388–404. Springer (2022)

work page 2022
[47]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023
[48]

Liu, H., Li, X., Li, P., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H.: Towards generalist robot policies: What matters in building vision- language-action models (2025)

work page 2025
[49]

arXiv preprint arXiv:2408.06158 (2024)

Liu, M., Li, B., Yu, Y.: Omniclip: Adapting clip for video recognition with spatial- temporal omni-scale feature learning. arXiv preprint arXiv:2408.06158 (2024)

work page arXiv 2024
[50]

In: CVPR

Liu, R., Li, C., Ge, Y., Li, T.H., Shan, Y., Li, G.: Bt-adapter: Video conversation is feasible without video instruction tuning. In: CVPR. pp. 13658–13667 (2024)

work page 2024
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, S., Zhang, C.L., Zhao, C., Ghanem, B.: End-to-end temporal action detection with 1b parameters across 1000 frames. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18591–18601 (2024)

work page 2024
[52]

In: CVPR

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: CVPR. pp. 3202–3211 (2022)

work page 2022
[53]

In: Proceedings of the 30th ACM International Conference on Multimedia

Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-clip: End-to-end multi- grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 638–647 (2022) 18 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

work page 2022
[54]

On the effectiveness of task granularity for transfer learning

Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effec- tiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)

work page Pith review arXiv 2018
[55]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523 (2024)

work page internal anchor Pith review arXiv 2024
[56]

In: Proc

Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: Learning motion representation for action recognition. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2018)

work page 2018
[57]

Advances in Neural Information Processing Systems37, 81808–81835 (2024)

Nie, M., Ding, D., Wang, C., Guo, Y., Han, J., Xu, H., Zhang, L.: Slowfocus: Enhancing fine-grained temporal understanding in video llm. Advances in Neural Information Processing Systems37, 81808–81835 (2024)

work page 2024
[58]

NeurIPS35, 26462–26477 (2022)

Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image-to- video transfer learning. NeurIPS35, 26462–26477 (2022)

work page 2022
[59]

In: CVPR

Park, J., Lee, J., Sohn, K.: Dual-path adaptation from image to video transformers. In: CVPR. pp. 2203–2213 (2023)

work page 2023
[60]

In: ECCV

Qian, R., Ding, S., Lin, D.: Rethinking image-to-video adaptation: An object-centric perspective. In: ECCV. pp. 329–348. Springer (2025)

work page 2025
[61]

In: ICCV

Qing, Z., Zhang, S., Huang, Z., Zhang, Y., Gao, C., Zhao, D., Sang, N.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: ICCV. pp. 13934–13944 (2023)

work page 2023
[62]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)

work page 2021
[63]

arXiv preprint arXiv:2510.26027 (2025)

Rasekh,A.,Soula,E.B.,Daliran,O.,Gottschalk,S.,Fayyaz,M.:Enhancingtemporal understanding in video-llms through stacked temporal attention in vision encoders. arXiv preprint arXiv:2510.26027 (2025)

work page arXiv 2025
[64]

In: CVPR (2020)

Shao, D., Zhao, Y., Dai, B., Lin, D.: Finegym: A hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)

work page 2020
[65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 730–739 (2020)

work page 2020
[66]

In: CVPR

Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR. vol. 1, pp. 405–412. IEEE (2005)

work page 2005
[67]

In: CVPR (2007)

Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)

work page 2007
[68]

Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023

Shi, D., Cao, Q., Zhong, Y., An, S., Cheng, J., Zhu, H., Tao, D.: Temporal action localization with enhanced instant discriminability. arXiv preprint arXiv:2309.05590 (2023)

work page arXiv 2023
[69]

In: Proceedings of the IEEE/CVF international conference on computer vision

Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: A benchmark for event segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8075–8084 (2021)

work page 2021
[70]

In: NeurIPS (2014)

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog- nition in videos. In: NeurIPS (2014)

work page 2014
[71]

In: CVPR

Son, J.: Contrastive learning for space-time correspondence via self-cycle consistency. In: CVPR. pp. 14679–14688 (2022)

work page 2022
[72]

In: CVPR (2018)

Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)

work page 2018
[73]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) Exploring High-Order Self-Similarity for Video Understanding 19

work page internal anchor Pith review arXiv 2023
[74]

arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252 (2024)

work page arXiv 2024
[75]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: A general architecture for arbitrary boundary detection. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

work page 2023
[76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3355–3364 (2022)

work page 2022
[77]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review arXiv 2024
[78]

In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 402–419. Springer (2020)

work page 2020
[79]

In: ICCV (2015)

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)

work page 2015
[80]

In: ICCV (2019)

Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel- separated convolutional networks. In: ICCV (2019)

work page 2019

Showing first 80 references.

[1] [1]

Vivit: A video vision transformer,

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)

work page arXiv 2021

[2] [2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review arXiv 2025

[3] [3]

arXiv preprint arXiv:2312.00826 (2023)

Bae, K., Ahn, G., Kim, Y., Choi, J.: Devias: Learning disentangled video repre- sentations of action and scene for holistic video understanding. arXiv preprint arXiv:2312.00826 (2023)

work page arXiv 2023

[4] [4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

work page internal anchor Pith review arXiv 2024

[5] [5]

Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)

work page arXiv 2021

[6] [6]

In: CVPR

Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multi- scale contrastive random walks. In: CVPR. pp. 6508–6519 (2022)

work page 2022

[7] [7]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review arXiv 2025

[8] [8]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.: pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review arXiv 2024

[9] [9]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

work page 2021

[10] [10]

In: CVPR (2017)

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)

work page 2017

[11] [11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

work page internal anchor Pith review arXiv 2024

[12] [12]

In: European Conference on Computer Vision

Cheng, F., Bertasius, G.: Tallformer: Temporal action localization with a long- memory transformer. In: European Conference on Computer Vision. pp. 503–521. Springer (2022)

work page 2022

[13] [13]

NeurIPS32(2019)

Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. NeurIPS32(2019)

work page 2019

[14] [14]

NeurIPS35, 39020–39033 (2022)

Chung, J., Wu, Y., Russakovsky, O.: Enabling detailed action recognition evaluation through video dataset augmentation. NeurIPS35, 39020–39033 (2022)

work page 2022

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X.,Unterthiner,T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

In: ICCV (2015)

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolu- tional networks. In: ICCV (2015)

work page 2015

[17] [17]

Multiscale

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021) 16 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

work page arXiv 2021

[18] [18]

In: CVPR (2020)

Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)

work page 2020

[19] [19]

In: ICCV (2019)

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV (2019)

work page 2019

[20] [20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

work page 2025

[21] [21]

something something

Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: ICCV (2017)

work page 2017

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022

[23] [23]

In: CVPR

Herzig, R., Ben-Avraham, E., Mangalam, K., Bar, A., Chechik, G., Rohrbach, A., Darrell, T., Globerson, A.: Object-region video transformers. In: CVPR. pp. 3148–3159 (2022)

work page 2022

[24] [24]

In: CVPR

Hong, W., Cheng, Y., Yang, Z., Wang, W., Wang, L., Gu, X., Huang, S., Dong, Y., Tang, J.: Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. In: CVPR. pp. 8450–8460 (2025)

work page 2025

[25] [25]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.: Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review arXiv 2024

[26] [26]

in the wild

Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding155, 1–23 (2017)

work page 2017

[27] [27]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.: pi0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

arXiv:2510.04246 [cs]

Jang, H., Yu, S., Kwon, H., Jeon, H., Seo, Y., Shin, J.: Contextvla: Vision-language- action model with amortized multi-frame context. arXiv preprint arXiv:2510.04246 (2025)

work page arXiv 2025

[29] [29]

IEEE TPAMI (2010)

Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. IEEE TPAMI (2010)

work page 2010

[30] [30]

In: ECCV (2008)

Junejo, I.N., Dexter, E., Laptev, I., PÚrez, P.: Cross-view action recognition from temporal self-similarities. In: ECCV (2008)

work page 2008

[31] [31]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review arXiv 2017

[32] [32]

NeurIPS34, 8046–8059 (2021)

Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational self-attention: What’s missing in attention for video understanding. NeurIPS34, 8046–8059 (2021)

work page 2021

[33] [33]

In: CVPR

Kim, M., Seo, P.H., Schmid, C., Cho, M.: Learning correlation structures for vision transformers. In: CVPR. pp. 18941–18951 (2024)

work page 2024

[34] [34]

arXiv preprint arXiv:2007.09933 (2020)

Kwon, H., Kim, M., Kwak, S., Cho, M.: Motionsqueeze: Neural motion feature learning for video understanding. arXiv preprint arXiv:2007.09933 (2020)

work page arXiv 2007

[35] [35]

arXiv preprint arXiv:2102.07092 (2021)

Kwon, H., Kim, M., Kwak, S., Cho, M.: Learning self-similarity in space and time as generalized motion for action recognition. arXiv preprint arXiv:2102.07092 (2021)

work page arXiv 2021

[36] [36]

arXiv:2208.01897 (2022)

Leong, M.C., Zhang, H., Tan, H.L., Li, L., Lim, J.H.: Combined cnn trans- former encoder for enhanced fine-grained human action recognition. arXiv preprint arXiv:2208.01897 (2022) Exploring High-Order Self-Similarity for Video Understanding 17

work page arXiv 2022

[37] [37]

arXiv preprint arXiv:2206.02985 (2022)

Li, C., Wang, X., Hong, D., Wang, Y., Zhang, L., Luo, T., Wen, L.: Struc- tured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)

work page arXiv 2022

[38] [38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

work page 2024

[39] [39]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., Qiao, Y.: Uniformerv2: Unlocking the potential of image vits for video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1632–1643 (2023)

work page 2023

[40] [40]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12581–12600 (2023)

work page 2023

[41] [41]

Evaluating Real-World Robot Manipulation Policies in Simulation

Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., et al.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024)

work page internal anchor Pith review arXiv 2024

[42] [42]

In: CVPR

Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C.: Mvitv2: Improved multiscale vision transformers for classification and detection. In: CVPR. pp. 4804–4814 (2022)

work page 2022

[43] [43]

In: ECCV (2018)

Li, Y., Li, Y., Vasconcelos, N.: Resound: Towards action recognition without representation bias. In: ECCV (2018)

work page 2018

[44] [44]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)

work page 2024

[45] [45]

In: ICCV (2019)

Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video under- standing. In: ICCV (2019)

work page 2019

[46] [46]

In: ECCV

Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: ECCV. pp. 388–404. Springer (2022)

work page 2022

[47] [47]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023

[48] [48]

Liu, H., Li, X., Li, P., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Kong, T., Zhang, H.: Towards generalist robot policies: What matters in building vision- language-action models (2025)

work page 2025

[49] [49]

arXiv preprint arXiv:2408.06158 (2024)

Liu, M., Li, B., Yu, Y.: Omniclip: Adapting clip for video recognition with spatial- temporal omni-scale feature learning. arXiv preprint arXiv:2408.06158 (2024)

work page arXiv 2024

[50] [50]

In: CVPR

Liu, R., Li, C., Ge, Y., Li, T.H., Shan, Y., Li, G.: Bt-adapter: Video conversation is feasible without video instruction tuning. In: CVPR. pp. 13658–13667 (2024)

work page 2024

[51] [51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, S., Zhang, C.L., Zhao, C., Ghanem, B.: End-to-end temporal action detection with 1b parameters across 1000 frames. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18591–18601 (2024)

work page 2024

[52] [52]

In: CVPR

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: CVPR. pp. 3202–3211 (2022)

work page 2022

[53] [53]

In: Proceedings of the 30th ACM International Conference on Multimedia

Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., Ji, R.: X-clip: End-to-end multi- grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 638–647 (2022) 18 Manjin Kim 1∗, Heeseung Kwon2∗, Karteek Alahari3, and Minsu Cho1

work page 2022

[54] [54]

On the effectiveness of task granularity for transfer learning

Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D., Memisevic, R.: On the effec- tiveness of task granularity for transfer learning. arXiv preprint arXiv:1804.09235 (2018)

work page Pith review arXiv 2018

[55] [55]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523 (2024)

work page internal anchor Pith review arXiv 2024

[56] [56]

In: Proc

Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: Learning motion representation for action recognition. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2018)

work page 2018

[57] [57]

Advances in Neural Information Processing Systems37, 81808–81835 (2024)

Nie, M., Ding, D., Wang, C., Guo, Y., Han, J., Xu, H., Zhang, L.: Slowfocus: Enhancing fine-grained temporal understanding in video llm. Advances in Neural Information Processing Systems37, 81808–81835 (2024)

work page 2024

[58] [58]

NeurIPS35, 26462–26477 (2022)

Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image-to- video transfer learning. NeurIPS35, 26462–26477 (2022)

work page 2022

[59] [59]

In: CVPR

Park, J., Lee, J., Sohn, K.: Dual-path adaptation from image to video transformers. In: CVPR. pp. 2203–2213 (2023)

work page 2023

[60] [60]

In: ECCV

Qian, R., Ding, S., Lin, D.: Rethinking image-to-video adaptation: An object-centric perspective. In: ECCV. pp. 329–348. Springer (2025)

work page 2025

[61] [61]

In: ICCV

Qing, Z., Zhang, S., Huang, Z., Zhang, Y., Gao, C., Zhao, D., Sang, N.: Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In: ICCV. pp. 13934–13944 (2023)

work page 2023

[62] [62]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)

work page 2021

[63] [63]

arXiv preprint arXiv:2510.26027 (2025)

Rasekh,A.,Soula,E.B.,Daliran,O.,Gottschalk,S.,Fayyaz,M.:Enhancingtemporal understanding in video-llms through stacked temporal attention in vision encoders. arXiv preprint arXiv:2510.26027 (2025)

work page arXiv 2025

[64] [64]

In: CVPR (2020)

Shao, D., Zhao, Y., Dai, B., Lin, D.: Finegym: A hierarchical video dataset for fine-grained action understanding. In: CVPR (2020)

work page 2020

[65] [65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 730–739 (2020)

work page 2020

[66] [66]

In: CVPR

Shechtman, E., Irani, M.: Space-time behavior based correlation. In: CVPR. vol. 1, pp. 405–412. IEEE (2005)

work page 2005

[67] [67]

In: CVPR (2007)

Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: CVPR (2007)

work page 2007

[68] [68]

Temporal action localization with enhanced instant discriminability.arXiv preprint arXiv:2309.05590, 2023

Shi, D., Cao, Q., Zhong, Y., An, S., Cheng, J., Zhu, H., Tao, D.: Temporal action localization with enhanced instant discriminability. arXiv preprint arXiv:2309.05590 (2023)

work page arXiv 2023

[69] [69]

In: Proceedings of the IEEE/CVF international conference on computer vision

Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: A benchmark for event segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8075–8084 (2021)

work page 2021

[70] [70]

In: NeurIPS (2014)

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog- nition in videos. In: NeurIPS (2014)

work page 2014

[71] [71]

In: CVPR

Son, J.: Contrastive learning for space-time correspondence via self-cycle consistency. In: CVPR. pp. 14679–14688 (2022)

work page 2022

[72] [72]

In: CVPR (2018)

Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)

work page 2018

[73] [73]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023) Exploring High-Order Self-Similarity for Video Understanding 19

work page internal anchor Pith review arXiv 2023

[74] [74]

arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252 (2024)

work page arXiv 2024

[75] [75]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: A general architecture for arbitrary boundary detection. IEEE Transactions on Pattern Analysis and Machine Intelligence45(10), 12506–12520 (2023)

work page 2023

[76] [76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3355–3364 (2022)

work page 2022

[77] [77]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review arXiv 2024

[78] [78]

In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16

Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 402–419. Springer (2020)

work page 2020

[79] [79]

In: ICCV (2015)

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)

work page 2015

[80] [80]

In: ICCV (2019)

Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel- separated convolutional networks. In: ICCV (2019)

work page 2019