P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

Benjamin Killeen; Christian Benz; Felix Tristram; Ghazal Ghazaei; Marcel Walch; Nassir Navab; Stefano Gasperini

arxiv: 2606.23256 · v1 · pith:23YL5AMYnew · submitted 2026-06-22 · 💻 cs.CV · cs.AI

P-JEPA: Procedural Video Representation Learning via Joint Embedding Predictive Architecture

Felix Tristram , Stefano Gasperini , Benjamin Killeen , Marcel Walch , Christian Benz , Nassir Navab , Ghazal Ghazaei This is my paper

Pith reviewed 2026-06-26 08:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords procedural videorepresentation learningjoint embedding predictive architecturelong-form video understandingaction classificationtemporal action segmentationvideo foundation models

0 comments

The pith

P-JEPA learns long procedural video representations by predicting pooled masked latent vectors in a dense frame-aligned action space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces P-JEPA as a way to handle procedural videos that last over 30 minutes, where standard models fail because of self-attention costs. It reduces the task to predicting masked latent vectors that stay aligned to individual frames, so the model can track long-range steps in multi-step tasks without quadratic scaling. This supports better activity understanding and segmentation in embodied AI settings. The approach works with existing video encoders and runs in real time while using far fewer parameters than language-model methods.

Core claim

P-JEPA is a backbone-agnostic approach that learns long-duration video representations by reducing the problem to a dense, frame-aligned action space and predicting pooled masked latent vectors. This allows ingestion of videos over 30 minutes long for effective long-form understanding of procedural steps.

What carries the argument

The Procedural Joint Embedding Predictive Architecture (P-JEPA), which predicts pooled masked latent vectors to capture long-range dependencies without self-attention.

If this is right

It improves linear separability, streaming inference, and temporal action segmentation performance across EgoExo4D, EgoProceL, and Assembly101.
It reaches state-of-the-art results on EgoExo4D fine-grained action classification.
It uses an order of magnitude fewer parameters than LLM-based methods while running in real time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frame-aligned prediction strategy may generalize to other long-sequence modalities where quadratic attention becomes prohibitive.
Because the method is backbone-agnostic, it could be paired with future video encoders to extend real-time procedural assistance.
Real-time operation on long videos supports online systems that give step-by-step guidance during complex tasks.

Load-bearing premise

That reducing the problem to a dense frame-aligned action space and predicting pooled masked latent vectors is enough to capture long-range dependencies between visually similar but procedurally distinct actions.

What would settle it

Evaluation on a dataset of procedural videos exceeding 30 minutes where P-JEPA does not improve fine-grained action classification accuracy over attention-based baselines.

Figures

Figures reproduced from arXiv: 2606.23256 by Benjamin Killeen, Christian Benz, Felix Tristram, Ghazal Ghazaei, Marcel Walch, Nassir Navab, Stefano Gasperini.

**Figure 1.** Figure 1: P-JEPA. A take-level JEPA is trained on continuous streams of feature tokens. The student encoder sees randomly selected context tokens, the predictor fills masked target positions, and an EMA teacher provides latent targets for the complete valid stream. broader procedure? In many tasks, the same local visual evidence can correspond to different actions depending on what happened before. For example, reac… view at source ↗

**Figure 2.** Figure 2: Clip-Causal Attention. All tokens within a segment are allowed to attend to each other (black) and to past segments (green). Attention to future segments is blocked (red). Clip-Causal Attention. A key design choice is how much temporal context each representation is allowed to use. If attention were fully bidirectional, a token could rely on future segments and would no longer represent an online or stream… view at source ↗

**Figure 3.** Figure 3: During inference, only the P-JEPA encoder is used. Input tokens are unmasked and are [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Assembly101 Qualitative. LTContext used with TSM and P-JEPA enhanced TSM features. LTContext can correctly segment detach sound module and detach cabin when using P-JEPA. 4.1.1 Ablations We ablate our clip-causal design in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: t-SNE of original I3D and P-JEPA features for two visually similar classes from EgoProceL validation set. P-JEPA features form two separate clusters. Full take linear probing results below confirm improved linear separability, with I3D features failing on the remove the motherboard class. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Average feature-space progress over relative video time. For each video, segment features [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE of original and P-JEPA enhanced I3D features from the EgoProceL validation split, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of Pooling frameworks. A.4.1 Probe Details For EgoProceL, all probe results use the official FACT I3D features, label mapping, and split1 train/test split. P-JEPA is trained and evaluated on those same 2048-d feature streams [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

The increasing maturity of embodied AI platforms has driven a growing interest in procedural video representation learning to support intelligent assistance systems for complex, multi-step tasks. Leveraging large-scale latent predictive training, video foundation models capture video dynamics, enabling downstream tasks such as activity understanding, spatiotemporal localization, and predictive control. However, procedural videos include actions with long-range dependencies that these models do not support, due to the quadratic complexity of self-attention. Distinct actions, for example, may be visually similar despite appearing at different points in the procedure, such as turning the stove on versus off. Here, we propose a backbone-agnostic approach that learns long-duration video representations by reducing the problem to a dense, frame-aligned action space and predicting pooled masked latent vectors. This approach allows our Procedural Joint Embedding Predictive Architecture (P-JEPA) to ingest videos over 30 minutes long, enabling effective long-form understanding of procedural steps. We evaluate P-JEPA using features extracted with VJEPA2.1, TSM, and I3D over the EgoExo4D, EgoProceL, and Assembly101 datasets, finding that it consistently improves linear separability, streaming inference, and temporal action segmentation performance, achieving state-of-the-art results on EgoExo4D fine-grained action classification while using an order of magnitude fewer parameters than LLM-based methods and running in real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P-JEPA adapts JEPA to long procedural videos via frame-aligned prediction but the abstract supplies almost no experimental detail to back the claims.

read the letter

The main point is that P-JEPA reduces long procedural video modeling to predicting pooled masked latent vectors in a dense frame-aligned action space. This is meant to capture long-range dependencies like visually similar actions at different stages of a task without paying the quadratic cost of self-attention, letting the model ingest videos over 30 minutes.

What is actually new is the specific framing for procedural videos: treating the problem as dense action-space prediction on top of existing backbones. They test with VJEPA2.1, TSM, and I3D features on EgoExo4D, EgoProceL, and Assembly101, and report gains in linear separability, streaming inference, and temporal action segmentation, plus SOTA on EgoExo4D fine-grained classification with far fewer parameters than LLM baselines and real-time speed. The backbone-agnostic design is a practical plus.

The soft spots are clear from the abstract alone. No baselines are named, no numbers or error bars appear, and there are no ablation results. The central sufficiency claim—that frame-aligned pooled prediction is enough to handle the long dependencies—remains untested in the provided text. Without those details it is hard to know whether the reported improvements are real or just from the choice of backbones.

This paper is for people working on video models for embodied AI and task assistance, where long procedural sequences matter. A reader who needs efficient long-video features might get something useful if the full experiments check out.

It deserves a serious referee because the problem is relevant and the method is a direct, testable extension of JEPA. The experimental gaps will need fixing, but the core idea is worth checking.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes P-JEPA, a backbone-agnostic Joint Embedding Predictive Architecture for procedural video representation learning. It reduces the problem to predicting pooled masked latent vectors in a dense frame-aligned action space to handle long videos (>30 min) and long-range dependencies without self-attention. Evaluations on EgoExo4D, EgoProceL, and Assembly101 using VJEPA2.1, TSM, and I3D features claim consistent improvements in linear separability, streaming inference, and temporal action segmentation, with SOTA on EgoExo4D fine-grained action classification using fewer parameters and real-time performance.

Significance. If the empirical claims hold, the work could offer a scalable and efficient method for long-form procedural video understanding, addressing limitations of attention-based models in embodied AI applications. The backbone-agnostic nature and focus on procedural steps are notable strengths.

major comments (1)

[Abstract] Abstract: The central claims of consistent improvements, SOTA results on EgoExo4D fine-grained action classification, order-of-magnitude parameter reduction versus LLM methods, and real-time performance are stated without any quantitative metrics, baselines, error bars, ablation studies, or dataset-specific numbers. This absence prevents verification of the key sufficiency claim that the pooled-masked-latent approach captures long-range procedural dependencies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the opportunity to clarify the presentation of our results. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of consistent improvements, SOTA results on EgoExo4D fine-grained action classification, order-of-magnitude parameter reduction versus LLM methods, and real-time performance are stated without any quantitative metrics, baselines, error bars, ablation studies, or dataset-specific numbers. This absence prevents verification of the key sufficiency claim that the pooled-masked-latent approach captures long-range procedural dependencies.

Authors: We agree that the abstract would be strengthened by including key quantitative results to support the stated claims. The body of the manuscript reports these details, including specific accuracy gains on EgoExo4D, parameter comparisons (order-of-magnitude reduction relative to LLM baselines), real-time inference speeds, and ablations with error bars across EgoExo4D, EgoProceL, and Assembly101. To directly address the concern, we will revise the abstract to incorporate representative numerical results, mention of the baselines used, and a brief reference to the ablation studies. The experiments section further substantiates the long-range dependency claim by evaluating on videos exceeding 30 minutes with the pooled masked latent prediction mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The abstract and description present P-JEPA as an independent backbone-agnostic architecture that reduces procedural video modeling to predicting pooled masked latent vectors in a dense frame-aligned action space. No equations, self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations are exhibited in the provided material. The central claim of handling long videos without quadratic self-attention is framed as a direct architectural choice rather than derived from prior fitted results or author-specific uniqueness theorems. This matches the reader's 0.0 assessment and qualifies as a normal non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5799 in / 1141 out tokens · 30490 ms · 2026-06-26T08:48:14.988028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 1 canonical work pages

[1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

2021
[2]

Hiervl: Learning hierarchical video-language embeddings

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video-language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023

2023
[3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[4]

How much temporal long-term context is needed for action segmentation? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023

Emad Bahrami, Gianpiero Francesca, and Juergen Gall. How much temporal long-term context is needed for action segmentation? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023

2023
[5]

My view is the best view: Procedure learning from egocentric videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. My view is the best view: Procedure learning from egocentric videos. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

2022
[6]

United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024

2024
[7]

Revisiting feature prediction for learning visual representations from video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024
[8]

Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation

Nadine Behrmann, S Alireza Golestaneh, Zico Kolter, Juergen Gall, and Mehdi Noroozi. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. InEuropean conference on computer vision, pages 52–68. Springer, 2022

2022
[9]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 813–824, 2021

2021
[10]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017
[11]

Streaming videollms for real-time procedural video understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Streaming videollms for real-time procedural video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22586–22598, 2025. 10

2025
[12]

Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

arXiv 2025
[13]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

2024
[14]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[15]

Unsupervised procedure learning via joint dynamic summarization

Ehsan Elhamifar and Zwe Naing. Unsupervised procedure learning via joint dynamic summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6341–6350, 2019

2019
[16]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021

2021
[17]

Ms-tcn: Multi-stage temporal convolutional network for action segmentation

Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3575–3584, 2019

2019
[18]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025

2025
[19]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

2024
[20]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504– 13514, 2024

2024
[21]

What changed and what could have changed? state-change counterfactuals for procedure-aware video representa- tion learning.arXiv preprint arXiv:2503.21055, 2025

Chi-Hsi Kung, Frangil Ramirez, Juhyung Ha, Yi-Ting Chen, David Crandall, and Yi-Hsuan Tsai. What changed and what could have changed? state-change counterfactuals for procedure-aware video representa- tion learning.arXiv preprint arXiv:2503.21055, 2025

arXiv 2025
[22]

Video token merging for long video understanding.Advances in Neural Information Processing Systems, 37:13851–13871, 2024

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long video understanding.Advances in Neural Information Processing Systems, 37:13851–13871, 2024

2024
[23]

Temporal reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Temporal reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

arXiv 2024
[24]

Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020

Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020. doi: 10.1109/TPAMI.2020.3021756

work page doi:10.1109/tpami.2020.3021756 2020
[25]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

2019
[26]

Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

Pith/arXiv arXiv 2024
[27]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, June 2022

2022
[28]

Fact: Frame-action cross-attention temporal modeling for efficient action segmentation

Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross-attention temporal modeling for efficient action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18175–18185, 2024. 11

2024
[29]

Streamer: Streaming representation learning and event segmentation in a hierarchical manner.Advances in Neural Information Processing Systems, 36: 45694–45715, 2023

Ramy Mounir, Sujal Vijayaraghavan, and Sudeep Sarkar. Streamer: Streaming representation learning and event segmentation in a hierarchical manner.Advances in Neural Information Processing Systems, 36: 45694–45715, 2023

2023
[30]

V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026
[31]

Video transformer network

Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3163–3172, October 2021

2021
[32]

Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018
[33]

Hiero: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

Simone Alberto Peirone, Francesca Pistilli, and Giuseppe Averta. Hiero: understanding the hierarchy of human behavior enhances reasoning on egocentric videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19862–19871, 2025

2025
[34]

Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24129–24138, 2025

2025
[35]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023

2023
[36]

Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13821–13831, 2022

2022
[37]

Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

arXiv 2024
[38]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024

2024
[39]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096– 21106, 2022

2022
[40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

2025
[41]

C2f-tcn: A framework for semi-and fully-supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 11484–11501, 2023

Dipika Singhania, Rahul Rahaman, and Angela Yao. C2f-tcn: A framework for semi-and fully-supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 11484–11501, 2023

2023
[42]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

2024
[43]

Moviechat+: Question- aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question- aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[44]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[45]

Koala: Key frame-conditioned long video-llm

Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13581–13591, 2024. 12

2024
[46]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35: 10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35: 10078–10093, 2022

2022
[47]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[48]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024
[49]

Videollm-mod: Efficient video-language streaming with mixture- of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. Videollm-mod: Efficient video-language streaming with mixture- of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024

2024
[50]

Hierarchical self-supervised representation learning for movie understanding

Fanyi Xiao, Kaustav Kundu, Joseph Tighe, and Davide Modolo. Hierarchical self-supervised representation learning for movie understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9727–9736, 2022

2022
[51]

Asformer: Transformer for action segmentation.arXiv preprint arXiv:2110.08568, 2021

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation.arXiv preprint arXiv:2110.08568, 2021

arXiv 2021
[52]

Learning procedure-aware video representation from instructional videos and their narrations

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video representation from instructional videos and their narrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023

2023
[53]

Procedure-aware pretraining for instructional video understanding

Honglu Zhou, Roberto Martín-Martín, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10727–10738, 2023. 13 A Supplementary Material In this supplementary material we first discuss the broa...

2023
[54]

Then we perform a similar analysis in the reduced dimensionality space of the t-SNE as shown in Figure 7, where the path metrics were computed in this reduced space (Table 7). A.1 Broader Impact Embodied AI and personalized assistants have a large upside, promising to take over menial tasks and increase productivity, making life easier for the majority of...

arXiv 2048

[1] [1]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

2021

[2] [2]

Hiervl: Learning hierarchical video-language embeddings

Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video-language embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023

2023

[3] [3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[4] [4]

How much temporal long-term context is needed for action segmentation? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023

Emad Bahrami, Gianpiero Francesca, and Juergen Gall. How much temporal long-term context is needed for action segmentation? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10351–10361, 2023

2023

[5] [5]

My view is the best view: Procedure learning from egocentric videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. My view is the best view: Procedure learning from egocentric videos. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

2022

[6] [6]

United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024

2024

[7] [7]

Revisiting feature prediction for learning visual representations from video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024

[8] [8]

Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation

Nadine Behrmann, S Alireza Golestaneh, Zico Kolter, Juergen Gall, and Mehdi Noroozi. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. InEuropean conference on computer vision, pages 52–68. Springer, 2022

2022

[9] [9]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 813–824, 2021

2021

[10] [10]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017

[11] [11]

Streaming videollms for real-time procedural video understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, et al. Streaming videollms for real-time procedural video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22586–22598, 2025. 10

2025

[12] [12]

Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, and Pascale Fung. Vl-jepa: Joint embedding predictive architecture for vision-language.arXiv preprint arXiv:2512.10942, 2025

arXiv 2025

[13] [13]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

2024

[14] [14]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024

[15] [15]

Unsupervised procedure learning via joint dynamic summarization

Ehsan Elhamifar and Zwe Naing. Unsupervised procedure learning via joint dynamic summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6341–6350, 2019

2019

[16] [16]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021

2021

[17] [17]

Ms-tcn: Multi-stage temporal convolutional network for action segmentation

Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3575–3584, 2019

2019

[18] [18]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025

2025

[19] [19]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

2024

[20] [20]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13504– 13514, 2024

2024

[21] [21]

What changed and what could have changed? state-change counterfactuals for procedure-aware video representa- tion learning.arXiv preprint arXiv:2503.21055, 2025

Chi-Hsi Kung, Frangil Ramirez, Juhyung Ha, Yi-Ting Chen, David Crandall, and Yi-Hsuan Tsai. What changed and what could have changed? state-change counterfactuals for procedure-aware video representa- tion learning.arXiv preprint arXiv:2503.21055, 2025

arXiv 2025

[22] [22]

Video token merging for long video understanding.Advances in Neural Information Processing Systems, 37:13851–13871, 2024

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long video understanding.Advances in Neural Information Processing Systems, 37:13851–13871, 2024

2024

[23] [23]

Temporal reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Temporal reasoning transfer from text to video.arXiv preprint arXiv:2410.06166, 2024

arXiv 2024

[24] [24]

Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020

Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020. doi: 10.1109/TPAMI.2020.3021756

work page doi:10.1109/tpami.2020.3021756 2020

[25] [25]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

2019

[26] [26]

Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024

Pith/arXiv arXiv 2024

[27] [27]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, June 2022

2022

[28] [28]

Fact: Frame-action cross-attention temporal modeling for efficient action segmentation

Zijia Lu and Ehsan Elhamifar. Fact: Frame-action cross-attention temporal modeling for efficient action segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18175–18185, 2024. 11

2024

[29] [29]

Streamer: Streaming representation learning and event segmentation in a hierarchical manner.Advances in Neural Information Processing Systems, 36: 45694–45715, 2023

Ramy Mounir, Sujal Vijayaraghavan, and Sudeep Sarkar. Streamer: Streaming representation learning and event segmentation in a hierarchical manner.Advances in Neural Information Processing Systems, 36: 45694–45715, 2023

2023

[30] [30]

V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026

[31] [31]

Video transformer network

Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3163–3172, October 2021

2021

[32] [32]

Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

Pith/arXiv arXiv 2018

[33] [33]

Hiero: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

Simone Alberto Peirone, Francesca Pistilli, and Giuseppe Averta. Hiero: understanding the hierarchy of human behavior enhances reasoning on egocentric videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19862–19871, 2025

2025

[34] [34]

Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24129–24138, 2025

2025

[35] [35]

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023

2023

[36] [36]

Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yi Xu, Xiang Wang, Mingqian Tang, Changxin Gao, Rong Jin, and Nong Sang. Learning from untrimmed videos: Self-supervised video representation learning with hierarchical consistency. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13821–13831, 2022

2022

[37] [37]

Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

arXiv 2024

[38] [38]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024

2024

[39] [39]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096– 21106, 2022

2022

[40] [40]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

2025

[41] [41]

C2f-tcn: A framework for semi-and fully-supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 11484–11501, 2023

Dipika Singhania, Rahul Rahaman, and Angela Yao. C2f-tcn: A framework for semi-and fully-supervised temporal action segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10): 11484–11501, 2023

2023

[42] [42]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

2024

[43] [43]

Moviechat+: Question- aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question- aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[44] [44]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[45] [45]

Koala: Key frame-conditioned long video-llm

Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A Plummer, Bryan Russell, and Kate Saenko. Koala: Key frame-conditioned long video-llm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13581–13591, 2024. 12

2024

[46] [46]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35: 10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35: 10078–10093, 2022

2022

[47] [47]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[48] [48]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

2024

[49] [49]

Videollm-mod: Efficient video-language streaming with mixture- of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. Videollm-mod: Efficient video-language streaming with mixture- of-depths vision computation.Advances in Neural Information Processing Systems, 37:109922–109947, 2024

2024

[50] [50]

Hierarchical self-supervised representation learning for movie understanding

Fanyi Xiao, Kaustav Kundu, Joseph Tighe, and Davide Modolo. Hierarchical self-supervised representation learning for movie understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9727–9736, 2022

2022

[51] [51]

Asformer: Transformer for action segmentation.arXiv preprint arXiv:2110.08568, 2021

Fangqiu Yi, Hongyu Wen, and Tingting Jiang. Asformer: Transformer for action segmentation.arXiv preprint arXiv:2110.08568, 2021

arXiv 2021

[52] [52]

Learning procedure-aware video representation from instructional videos and their narrations

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video representation from instructional videos and their narrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023

2023

[53] [53]

Procedure-aware pretraining for instructional video understanding

Honglu Zhou, Roberto Martín-Martín, Mubbasir Kapadia, Silvio Savarese, and Juan Carlos Niebles. Procedure-aware pretraining for instructional video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10727–10738, 2023. 13 A Supplementary Material In this supplementary material we first discuss the broa...

2023

[54] [54]

Then we perform a similar analysis in the reduced dimensionality space of the t-SNE as shown in Figure 7, where the path metrics were computed in this reduced space (Table 7). A.1 Broader Impact Embodied AI and personalized assistants have a large upside, promising to take over menial tasks and increase productivity, making life easier for the majority of...

arXiv 2048