pith. sign in

arxiv: 2505.23617 · v3 · submitted 2025-05-29 · 💻 cs.CV · cs.AI· cs.GR· cs.LG

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Pith reviewed 2026-05-19 12:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.LG
keywords grounded video tokenizationpanoptic sub-object trajectoriesTrajViTvideo transformerstoken reductionvideo understandingVideoLLM
0
0 comments X

The pith

Videos tokenized by panoptic sub-object trajectories cut token count tenfold while improving retrieval and VideoQA accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that organizing video tokens around panoptic sub-object trajectories rather than fixed space-time patches creates a more efficient representation aligned with scene structure. This grounded tokenization reflects scene complexity instead of video duration, reducing redundancy while preserving semantic identity and temporal coherence. A sympathetic reader would care because current patch-based methods produce excessive tokens that hinder scaling transformers to long videos. The authors demonstrate the approach with TrajViT, which extracts trajectories and maps each to one token, yielding measurable gains on retrieval and question-answering tasks.

Core claim

TrajViT extracts panoptic sub-object trajectories from video frames and converts each trajectory into a single semantically meaningful token. This strategy replaces uniform space-time patch tokenization with one that follows object motion and identity. Trained via contrastive learning, the model outperforms space-time ViT3D by 6 percent top-5 recall on average in video-text retrieval while using 10 times fewer tokens. When employed as the video encoder in modern VideoLLMs, it delivers an average 5.2 percent performance lift across six VideoQA benchmarks together with 4 times faster training and 18 times lower inference FLOPs.

What carries the argument

Panoptic sub-object trajectory, the continuous path of a semantic sub-object across frames that is encoded as exactly one token to capture its identity and motion.

If this is right

  • Token count drops by a factor of ten on video-text retrieval while top-5 recall rises by 6 percent on average.
  • VideoLLM training runs four times faster and inference uses eighteen times fewer FLOPs.
  • Average performance improves by 5.2 percent across six VideoQA benchmarks when TrajViT serves as the video encoder.
  • Tokenization now scales with scene complexity rather than raw video length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory principle could extend to other time-series data such as audio or motion capture where distinct entities move continuously.
  • Robustness to moving cameras might increase because tokens track objects instead of remaining anchored to a static grid.
  • Experiments on hour-long videos would test whether the efficiency advantage widens as patch-based methods scale linearly with duration.

Load-bearing premise

Reliable panoptic sub-object trajectories can be extracted from videos without introducing errors or biases that would degrade semantic and temporal information needed for downstream tasks.

What would settle it

If trajectory extraction errors on videos with frequent occlusions or rapid camera motion cause TrajViT to fall below ViT3D accuracy on retrieval or VideoQA benchmarks, the superiority claim would not hold.

Figures

Figures reproduced from arXiv: 2505.23617 by Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Norimasa Kobori, Quan Kong, Ranjay Krishna, Vishnu Iyengar, Ziqi Gao.

Figure 1
Figure 1. Figure 1: (a) Traditional video tokenization divides a video into [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TrajViT. Given a video, we first panopti￾cally extract the trajectories for all objects. Our trajectory encoder converts these dynamic object trajectories into fixed sized embed￾dings, which serve as the input to the transformer encoder. ples governing object perception and motion [46, 56, 61] by organizing tokens to correspond to panoptic sub-object tra￾jectories. Rooted in Spelke’s core cogni… view at source ↗
Figure 3
Figure 3. Figure 3: Our parallel trajectory generation pipeline. We use key frame detection to break a video into subclips. We segment and track objects in each clip in parallel and finally merge objects between clips. This paradigm captures objects that emerge over time while reducing overall tracking latency. Efficient video large language models. In the context of large video language models (VideoLLMs), to reduce the numb… view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of trajectory encoder. we employ a two-branch design that enocodes a trajectory’s appearance and temporal position separately. At each frame, we represent the appearance of a segment by mask pooling its feature, and represent its position by bounding box coordinates. Both features are then aggregated across frames via perceiver resampler and added together to form the trajectory feature [PITH… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizations of generated trajectories. ding for each trajectory ( [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of inference frame number scaling in ActivityNet video-to-text retrieval task. Scaling with our tokenization paradigm obtains a better trade-off than baselines in terms of efficiency and accuracy. 2 4 6 8 Training data size (millions) 21 24 27 30 33 Mean Retrieval Vid2Txt R@5 (%) 20.48 28.46 32.89 24.49 30.33 34.58 ViT3D Ours [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Incorporating image data. TrajViT benefits more from adding image data into pretraining as it requires no ar￾chitectural modifications. manner using our pretrained video and text encoder. As shown in Tab. 1, TrajViT achieves a significant improve￾ment over all baselines. We attribute this to the nature of video-text retrieval tasks, where textual descriptions primar￾ily focus on objects and their interacti… view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy and inference FLOPs at MovieChat long video benchmarks with input frame scaling. The VideoLLM with TrajViT as video encoder scales significantly better than the one with ViT3D in both accuracy and efficiency. A linear layer is used to connect the trained video encoder and LLM. We train two VideoLLM variants with ViT3D and TrajViT as video encoder (the variants that pretrained on 8M video data and … view at source ↗
Figure 10
Figure 10. Figure 10: Architecture for TokenMerge baseline. Model K400 SSV2 UFC-101 ViT3D 42.0 12.3 40.4 TokenLearner 40.9 11.0 37.8 ViViT 39.9 11.5 34.8 AutoMerge 38.4 10.3 35.4 RLT 41.0 10.3 33.7 ToMe 38.2 9.9 37.1 TrajViT (ours) 42.4 11.8 42.1 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualizations of our generated trajectories (part 1). 5 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualizations of our generated trajectories (part 2). 6 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 18x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust and scalable solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces grounded video tokenization, which organizes video tokens around panoptic sub-object trajectories rather than fixed space-time patches. The proposed TrajViT encoder, trained with contrastive learning, is claimed to outperform a space-time ViT baseline (ViT3D) by 6% top-5 recall on video-text retrieval at 10x token reduction, deliver 5.2% average gains across six VideoQA benchmarks when used as a VideoLLM encoder, and provide 4x faster training with 18x lower inference FLOPs.

Significance. If the empirical margins prove robust, the approach could meaningfully advance efficient long-video modeling by aligning tokenization with scene complexity and perceptual principles, offering a practical route to lower redundancy in video transformers and VideoLLMs.

major comments (1)
  1. [Abstract / Experimental evaluation] The reported gains (6% retrieval, 5.2% VideoQA) are load-bearing on the claim that panoptic trajectory extraction remains reliable and information-preserving under camera motion, occlusion, and fast motion. The abstract notes that prior reduction methods fail precisely in these regimes, yet no ablation or quantitative assessment of extraction error rates or downstream sensitivity to missed/broken tracks is referenced in the provided summary.
minor comments (1)
  1. [Abstract] Clarify whether '10x token deduction' refers to a fixed reduction factor or an average; consistent terminology would aid comparison with prior token-reduction baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the robustness of our trajectory extraction. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract / Experimental evaluation] The reported gains (6% retrieval, 5.2% VideoQA) are load-bearing on the claim that panoptic trajectory extraction remains reliable and information-preserving under camera motion, occlusion, and fast motion. The abstract notes that prior reduction methods fail precisely in these regimes, yet no ablation or quantitative assessment of extraction error rates or downstream sensitivity to missed/broken tracks is referenced in the provided summary.

    Authors: We agree that a direct quantitative assessment of extraction reliability under camera motion, occlusion, and fast motion would strengthen the validation of our claims. The full manuscript evaluates TrajViT on diverse benchmarks (e.g., MSR-VTT, ActivityNet, and VideoQA datasets) that contain substantial camera motion, occlusions, and rapid movements; the consistent 6% retrieval and 5.2% VideoQA gains over ViT3D indicate that trajectory-based tokenization preserves information more effectively than fixed patches in these regimes. However, we acknowledge the absence of a dedicated ablation on extraction error rates and sensitivity to missed or broken tracks. We will add this analysis in the revision, including metrics on track continuity and an ablation simulating track breaks to measure downstream impact. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent benchmark comparisons

full rationale

The paper introduces a grounded video tokenization approach using panoptic sub-object trajectories and evaluates TrajViT empirically against ViT3D on retrieval and VideoQA tasks. Performance margins (e.g., 6% top-5 recall, 5.2% VideoQA gains) are presented as experimental outcomes from contrastive training, not as derivations that reduce to fitted inputs or self-citations by construction. No load-bearing equations, uniqueness theorems, or ansatzes are invoked that collapse to the method's own definitions. The central claims rest on observable benchmark differences rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review is limited to the abstract, which does not detail mathematical derivations, specific hyperparameters, or background assumptions beyond standard contrastive learning and panoptic segmentation techniques from prior computer vision literature.

invented entities (1)
  • TrajViT no independent evidence
    purpose: Video encoder that extracts and tokenizes object trajectories
    Proposed model name and architecture introduced in the paper.

pith-pipeline@v0.9.0 · 5820 in / 1198 out tokens · 44389 ms · 2026-05-19T12:50:01.701342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TrajTok: Learning Trajectory Tokens enables better Video Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022. 3, 4

  2. [2]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 6836–6846, 2021. 1, 2, 5, 6, 7

  3. [3]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022. 1, 2, 5

  4. [4]

    Re- visiting the” video” in video-language understanding

    Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Re- visiting the” video” in video-language understanding. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 2917– 2927, 2022. 2

  5. [5]

    PuMer: Pruning and merging tokens for efficient vision language models

    Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. PuMer: Pruning and merging tokens for efficient vision language models. In Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 12890–12903, Toronto, Canada, 2023. Association for Computational Linguistics. 2

  6. [6]

    Subobject-level image tokenization

    Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung. Subobject-level image tokenization. arXiv preprint arXiv:2402.14327,

  7. [7]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language mod- els, 2024

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language mod- els, 2024. 3

  8. [8]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Mena- pace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13320–13331, 2024. 2, 5

  9. [9]

    Putting the ob- ject back into video object segmentation

    Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon- Young Lee, and Alexander Schwing. Putting the ob- ject back into video object segmentation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 3

  10. [10]

    Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Min- hyuk Choi, and Hyunwoo J. Kim. vid-tldr: Training free token merging for light-weight video transformer,

  11. [11]

    Don’t look twice: Faster video transformers with run-length tok- enization

    Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris Kitani, and L ´aszl´o Jeni. Don’t look twice: Faster video transformers with run-length tok- enization. Advances in Neural Information Processing Systems, 37:28127–28149, 2025. 1, 2, 5, 7

  12. [12]

    A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching

    Pradipto Das, Chenliang Xu, Richard F Doell, and Ja- son J Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2634–2641, 2013. 6

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 5

  14. [14]

    Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation

    Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv preprint arXiv:2501.18564, 2025. 2

  15. [15]

    Adaptive token sampling for efficient vision transformers, 2022

    Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsi- avash, and Juergen Gall. Adaptive token sampling for efficient vision transformers, 2022. 2

  16. [16]

    Masked autoencoders as spatiotemporal learners, 2022

    Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaiming He. Masked autoencoders as spatiotemporal learners, 2022. 2

  17. [17]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video- mme: The first-ever comprehensive evaluation bench- mark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 7

  18. [18]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neu- ral Information Processing Systems, 36:27092–27112,

  19. [19]

    The” some- thing something” video database for learning and eval- uating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” some- thing something” video database for learning and eval- uating visual common sense. In Proceedings of the IEEE international conference on computer vision , pages 58...

  20. [20]

    Ava: A video dataset of spatio- temporally localized atomic visual actions

    Chunhui Gu, Chen Sun, David A Ross, Carl V ondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio- temporally localized atomic visual actions. In Pro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018. 6

  21. [21]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4

  22. [22]

    Space- time correspondence as a contrastive random walk

    Allan Jabri, Andrew Owens, and Alexei Efros. Space- time correspondence as a contrastive random walk. Advances in neural information processing systems , 33:19545–19560, 2020. 8

  23. [23]

    Per- ceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Per- ceiver: General perception with iterative attention. In International conference on machine learning , pages 4651–4664. PMLR, 2021. 3

  24. [24]

    Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization, 2024

    Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yu- liang Liu, Di Zhang, Yang Song, Kun Gai, and Yadong Mu. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization, 2024. 2

  25. [25]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 6

  26. [26]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international con- ference on computer vision, pages 706–715, 2017. 5, 6

  27. [27]

    Less is more: Clipbert for video-and-language learning via sparse sampling

    Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 7331–7341, 2021. 2

  28. [28]

    Reveal- ing single frame bias for video-and-language learning

    Jie Lei, Tamara L Berg, and Mohit Bansal. Reveal- ing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022. 2, 7

  29. [29]

    Lmms- eval: Accelerating the development of large multimoal models, 2024

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xin- run Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Accelerating the development of large multimoal models, 2024. 7

  30. [30]

    Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models, 2023. 3

  31. [31]

    Videochat: Chat-centric video understanding,

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding,

  32. [32]

    Videomamba: State space model for efficient video understanding, 2024

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State space model for efficient video understanding, 2024. 2

  33. [33]

    Svitt: Temporal learning of sparse video-text transformers

    Yi Li, Kyle Min, Subarna Tripathi, and Nuno Vascon- celos. Svitt: Temporal learning of sparse video-text transformers. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18919–18929, 2023. 2

  34. [34]

    Llama-vid: An image is worth 2 tokens in large language models,

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models,

  35. [35]

    arXiv preprint arXiv:2202.07800 , year=

    Youwei Liang, Chongjian Ge, Zhan Tong, Yib- ing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision trans- formers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022. 2

  36. [36]

    Swinbert: End-to-end transformers with sparse attention for video captioning

    Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Li- juan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17949–17958, 2022. 1

  37. [37]

    PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

    Ruyang Liu, Haoran Tang, Haibo Liu, Yixiao Ge, Ying Shan, Chen Li, and Jiankun Yang. Ppllava: Var- ied video sequence understanding with prompt guid- ance. arXiv preprint arXiv:2411.02327, 2024. 7

  38. [38]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024. 7

  39. [39]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decou- pled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 1

  40. [40]

    Clip4clip: An empir- ical study of clip for end to end video clip retrieval,

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empir- ical study of clip for end to end video clip retrieval,

  41. [41]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards 10 detailed video understanding via large vision and lan- guage models. arXiv preprint arXiv:2306.05424 ,

  42. [42]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM , 65(1): 99–106, 2021. 4

  43. [43]

    Atten- tion bottlenecks for multimodal fusion

    Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. Atten- tion bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems , pages 14200–14213. Curran Associates, Inc., 2021. 2

  44. [44]

    Video transformer network

    Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. In Pro- ceedings of the IEEE/CVF international conference on computer vision, pages 3163–3172, 2021. 1

  45. [45]

    Ia-red2: Interpretability-aware redundancy reduction for vision transformers, 2021

    Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. Ia-red2: Interpretability-aware redundancy reduction for vision transformers, 2021. 2

  46. [46]

    Tracking multi- ple independent targets: Evidence for a parallel track- ing mechanism

    Zenon W Pylyshyn and Ron W Storm. Tracking multi- ple independent targets: Evidence for a parallel track- ing mechanism. Spatial vision, 3(3):179–197, 1988. 2

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 2, 4

  48. [48]

    Towards universal soccer video understanding, 2024

    Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yan- feng Wang, and Weidi Xie. Towards universal soccer video understanding, 2024. 3

  49. [49]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 , 2024. 2, 3

  50. [50]

    Token- learner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297, 2021

    Michael S Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Token- learner: What can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297, 2021. 1, 2, 5

  51. [51]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bor- des, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elho- seiny, and Vikas Chandra. Longvu: Spatiotemporal adaptive compression for long video-language under- standing. arXiv p...

  52. [52]

    Video-xl: Extra-long vision language model for hour-scale video understanding, 2024

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding, 2024. 2

  53. [53]

    Hollywood in homes: Crowdsourcing data collection for activity understanding

    Gunnar A Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 , pages 510–526. Springer, 2016. 5

  54. [54]

    Moviechat: From dense token to sparse memory for long video under- standing

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video under- standing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18221–18232, 2024. 2, 7

  55. [55]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 6

  56. [56]

    Principles of object perception

    Elizabeth S Spelke. Principles of object perception. Cognitive science, 14(1):29–56, 1990. 2

  57. [57]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neuro- computing, 568:127063, 2024. 4

  58. [58]

    Global device growth and traffic pro- files

    Cisco Systems. Global device growth and traffic pro- files. Technical report, Cisco, 2018. Accessed: 2024- 11-29. 1

  59. [59]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Ad- vances in neural information processing systems , 35: 10078–10093, 2022. 2

  60. [60]

    Attention is all you need

    A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. 1

  61. [61]

    A century of gestalt psy- chology in visual perception: I

    Johan Wagemans, James H Elder, Michael Kubovy, Stephen E Palmer, Mary A Peterson, Manish Singh, and R¨udiger V on der Heydt. A century of gestalt psy- chology in visual perception: I. perceptual grouping and figure–ground organization. Psychological bul- letin, 138(6):1172, 2012. 2

  62. [62]

    Efficient video trans- formers with spatial-temporal token selection

    Junke Wang, Xitong Yang, Hengduo Li, Liu Li, Zux- uan Wu, and Yu-Gang Jiang. Efficient video trans- formers with spatial-temporal token selection. In ECCV, 2022. 2

  63. [63]

    Vatex: A large- scale, high-quality multilingual dataset for video-and- 11 language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan- Fang Wang, and William Yang Wang. Vatex: A large- scale, high-quality multilingual dataset for video-and- 11 language research. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4581–4591, 2019. 5

  64. [64]

    Internvideo2: Scaling foun- dation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yi- nan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foun- dation models for multimodal video understanding. In European Conference on Computer Vision , pages 396–416. Springer, 2024. 6, 7

  65. [65]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 7

  66. [66]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288– 5296, 2016. 5

  67. [67]

    Track anything: Segment anything meets videos

    Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023. 3

  68. [68]

    Visionzip: Longer is better but not necessary in vision language models, 2024

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models, 2024. 3

  69. [69]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Con- ference on Artificial Intelligence , pages 9127–9134,

  70. [70]

    Videoglue: Video general understanding evaluation of foundation models

    Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, et al. Videoglue: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166,

  71. [71]

    Vide- ollama 3: Frontier multimodal foundation models for image and video understanding, 2025

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding, 2025. 2

  72. [72]

    Llava-mini: Efficient image and video large multimodal models with one vision token, 2025

    Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. Llava-mini: Efficient image and video large multimodal models with one vision token, 2025. 2

  73. [73]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data. arXiv preprint arXiv:2410.02713, 2024. 7

  74. [74]

    Videoprism: A foundational visual encoder for video understanding.arXiv preprint arXiv:2402.13217, 2024

    Long Zhao, Nitesh B Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, et al. Video- prism: A foundational visual encoder for video under- standing. arXiv preprint arXiv:2402.13217, 2024. 6, 7

  75. [75]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264, 2024. 7 12 One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory Supplementary Material

  76. [76]

    We optimize the models using AdamW optimizer [39] with a learning rate of 10−4, a weight decay of 10−2, and mixed precision training

    More Implementation Details We provide the complete training details in Table 7. We optimize the models using AdamW optimizer [39] with a learning rate of 10−4, a weight decay of 10−2, and mixed precision training. We adopt a cosine annealing learning rate schedule. The contrastive view (batch size) for video training is set to 256, and all models are tra...

  77. [77]

    Trajectory Encoder

    More Architecture Details To complement the main paper, we provide additional de- tails on our model architecture and TokenMerge baseline’s architecture. Trajectory Encoder. We provide the complete architec- tural details of our trajectory tokenizer in table ??. As shown, the parameter size of our tokenizer is an order of magnitude smaller compared with m...

  78. [78]

    A frame is classified as a key frame if it is proposed by at least two out of the three detectors

    Key Frame Detection Algorithm We illustrate the details of our key frame detection algo- rithm, which ensembles three sub-detectors to ensure ro- bust scene boundary identification. A frame is classified as a key frame if it is proposed by at least two out of the three detectors. All detectors are implemented using the Content- Aware Detector from the PyS...

  79. [79]

    In this task, given an object’s bounding box in a specific video frame, the model must predict the action associated with that object at that time instant

    Detailed setup in A V Av2 Spatial Temporal Detection task We follow the setup in [59] to evaluate our model on the A V Av2 spatial-temporal action detection task. In this task, given an object’s bounding box in a specific video frame, the model must predict the action associated with that object at that time instant. This requires extracting video feature...

  80. [80]

    Table 10 presents the performance variations of the model with the change of the scale of the training data

    Full tables for scaling performance experi- ments We provide the complete table for the scaling up experi- ments, which we only show the plots of average trend in the main table. Table 10 presents the performance variations of the model with the change of the scale of the training data. Table 11 presents the model’s performance with im- ages adding to tra...

Showing first 80 references.