pith. sign in

arxiv: 2605.28604 · v1 · pith:XYJRGK7Mnew · submitted 2026-05-27 · 💻 cs.CV · cs.AI

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

Pith reviewed 2026-06-29 13:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video important person identificationTemporal-VIP datasetVIP-Netspatio-temporal cuesmulti-modalityTemporal Importance Shiftrationale annotationperson ranking
0
0 comments X

The pith

VIP-Net identifies the most important person in videos by fusing multi-modal spatio-temporal cues and correcting for importance shifts over time, reaching 67.3% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Video Important Person identification task to find the most influential individual in a video clip while also supplying a textual rationale. It identifies Temporal Importance Shift as the core flaw in prior work, where a person important early in a clip loses that status once the full sequence is considered. The authors release the Temporal-VIP dataset of 9,249 annotated segments across 11 categories and present the VIP-Net model that extracts social cues from multiple modalities, rectifies temporal importance, and ranks individuals. If correct, the approach would let automated systems select focus subjects using complete video context rather than isolated frames. This directly supports downstream uses such as video editing and surveillance that require consistent judgments across time.

Core claim

The central claim is that the VIP-Net framework, built from a Social Cue Encoder to extract multi-modal spatio-temporal cues, a Temporal Importance Rectifier for hierarchical cue fusion and cross-modal alignment, and a VIP Inference module for ranking, successfully mitigates Temporal Importance Shift and identifies the most influential person in videos while producing textual rationales that match ground-truth annotations at a mean similarity of 0.63 on the new Temporal-VIP dataset.

What carries the argument

The VIP-Net framework, which combines a Social Cue Encoder and Temporal Importance Rectifier to mine and align multi-modality spatio-temporal cues before ranking individuals.

If this is right

  • VIP-Net reaches 67.3% accuracy on the VIP task, exceeding prior state-of-the-art ranges of 37.5-53.9%.
  • The model produces rationales whose mean similarity to ground truth reaches 0.63 after feature-guided LLM refinement.
  • The method operates on full temporal context across 11 video categories instead of static frames.
  • The released Temporal-VIP dataset supplies 9,249 segments with aligned importance rationales for training and testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar temporal rectification steps could be tested on other video tasks that require consistent selection across an entire clip, such as keyframe extraction.
  • The combination of visual cue encoding and LLM-based rationale generation may support richer human-AI collaboration in video analysis pipelines.
  • Deployment in surveillance could shift from per-frame detection to sequence-level importance ranking without retraining the core encoder.
  • The dataset annotations provide a ready benchmark for measuring how well other multimodal models align with human notions of influence.

Load-bearing premise

That Temporal Importance Shift is a widespread limitation of earlier methods and that the rationale annotations in Temporal-VIP reliably capture human judgments of importance.

What would settle it

Evaluating the model on a fresh collection of videos where a person's importance does not shift across frames, or on annotations produced by a different group of human raters, and checking whether accuracy remains substantially above the 37.5-53.9% range of prior models.

Figures

Figures reproduced from arXiv: 2605.28604 by Bin Yang, Mang Ye, Minglei Yang, Wenke Huang, Xiao Wang, Xin Xu, Zheng Wang.

Figure 1
Figure 1. Figure 1: Illustration of the Temporal Importance Shift (TIS) phenomenon [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the proposed VIP-Net framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal-VIP dataset samples across 11 scene categories. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Temporal-VIP dataset statistics. (a) Distribution across 11 scene categories. (b) Distribution of number of people per video clip, showing that 3-person clips are the most frequent, followed by 4- and 5-person clips. video modality with temporal analysis, extensive person de￾scriptions, and clearly defined importance rationales [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sensitivity analysis. Rank-1 accuracy (%) across different weight values for (a) text similarity weight λtext, (b) contrastive learning weight λcont, and (c) regularization weight λreg. Peak performance is marked for each parameter. TABLE V. Comparison of description quality across different variants. Baseline uses structured descriptions generated from predicted VIP ID and ranked feature cu… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples. This topic not only identifies key individuals but also elucidates the rationale behind their significance. Such elaboration enhances the understanding of the scene by providing contextual information that supports the interpretation of the visual content. unguided LLM and enhances all samples in the guided setting, achieving an overall similarity of 0.63 with a variance of 0.09 [PIT… view at source ↗
read the original abstract

Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper introduces the Video Important Person (VIP) identification task to identify the most influential individuals in videos while providing textual rationales, motivated by the Temporal Importance Shift (TIS) phenomenon in prior methods. It releases the Temporal-VIP dataset (9,249 video segments across 11 categories with aligned rationales) and proposes the VIP-Net framework consisting of a Social Cue Encoder (SCE) for multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical fusion and cross-modal alignment, and a VIP Inference module for ranking. Experiments report VIP-Net achieving 67.3% accuracy (outperforming SOTA baselines at 37.5%-53.9%) and a mean rationale similarity of 0.63 to ground truth via feature-guided LLM refinement, with dataset and code released publicly.

Significance. If the empirical claims hold after proper validation, the work could advance video person identification beyond static-image methods by incorporating spatio-temporal cues and interpretability via rationales, with applications in editing and surveillance. The public release of the Temporal-VIP dataset and code is a clear strength that supports reproducibility and future research.

major comments (4)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: the central performance claims (67.3% accuracy, outperformance over 37.5%-53.9%, 0.63 rationale similarity) are stated without any description of baseline implementations, number of runs, error bars, statistical tests, or validation procedures (e.g., train/test splits on Temporal-VIP), making it impossible to determine whether the data support the accuracy claims.
  2. [Introduction] Introduction / Motivation section: the claim that TIS is a prevalent and addressable limitation in prior video person identification methods is not supported by any explicit measurement or quantitative demonstration that the cited SOTA models (37.5%-53.9%) actually exhibit TIS on the new task; this assumption is load-bearing for the motivation of SCE and TIR.
  3. [Dataset] Dataset section: no inter-annotator agreement statistics are reported for the rationale annotations across the 11 categories in Temporal-VIP, which directly undermines the reliability of the ground-truth rationales used to compute the 0.63 similarity score and to train/evaluate the framework.
  4. [Method] Method / Ablation section: there is no ablation study isolating the contribution of the TIR module (or SCE) to mitigating TIS versus other factors such as task redefinition or additional modalities; without this, the source of the reported gains cannot be attributed to the proposed components.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'feature-guided LLM refinement' is used without specifying which LLM, how features guide it, or any implementation details.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the central performance claims (67.3% accuracy, outperformance over 37.5%-53.9%, 0.63 rationale similarity) are stated without any description of baseline implementations, number of runs, error bars, statistical tests, or validation procedures (e.g., train/test splits on Temporal-VIP), making it impossible to determine whether the data support the accuracy claims.

    Authors: We agree that the current presentation lacks sufficient experimental details to fully substantiate the claims. In the revised manuscript, the Experimental Results section will be expanded to describe baseline re-implementations, the train/test split protocol on Temporal-VIP, results averaged over multiple runs with standard deviations (error bars), and statistical significance tests supporting the reported improvements. revision: yes

  2. Referee: [Introduction] Introduction / Motivation section: the claim that TIS is a prevalent and addressable limitation in prior video person identification methods is not supported by any explicit measurement or quantitative demonstration that the cited SOTA models (37.5%-53.9%) actually exhibit TIS on the new task; this assumption is load-bearing for the motivation of SCE and TIR.

    Authors: The TIS motivation is derived from the conceptual gap between static-image methods and full-video context. To provide the requested quantitative support, we will add an analysis (new subsection or figure) that measures importance ranking shifts in the cited baselines when evaluated on early frames versus complete temporal sequences on Temporal-VIP. revision: yes

  3. Referee: [Dataset] Dataset section: no inter-annotator agreement statistics are reported for the rationale annotations across the 11 categories in Temporal-VIP, which directly undermines the reliability of the ground-truth rationales used to compute the 0.63 similarity score and to train/evaluate the framework.

    Authors: We recognize that inter-annotator agreement metrics are essential for validating rationale quality. The dataset annotations involved multiple annotators; in the revision we will compute and report agreement statistics (e.g., Fleiss' kappa) broken down by the 11 categories. revision: yes

  4. Referee: [Method] Method / Ablation section: there is no ablation study isolating the contribution of the TIR module (or SCE) to mitigating TIS versus other factors such as task redefinition or additional modalities; without this, the source of the reported gains cannot be attributed to the proposed components.

    Authors: Current ablations demonstrate the effect of removing SCE and TIR on final accuracy and rationale similarity. To isolate the TIS-mitigation contribution, we will add a targeted ablation that measures temporal ranking consistency (e.g., importance shift metrics) with and without TIR, directly comparing against baselines to attribute gains to the proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on new task and dataset.

full rationale

The paper introduces the VIP identification task, Temporal-VIP dataset, and VIP-Net framework (with SCE, TIR modules) as novel contributions, then reports direct experimental outcomes (67.3% accuracy, 0.63 rationale similarity) on that dataset. No derivation chain, equations, or fitted-parameter reductions are described that collapse to inputs by construction. Performance claims rest on standard ML evaluation against external benchmarks rather than self-referential definitions or self-citation load-bearing premises. This is the common case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level naming of the new task and model components.

pith-pipeline@v0.9.1-grok · 5783 in / 1093 out tokens · 46354 ms · 2026-06-29T13:23:33.409665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    Temporal sentence ground- ing in videos: A survey and future directions,

    H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Temporal sentence ground- ing in videos: A survey and future directions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 10 443– 10 465, 2023

  2. [2]

    Bridgenet: Comprehensive and effective feature interactions via bridge feature for multi-task dense predictions,

    J. Zhang, J. Fan, P. Ye, B. Zhang, H. Ye, B. Li, Y . Cai, and T. Chen, “Bridgenet: Comprehensive and effective feature interactions via bridge feature for multi-task dense predictions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3657–3672, 2025

  3. [3]

    Learning efficient and effective trajectories for differential equation-based image restoration,

    Z. Zhu, J. Hou, H. Liu, H. Zeng, and J. Hou, “Learning efficient and effective trajectories for differential equation-based image restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  4. [4]

    Evolved hierarchical masking for self- supervised learning,

    Z. Feng and S. Zhang, “Evolved hierarchical masking for self- supervised learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 1013–1027, 2024

  5. [5]

    Higcin: Hierarchical graph-based cross inference network for group activity recognition,

    R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “Higcin: Hierarchical graph-based cross inference network for group activity recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 6, pp. 6955–6968, 2020

  6. [6]

    Ve-bench: subjective- aligned benchmark suite for text-driven video editing quality assess- ment,

    S. Sun, X. Liang, S. Fan, W. Gao, and W. Gao, “Ve-bench: subjective- aligned benchmark suite for text-driven video editing quality assess- ment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7105–7113

  7. [7]

    Multitec: a data- driven multimodal short video detection framework for healthcare misinformation on tiktok,

    L. Shang, Y . Zhang, Y . Deng, and D. Wang, “Multitec: a data- driven multimodal short video detection framework for healthcare misinformation on tiktok,”IEEE Transactions on Big Data, vol. 11, no. 5, pp. 2471–2488, 2025

  8. [8]

    Value co-creation in sports live streaming platforms: A microfoundations perspective,

    H. Liu, K. H. Tan, A. Kumar, S. K. Singh, and L. Chung, “Value co-creation in sports live streaming platforms: A microfoundations perspective,”IEEE Transactions on Engineering Management, vol. 71, pp. 12 674–12 685, 2022

  9. [9]

    Efficient high- order spatial interactions for visual perception,

    Z. Liu, Y . Rao, W. Zhao, J. Zhou, and J. Lu, “Efficient high- order spatial interactions for visual perception,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  10. [10]

    Deep learning for person re-identification: A survey and outlook,

    M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 6, pp. 2872–2893, 2021

  11. [11]

    Offsetnet: Towards efficient multiple object tracking, detection, and segmentation,

    W. Zhang, J. Li, M. Xia, X. Gao, X. Tan, Y . Shi, Z. Huang, and G. Li, “Offsetnet: Towards efficient multiple object tracking, detection, and segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 2, pp. 949–960, 2024

  12. [12]

    Infogcn++: Learning representation by predicting the future for online skeleton-based action recognition,

    S. Chi, H.-G. Chi, Q. Huang, and K. Ramani, “Infogcn++: Learning representation by predicting the future for online skeleton-based action recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 1, pp. 514–528, 2024

  13. [13]

    Personrank: Detecting important people in images,

    W.-H. Li, B. Li, and W.-S. Zheng, “Personrank: Detecting important people in images,” in2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 234–241

  14. [14]

    Very important person localization in unconstrained conditions: A new benchmark,

    X. Wang, Z. Wang, T. Yamasaki, and W. Zeng, “Very important person localization in unconstrained conditions: A new benchmark,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2809–2816

  15. [15]

    Mip-gaf: A mllm-annotated benchmark for most important person localization and group context understanding,

    S. Madan, S. Ghosh, L. R. Sookha, M. Ganaie, R. Subramanian, A. Dhall, and T. Gedeon, “Mip-gaf: A mllm-annotated benchmark for most important person localization and group context understanding,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 1467–1476

  16. [16]

    The Kinetics Human Action Video Dataset

    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya- narasimhan, F. Viola, T. Green, T. Back, P. Natsevet al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017

  17. [17]

    Ava: A video dataset of spatio-temporally localized atomic visual actions,

    C. Gu, C. Sun, D. A. Ross, C. V ondrick, C. Pantofaru, Y . Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankaret al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6047–6056

  18. [18]

    Sportsmot: A large multi-object tracking dataset in multiple sports scenes,

    Y . Cui, C. Zeng, X. Zhao, Y . Yang, G. Wu, and L. Wang, “Sportsmot: A large multi-object tracking dataset in multiple sports scenes,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9921–9931

  19. [19]

    Towards local visual modeling for image captioning,

    Y . Ma, J. Ji, X. Sun, Y . Zhou, and R. Ji, “Towards local visual modeling for image captioning,”Pattern Recognition, vol. 138, p. 109420, 2023

  20. [20]

    Background activation suppression for weakly supervised object localization and semantic segmentation,

    W. Zhai, P. Wu, K. Zhu, Y . Cao, F. Wu, and Z.-J. Zha, “Background activation suppression for weakly supervised object localization and semantic segmentation,”International Journal of Computer Vision, vol. 132, no. 3, pp. 750–775, 2024

  21. [21]

    Bytetrack: Multi-object tracking by associating every detection box,

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” inEuropean conference on computer vision. Springer, 2022, pp. 1–21

  22. [22]

    Track to detect and segment: An online multi-object tracker,

    J. Wu, J. Cao, L. Song, Y . Wang, M. Yang, and J. Yuan, “Track to detect and segment: An online multi-object tracker,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 352–12 361

  23. [23]

    Samba: A unified mamba-based framework for general salient object detection,

    J. He, K. Fu, X. Liu, and Q. Zhao, “Samba: A unified mamba-based framework for general salient object detection,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 314–25 324

  24. [24]

    Mgfn: Magnitude-contrastive glance-and-focus network for weakly- supervised video anomaly detection,

    Y . Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y .-C. Wu, “Mgfn: Magnitude-contrastive glance-and-focus network for weakly- supervised video anomaly detection,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 387–395

  25. [25]

    Mutuality attribute makes better video anomaly detection,

    X. Han, X. Wang, K. Jiang, W. Liu, R. Hu, X. Pan, and X. Xu, “Mutuality attribute makes better video anomaly detection,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 2670–2674

  26. [26]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

  27. [27]

    Tinyllava: A framework of small-scale large multimodal models,

    B. Zhou, Y . Hu, X. Weng, J. Jia, J. Luo, X. Liu, J. Wu, and L. Huang, “Tinyllava: A framework of small-scale large multimodal models,” arXiv preprint arXiv:2402.14289, 2024

  28. [28]

    Jrdb- act: A large-scale dataset for spatio-temporal action, social group and activity detection,

    M. Ehsanpour, F. Saleh, S. Savarese, I. Reid, and H. Rezatofighi, “Jrdb- act: A large-scale dataset for spatio-temporal action, social group and activity detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 983–20 992

  29. [29]

    Msr-vtt: A large video description dataset for bridging video and language,

    J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inProceedings of the IEEE IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 15 conference on computer vision and pattern recognition, 2016, pp. 5288–5296

  30. [30]

    Role of group level affect to find the most influential person in images,

    S. Ghosh and A. Dhall, “Role of group level affect to find the most influential person in images,” inProceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 518– 533

  31. [31]

    Learning to learn relation for important people detection in still images,

    W.-H. Li, F.-T. Hong, and W. Zheng, “Learning to learn relation for important people detection in still images,”2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4998–5006, 2019

  32. [32]

    Learning to detect important people in unlabelled images for semi-supervised important people detection,

    F.-T. Hong, W.-H. Li, and W.-S. Zheng, “Learning to detect important people in unlabelled images for semi-supervised important people detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4146–4154

  33. [33]

    Towards causality inference for very important person localization,

    X. Wang, Z. Wang, W. Liu, X. Xu, Q. Zhao, and S. Satoh, “Towards causality inference for very important person localization,” inProceed- ings of the 30th ACM International Conference on Multimedia, 2022, pp. 6618–6626

  34. [34]

    Impact: Interpretable most important person analysis and classification using transformer- based models,

    A. Rampuria, K. P. Nayak, K. V . Thakare, T. Joshi, A. D. Singh, H. Park, H. Choi, D. P. Dogra, and I.-J. Kim, “Impact: Interpretable most important person analysis and classification using transformer- based models,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 8179–8187

  35. [35]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

  36. [36]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969

  37. [37]

    Space: Finding key-speaker in complex multi-person scenes,

    H. Zhao, W. Min, J. Xu, Q. Han, W. Li, Q. Wang, Z. Yang, and L. Zhou, “Space: Finding key-speaker in complex multi-person scenes,”IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 3, pp. 1645–1656, 2021

  38. [38]

    Simple online and realtime tracking with a deep association metric,

    N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in2017 IEEE international conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649

  39. [39]

    Track- former: Multi-object tracking with transformers,

    T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track- former: Multi-object tracking with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8844–8854

  40. [40]

    Observation- centric sort: Rethinking sort for robust multi-object tracking,

    J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observation- centric sort: Rethinking sort for robust multi-object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9686–9696

  41. [41]

    Motchallenge: A benchmark for single- camera multiple target tracking: P. dendorfer et al

    P. Dendorfer, A. Osep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taix ´e, “Motchallenge: A benchmark for single- camera multiple target tracking: P. dendorfer et al.”International Journal of Computer Vision, vol. 129, no. 4, pp. 845–881, 2021

  42. [42]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

  43. [43]

    Quo vadis, action recognition? a new model and the kinetics dataset,

    J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308

  44. [44]

    Vivit: A video vision transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “Vivit: A video vision transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6836–6846

  45. [45]

    Is space-time attention all you need for video understanding?

    G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inIcml, vol. 2, no. 3, 2021, p. 4

  46. [46]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre- training,

    Z. Tong, Y . Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre- training,”Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022

  47. [47]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  48. [48]

    The” something something

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. West- phal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The” something something” video database for learning and evaluating visual common sense,” inProceedings of the IEEE interna- tional conference on computer vision, 2017, pp. 5842–5850

  49. [49]

    Motionllm: Understanding human behaviors from human motions and videos,

    L.-H. Chen, S. Lu, A. Zeng, H. Zhang, B. Wang, R. Zhang, and L. Zhang, “Motionllm: Understanding human behaviors from human motions and videos,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  50. [50]

    Actionart: Advancing multimodal large models for fine-grained human-centric video understanding,

    Y .-X. Peng, Q. Yang, Y .-M. Tang, S. Fu, K.-Y . Lin, X. Wei, and W.-S. Zheng, “Actionart: Advancing multimodal large models for fine-grained human-centric video understanding,”arXiv preprint arXiv:2504.18152, 2025

  51. [51]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  52. [52]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  53. [53]

    Referring image segmentation using text supervision,

    F. Liu, Y . Liu, Y . Kong, K. Xu, L. Zhang, B. Yin, G. Hancke, and R. Lau, “Referring image segmentation using text supervision,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 22 124–22 134

  54. [54]

    Video action transformer network,

    R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 244–253

  55. [55]

    Social scene understanding: End-to-end multi-person action localization and collective activity recognition,

    T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4315–4324

  56. [56]

    Learning actor relation graphs for group activity recognition,

    J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, “Learning actor relation graphs for group activity recognition,” inProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2019, pp. 9964–9974

  57. [57]

    Con- gnn: Context-consistent cross-graph neural network for group emotion recognition in the wild,

    Y . Wang, S. Zhou, Y . Liu, K. Wang, F. Fang, and H. Qian, “Con- gnn: Context-consistent cross-graph neural network for group emotion recognition in the wild,”Information Sciences, vol. 610, pp. 707–724, 2022

  58. [58]

    So- cial relation recognition from videos via multi-scale spatial-temporal reasoning,

    X. Liu, W. Liu, M. Zhang, J. Chen, L. Gao, C. Yan, and T. Mei, “So- cial relation recognition from videos via multi-scale spatial-temporal reasoning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3566–3574

  59. [59]

    A domain based approach to social relation recognition,

    Q. Sun, B. Schiele, and M. Fritz, “A domain based approach to social relation recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3481–3490

  60. [60]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  61. [61]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inInternational conference on machine learning. PMLR, 2022, pp. 12 888–12 900

  62. [62]

    Instructblip: Towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 49 250–49 267, 2023

  63. [63]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

  64. [64]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Q.-V . Zhou, “A versatile vision-language model for un- derstanding, localization, text reading, and beyond,”arXiv preprint arXiv:2308.12966, vol. 6, p. 3, 2023

  65. [65]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding,

    P. Jin, R. Takanobu, W. Zhang, X. Cao, and L. Yuan, “Chat-univi: Unified visual representation empowers large language models with image and video understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 700–13 710

  66. [66]

    Llava- cot: Let vision language models reason step-by-step,

    G. Xu, P. Jin, Z. Wu, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava- cot: Let vision language models reason step-by-step,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2087–2098

  67. [67]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,

    A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2630–2640

  68. [68]

    Actbert: Learning global-local video-text repre- sentations,

    L. Zhu and Y . Yang, “Actbert: Learning global-local video-text repre- sentations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8746–8755

  69. [69]

    Frozen in time: A joint video and image encoder for end-to-end retrieval,

    M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 16

  70. [70]

    Videoclip: Contrastive pre- training for zero-shot video-text understanding,

    H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer, “Videoclip: Contrastive pre- training for zero-shot video-text understanding,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6787–6800

  71. [71]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    B. Zhang, K. Li, Z. Cheng, Z. Hu, Y . Yuan, G. Chen, S. Leng, Y . Jiang, H. Zhang, X. Liet al., “Videollama 3: Frontier multimodal foundation models for image and video understanding,”arXiv preprint arXiv:2501.13106, 2025

  72. [72]

    VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wanget al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

  73. [73]

    Timechat: A time-sensitive multimodal large language model for long video understanding,

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou, “Timechat: A time-sensitive multimodal large language model for long video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 313–14 323

  74. [74]

    Video- llava: Learning united visual representation by alignment before pro- jection,

    B. Lin, Y . Ye, B. Zhu, J. Cui, M. Ning, P. Jin, and L. Yuan, “Video- llava: Learning united visual representation by alignment before pro- jection,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 5971–5984

  75. [75]

    Llava-st: A multimodal large language model for fine-grained spatial- temporal understanding,

    H. Li, J. Chen, Z. Wei, S. Huang, T. Hui, J. Gao, X. Wei, and S. Liu, “Llava-st: A multimodal large language model for fine-grained spatial- temporal understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8592–8603

  76. [76]

    Video-chatgpt: Towards detailed video understanding via large vision and language models,

    M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 585– 12 602

  77. [77]

    Internvideo2: Scaling foundation models for multimodal video understanding,

    Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y . Shiet al., “Internvideo2: Scaling foundation models for multimodal video understanding,” inEuropean conference on computer vision. Springer, 2024, pp. 396–416

  78. [78]

    Sharegpt4video: Improving video understanding and generation with better captions,

    L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, B. Lin, Z. Tanget al., “Sharegpt4video: Improving video understanding and generation with better captions,”Advances in Neural Information Processing Systems, vol. 37, pp. 19 472–19 495, 2024

  79. [79]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 24 108–24 118

  80. [80]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,

    H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 11 198–11 201

Showing first 80 references.