StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset
Pith reviewed 2026-06-28 02:03 UTC · model grok-4.3
The pith
VideoQA models cannot maintain long-range character associations or coherent storyline understanding on a new 363K-question benchmark of TV series and movies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing VideoQA approaches excel on factoid questions but cannot sustain long-range character associations or construct coherent understanding of complex storylines when tested at scale; StoryVideoQA supplies the first benchmark large and diverse enough to expose this gap across both TV series and full-length movies, while PlotTree demonstrates that reorganizing video into hierarchical plots enables efficient reasoning over those storylines.
What carries the argument
PlotTree, a video understanding agent that re-organizes long-range video content into a hierarchical plot structure to support storyline reasoning.
If this is right
- Current state-of-the-art VideoQA methods cannot fully maintain long-range character associations across extended videos.
- They also cannot construct coherent understanding of complex storylines on the scale of full movies.
- Re-organizing video content into a hierarchical plot structure enables more efficient storyline reasoning.
- The scale and diversity of StoryVideoQA expose limitations not visible in smaller, manually constructed DVU datasets.
Where Pith is reading between the lines
- PlotTree's hierarchical approach may generalize to other long-form sequential reasoning tasks such as multi-document summarization or long-horizon planning.
- If the generation pipeline scales without degradation, similar automatic construction methods could be applied to create benchmarks in adjacent domains like audio stories or game logs.
- The observed failures suggest that future VideoQA architectures may need explicit memory or graph-based structures for character tracking rather than relying solely on transformer attention.
Load-bearing premise
The supervisor-guided multi-agent generation and multi-reviewer voting process produces high-quality, balanced question-answer pairs that accurately reflect complex storylines without systematic bias or hallucination.
What would settle it
A controlled human audit of several thousand generated question-answer pairs for factual accuracy, storyline fidelity, and absence of bias, followed by re-running the 20 baseline models on the audited subset.
read the original abstract
Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce StoryVideoQA, the largest DVU dataset to date with 363K QAs over 393.2 hours of TV series (avg. 1635s) and movies (avg. 7878s), generated via StoryMindv2 (an enhanced multi-agent framework with supervisor-guided generation and multi-reviewer voting). It reports that evaluations of 20 SOTA VideoQA models on this benchmark show failures to maintain long-range character associations or coherent storyline understanding, and proposes PlotTree, a hierarchical plot-structure agent, to enable better reasoning.
Significance. If the dataset quality holds and the evaluations are robust, the work would deliver a valuable large-scale, multi-genre benchmark exposing clear limitations in current VideoQA methods for complex narratives, while the PlotTree agent offers a concrete architectural direction for long-range reasoning; the scale and genre diversity would be a notable contribution to the field.
major comments (2)
- [Abstract and StoryMindv2 construction section] Abstract and the section on StoryMindv2 / StoryVideoQA construction: the headline claim that 20 models 'cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines' rests entirely on the fidelity of the 363K auto-generated QAs, yet no human validation, inter-annotator agreement, or accuracy comparison against manually curated subsets is reported (especially for the 7878-second movies where v1 degradation was acknowledged). This is load-bearing for both the evaluation results and the motivation for PlotTree.
- [StoryMindv2 / multi-reviewer voting description] The section describing the multi-reviewer voting strategy: the assertion that the refined voting produces 'high-quality' and 'balanced' QAs for movies is presented without any quantitative metrics (e.g., agreement rates, hallucination rates, or external human ratings), leaving the central evaluation results unanchored.
minor comments (2)
- [Abstract] Abstract contains a repeated sentence ('These difficulties constrain the scale and diversity of manually constructed DVU datasets.') and a minor grammatical issue ('manually-constructed DVU dataset' should be plural).
- [Introduction / Related Work] The relationship to the prior StoryMind paper is referenced but the specific quantitative improvements of v2 over v1 on movie-length content are not tabulated or highlighted.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the validation of StoryVideoQA. We address the two major comments below and agree that additional quantitative evidence is needed to support the dataset quality claims.
read point-by-point responses
-
Referee: [Abstract and StoryMindv2 construction section] Abstract and the section on StoryMindv2 / StoryVideoQA construction: the headline claim that 20 models 'cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines' rests entirely on the fidelity of the 363K auto-generated QAs, yet no human validation, inter-annotator agreement, or accuracy comparison against manually curated subsets is reported (especially for the 7878-second movies where v1 degradation was acknowledged). This is load-bearing for both the evaluation results and the motivation for PlotTree.
Authors: We agree that the fidelity of the auto-generated QAs is central to the headline claims and the motivation for PlotTree. StoryMindv2 extends the prior StoryMind framework (which included some validation for TV series) with supervisor-guided generation and multi-reviewer voting specifically to address degradation on longer movies, but the current manuscript does not report new human validation, inter-annotator agreement, or direct accuracy comparisons against manual subsets. We will add a human evaluation study on sampled QAs (stratified across TV and movies), including accuracy rates, inter-annotator agreement, and comparison to manually curated references. This will be incorporated into the revised manuscript. revision: yes
-
Referee: [StoryMindv2 / multi-reviewer voting description] The section describing the multi-reviewer voting strategy: the assertion that the refined voting produces 'high-quality' and 'balanced' QAs for movies is presented without any quantitative metrics (e.g., agreement rates, hallucination rates, or external human ratings), leaving the central evaluation results unanchored.
Authors: We acknowledge that the manuscript asserts high-quality and balanced QAs from the refined multi-reviewer voting without accompanying quantitative metrics such as agreement rates or hallucination rates. The voting mechanism is intended to enforce consensus and filter issues, but no explicit numbers are provided. In the revision we will report reviewer agreement statistics, any hallucination filtering steps, and tie these to the human evaluation results noted above to better anchor the quality claims. revision: yes
Circularity Check
No circularity: dataset generation, benchmark evaluations, and PlotTree proposal form independent chain
full rationale
The paper's derivation proceeds from describing StoryMindv2 (an enhanced multi-agent framework extending prior StoryMind), applying it to produce the 363K-QA StoryVideoQA dataset across TV series and movies, running direct empirical evaluations of 20 existing VideoQA models on that benchmark to measure failures on long-range associations, and introducing PlotTree as a hierarchical agent to address observed gaps. None of these steps reduce by construction to their inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations that are themselves unverified. The self-reference to StoryMind is limited to motivating the v2 improvements and is not invoked as an external uniqueness theorem or to force the evaluation outcomes. The benchmark results and PlotTree design remain falsifiable against the generated data and external models without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-agent collaboration with supervisor guidance and voting produces high-quality, unbiased question-answer pairs that capture complex storylines in long videos.
invented entities (1)
-
PlotTree
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video under- standing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14313–14323 (2024)
2024
-
[2]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sat- tler, T., Varol, G
Huang, D.-A., Liao, S., Radhakrishnan, S., Yin, H., Molchanov, P., Yu, Z., Kautz, J.: Lita: Language instructed temporal- localization assistant. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sat- tler, T., Varol, G. (eds.) Computer Vision – ECCV 2024, pp. 202–218. Springer, Cham (2025)
2024
-
[3]
In: Proceedings of the 41st International Con- ference on Machine Learning, pp
Qian, L., Li, J., Wu, Y., Ye, Y., Fei, H., Chua, T.-S., Zhuang, Y., Tang, S.: Momen- tor: advancing video large language model with fine-grained temporal reasoning. In: Proceedings of the 41st International Con- ference on Machine Learning, pp. 41340– 41356 (2024)
2024
-
[4]
In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp
Lee, M.J., Gong, D., Cho, M.: Video sum- marization with large language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18981– 18991 (2025)
2025
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Argaw, D.M., Yoon, S., Heilbron, F.C., Deil- amsalehy, H., Bui, T., Wang, Z., Dernon- court, F., Chung, J.S.: Scaling up video sum- marization pretraining with large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8332–8341 (2024)
2024
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
He, B., Wang, J., Qiu, J., Bui, T., Shrivas- tava, A., Wang, Z.: Align and attend: Multi- modal summarization with dual contrastive losses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14867–14878 (2023)
2023
-
[7]
In: Proceedings of the 2025 International Conference on Multimedia Retrieval
Wu, Z., Wang, X., Chang, H., Chen, H., Sun, L., Zhu, W.: Aligning large multimodal 23 model with sequential recommendation via content-behavior guidance. In: Proceedings of the 2025 International Conference on Multimedia Retrieval. ICMR ’25, pp. 1507–
2025
-
[8]
Association for Computing Machinery, New York, NY, USA (2025)
2025
-
[9]
In: MultiMedia Modeling, pp
Gu, G., Wu, Z., He, J., Song, L., Wang, Z., Liang, C.: Talksee: Interactive video retrieval engine using large language model. In: MultiMedia Modeling, pp. 387–393. Springer, Cham (2024)
2024
-
[10]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp
Galanopoulos, D., Goulas, A., Leven- takis, A., Patras, I., Mezaris, V.: An llm framework for long-form video retrieval and audio-visual question answering using qwen2/2.5. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3739–3748 (2025)
2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp
Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual repre- sentation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 13700–13710 (2024)
2024
-
[12]
In: Ku, L.-W., Martins, A., Srikumar, V
Maaz, M., Rasheed, H., Khan, S., Khan, F.: Video-ChatGPT: Towards detailed video understanding via large vision and lan- guage models. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12585–12602. Association for Computational Linguistics, B...
2024
-
[13]
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: VideoChat: Chat-Centric Video Under- standing (2024)
2024
-
[14]
Proceedings of the AAAI Conference on Artificial Intel- ligence33, 9127–9134 (2019)
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., Tao, D.: Activitynet-qa: A dataset for understanding complex web videos via question answering. Proceedings of the AAAI Conference on Artificial Intel- ligence33, 9127–9134 (2019)
2019
-
[15]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Xiao, J., Shang, X., Yao, A., Chua, T.-S.: Next-qa: Next phase of question-answering to explaining temporal actions. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786 (2021)
2021
-
[16]
In: 2023 IEEE Interna- tional Conference on Multimedia and Expo (ICME), pp
Guo, J., Liang, C., Wang, Z.: Who, what and where: Composite-semantic instance search for story videos. In: 2023 IEEE Interna- tional Conference on Multimedia and Expo (ICME), pp. 858–863 (2023). IEEE
2023
-
[17]
IEEE Transactions on Image Processing34, 1412–1426 (2025)
Guo, J., Lu, A., Wu, Z., Wang, Z., Liang, C.: Who, what, and where: Composite- semantics instance search for story videos. IEEE Transactions on Image Processing34, 1412–1426 (2025)
2025
-
[18]
Proceedings of the AAAI Conference on Artificial Intelligence39(8), 8523–8531 (2025)
Wu, Z., Li, R., Xu, Z., Wang, Z., Xiao, C., Liang, C.: Friendsqa: A new large- scale deep video understanding dataset with fine-grained topic categorization for story videos. Proceedings of the AAAI Conference on Artificial Intelligence39(8), 8523–8531 (2025)
2025
-
[19]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Tor- ralba, A., Urtasun, R., Fidler, S.: Movieqa: Understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4631–4640 (2016)
2016
-
[20]
In: Riloff, E., Chiang, D., Hock- enmaier, J., Tsujii, J
Lei, J., Yu, L., Bansal, M., Berg, T.: TVQA: Localized, compositional video question answering. In: Riloff, E., Chiang, D., Hock- enmaier, J., Tsujii, J. (eds.) Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pp. 1369–1379. Association for Computational Linguistics, Brussels, Belgium (2018)
2018
-
[21]
In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J
Lei, J., Yu, L., Berg, T., Bansal, M.: TVQA+: Spatio-temporal grounding for video question answering. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pp. 8211–8225. Association for Compu- tational Linguistics, Online (2020) 24
2020
-
[22]
In: Proceedings of the 2020 International Conference on Multimedia Retrieval
Curtis, K., Awad, G., Rajput, S., Sobo- roff, I.: Hlvu: A new challenge to test deep understanding of movies the way humans do. In: Proceedings of the 2020 International Conference on Multimedia Retrieval. ICMR ’20, pp. 355–361. Association for Computing Machinery, New York, NY, USA (2020)
2020
-
[23]
In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol
Choi, S., On, K.-W., Heo, Y.-J., Seo, A., Jang, Y., Lee, M., Zhang, B.-T.: Dramaqa: Character-centered video story understand- ing with hierarchical qa. In: Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 35, pp. 1166–1174 (2021)
2021
-
[24]
In: Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pp
Fung, Y., Wang, H., Wang, T., Kebarighotbi, A., Bansal, M., Ji, H., Natarajan, P.: Deepmaven: Deep question answering on long-distance movie/tv show videos with multimedia knowledge extrac- tion and synthesis. In: Proceedings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pp. 3041–3051 (2023)
2023
-
[25]
arXiv preprint arXiv:2405.08813 (2024)
Rawal, R., Saifullah, K., Basri, R., Jacobs, D., Somepalli, G., Goldstein, T.: Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813 (2024)
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Song, E., Chai, W., Wang, G., Zhang, Y., Zhou, H., Wu, F., Chi, H., Guo, X., Ye, T., Zhang, Y.,et al.: Moviechat: From dense token to sparse memory for long video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18221–18232 (2024)
2024
-
[27]
International Journal of Com- puter Vision133(11), 7726–7747 (2025)
Zhang, H., Dong, L., Liu, Y., Huang, Y., Wang, Y., Wang, L., Qiao, Y.: Lvbench: A benchmark for long-form video understand- ing with versatile multi-modal question answering. International Journal of Com- puter Vision133(11), 7726–7747 (2025)
2025
-
[28]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first- ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Visio...
2025
-
[29]
In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C
Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems, vol. 37, pp. 28828–28857 (2024)
2024
-
[30]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video-mmmu: Eval- uating knowledge acquisition from multi- discipline professional videos. arXiv preprint arXiv:2501.13826 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
In: The Thirteenth International Conference on Learning Representations
Chen, G., Liu, Y., Huang, Y., Pei, B., Xu, J., He, Y., Lu, T., Wang, Y., Wang, L.: Cg- bench: Clue-grounded question answering benchmark for long video understanding. In: The Thirteenth International Conference on Learning Representations
-
[32]
Vrbench: A benchmark for multi-step reasoning in long narrative videos,
Yu, J., Wu, Y., Chu, M., Ren, Z., Huang, Z., Chu, P., Zhang, R., He, Y., Li, Q., Li, S., et al.: Vrbench: A benchmark for multi-step reasoning in long narrative videos. arXiv preprint arXiv:2506.10857 (2025)
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., Qi, J., Ding, M., Gu, X., Huang, S., Xu, B., Dong, Y., Tang, J.: LVBench: An Extreme Long Video Under- standing Benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22958–22967 (2025)
2025
-
[34]
Image (2020)
He, X., Zhu, W.: Visual question answering from theory to application. Image (2020)
2020
-
[35]
In: Goldberg, Y., Kozareva, Z., Zhang, 25 Y
Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.-S.: Video question answer- ing: Datasets, algorithms and challenges. In: Goldberg, Y., Kozareva, Z., Zhang, 25 Y. (eds.) Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pp. 6439–6455. Asso- ciation for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)
2022
-
[36]
In: Ku, L.-W., Martins, A., Srikumar, V
Nguyen, T., Bin, Y., Xiao, J., Qu, L., Li, Y., Wu, J.Z., Nguyen, C.-D., Ng, S.-K., Luu, A.T.: Video-language understanding: A sur- vey from model architecture, model train- ing, and data perspectives. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguis- tics: ACL 2024, pp. 3636–3657. Association for Comput...
2024
-
[37]
International Journal of Computer Vision, 1–24 (2025)
Xiao, J., Huang, N., Qin, H., Li, D., Li, Y., Zhu, F., Tao, Z., Yu, J., Lin, L., Chua, T.-S., et al.: Videoqa in the era of llms: An empirical study. International Journal of Computer Vision, 1–24 (2025)
2025
-
[38]
In: Proceed- ings of the 25th ACM International Con- ference on Multimedia
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: Proceed- ings of the 25th ACM International Con- ference on Multimedia. MM ’17, pp. 1645–
-
[39]
Association for Computing Machinery, New York, NY, USA (2017)
2017
-
[40]
In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018, pp. 487–503. Springer, Cham (2018)
2018
-
[41]
In: Webber, B., Cohn, T., He, Y., Liu, Y
Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: HERO: Hierarchical encoder for Video+Language omni-representation pre- training. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pp. 2046–2065. Association for Computational Linguistics, Online (2020)
2020
-
[42]
8731–8772 (2024)
Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Temp- compass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics ACL 2024, pp. 8731–8772 (2024)
2024
-
[43]
arXiv preprint arXiv:2406.11303 (2024)
Li, Y., Chen, X., Hu, B., Wang, L., Shi, H., Zhang, M.: Videovista: A versatile bench- mark for video understanding and reason- ing. arXiv preprint arXiv:2406.11303 (2024)
-
[44]
Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024
Wu, B., Yu, S., Chen, Z., Tenenbaum, J.B., Gan, C.: Star: A benchmark for situated rea- soning in real-world videos. arXiv preprint arXiv:2405.09711 (2024)
-
[45]
Advances in Neural Information Pro- cessing Systems36, 46212–46244 (2023)
Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understand- ing. Advances in Neural Information Pro- cessing Systems36, 46212–46244 (2023)
2023
-
[46]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video under- standing benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22195–22206 (2024)
2024
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X.,et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012 (2022)
2022
-
[48]
In: Proceedings of the 33rd ACM International Conference on Multi- media
Xu, Z., Guo, J., Zhang, C., Wang, Z., Xiao, C., Liang, C.: Quantum interference-inspired who-what-where composite-semantics instance search for story videos. In: Proceedings of the 33rd ACM International Conference on Multi- media. MM ’25, pp. 4166–4174, New York, NY, USA (2025)
2025
-
[49]
IEEE Transactions on Multimedia15(2), 401–414 (2013) 26
Liang, C., Xu, C., Cheng, J., Min, W., Lu, H.: Script-to-movie: A computational frame- work for story movie composition. IEEE Transactions on Multimedia15(2), 401–414 (2013) 26
2013
-
[50]
In: CVPR 2011, pp
Liang, C., Xu, C., Cheng, J., Lu, H.: Tvparser: An automatic tv video parsing method. In: CVPR 2011, pp. 3377–3384 (2011)
2011
-
[51]
In: Proceed- ings of the 30th ACM International Con- ference on Multimedia
Curtis, K., Awad, G., Rajput, S., Soboroff, I.: The acm multimedia 2022 deep video understanding grand challenge. In: Proceed- ings of the 30th ACM International Con- ference on Multimedia. MM ’22, pp. 7075–
2022
-
[52]
Association for Computing Machinery, New York, NY, USA (2022)
2022
-
[53]
In: Proceedings of the 31st ACM International Conference on Multimedia
Curtis, K., Awad, G., Godil, A., Soboroff, I.: The acm multimedia 2023 deep video under- standing grand challenge. In: Proceedings of the 31st ACM International Conference on Multimedia. MM ’23, pp. 9606–9609. Association for Computing Machinery, New York, NY, USA (2023)
2023
-
[54]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Kukleva, A., Tapaswi, M., Laptev, I.: Learn- ing interactions and relationships between movie characters. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9849– 9858 (2020)
2020
-
[55]
In: Proceedings of the 31st ACM International Conference on Multimedia, pp
Li, R., Guo, J., Li, M., Wu, Z., Liang, C.: A hierarchical deep video understand- ing method with shot-based instance search and large language model. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9425–9429 (2023)
2023
-
[56]
In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp
Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: Video-and-language pre- training with entity prompts. In: Proceed- ings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 4953–4963 (2022)
2022
-
[57]
In: Rogers, A., Boyd-Graber, J., Okazaki, N
Lei, J., Berg, T., Bansal, M.: Revealing single frame bias for video-and-language learning. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 487–507. Association for Com- putational Linguistics, Toronto, Canada (2023)
2023
-
[58]
In: International Conference on Machine Learning, pp
Xu, H., Ye, Q., Yan, M., Shi, Y., Ye, J., Xu, Y., Li, C., Bi, B., Qian, Q., Wang, W.,et al.: mplug-2: A modularized multi- modal foundation model across text, image and video. In: International Conference on Machine Learning, pp. 38728–38748 (2023). PMLR
2023
-
[59]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Choi, J., Lee, S., Chu, J., Choi, M., Kim, H.J.: vid-tldr: Training free token merging for light-weight video transformer. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18771–18781 (2024)
2024
-
[60]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp
Fu, T.-J., Li, L., Gan, Z., Lin, K., Wang, W.Y., Wang, L., Liu, Z.: An empirical study of end-to-end video-language trans- formers with masked visual modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 22898–22909 (2023)
2023
-
[61]
Technical report (2023)
OpenAI: Chatgpt: A language model for conversational ai openai. Technical report (2023)
2023
-
[62]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalk- wyk, J., Dai, A.M., Hauth, A., et al.: Gem- ini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Reid, M., Savinov, N., Teplyashin, D., Lep- ikhin, D., Lillicrap, T., Alayrac, J.-b., Sori- cut, R., Lazaridou, A., Firat, O., Schrit- twieser, J., et al.: Gemini 1.5: Unlock- ing multimodal understanding across mil- lions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., Wang, L.: The dawn of lmms: Preliminary explorations with gpt- 4v (ision). arXiv preprint arXiv:2309.17421 9(1) (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, 27 F., et al.: Llama: Open and efficient foun- dation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
See https://vicuna
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impress- ing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023)
2023
-
[67]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Journal of Machine Learning Research 25(70), 1–53 (2024)
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen, X., Chowd- hery, A., Castro-Ros, A., Pellat, M., Robin- son, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E.H., Dean, J., Devlin, J., Roberts...
2024
-
[69]
A Survey on Multimodal Large Language Models
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multi- modal large language models. arXiv preprint arXiv:2306.13549 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scar- lett, J
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large lan- guage models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scar- lett, J. (eds.) Proceedings of the 40th Inter- national Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202...
2023
-
[71]
In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Infor- mation Processing Systems, vol. 36, pp. 34892–34916 (2023)
2023
-
[72]
Advances in neural informa- tion processing systems35, 23716–23736 (2022)
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M.,et al.: Flamingo: a visual language model for few- shot learning. Advances in neural informa- tion processing systems35, 23716–23736 (2022)
2022
-
[73]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learn- ing united visual representation by align- ment before projection. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pp. 5971–5984. Association for Computational Linguistics, Miami, ...
2024
-
[75]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[77]
Long Context Transfer from Language to Vision
Zhang, P., Zhang, K., Li, B., Zeng, G., Yang, J., Zhang, Y., Wang, Z., Tan, H., Li, C., Liu, Z.: Long context transfer from language to vision. arXiv preprint arXiv:2406.16852 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., Liu, Z., Xu, H., J. Kim, H., Soran, B., Krishnamoor- thi, R., Elhoseiny, M., Chandra, V.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv 28 preprint arXiv:2410.17434 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp
He, B., Li, H., Jang, Y.K., Jia, M., Cao, X., Shah, A., Shrivastava, A., Lim, S.-N.: Ma- lmm: Memory-augmented large multimodal model for long-term video understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 13504–13514 (2024)
2024
-
[80]
In: Pro- ceedings of the Computer Vision and Pat- tern Recognition Conference, pp
Man, Y., Huang, Y., Zhang, C., Li, B., Niu, W., Yin, M.: Adacmˆ 2: On understand- ing extremely long-term video with adaptive cross-modality memory reduction. In: Pro- ceedings of the Computer Vision and Pat- tern Recognition Conference, pp. 8534–8544 (2025)
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.