pith. machine review for the scientific record. sign in

arxiv: 2404.16994 · v2 · submitted 2024-04-25 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords PLLaVAtemporal poolinghigh-norm feature biasvideo dense captioningvideo question answeringparameter-free adaptationimage-to-video extension
0
0 comments X

The pith

A parameter-free temporal pooling strategy lets image-language models extend directly to video dense captioning and question answering without added parameters or heavy retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that feeding multiple video frames straight into a pre-trained image-language model causes performance to plateau or decline, largely because extreme high-norm visual features dominate the output. A straightforward temporal pooling operation smooths the feature distribution across time, reducing the skew from those outliers while leaving the rest of the model unchanged. This yields PLLaVA, which records 3.48 out of 5 on the VideoChatGPT benchmark and 58.1 percent accuracy on MVBench, surpassing prior GPT-4V-based results by 9 and 14.5 percentage points respectively. The approach matters because full video-language pre-training requires enormous data and compute; the pooling method reuses existing image models at almost no extra cost. Experiments confirm the gains hold for both video question answering and dense captioning tasks.

Core claim

Direct fine-tuning of pre-trained image-language models on video datasets with multiple frames as input leads to performance saturation or even degradation, which the authors trace to the bias introduced by learned high-norm visual features. By applying a simple, parameter-free pooling strategy along the temporal dimension, the feature distribution is smoothed and the influence of extreme features is curtailed, producing the model termed PLLaVA that attains new state-of-the-art results on standard video benchmarks.

What carries the argument

The parameter-free temporal pooling strategy that smooths visual feature distributions along the time axis to counteract high-norm bias.

If this is right

  • PLLaVA reaches 3.48 out of 5 on VideoChatGPT across five dimensions, 9 percent above the prior GPT-4V IG-VLM result.
  • PLLaVA attains 58.1 percent average accuracy on the 20-task MVBench, 14.5 percent above GPT-4V IG-VLM.
  • The same pooling adaptation improves both video question answering and dense captioning without introducing new parameters.
  • The method requires only the original image-pretraining weights plus a lightweight pooling step, avoiding large-scale video pre-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing image models may already contain most of the visual knowledge needed for video tasks once temporal feature skew is corrected.
  • The pooling technique could be tested on other static-to-dynamic extensions such as audio or 3-D scene understanding with minimal code changes.
  • Adding a small number of temporal layers on top of the pooled features might further improve long-range video reasoning while preserving the low-cost adaptation path.

Load-bearing premise

The performance drop observed when feeding multiple frames directly arises primarily from high-norm visual feature bias rather than from insufficient temporal modeling capacity or training-data mismatch.

What would settle it

Train the base model on multiple frames after explicitly normalizing or clipping feature norms to remove high-norm bias and measure whether accuracy on VideoChatGPT or MVBench recovers to match or exceed the pooled PLLaVA scores.

read the original abstract

Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at https://pllava.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PLLaVA, a parameter-free temporal pooling extension of the LLaVA image-language model for video dense captioning and question answering. Preliminary experiments show that directly feeding multiple frames to fine-tuned image VLMs causes saturation or drops; the authors attribute this primarily to high-norm visual feature bias and introduce a simple pooling operation along the temporal dimension to smooth feature distributions. The resulting model reports new SOTA numbers on VideoChatGPT (3.48/5 average, +9% over GPT-4V IG-VLM) and MVBench (58.1% average across 20 tasks, +14.5% over GPT-4V IG-VLM).

Significance. If the mechanistic attribution is correct, the work supplies a lightweight, training-free adaptation route that avoids the heavy compute and data costs of native video pre-training while delivering concrete benchmark gains. The parameter-free nature and public code release are clear strengths; however, the absence of controls that isolate norm bias from token-count reduction or averaging effects weakens the causal claim that underpins the method's motivation.

major comments (2)
  1. [Preliminary experiments and motivation section] Preliminary experiments and motivation section: the attribution of multi-frame performance drops to high-norm visual feature bias is not isolated from confounds. No ablation is reported that applies per-frame norm clipping or normalization while preserving the original frame count and full token sequence length; thus the headline gains cannot be unambiguously credited to bias mitigation rather than reduced token count or distributional averaging.
  2. [Method description (pooling strategy)] Method description (pooling strategy): the temporal pooling simultaneously reduces effective visual token count and performs averaging; without a controlled comparison that decouples these two effects, it remains unclear which component drives the reported improvements on VideoChatGPT and MVBench.
minor comments (2)
  1. [Abstract and results tables] The abstract and results tables would benefit from explicit reporting of the number of frames and visual tokens used in the PLLaVA vs. direct multi-frame baselines to allow direct comparison of token budgets.
  2. [Motivation section] Minor notation inconsistency: the term 'high-norm visual features' is used without a precise definition or measurement protocol (e.g., L2-norm threshold or per-layer statistics) in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our experimental observations and commit to targeted revisions that strengthen the causal claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Preliminary experiments and motivation section] Preliminary experiments and motivation section: the attribution of multi-frame performance drops to high-norm visual feature bias is not isolated from confounds. No ablation is reported that applies per-frame norm clipping or normalization while preserving the original frame count and full token sequence length; thus the headline gains cannot be unambiguously credited to bias mitigation rather than reduced token count or distributional averaging.

    Authors: We appreciate this observation. Our Section 3.1 shows clear saturation when scaling frame count, and Section 3.2 links this to high-norm bias through explicit norm statistics and feature visualizations. We agree a controlled ablation applying per-frame norm clipping or normalization while retaining the full token sequence would isolate the bias effect more cleanly from token-count reduction. We will add this experiment (and corresponding discussion) to the revised manuscript. revision: yes

  2. Referee: [Method description (pooling strategy)] Method description (pooling strategy): the temporal pooling simultaneously reduces effective visual token count and performs averaging; without a controlled comparison that decouples these two effects, it remains unclear which component drives the reported improvements on VideoChatGPT and MVBench.

    Authors: The referee correctly notes the dual action of pooling. Our motivation centers on distributional smoothing to suppress extreme high-norm features rather than token reduction per se. Existing ablations already compare against frame-sampling baselines (which reduce count without averaging) and show inferior results; however, we will add explicit controls that isolate pure averaging (e.g., mean-pool all tokens without downsampling) versus token reduction alone to quantify the contribution of each effect on VideoChatGPT and MVBench. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method is self-contained

full rationale

The paper's derivation consists of empirical observations from preliminary experiments (performance saturation/drop on multi-frame inputs) followed by an explicit, parameter-free temporal pooling operation whose effects are measured directly on held-out external benchmarks (VideoChatGPT, MVBench). No equations, fitted parameters, or self-citations reduce the reported gains to quantities defined inside the same experiment; the pooling strategy is introduced as a straightforward mitigation without invoking uniqueness theorems or ansatzes from prior self-work. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that high-norm features cause saturation; no new mathematical axioms or invented entities are introduced.

axioms (1)
  • domain assumption High-norm visual features from image-pretrained encoders dominate when multiple frames are concatenated.
    Stated as the cause of performance drop in preliminary experiments.

pith-pipeline@v0.9.0 · 5595 in / 1171 out tokens · 17636 ms · 2026-05-15T20:18:37.088390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

  2. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  3. Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

    cs.LG 2026-04 unverdicted novelty 7.0

    Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...

  4. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  5. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  6. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  7. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  8. HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.

  9. Small Vision-Language Models are Smart Compressors for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

  10. STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

  11. Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

  12. GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

    cs.CV 2026-02 unverdicted novelty 6.0

    GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

  13. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  14. Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

    cs.CV 2026-03 unverdicted novelty 5.0

    AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.

  15. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  16. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  17. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  18. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  19. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  20. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021

  3. [3]

    Videollm: Modeling video sequence with large language models

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  5. [5]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023

  6. [6]

    Kitani, and László A

    Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, and László A. Jeni. Zero-shot video question answering with procedural programs. ArXiv abs/2312.00937, 2023

  7. [7]

    Mixture-of-loras: An efficient multitask tuning for large language models

    Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Yu Han, and Hao Wang. Mixture-of-loras: An efficient multitask tuning for large language models. arXiv preprint arXiv:2403.03432, 2024

  8. [8]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842– 5...

  9. [9]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, and Xingyu Liu et al. Ego4d: Around the world in 3,000 hours of egocentric video. IEEE Conf. Comput. Vis. Pattern Recog., pages 18995–19012, 2022

  10. [10]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. ArXiv, abs/2312.08914, 2023

  11. [11]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021

  12. [12]

    Vtimellm: Empower llm to grasp video moments, 2023

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments, 2023

  13. [13]

    Lita: Language instructed temporal-localization assistant

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024

  14. [14]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. ArXiv abs/2311.08046, 2024

  15. [15]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 14

  16. [16]

    An image grid can be worth a video: Zero-shot video question answering using a vlm

    Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024

  17. [17]

    Handwritten digit recognition with a back-propagation network

    Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989

  18. [18]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  19. [19]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023

  20. [20]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. ArXiv abs/2311.17005, 2023

  21. [21]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. ArXiv abs/2311.17043, 2023

  22. [22]

    Tgif: A new dataset and benchmark on animated gif description

    Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016

  23. [23]

    Scaling & shifting your features: A new baseline for efficient model tuning

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35:109–123, 2022

  24. [24]

    Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

    Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023

  25. [25]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. ArXiv abs/2311.10122, 2023

  26. [26]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023

  27. [27]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  28. [28]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  29. [29]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

  30. [30]

    One for all: Video conversation is feasible without video instruction tuning

    Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H Li, and Ge Li. One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785, 2023

  31. [31]

    St-llm: Large language models are effective temporal learners

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. arXiv preprint arXiv:2404.00308, 2024

  32. [32]

    Vista-llama: Reliable video narrator via equal distance to visual tokens

    Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, and Yi Yang. Vista-llama: Reliable video narrator via equal distance to visual tokens. ArXiv abs/2312.08870, 2023

  33. [33]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 15

  34. [34]

    OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  36. [36]

    Llama 2: Early adopters’ utilization of meta’s new open-source pretrained model

    Konstantinos I Roumeliotis, Nikolaos D Tselikas, and Dimitrios K Nasiopoulos. Llama 2: Early adopters’ utilization of meta’s new open-source pretrained model. 2023

  37. [37]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tianbo Ye, Yang Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding. ArXiv abs/2307.16449, 2023

  38. [38]

    Adapool: Exponential adaptive pooling for information- retaining downsampling

    Alexandros Stergiou and Ronald Poppe. Adapool: Exponential adaptive pooling for information- retaining downsampling. IEEE Transactions on Image Processing, 32:251–266, 2022

  39. [39]

    Vipergpt: Visual inference via python execution for reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV) , 2023

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  42. [42]

    A large cross-modal video retrieval dataset with reading comprehension

    Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, and Xiang Bai. A large cross-modal video retrieval dataset with reading comprehension. arXiv preprint arXiv:2305.03347, 2023

  43. [43]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  44. [44]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017

  45. [45]

    Zero-shot video question answering via frozen bidirectional language models

    Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. Adv. Neural Inform. Process. Syst., 35:124–141, 2022

  46. [46]

    Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

    Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. arXiv preprint arXiv:2403.04640, 2024

  47. [47]

    Clevrer: Collision events for video representation and reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020

  48. [48]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019

  49. [49]

    A simple llm framework for long-range video question-answering

    Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. ArXiv abs/2312.17235, 2023

  50. [50]

    Video-LLaMA: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Conf. Empirical Methods in Natural Language Processing, pages 543–553, 2023. 16

  51. [51]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

  52. [52]

    Towards automatic learning of procedures from web instructional videos

    Luowei Zhou, Chenliang Xu, and Jason Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence , 2018

  53. [53]

    Minigpt-4: En- hancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2023. 17