pith. machine review for the scientific record. sign in

arxiv: 2410.17434 · v1 · submitted 2024-10-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingspatiotemporal compressionmultimodal large language modelstoken reductionvideo-language modelsadaptive compressionframe redundancy removal
0
0 comments X

The pith

LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongVU as a way to overcome the context-length limits of multimodal large language models when analyzing extended videos. It reduces video tokens through a three-step process that first drops visually similar frames, then selectively compresses frame features guided by text, and finally reduces spatial tokens using temporal relations between frames. The goal is to keep essential visual details intact while processing many more frames than standard approaches allow. A sympathetic reader would care because this makes practical understanding of long-form video content feasible without requiring ever-larger context windows or sacrificing accuracy. The work also demonstrates that the same compression works when paired with smaller, lighter language models.

Core claim

LongVU is a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos by leveraging DINOv2 features to remove redundant frames with high similarity, utilizing text-guided cross-modal query for selective frame feature reduction, and performing spatial token reduction across frames based on their temporal dependencies.

What carries the argument

Spatiotemporal adaptive compression that uses cross-modal queries and inter-frame dependencies to cut temporal and spatial redundancy.

If this is right

  • Outperforms prior methods on multiple video understanding benchmarks, with largest gains on hour-long tasks such as VideoMME and MLVU.
  • Maintains strong performance when paired with lightweight LLMs, enabling smaller overall model sizes.
  • Processes large numbers of frames inside a fixed context length with only minor visual information loss.
  • Directly addresses the LLM context-size bottleneck for long video-language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the compression reliably preserves task-relevant details, similar adaptive reduction could be applied to live video streams for real-time monitoring systems.
  • The approach suggests that combining visual similarity metrics with text guidance could be tested on other sequential data such as audio tracks or time-series sensor inputs.
  • Success with fixed compression steps raises the question of whether jointly training the compression module with the language model would yield further gains on diverse video domains.

Load-bearing premise

DINOv2 similarity and text-guided queries can safely discard frames and tokens without removing information the language model needs for accurate understanding.

What would settle it

Run LongVU on videos containing subtle but task-critical differences between visually similar frames and measure whether downstream accuracy on a long-video benchmark falls below the uncompressed baseline.

read the original abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes LongVU, a spatiotemporal adaptive compression mechanism for long-video understanding in MLLMs. It first uses DINOv2 features to prune redundant frames via similarity, then applies text-guided cross-modal queries for selective frame-feature reduction, and finally performs spatial token reduction across frames based on temporal dependencies. The central claim is that this three-stage process reduces video tokens with negligible visual information loss, enabling effective processing of hour-long videos within LLM context limits and yielding consistent outperformance over prior methods on benchmarks such as VideoMME and MLVU, including when paired with lightweight LLMs.

Significance. If the compression indeed preserves answer-critical content while achieving the reported token reduction, the approach would provide a practical advance for long-video MLLMs by mitigating context-length constraints without requiring architectural changes to the underlying LLM.

major comments (3)
  1. [Abstract] Abstract: the claim of 'little visual information loss' and 'consistent outperformance' is asserted without any quantitative metrics, ablation tables, error bars, or dataset statistics, so the central empirical claim rests on unverified assertions.
  2. [Method] Method description (DINOv2 pruning stage): frame pruning occurs unconditionally on DINOv2 cosine similarity before any text conditioning; this creates an irreversible information bottleneck because visually similar frames (static backgrounds, recurring angles) can contain distinct actions or objects needed to answer a downstream query, yet no quantitative bound or recovery mechanism is supplied.
  3. [Method] Method description (cross-modal and spatial stages): because the first stage has already discarded tokens, the subsequent text-guided query reduction and temporal-dependency pooling have no access to the dropped content; this ordering undermines the claim that the overall pipeline is 'adaptive' with respect to the task.
minor comments (2)
  1. [Abstract] Abstract contains a grammatical error: 'thats reduces' should read 'that reduces'.
  2. [Abstract] Abstract contains a subject-verb agreement error: 'Our LongVU consistently surpass' should read 'surpasses'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation for major revision. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'little visual information loss' and 'consistent outperformance' is asserted without any quantitative metrics, ablation tables, error bars, or dataset statistics, so the central empirical claim rests on unverified assertions.

    Authors: We acknowledge that the abstract would be improved by incorporating specific quantitative highlights. The full manuscript contains ablation tables, performance comparisons on VideoMME and MLVU with consistent gains over baselines, token reduction statistics, and results across multiple datasets. We will revise the abstract to include key metrics such as average compression ratios and benchmark improvements to better substantiate the claims. revision: yes

  2. Referee: [Method] Method description (DINOv2 pruning stage): frame pruning occurs unconditionally on DINOv2 cosine similarity before any text conditioning; this creates an irreversible information bottleneck because visually similar frames (static backgrounds, recurring angles) can contain distinct actions or objects needed to answer a downstream query, yet no quantitative bound or recovery mechanism is supplied.

    Authors: This is a valid concern about potential information loss in the initial stage. DINOv2 is used to identify broad visual redundancy for scalability in long videos, and the manuscript's ablations show robust downstream performance. However, we agree to strengthen the presentation by adding quantitative analysis of pruning effects, including edge cases with similar frames, and bounds on information retention in the revised method section. revision: partial

  3. Referee: [Method] Method description (cross-modal and spatial stages): because the first stage has already discarded tokens, the subsequent text-guided query reduction and temporal-dependency pooling have no access to the dropped content; this ordering undermines the claim that the overall pipeline is 'adaptive' with respect to the task.

    Authors: The hierarchical ordering enables efficient processing of hour-long videos by first removing global redundancy, after which the text-guided and temporal stages adaptively compress the retained content based on the query. This design choice is explained in the manuscript as necessary for context-length constraints. We will revise the method section to clarify this rationale and add supporting comparisons to alternative orderings. revision: yes

Circularity Check

0 steps flagged

No circularity: method composes external pre-trained features with standard attention

full rationale

The paper's core mechanism (DINOv2-based frame pruning followed by text-guided cross-modal query reduction and temporal-dependency spatial pooling) is described as a composition of pre-existing components: DINOv2 is an external model, cross-modal queries follow standard attention patterns, and spatial pooling uses inter-frame dependencies without any fitted parameters defined from the target benchmarks. No equations are presented that reduce claimed performance gains on VideoMME/MLVU to quantities defined by the result itself. Evaluation uses external benchmarks and a light-weight LLM without self-referential fitting or load-bearing self-citations. The derivation chain is therefore self-contained against external references.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the effectiveness of pre-trained visual features and query alignment but introduces no new free parameters or invented entities beyond standard architectural choices.

axioms (2)
  • domain assumption DINOv2 features provide a reliable measure of visual similarity for identifying redundant frames
    Invoked in the first compression stage without further justification in the abstract.
  • domain assumption Text-guided cross-modal queries can selectively retain task-relevant visual information
    Central to the second stage; treated as effective without supporting derivation.

pith-pipeline@v0.9.0 · 5554 in / 1158 out tokens · 37373 ms · 2026-05-16T13:49:27.580289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    cs.CV 2026-01 unverdicted novelty 8.0

    Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

  2. CATS: Curvature Aware Temporal Selection for efficient long video understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.

  3. LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

    cs.CV 2026-05 conditional novelty 7.0

    LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...

  4. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  5. Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

    cs.LG 2026-04 unverdicted novelty 7.0

    Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...

  6. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  7. Mosaic: Cross-Modal Clustering for Efficient Video Understanding

    cs.PF 2026-04 unverdicted novelty 7.0

    Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

  8. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  9. LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

  10. Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    cs.CV 2026-05 unverdicted novelty 6.0

    Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

  11. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.

  12. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  13. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  14. Small Vision-Language Models are Smart Compressors for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

  15. HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

    cs.CV 2026-01 unverdicted novelty 6.0

    HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

  16. Streaming Video Instruction Tuning

    cs.CV 2025-12 unverdicted novelty 6.0

    Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

  17. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

  18. Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.

  19. EgoSelf: From Memory to Personalized Egocentric Assistant

    cs.CV 2026-04 unverdicted novelty 5.0

    EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.

  20. Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

    cs.CV 2026-03 unverdicted novelty 5.0

    AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.

  21. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 19 Pith papers · 23 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024a

    Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024a. Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao ...

  4. [4]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

  5. [5]

    Language Models are Few-Shot Learners

    Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

  6. [6]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023a. Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and R...

  7. [7]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023c. Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang,...

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.https://lmsys.org/blog/2023-03-30-vicuna/. 11 Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hy...

  9. [9]

    Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,

  10. [10]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

  11. [11]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  12. [12]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv:2401.04088,

  13. [13]

    Chat-univi: Unified visual representation empowers large language models with image and video understanding.arXiv preprint arXiv:2311.08046,

    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding.arXiv preprint arXiv:2311.08046,

  14. [14]

    The Kinetics Human Action Video Dataset

    https://github.com/gkamradt/LLMTest_ NeedleInAHaystack. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,

  15. [15]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod...

  16. [16]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122,

  17. [17]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    12 Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b.https://llava-vl.github.io/bl...

  18. [18]

    Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

    Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,

  19. [19]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023a. Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.a...

  20. [20]

    DINOv2: Learning Robust Visual Features without Supervision

    OpenAI. Gpt-4v(ision) system card, 2023.https://openai.com/research/gpt-4v-system-card. OpenAI. Gpt-4o system card, 2024.https://openai.com/index/hello-gpt-4o/. Team OpenGVLab. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy, 2024.https://internvl.github.io/blog/2024-0...

  21. [21]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

  22. [22]

    TESTA: Temporal-spatial token aggregation for long-form video-language understanding.arXiv preprint arXiv:2310.19060, 2023a

    Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. TESTA: Temporal-spatial token aggregation for long-form video-language understanding.arXiv preprint arXiv:2310.19060, 2023a. Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding.arXiv preprint arXiv:2312.02...

  23. [23]

    Cambrian- 1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860,

  24. [24]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    13 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288,

  25. [25]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079,

  26. [26]

    Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377,

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377,

  27. [27]

    Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841,

    Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841,

  28. [28]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang Qi, Ji Chao Zhang, and Feiyan Huang. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178,

  29. [29]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442,

  30. [30]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858,

  31. [31]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024a. Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-s...

  32. [32]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,

  33. [33]

    14 Appendix A Training Datasets For the image-language training stage, previous methods (Chen et al., 2023b; Peng et al., 2023; Wang et al., 2023; Chen et al., 2023a; Liu et al., 2024b; Dong et al.,

  34. [34]

    For simplicity, we combine and alignment in one stage using single image version of LLaVA-OneVision (Li et al., 2024a) data

    usually use two stages of alignment and finetuning. For simplicity, we combine and alignment in one stage using single image version of LLaVA-OneVision (Li et al., 2024a) data. For video-language training, we utilize a large-scale video-text pairs sourced from several publicly accessible databases. The video training data is a subset of VideoChat2-IT (Li ...

  35. [35]

    Therefore, we decide not to include it in our default setting

    = cos(t/100002i/d) (3) 15 The ablation shows in Table 8 and Table 9 that adding the FPE does not affect much to the overall performance across several benchmarks. Therefore, we decide not to include it in our default setting. Methods Context Length #Tokens EgoSchema VideoMME ML VU DINO + Query 8k 64/144 67.30 60.08 65.05 DINO + Query + STC (default) 8k dy...