Recognition: 2 theorem links
· Lean TheoremLongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Pith reviewed 2026-05-16 13:49 UTC · model grok-4.3
The pith
LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongVU is a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos by leveraging DINOv2 features to remove redundant frames with high similarity, utilizing text-guided cross-modal query for selective frame feature reduction, and performing spatial token reduction across frames based on their temporal dependencies.
What carries the argument
Spatiotemporal adaptive compression that uses cross-modal queries and inter-frame dependencies to cut temporal and spatial redundancy.
If this is right
- Outperforms prior methods on multiple video understanding benchmarks, with largest gains on hour-long tasks such as VideoMME and MLVU.
- Maintains strong performance when paired with lightweight LLMs, enabling smaller overall model sizes.
- Processes large numbers of frames inside a fixed context length with only minor visual information loss.
- Directly addresses the LLM context-size bottleneck for long video-language tasks.
Where Pith is reading between the lines
- If the compression reliably preserves task-relevant details, similar adaptive reduction could be applied to live video streams for real-time monitoring systems.
- The approach suggests that combining visual similarity metrics with text guidance could be tested on other sequential data such as audio tracks or time-series sensor inputs.
- Success with fixed compression steps raises the question of whether jointly training the compression module with the language model would yield further gains on diverse video domains.
Load-bearing premise
DINOv2 similarity and text-guided queries can safely discard frames and tokens without removing information the language model needs for accurate understanding.
What would settle it
Run LongVU on videos containing subtle but task-critical differences between visually similar frames and measure whether downstream accuracy on a long-video benchmark falls below the uncompressed baseline.
read the original abstract
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LongVU, a spatiotemporal adaptive compression mechanism for long-video understanding in MLLMs. It first uses DINOv2 features to prune redundant frames via similarity, then applies text-guided cross-modal queries for selective frame-feature reduction, and finally performs spatial token reduction across frames based on temporal dependencies. The central claim is that this three-stage process reduces video tokens with negligible visual information loss, enabling effective processing of hour-long videos within LLM context limits and yielding consistent outperformance over prior methods on benchmarks such as VideoMME and MLVU, including when paired with lightweight LLMs.
Significance. If the compression indeed preserves answer-critical content while achieving the reported token reduction, the approach would provide a practical advance for long-video MLLMs by mitigating context-length constraints without requiring architectural changes to the underlying LLM.
major comments (3)
- [Abstract] Abstract: the claim of 'little visual information loss' and 'consistent outperformance' is asserted without any quantitative metrics, ablation tables, error bars, or dataset statistics, so the central empirical claim rests on unverified assertions.
- [Method] Method description (DINOv2 pruning stage): frame pruning occurs unconditionally on DINOv2 cosine similarity before any text conditioning; this creates an irreversible information bottleneck because visually similar frames (static backgrounds, recurring angles) can contain distinct actions or objects needed to answer a downstream query, yet no quantitative bound or recovery mechanism is supplied.
- [Method] Method description (cross-modal and spatial stages): because the first stage has already discarded tokens, the subsequent text-guided query reduction and temporal-dependency pooling have no access to the dropped content; this ordering undermines the claim that the overall pipeline is 'adaptive' with respect to the task.
minor comments (2)
- [Abstract] Abstract contains a grammatical error: 'thats reduces' should read 'that reduces'.
- [Abstract] Abstract contains a subject-verb agreement error: 'Our LongVU consistently surpass' should read 'surpasses'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and recommendation for major revision. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'little visual information loss' and 'consistent outperformance' is asserted without any quantitative metrics, ablation tables, error bars, or dataset statistics, so the central empirical claim rests on unverified assertions.
Authors: We acknowledge that the abstract would be improved by incorporating specific quantitative highlights. The full manuscript contains ablation tables, performance comparisons on VideoMME and MLVU with consistent gains over baselines, token reduction statistics, and results across multiple datasets. We will revise the abstract to include key metrics such as average compression ratios and benchmark improvements to better substantiate the claims. revision: yes
-
Referee: [Method] Method description (DINOv2 pruning stage): frame pruning occurs unconditionally on DINOv2 cosine similarity before any text conditioning; this creates an irreversible information bottleneck because visually similar frames (static backgrounds, recurring angles) can contain distinct actions or objects needed to answer a downstream query, yet no quantitative bound or recovery mechanism is supplied.
Authors: This is a valid concern about potential information loss in the initial stage. DINOv2 is used to identify broad visual redundancy for scalability in long videos, and the manuscript's ablations show robust downstream performance. However, we agree to strengthen the presentation by adding quantitative analysis of pruning effects, including edge cases with similar frames, and bounds on information retention in the revised method section. revision: partial
-
Referee: [Method] Method description (cross-modal and spatial stages): because the first stage has already discarded tokens, the subsequent text-guided query reduction and temporal-dependency pooling have no access to the dropped content; this ordering undermines the claim that the overall pipeline is 'adaptive' with respect to the task.
Authors: The hierarchical ordering enables efficient processing of hour-long videos by first removing global redundancy, after which the text-guided and temporal stages adaptively compress the retained content based on the query. This design choice is explained in the manuscript as necessary for context-length constraints. We will revise the method section to clarify this rationale and add supporting comparisons to alternative orderings. revision: yes
Circularity Check
No circularity: method composes external pre-trained features with standard attention
full rationale
The paper's core mechanism (DINOv2-based frame pruning followed by text-guided cross-modal query reduction and temporal-dependency spatial pooling) is described as a composition of pre-existing components: DINOv2 is an external model, cross-modal queries follow standard attention patterns, and spatial pooling uses inter-frame dependencies without any fitted parameters defined from the target benchmarks. No equations are presented that reduce claimed performance gains on VideoMME/MLVU to quantities defined by the result itself. Evaluation uses external benchmarks and a light-weight LLM without self-referential fitting or load-bearing self-citations. The derivation chain is therefore self-contained against external references.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DINOv2 features provide a reliable measure of visual similarity for identifying redundant frames
- domain assumption Text-guided cross-modal queries can selectively retain task-relevant visual information
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
-
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
-
EgoSelf: From Memory to Personalized Egocentric Assistant
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
-
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413, 2024a. Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao ...
-
[4]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Language Models are Few-Shot Learners
Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[6]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023a. Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and R...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023c. Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.https://lmsys.org/blog/2023-03-30-vicuna/. 11 Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hy...
work page 2023
-
[9]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420,
-
[10]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding.arXiv preprint arXiv:2311.08046,
-
[14]
The Kinetics Human Action Video Dataset
https://github.com/gkamradt/LLMTest_ NeedleInAHaystack. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language mod...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
World Model on Million-Length Video And Language With Blockwise RingAttention
12 Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b.https://llava-vl.github.io/bl...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,
Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207,
-
[19]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023a. Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.a...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DINOv2: Learning Robust Visual Features without Supervision
OpenAI. Gpt-4v(ision) system card, 2023.https://openai.com/research/gpt-4v-system-card. OpenAI. Gpt-4o system card, 2024.https://openai.com/index/hello-gpt-4o/. Team OpenGVLab. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy, 2024.https://internvl.github.io/blog/2024-0...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, and Lu Hou. TESTA: Temporal-spatial token aggregation for long-form video-language understanding.arXiv preprint arXiv:2310.19060, 2023a. Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding.arXiv preprint arXiv:2312.02...
-
[23]
Cambrian- 1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. arXiv preprint arXiv:2406.16860,
-
[24]
Llama 2: Open Foundation and Fine-Tuned Chat Models
13 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding.arXiv preprint arXiv:2403.15377,
-
[27]
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, and Afshin Dehghan. Slowfast-llava: A strong training-free baseline for video large language models.arXiv preprint arXiv:2407.15841,
-
[28]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang Qi, Ji Chao Zhang, and Feiyan Huang. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442,
work page internal anchor Pith review arXiv 1910
-
[30]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024a. Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-s...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
14 Appendix A Training Datasets For the image-language training stage, previous methods (Chen et al., 2023b; Peng et al., 2023; Wang et al., 2023; Chen et al., 2023a; Liu et al., 2024b; Dong et al.,
work page 2023
-
[34]
usually use two stages of alignment and finetuning. For simplicity, we combine and alignment in one stage using single image version of LLaVA-OneVision (Li et al., 2024a) data. For video-language training, we utilize a large-scale video-text pairs sourced from several publicly accessible databases. The video training data is a subset of VideoChat2-IT (Li ...
work page 2025
-
[35]
Therefore, we decide not to include it in our default setting
= cos(t/100002i/d) (3) 15 The ablation shows in Table 8 and Table 9 that adding the FPE does not affect much to the overall performance across several benchmarks. Therefore, we decide not to include it in our default setting. Methods Context Length #Tokens EgoSchema VideoMME ML VU DINO + Query 8k 64/144 67.30 60.08 65.05 DINO + Query + STC (default) 8k dy...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.