Recognition: 2 theorem links
· Lean TheoremInternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Pith reviewed 2026-05-17 02:47 UTC · model grok-4.3
The pith
Long and rich context modeling lets video MLLMs process at least six times longer inputs while gaining object tracking and segmentation skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that its design for long and rich context modeling, which incorporates dense vision task annotations into MLLMs using direct preference optimization and creates compact spatiotemporal representations through adaptive hierarchical token compression, substantially improves video MLLM performance. This produces stronger results on mainstream short and long video understanding benchmarks, allows the models to memorize and use video inputs at least six times longer than before, and equips them with specialized vision capabilities such as object tracking and segmentation.
What carries the argument
Long and rich context (LRC) modeling, which integrates dense vision task annotations via direct preference optimization and adaptive hierarchical token compression to form compact spatiotemporal representations that support finer detail perception and longer temporal capture.
If this is right
- Video MLLMs show improved accuracy on both short-form and long-form video understanding benchmarks.
- Models gain the ability to retain and reason over video inputs at least six times longer than the original design.
- MLLMs acquire new specialized vision skills including object tracking and segmentation.
- Context richness in length and fineness directly strengthens the models' focus and memory functions.
Where Pith is reading between the lines
- Similar preference optimization on dense annotations could extend context handling in other sequential multimodal tasks such as audio streams or time-series data.
- The token compression method might reduce memory costs enough to allow deployment of long-context video models on resource-limited devices.
- Future tests on uncurated real-world video sources could show whether the benchmark gains translate to practical applications like surveillance or video editing.
Load-bearing premise
The gains in context length, benchmark scores, and specialized vision tasks arise from the long and rich context modeling components rather than from differences in training data volume, model scale, or benchmark choices.
What would settle it
An ablation experiment that removes the dense vision annotations and hierarchical token compression steps while holding training data volume and model size fixed, then measures whether context length and benchmark performance still improve, would settle whether the central claim holds.
read the original abstract
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InternVideo2.5, an updated video multimodal large language model (MLLM) that incorporates long and rich context (LRC) modeling. The core contributions are (1) injecting dense vision task annotations into the MLLM via direct preference optimization (DPO) and (2) producing compact spatiotemporal representations through adaptive hierarchical token compression. The authors report that these changes yield gains on short- and long-form video understanding benchmarks, enable the model to handle at least 6× longer video inputs than the prior InternVideo2, and confer new capabilities such as object tracking and segmentation.
Significance. If the reported gains are shown to stem from the LRC mechanisms rather than uncontrolled differences in training data volume, model scale, or optimizer schedule, the work would meaningfully advance video MLLM research by demonstrating the value of explicit context richness for fine-grained perception and long-term memory. The public release of code and models at the cited GitHub repository is a clear strength that aids reproducibility.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline claims of benchmark improvements, 6× context extension, and new tracking/segmentation capabilities are presented only as end-to-end results against InternVideo2. No ablation is reported that freezes data volume, training schedule, and base architecture while toggling only the DPO stage and the hierarchical compression module; without such controls the attribution to LRC remains untested and the central causal claim is under-supported.
- [§3.2 (Adaptive Hierarchical Token Compression)] §3.2 (Adaptive Hierarchical Token Compression): the description does not specify the exact compression ratios, the criterion used for adaptive selection, or quantitative evidence that fine-grained spatial details required for tracking and segmentation are preserved after compression; this detail is load-bearing for the claim that the module enables both longer context and specialized vision capabilities.
minor comments (1)
- [Abstract] The abstract refers to “mainstream video understanding benchmarks (short & long)” without naming the specific datasets or splits; adding this information would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening the causal attribution of our results to the proposed LRC mechanisms and on providing more technical details in the method section. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the headline claims of benchmark improvements, 6× context extension, and new tracking/segmentation capabilities are presented only as end-to-end results against InternVideo2. No ablation is reported that freezes data volume, training schedule, and base architecture while toggling only the DPO stage and the hierarchical compression module; without such controls the attribution to LRC remains untested and the central causal claim is under-supported.
Authors: We agree that an ablation isolating the DPO and adaptive hierarchical token compression components—while strictly controlling data volume, training schedule, optimizer, and base architecture—would provide stronger evidence for the specific contribution of LRC. Our current comparisons are against the publicly released InternVideo2 checkpoint under the same overall training recipe, and the observed gains are consistent across short- and long-form benchmarks as well as the new tracking/segmentation tasks. Nevertheless, to directly address the referee’s concern we will add a controlled ablation study in the revised §4 that trains variants with and without each LRC module under matched conditions. This will clarify the incremental benefit attributable to dense vision annotations via DPO and to the compression module. revision: yes
-
Referee: [§3.2 (Adaptive Hierarchical Token Compression)] §3.2 (Adaptive Hierarchical Token Compression): the description does not specify the exact compression ratios, the criterion used for adaptive selection, or quantitative evidence that fine-grained spatial details required for tracking and segmentation are preserved after compression; this detail is load-bearing for the claim that the module enables both longer context and specialized vision capabilities.
Authors: We thank the referee for highlighting this gap. In the revised manuscript we will expand §3.2 with the exact per-layer compression ratios (spatial and temporal), the adaptive selection criterion (based on token importance scores derived from cross-attention with the text query), and quantitative evidence of detail preservation. Specifically, we will report (i) cosine similarity of visual features before and after compression on a held-out set of tracking/segmentation videos and (ii) downstream performance drop on object-tracking and segmentation probes when the compression module is ablated. These additions will substantiate that fine-grained spatial information is retained while achieving the reported 6× context extension. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmark results rather than self-referential derivations
full rationale
The paper introduces architectural and training modifications—dense vision annotations via direct preference optimization and adaptive hierarchical token compression—to extend context length and improve video understanding in MLLMs. These are presented as novel design choices whose value is demonstrated through end-to-end experimental comparisons on standard benchmarks, not through any closed-form derivation, fitted-parameter prediction, or uniqueness theorem that reduces to prior self-citations by construction. References to the InternVideo lineage are contextual background rather than load-bearing justifications for the reported gains. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
-
Grounding Video Reasoning in Physical Signals
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
-
From Priors to Perception: Grounding Video-LLMs in Physical Reality
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
-
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video und...
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
-
High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.
-
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
One token to seg them all: Language instructed reasoning segmentation in videos
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos. arXiv preprint arXiv:2409.19603,
-
[4]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Hourvideo: 1-hour video-language understanding
Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video-language understanding. arXiv preprint arXiv:2411.04998,
-
[7]
ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tiannin...
-
[8]
Usp: A unified sequence parallelism approach for long context generative ai
Jiarui Fang and Shangchun Zhao. A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719,
-
[9]
MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. arXiv preprint arXiv:2406.14515,
-
[10]
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Forty-first International Conference on Machine Learning, 2024a. Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language understanding ...
-
[11]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In CVPR, 2024a. Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024a. Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Mvbench: A comprehensive multi-modal video understanding benchmark
13 InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, pages 22195–22206, 2024c. Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenj...
-
[16]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Videogpt+: Integrating image and video encoders for enhanced video understanding
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding. arXiv preprint arXiv:2406.09418,
-
[18]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
SAM 2: Segment Anything in Images and Videos
URL https://arxiv.org/abs/2408.00714. Ruchit Rawal, Khalid Saifullah, Miquel Farré, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
14 InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia G...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Timechat: A time-sensitive multimodal large language model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. CVPR, abs/2312.02051,
-
[23]
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
URL https: //github.com/Share14/ShareGemini. Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485,
-
[25]
Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models
Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models. arXiv preprint arXiv:2410.03290, 2024a. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong...
-
[26]
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, and Benyou Wang. Longllava: Scaling multi-modal llms to 1000 images efficiently via hybrid architecture. CoRR, abs/2409.02889, 2024e. Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via g...
-
[27]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Internvideo2: Scaling video foundation models for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. In ECCV, 2024f. Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in vi...
-
[29]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. arXiv preprint arXiv:2407.15754, 2024a. Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. In CVPR, pages 4974–4984,
-
[30]
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. arXiv preprint arXiv:2406.08394, 2024b. Yinda Xu et al. Siamfc++: Towards robust and accurate visual tracking with target es...
-
[31]
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188,
work page internal anchor Pith review arXiv
-
[32]
Task preference optimization: Improving multimodal large language models with vision task alignment
Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Task preference optimization: Improving multimodal large language models with vision task alignment. arXiv preprint arXiv:2412.19326,
-
[33]
Vript: A video is worth thousands of words
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words. arXiv preprint arXiv:2406.06040,
-
[34]
mplug-owl3: Towards long image-sequence understanding in multi-modal large language models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840,
-
[35]
Next-chat: An lmm for chat, detection and segmentation
Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023a. Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. In European Conference on Computer Vision, pages 310–325. Springer,
-
[36]
Movqa: A benchmark of versatile question-answering for long-form movie understanding
Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, and Yu Qiao. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817, 2023b. Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. L...
-
[37]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Apollo: An exploration of video understanding in large multimodal models
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models. arXiv preprint arXiv:2412.10360,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.