pith. machine review for the scientific record. sign in

arxiv: 2604.09547 · v2 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Tango: Taming Visual Signals for Efficient Video Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords video large language modelstoken pruningefficient inferenceattention mechanismsrotary position embeddingsmultimodal modelsvideo understanding
0
0 comments X

The pith

Tango prunes video tokens to 10% while preserving 98.9% of original Video LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Token pruning is a key way to make Video Large Language Models efficient, but existing approaches have two main flaws. Attention-based methods using simple top-k selection miss the full picture because attention is often spread across multiple areas and has a long tail of importance. Similarity-based clustering often breaks videos into small disconnected groups, which warps the meaning when features are pooled. Tango solves these by adding a diversity-driven selection process and a new position embedding called ST-RoPE that respects space and time relations. If this works, it means video models can process content much faster and cheaper while understanding almost as well as the full version.

Core claim

This work reveals two critical limitations in existing methods: conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. When retaining only 10% of the video tokens, this yields 98.9% of the original performance on LLaVA-0

What carries the argument

Diversity-driven strategy for attention-based token selection combined with Spatio-temporal Rotary Position Embedding (ST-RoPE) that applies locality priors to maintain geometric structure and avoid pooling distortions.

If this is right

  • Video LLMs achieve 1.88 times faster inference speeds at 10% token retention.
  • 98.9% of baseline performance is retained on LLaVA-OV and similar benchmarks.
  • The approach applies across multiple Video LLM architectures and video understanding tasks.
  • Token pruning becomes more effective by explicitly handling multi-modal attention and cluster fragmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection and embedding fixes might reduce token counts in image-only multimodal models with less spatial complexity.
  • Longer or higher-frame-rate videos could see larger relative speedups if the locality priors in ST-RoPE continue to hold.
  • Applying the method to other pruning paradigms such as score-based or gradient-based selection would test whether the two limitations are truly general.

Load-bearing premise

The two identified limitations in attention selection and clustering are the primary bottlenecks, and the diversity strategy plus ST-RoPE mitigate them across diverse video content without introducing new distortions.

What would settle it

Running Tango on a set of videos engineered to have single-mode attention distributions and comparing its accuracy retention directly against standard top-k pruning would show whether the diversity component is required.

Figures

Figures reproduced from arXiv: 2604.09547 by Baozhi Jia, Chaoyou Fu, Enhong Chen, Hanchao Wang, Shukang Yin, Sirui Zhao, Xianquan Wang.

Figure 1
Figure 1. Figure 1: Limitations of two typical pre-LLM token pruning approaches. (Top Right) Top-k selection fails to fully capture the attention distribution, which is spatially multimodal and long-tailed in magnitude. (Bottom Right) Direct similarity-based clustering can result in noisy representations. these two visual signals, grounded in fundamental token￾pruning paradigms. As illustrated in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of attention distribution and similarity-based clustering. Attention Heatmaps: The spatial distribution of attention scores exhibits multiple modes (i.e., local maxima of attention scores) corresponding to distinct semantic regions (e.g., subtitle and head in the first frame). Clustering Results: Direct similarity-based cluster assignment (Baseline) leads to spatially fragmented results, where… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of sorted attention scores. The scores exhibit a distinct long-tailed distribution with a narrow variance band, demonstrating highly stable attention patterns across videos. Results are averaged over 100 video samples from ActivityNet [15]. where rl is the actual retention ratio for the l-th layer. Investigating Visual Signals. We investigate the character￾istics of visual signals grounded in … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our method. It comprises three modules: Temporal Video Segmentation (TVS), Salient Token Selection (STS), and Spatio-Temporal Merging (STM). Our technical con￾tributions lie in (1) an enhanced salient token selection strategy in the STS module, and (2) position-aware clustering with ST-RoPE, used in STS and STM modules. heatmaps exhibit multi-modal patterns (i.e., concentrated high-attention re… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of ST-RoPE. (Bottom left) It encodes spatio￾temporal position information by applying a rotation matrix to vision tokens. (Bottom right) The pairwise distance on the hyper￾sphere, d(·, ·), is not only determined by cosine similarity, but also modulated by ∆p, where spatio-temporally distant tokens are as￾signed a larger distance penalty. Consequently, semantically similar and spatio-temporally… view at source ↗
Figure 6
Figure 6. Figure 6: Results with different input frames. Our method main￾tains stable scaling with frame number, clearly outperforming other SOTA methods. “w/o M” denotes pre-LLM-only pruning [9]. The performance is averaged across three video datasets: Video-MME, LongVideoBench, and MLVU. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Results with different salient token selection strategy. Top-k (attn) underperforms, and Top-k (p-attn) is only comparable to a simple uniform sampling baseline. Plugging in our proposed diversity-driven selection method (“w/ div.”) brings remarkable gains. attn and p-attn denote attention scores extracted by pool￾ing attention weights on the query dimension [3, 9] and calculating with a global query [5], … view at source ↗
Figure 9
Figure 9. Figure 9: illustrates the latency-performance trade-off of our method. Our approach strikes an exceptional balance between efficiency and accuracy: at 10% token retention, it preserves 98.9% of full-token performance with a 1.88× speedup. Increasing retention to 20% yields nearly lossless performance (99.7%) and a 1.63× speedup. 0 200 400 600 800 1000 1200 Time-To-First-Token (TTFT) breakdown (ms) ↓ Vanilla Tango Ta… view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of SigLIP ViT backbone. The image is transformed by a patch embedding layer, added with position em￾bedding, then sent into a stack of Transformer blocks, comprising layer normalization, self-attention/MLP, and residual connection [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of spatial distributions of attention scores extracted with SigLIP ViT. We observe that (1) sink tokens are distributed around the corners (“attn” and “p-attn” show similar patterns), usually corresponding to backgrounds; and (2) “p-attn” shows more robustness to attention sinks. The indices of the top-10 highest-scoring tokens are annotated. Results are averaged over 100 video samples from … view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative case of intermediate variables within the attention block. We observe that (1) the sink tokens are initially induced by adding with the position embedding, where a few outlier tokens exhibit exceptionally high norms; (2) the skewed distribution induced by outliers is passed across layers through residual connection, and temporarily reverted by layer normalization (LM); (3) post-attention norms… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative cases of attention patterns. We observe that (1) “attn” shows more severe attention sinks, failing to attend to salient objects; (2) “p-attn” shows a strong tendency to attend to text-related regions. For instance, the subtitle in the first case, and the leaderboard in the second and third cases. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative cases of clustering image features. Our method better preserves the geometric structure of objects and better separates different semantic entities. We note that compressing complex scenes remains a significant challenge (e.g., crowded scenes in the last frame of case 2 or an e-sports event in case 4, which involve intricate or abstract semantics). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
read the original abstract

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88$\times$ inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to advance token pruning for Video LLMs by identifying limitations in top-k attention selection (ignoring multi-modal long-tailed distributions) and similarity clustering (fragmented clusters). It proposes a diversity-driven selection strategy and ST-RoPE to preserve structure, reporting 98.9% performance retention at 10% tokens on LLaVA-OV with 1.88× speedup across benchmarks.

Significance. If the central results hold under scrutiny, this could substantially improve the efficiency of video understanding in large models, enabling faster inference and longer context handling with minimal accuracy trade-offs. The emphasis on addressing specific attention and clustering issues provides a targeted contribution to the field of efficient multimodal models.

major comments (3)
  1. [Abstract] The claim that Tango preserves 98.9% performance while retaining only 10% tokens is central, yet the description lacks direct evidence (such as attention histogram comparisons) that the diversity strategy specifically corrects for long-tailed multi-modal attention ignored by top-k.
  2. [Method] ST-RoPE is introduced to preserve geometric structure via locality priors, but without an explicit equation showing how it modifies standard RoPE for spatio-temporal video tokens, it is hard to assess if it avoids warping temporal relations in non-rigid scenes as per the skeptic concern.
  3. [Experiments] Table or results section reporting the 1.88× speedup and 98.9% retention does not include ablations isolating the contribution of diversity strategy versus ST-RoPE, nor tests on long-tailed video datasets, which is load-bearing for the mitigation claim.
minor comments (2)
  1. [Abstract] Consider expanding 'ST-RoPE' to 'Spatio-temporal Rotary Position Embedding' on first mention for clarity.
  2. Some sentences in the method description could be rephrased for better flow regarding the two limitations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] The claim that Tango preserves 98.9% performance while retaining only 10% tokens is central, yet the description lacks direct evidence (such as attention histogram comparisons) that the diversity strategy specifically corrects for long-tailed multi-modal attention ignored by top-k.

    Authors: We agree that direct evidence would enhance the abstract's claim. In the revised version, we will incorporate attention histogram comparisons and additional analysis in the introduction or method section to demonstrate how the diversity-driven strategy addresses the long-tailed multi-modal attention distributions that top-k selection overlooks. revision: yes

  2. Referee: [Method] ST-RoPE is introduced to preserve geometric structure via locality priors, but without an explicit equation showing how it modifies standard RoPE for spatio-temporal video tokens, it is hard to assess if it avoids warping temporal relations in non-rigid scenes as per the skeptic concern.

    Authors: We will add the explicit equation for ST-RoPE in the revised Method section. This will clearly show the modifications to standard RoPE for spatio-temporal tokens and explain how the locality priors help preserve geometric structure, including in non-rigid scenes. revision: yes

  3. Referee: [Experiments] Table or results section reporting the 1.88× speedup and 98.9% retention does not include ablations isolating the contribution of diversity strategy versus ST-RoPE, nor tests on long-tailed video datasets, which is load-bearing for the mitigation claim.

    Authors: We recognize the importance of isolating component contributions and testing on relevant datasets. We will expand the experiments section with ablations that separate the effects of the diversity strategy and ST-RoPE. Additionally, we will include evaluations on long-tailed video datasets to support the claims regarding mitigation of the identified limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation stands independent of inputs

full rationale

The paper's core contribution consists of identifying two limitations in existing token-pruning methods (multi-modal long-tailed attention and fragmented clusters) via direct observation, then introducing a diversity-driven selection strategy plus ST-RoPE to address them. These interventions are evaluated on external benchmarks (LLaVA-OV and others) with reported metrics such as 98.9% performance retention at 10% tokens. No equations, fitted parameters, or self-citations are shown to reduce the claimed gains to quantities defined by the same data or prior author work; the argument chain remains self-contained and falsifiable against held-out video understanding tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach rests on standard assumptions from attention mechanisms and token pruning literature; no explicit free parameters, axioms, or new entities with independent evidence are detailed in the abstract beyond the introduced ST-RoPE.

invented entities (1)
  • Spatio-temporal Rotary Position Embedding (ST-RoPE) no independent evidence
    purpose: preserve geometric structure via locality priors during similarity-based clustering
    Introduced to prevent fragmented clusters and distorted representations after pooling.

pith-pipeline@v0.9.0 · 5512 in / 1239 out tokens · 37142 ms · 2026-05-10T16:56:52.603078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InECCV, 2024. 1, 2, 6, 7, 8, 11

  2. [2]

    Important Tokens

    Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qin- tong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More. InEMNLP, 2025. 1, 2, 6, 7, 11

  3. [3]

    VisionZip: Longer is Better but Not Necessary in Vision Language Models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. InCVPR,

  4. [4]

    Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

    Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models. InEMNLP,

  5. [5]

    FastVID: Dynamic Density Pruning for Fast Video Large Language Models

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. FastVID: Dynamic Density Pruning for Fast Video Large Language Models. InNeurIPS, 2025. 1, 2, 6, 7, 8, 11

  6. [6]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. InCVPR, 2025. 1, 2 9

  7. [7]

    DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models. InCVPR, 2025. 1

  8. [8]

    PruneVid: Visual Token Pruning for Efficient Video Large Language Models

    Xiaohu Huang, Hao Zhou, and Kai Han. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. InACL (Findings), 2025. 2

  9. [9]

    HoliTom: Holistic Token Merging for Fast Video Large Language Models

    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. HoliTom: Holistic Token Merging for Fast Video Large Language Models. InNeurIPS, 2025. 2, 4, 6, 7, 8, 11

  10. [10]

    Frame- Fusion: Combining Similarity and Importance for Video To- ken Reduction on Large Vision Language Models

    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Frame- Fusion: Combining Similarity and Importance for Video To- ken Reduction on Large Vision Language Models. InICCV,

  11. [11]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InICCV, 2023. 3, 12

  12. [12]

    A survey on multimodal large language models.National Science Review, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024. 2

  13. [13]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 Technical Report.arXiv:2407.10671,

  14. [14]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference. InICML, 2025. 3

  15. [15]

    ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. InCVPR,

  16. [16]

    Vision Transformers Need Registers

    Timoth´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers. InICLR,

  17. [17]

    Vision Transformers Don’t Need Trained Registers

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision Transformers Don’t Need Trained Registers. In NeurIPS, 2025. 4, 11, 12

  18. [18]

    Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 2016

    Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 2016. 4

  19. [19]

    Clustering by fast search and find of density peaks.Science, 2014

    Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks.Science, 2014. 4

  20. [20]

    RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 2024. 5

  21. [21]

    Base of RoPE Bounds Context Length

    Mingyu Xu, Xin Men, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, et al. Base of RoPE Bounds Context Length. InNeurIPS, 2024. 5

  22. [22]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. InCVPR, 2025. 6, 11

  23. [23]

    MVBench: A Comprehensive Multi-modal Video Understand- ing Benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVBench: A Comprehensive Multi-modal Video Understand- ing Benchmark. InCVPR, 2024. 6, 11

  24. [24]

    LongVideoBench: A Benchmark for Long-context Inter- leaved Video-Language Understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Benchmark for Long-context Inter- leaved Video-Language Understanding. InNeurIPS, 2024. 6, 11

  25. [25]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. MLVU: Benchmarking Multi-task Long Video Understanding. InCVPR, 2025. 6, 11

  26. [26]

    LLaV A-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. LLaV A-OneVision: Easy Visual Task Transfer. Transactions on Machine Learning Research, 2025. 6, 7, 11

  27. [27]

    LLaV A-Video: Video Instruction Tun- ing With Synthetic Data.Transactions on Machine Learning Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video Instruction Tun- ing With Synthetic Data.Transactions on Machine Learning Research, 2025. 6, 8, 11

  28. [28]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv:2502.13923,

  29. [29]

    VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models. InACM MM,

  30. [30]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Per- ception of the World at Any Resolution.arXiv:2409.12191,

  31. [31]

    Massive activations in large language models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. InCOLM,

  32. [32]

    attn”), and (2) using a global query (similar to the [CLS] token) to calculate [ 5] (denoted as “p-attn

    12 10 Tango: Taming Visual Signals for Efficient Video Large Language Models Supplementary Material A. Experimental Details A.1. Benchmark Details Video-MME[ 22] is a video understanding benchmark designed specifically for Video LLMs. It comprises a total of 900 videos and 2,700 multiple-choice questions. The video durations range from a few minutes to 1 ...