arxiv: 2604.09547 · v2 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Tango: Taming Visual Signals for Efficient Video Large Language Models

Shukang Yin , Sirui Zhao , Hanchao Wang , Baozhi Jia , Xianquan Wang , Chaoyou Fu , Enhong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords video large language modelstoken pruningefficient inferenceattention mechanismsrotary position embeddingsmultimodal modelsvideo understanding

0 comments

The pith

Tango prunes video tokens to 10% while preserving 98.9% of original Video LLM performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Token pruning is a key way to make Video Large Language Models efficient, but existing approaches have two main flaws. Attention-based methods using simple top-k selection miss the full picture because attention is often spread across multiple areas and has a long tail of importance. Similarity-based clustering often breaks videos into small disconnected groups, which warps the meaning when features are pooled. Tango solves these by adding a diversity-driven selection process and a new position embedding called ST-RoPE that respects space and time relations. If this works, it means video models can process content much faster and cheaper while understanding almost as well as the full version.

Core claim

This work reveals two critical limitations in existing methods: conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. When retaining only 10% of the video tokens, this yields 98.9% of the original performance on LLaVA-0

What carries the argument

Diversity-driven strategy for attention-based token selection combined with Spatio-temporal Rotary Position Embedding (ST-RoPE) that applies locality priors to maintain geometric structure and avoid pooling distortions.

If this is right

Video LLMs achieve 1.88 times faster inference speeds at 10% token retention.
98.9% of baseline performance is retained on LLaVA-OV and similar benchmarks.
The approach applies across multiple Video LLM architectures and video understanding tasks.
Token pruning becomes more effective by explicitly handling multi-modal attention and cluster fragmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection and embedding fixes might reduce token counts in image-only multimodal models with less spatial complexity.
Longer or higher-frame-rate videos could see larger relative speedups if the locality priors in ST-RoPE continue to hold.
Applying the method to other pruning paradigms such as score-based or gradient-based selection would test whether the two limitations are truly general.

Load-bearing premise

The two identified limitations in attention selection and clustering are the primary bottlenecks, and the diversity strategy plus ST-RoPE mitigate them across diverse video content without introducing new distortions.

What would settle it

Running Tango on a set of videos engineered to have single-mode attention distributions and comparing its accuracy retention directly against standard top-k pruning would show whether the diversity component is required.

Figures

Figures reproduced from arXiv: 2604.09547 by Baozhi Jia, Chaoyou Fu, Enhong Chen, Hanchao Wang, Shukang Yin, Sirui Zhao, Xianquan Wang.

**Figure 1.** Figure 1: Limitations of two typical pre-LLM token pruning approaches. (Top Right) Top-k selection fails to fully capture the attention distribution, which is spatially multimodal and long-tailed in magnitude. (Bottom Right) Direct similarity-based clustering can result in noisy representations. these two visual signals, grounded in fundamental tokenpruning paradigms. As illustrated in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 2.** Figure 2: Illustration of attention distribution and similarity-based clustering. Attention Heatmaps: The spatial distribution of attention scores exhibits multiple modes (i.e., local maxima of attention scores) corresponding to distinct semantic regions (e.g., subtitle and head in the first frame). Clustering Results: Direct similarity-based cluster assignment (Baseline) leads to spatially fragmented results, where… view at source ↗

**Figure 3.** Figure 3: Distribution of sorted attention scores. The scores exhibit a distinct long-tailed distribution with a narrow variance band, demonstrating highly stable attention patterns across videos. Results are averaged over 100 video samples from ActivityNet [15]. where rl is the actual retention ratio for the l-th layer. Investigating Visual Signals. We investigate the characteristics of visual signals grounded in … view at source ↗

**Figure 4.** Figure 4: Overview of our method. It comprises three modules: Temporal Video Segmentation (TVS), Salient Token Selection (STS), and Spatio-Temporal Merging (STM). Our technical contributions lie in (1) an enhanced salient token selection strategy in the STS module, and (2) position-aware clustering with ST-RoPE, used in STS and STM modules. heatmaps exhibit multi-modal patterns (i.e., concentrated high-attention re… view at source ↗

**Figure 5.** Figure 5: Illustration of ST-RoPE. (Bottom left) It encodes spatiotemporal position information by applying a rotation matrix to vision tokens. (Bottom right) The pairwise distance on the hypersphere, d(·, ·), is not only determined by cosine similarity, but also modulated by ∆p, where spatio-temporally distant tokens are assigned a larger distance penalty. Consequently, semantically similar and spatio-temporally… view at source ↗

**Figure 6.** Figure 6: Results with different input frames. Our method maintains stable scaling with frame number, clearly outperforming other SOTA methods. “w/o M” denotes pre-LLM-only pruning [9]. The performance is averaged across three video datasets: Video-MME, LongVideoBench, and MLVU. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Results with different salient token selection strategy. Top-k (attn) underperforms, and Top-k (p-attn) is only comparable to a simple uniform sampling baseline. Plugging in our proposed diversity-driven selection method (“w/ div.”) brings remarkable gains. attn and p-attn denote attention scores extracted by pooling attention weights on the query dimension [3, 9] and calculating with a global query [5], … view at source ↗

**Figure 9.** Figure 9: illustrates the latency-performance trade-off of our method. Our approach strikes an exceptional balance between efficiency and accuracy: at 10% token retention, it preserves 98.9% of full-token performance with a 1.88× speedup. Increasing retention to 20% yields nearly lossless performance (99.7%) and a 1.63× speedup. 0 200 400 600 800 1000 1200 Time-To-First-Token (TTFT) breakdown (ms) ↓ Vanilla Tango Ta… view at source ↗

**Figure 10.** Figure 10: Illustration of SigLIP ViT backbone. The image is transformed by a patch embedding layer, added with position embedding, then sent into a stack of Transformer blocks, comprising layer normalization, self-attention/MLP, and residual connection [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of spatial distributions of attention scores extracted with SigLIP ViT. We observe that (1) sink tokens are distributed around the corners (“attn” and “p-attn” show similar patterns), usually corresponding to backgrounds; and (2) “p-attn” shows more robustness to attention sinks. The indices of the top-10 highest-scoring tokens are annotated. Results are averaged over 100 video samples from … view at source ↗

**Figure 12.** Figure 12: Qualitative case of intermediate variables within the attention block. We observe that (1) the sink tokens are initially induced by adding with the position embedding, where a few outlier tokens exhibit exceptionally high norms; (2) the skewed distribution induced by outliers is passed across layers through residual connection, and temporarily reverted by layer normalization (LM); (3) post-attention norms… view at source ↗

**Figure 13.** Figure 13: Qualitative cases of attention patterns. We observe that (1) “attn” shows more severe attention sinks, failing to attend to salient objects; (2) “p-attn” shows a strong tendency to attend to text-related regions. For instance, the subtitle in the first case, and the leaderboard in the second and third cases. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative cases of clustering image features. Our method better preserves the geometric structure of objects and better separates different semantic entities. We note that compressing complex scenes remains a significant challenge (e.g., crowded scenes in the last frame of case 2 or an e-sports event in case 4, which involve intricate or abstract semantics). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

read the original abstract

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88$\times$ inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tango's diversity-driven pruning and ST-RoPE deliver solid efficiency gains on video LLMs, but the fixes may not fully generalize without further checks.

read the letter

Hey colleague, Tango's main contribution is a token pruning method that keeps 98.9% performance at 10% tokens on LLaVA-OV with a 1.88x speedup. The approach combines diversity-driven selection for attention and ST-RoPE to handle structure in video token reduction. What the paper does well is pinpointing real issues with existing paradigms: top-k not handling multi-modal long-tailed attention, and similarity clustering leading to fragmented pools. Their diversity strategy and the spatio-temporal embedding are direct responses to those, and the experiments across models and benchmarks give evidence that it works in practice. It's nice to see generalizability tested. The potential weak point is whether these fixes are complete. The stress test raises a fair point about possible new biases in long-tailed or multi-object videos, or if ST-RoPE warps things in non-rigid motion. If the paper's ablations don't cover those cases thoroughly or if the gains depend on specific hyperparameter choices, the results might not be as robust as claimed. The citation pattern looks standard for the field, no obvious gaps there. This paper is for people in the efficiency corner of multimodal AI, particularly those working on video LLMs. A reader interested in practical token reduction would get concrete ideas and numbers from it. I would bring this to the next reading group as a maybe, to discuss the pruning techniques. I probably wouldn't cite it in my next papers unless I'm building on pruning. But it should go through peer review, as the topic is relevant and the claims are testable.

Referee Report

3 major / 2 minor

Summary. The paper claims to advance token pruning for Video LLMs by identifying limitations in top-k attention selection (ignoring multi-modal long-tailed distributions) and similarity clustering (fragmented clusters). It proposes a diversity-driven selection strategy and ST-RoPE to preserve structure, reporting 98.9% performance retention at 10% tokens on LLaVA-OV with 1.88× speedup across benchmarks.

Significance. If the central results hold under scrutiny, this could substantially improve the efficiency of video understanding in large models, enabling faster inference and longer context handling with minimal accuracy trade-offs. The emphasis on addressing specific attention and clustering issues provides a targeted contribution to the field of efficient multimodal models.

major comments (3)

[Abstract] The claim that Tango preserves 98.9% performance while retaining only 10% tokens is central, yet the description lacks direct evidence (such as attention histogram comparisons) that the diversity strategy specifically corrects for long-tailed multi-modal attention ignored by top-k.
[Method] ST-RoPE is introduced to preserve geometric structure via locality priors, but without an explicit equation showing how it modifies standard RoPE for spatio-temporal video tokens, it is hard to assess if it avoids warping temporal relations in non-rigid scenes as per the skeptic concern.
[Experiments] Table or results section reporting the 1.88× speedup and 98.9% retention does not include ablations isolating the contribution of diversity strategy versus ST-RoPE, nor tests on long-tailed video datasets, which is load-bearing for the mitigation claim.

minor comments (2)

[Abstract] Consider expanding 'ST-RoPE' to 'Spatio-temporal Rotary Position Embedding' on first mention for clarity.
Some sentences in the method description could be rephrased for better flow regarding the two limitations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Abstract] The claim that Tango preserves 98.9% performance while retaining only 10% tokens is central, yet the description lacks direct evidence (such as attention histogram comparisons) that the diversity strategy specifically corrects for long-tailed multi-modal attention ignored by top-k.

Authors: We agree that direct evidence would enhance the abstract's claim. In the revised version, we will incorporate attention histogram comparisons and additional analysis in the introduction or method section to demonstrate how the diversity-driven strategy addresses the long-tailed multi-modal attention distributions that top-k selection overlooks. revision: yes
Referee: [Method] ST-RoPE is introduced to preserve geometric structure via locality priors, but without an explicit equation showing how it modifies standard RoPE for spatio-temporal video tokens, it is hard to assess if it avoids warping temporal relations in non-rigid scenes as per the skeptic concern.

Authors: We will add the explicit equation for ST-RoPE in the revised Method section. This will clearly show the modifications to standard RoPE for spatio-temporal tokens and explain how the locality priors help preserve geometric structure, including in non-rigid scenes. revision: yes
Referee: [Experiments] Table or results section reporting the 1.88× speedup and 98.9% retention does not include ablations isolating the contribution of diversity strategy versus ST-RoPE, nor tests on long-tailed video datasets, which is load-bearing for the mitigation claim.

Authors: We recognize the importance of isolating component contributions and testing on relevant datasets. We will expand the experiments section with ablations that separate the effects of the diversity strategy and ST-RoPE. Additionally, we will include evaluations on long-tailed video datasets to support the claims regarding mitigation of the identified limitations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation stands independent of inputs

full rationale

The paper's core contribution consists of identifying two limitations in existing token-pruning methods (multi-modal long-tailed attention and fragmented clusters) via direct observation, then introducing a diversity-driven selection strategy plus ST-RoPE to address them. These interventions are evaluated on external benchmarks (LLaVA-OV and others) with reported metrics such as 98.9% performance retention at 10% tokens. No equations, fitted parameters, or self-citations are shown to reduce the claimed gains to quantities defined by the same data or prior author work; the argument chain remains self-contained and falsifiable against held-out video understanding tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach rests on standard assumptions from attention mechanisms and token pruning literature; no explicit free parameters, axioms, or new entities with independent evidence are detailed in the abstract beyond the introduced ST-RoPE.

invented entities (1)

Spatio-temporal Rotary Position Embedding (ST-RoPE) no independent evidence
purpose: preserve geometric structure via locality priors during similarity-based clustering
Introduced to prevent fragmented clusters and distorted representations after pooling.

pith-pipeline@v0.9.0 · 5512 in / 1239 out tokens · 37142 ms · 2026-05-10T16:56:52.603078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InECCV, 2024. 1, 2, 6, 7, 8, 11

work page 2024
[2]

Important Tokens

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qin- tong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More. InEMNLP, 2025. 1, 2, 6, 7, 11

work page 2025
[3]

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. InCVPR,

work page
[4]

Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models. InEMNLP,

work page
[5]

FastVID: Dynamic Density Pruning for Fast Video Large Language Models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. FastVID: Dynamic Density Pruning for Fast Video Large Language Models. InNeurIPS, 2025. 1, 2, 6, 7, 8, 11

work page 2025
[6]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. InCVPR, 2025. 1, 2 9

work page 2025
[7]

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models. InCVPR, 2025. 1

work page 2025
[8]

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Xiaohu Huang, Hao Zhou, and Kai Han. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. InACL (Findings), 2025. 2

work page 2025
[9]

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. HoliTom: Holistic Token Merging for Fast Video Large Language Models. InNeurIPS, 2025. 2, 4, 6, 7, 8, 11

work page 2025
[10]

Frame- Fusion: Combining Similarity and Importance for Video To- ken Reduction on Large Vision Language Models

Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Frame- Fusion: Combining Similarity and Importance for Video To- ken Reduction on Large Vision Language Models. InICCV,

work page
[11]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training. InICCV, 2023. 3, 12

work page 2023
[12]

A survey on multimodal large language models.National Science Review, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 2024. 2

work page 2024
[13]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 Technical Report.arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference. InICML, 2025. 3

work page 2025
[15]

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. InCVPR,

work page
[16]

Vision Transformers Need Registers

Timoth´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers. InICLR,

work page
[17]

Vision Transformers Don’t Need Trained Registers

Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandels- man. Vision Transformers Don’t Need Trained Registers. In NeurIPS, 2025. 4, 11, 12

work page 2025
[18]

Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 2016

Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 2016. 4

work page 2016
[19]

Clustering by fast search and find of density peaks.Science, 2014

Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks.Science, 2014. 4

work page 2014
[20]

RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding.Neurocomputing, 2024. 5

work page 2024
[21]

Base of RoPE Bounds Context Length

Mingyu Xu, Xin Men, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, et al. Base of RoPE Bounds Context Length. InNeurIPS, 2024. 5

work page 2024
[22]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. InCVPR, 2025. 6, 11

work page 2025
[23]

MVBench: A Comprehensive Multi-modal Video Understand- ing Benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. MVBench: A Comprehensive Multi-modal Video Understand- ing Benchmark. InCVPR, 2024. 6, 11

work page 2024
[24]

LongVideoBench: A Benchmark for Long-context Inter- leaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Benchmark for Long-context Inter- leaved Video-Language Understanding. InNeurIPS, 2024. 6, 11

work page 2024
[25]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. MLVU: Benchmarking Multi-task Long Video Understanding. InCVPR, 2025. 6, 11

work page 2025
[26]

LLaV A-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. LLaV A-OneVision: Easy Visual Task Transfer. Transactions on Machine Learning Research, 2025. 6, 7, 11

work page 2025
[27]

LLaV A-Video: Video Instruction Tun- ing With Synthetic Data.Transactions on Machine Learning Research, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video Instruction Tun- ing With Synthetic Data.Transactions on Machine Learning Research, 2025. 6, 8, 11

work page 2025
[28]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models. InACM MM,

work page
[30]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing Vision-Language Model’s Per- ception of the World at Any Resolution.arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Massive activations in large language models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. InCOLM,

work page
[32]

attn”), and (2) using a global query (similar to the [CLS] token) to calculate [ 5] (denoted as “p-attn

12 10 Tango: Taming Visual Signals for Efficient Video Large Language Models Supplementary Material A. Experimental Details A.1. Benchmark Details Video-MME[ 22] is a video understanding benchmark designed specifically for Video LLMs. It comprises a total of 900 videos and 2,700 multiple-choice questions. The video durations range from a few minutes to 1 ...

work page