pith. sign in

arxiv: 2511.14582 · v2 · submitted 2025-11-18 · 💻 cs.CV

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Pith reviewed 2026-05-17 20:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords token compressionomnimodal llmaudio-guided pruningmultimodal inference accelerationtraining-free compressionaudio video token reductiondynamic spatio-temporal pruningcross-modal similarity
0
0 comments X

The pith

OmniZip lets audio retention scores decide which video tokens to drop in joint sequences, cutting inference time by 3.42 times and memory by 1.4 times with no retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free method to compress the combined token stream from audio and video inputs to omnimodal large language models. It works by locating the most informative audio tokens first, then scoring each short time segment for how densely the audio carries information. Those scores tell the system how aggressively to prune video tokens in the same segment while still keeping cross-modal cues through similarity checks. The remaining video tokens receive an interleaved spatial and temporal reduction. If the approach succeeds, longer audio-video inputs become practical for real-time understanding tasks without sacrificing accuracy or requiring model changes.

Core claim

OmniZip identifies salient audio tokens, computes an audio retention score for each time group to capture information density, uses this score to dynamically guide video token pruning while preserving audio-anchor cues through cross-modal similarity, and then applies an interleaved spatio-temporal compression scheme to the surviving video tokens.

What carries the argument

Audio retention score per time group, derived from salient audio tokens, that measures local information density and directs selective pruning of video tokens.

If this is right

  • OmniLLMs process longer joint audio-video sequences at 3.42 times the speed of prior methods.
  • Memory footprint for multimodal inference drops by a factor of 1.4 while accuracy on understanding benchmarks stays the same.
  • Token compression now applies to paired audio-video streams rather than to one modality at a time.
  • No additional training is needed to obtain the speedup and memory savings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same audio-density signal could steer compression in other paired modalities such as text and image sequences.
  • Lower token counts may cut energy use enough to run these models on mobile or embedded hardware.
  • Measuring retention scores on inputs of varying total length could show how the speedup scales with sequence duration.

Load-bearing premise

The audio retention score computed from salient audio tokens reliably identifies time groups where video tokens can be pruned without losing critical cross-modal information.

What would settle it

A measurable drop in accuracy on tasks that require precise audio-video alignment, such as event localization or speech-driven action recognition, when the compression ratios suggested by the retention scores are applied.

Figures

Figures reproduced from arXiv: 2511.14582 by Bohan Yu, Huan Wang, Jian Liu, Keda Tao, Kele Shao, Weiqiang Wang.

Figure 1
Figure 1. Figure 1: (a): We introduce OmniZip, an audio-video token compression method tailored for efficient OmniLLMs. The key innovation is a “listen-to-prune” paradigm – utilizing audio to dynamically guide video token pruning, complemented by a proposed compression module. (b): OmniZip achieves superior performance on various audio-video tasks on WorldSense [17], outperforming other methods. (c): Efficiency and performanc… view at source ↗
Figure 2
Figure 2. Figure 2: Audio tokens dominate attention heatmaps. Regular vertical bands aligned with audio-token positions indicate consis￾tently higher attention to audio tokens, while many video tokens receive little attention, suggesting greater redundancy. Attention aggregates within time windows and decays across windows, in￾dicating that audio and video tokens preferentially attend to short￾range context within the same wi… view at source ↗
Figure 3
Figure 3. Figure 3: Detailed overview of our OmniZip method. First, OmniZip computes an audio retention rate derived from dominant audio tokens to determine a dynamic pruning rate for the corresponding video tokens. Next, to preserve multimodal information, we uniformly sample audio anchors and merge with non-anchor tokens selected via cross-modal similarity. Finally, video tokens undergo interleaved spatio-temporal compressi… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on ρa and ρv. All experiments illustrated in the figure were carried out on the Qwen2.5-Omni-7B model and the WorldSense benchmark. Left and Middle: We separately analyze the influence of varying pa and pv on model performance. In general, excessive pruning of either modality negatively impacts model performance. However, an appropriate balance of audio and video token pruning achieves the b… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of dynamic pruning ratios. The figure illustrates how audio token retention guides the allocation of video token pruning. Specifically, for time windows with low audio reten￾tion, we allocate a higher video pruning ratio, while maintaining a constant total pruning rate. Method GPU Mem. ↓ Perfiling Time ↓ Acc. ↑ Latency per Example ↓ Qwen2.5-Omni-7B Full Tokens 35G 291ms (1.00×) 46.8 4.52s (1.… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on G. The accuracy of our method in a 45% retained ratio is analyzed with the value of G, which is defined as the number of tokens merged by each audio token anchor. All experiments illustrated in the figure were carried out on the Qwen2.5-Omni-7B model. the GS strategy is suboptimal for the omnimodal setting. The GS strategy extracts focused video and audio tokens independently, ignoring se… view at source ↗
Figure 8
Figure 8. Figure 8: More visualization of dynamic pruning ratios. The figure illustrates how audio token retention guides the allocation of video token pruning. Specifically, for time windows with low audio retention, we allocate a higher video pruning ratio, while maintaining a constant total pruning rate. In addition, for the dynamic pruning ratio allocation, we provide more visualization results as shown in [PITH_FULL_IMA… view at source ↗
read the original abstract

Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding. However, the high computational cost of processing longer joint audio-video token sequences has become a key bottleneck. Existing token compression methods have not addressed the emerging need to jointly compress multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates model inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive results demonstrate the merits of OmniZip: it achieves a 3.42X inference speedup and a 1.4X memory reduction over other top-performing counterparts, while maintaining the performance of OmniLLMs without training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OmniZip, a training-free, audio-guided framework for joint audio-visual token compression in omnimodal LLMs. It first identifies salient audio tokens, computes an audio retention score per time group to capture information density, uses this to dynamically prune video tokens while enhancing audio anchors via cross-modal similarity, and applies an interleaved spatio-temporal compression scheme within each time window. The central empirical claim is a 3.42X inference speedup and 1.4X memory reduction relative to top-performing counterparts, with no performance degradation on OmniLLM tasks and no training required.

Significance. If the core assumption holds and the reported speedups are reproducible, the work would offer a practical advance for scaling omnimodal models to longer sequences without retraining. The training-free heuristic and explicit use of audio as a guide for video pruning distinguish it from prior unimodal compression techniques; reproducible code or parameter-free derivations would further strengthen its utility for deployment.

major comments (2)
  1. [Abstract] Abstract: the headline claim of maintained performance (no loss while achieving 3.42X speedup and 1.4X memory reduction) is load-bearing yet rests on the untested premise that the audio retention score reliably identifies safe video-pruning groups; the abstract supplies no ablation, correlation plot, or failure-case analysis quantifying how well audio information density proxies joint audio-visual density when alignment is weak or asymmetric.
  2. [Method] Method (description of audio retention score and cross-modal similarity step): the procedure for deriving the retention score from salient audio tokens and for recovering lost cues via cross-modal similarity is presented at a high level without an explicit equation, threshold, or pseudocode; this makes it impossible to verify whether the pruning decision is under-constrained in regimes where salient audio tokens are sparse.
minor comments (2)
  1. [Experiments] Experiments: the abstract refers to 'extensive results' but does not mention error bars, specific evaluation protocols, or the exact datasets and baselines used; adding these details would make the performance-maintenance claim easier to assess.
  2. [Abstract] Notation: the distinction between 'time group' and 'time window' is used interchangeably in the abstract; a short clarifying sentence or diagram would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions that will strengthen the clarity and empirical support of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of maintained performance (no loss while achieving 3.42X speedup and 1.4X memory reduction) is load-bearing yet rests on the untested premise that the audio retention score reliably identifies safe video-pruning groups; the abstract supplies no ablation, correlation plot, or failure-case analysis quantifying how well audio information density proxies joint audio-visual density when alignment is weak or asymmetric.

    Authors: We agree that the abstract claim would be strengthened by direct evidence on the audio retention score's reliability under weak or asymmetric alignment. Our main experiments already demonstrate that OmniZip preserves task performance on multiple OmniLLM benchmarks relative to the uncompressed baseline, indicating that the pruning decisions are safe in the evaluated regimes. To address the specific concern, the revised manuscript will add a dedicated ablation subsection with correlation plots between audio retention scores and joint audio-visual information density, plus failure-case analysis on deliberately misaligned or sparse-audio inputs. revision: yes

  2. Referee: [Method] Method (description of audio retention score and cross-modal similarity step): the procedure for deriving the retention score from salient audio tokens and for recovering lost cues via cross-modal similarity is presented at a high level without an explicit equation, threshold, or pseudocode; this makes it impossible to verify whether the pruning decision is under-constrained in regimes where salient audio tokens are sparse.

    Authors: We acknowledge that the current method description is high-level and would benefit from greater formality. The retention score aggregates normalized importance weights of salient audio tokens per time group, after which cross-modal similarity (computed via cosine similarity in the shared embedding space) is used to up-weight audio anchors that guide video pruning. In the revision we will insert the explicit equations for both the retention score and the cross-modal enhancement step, together with the exact threshold values and a concise pseudocode block for the full per-window pruning procedure. This will make the behavior under sparse salient-token conditions directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental measurements, not self-referential definitions or fitted inputs

full rationale

The paper describes a training-free heuristic that identifies salient audio tokens, computes an audio retention score per time group, and uses cross-modal similarity to guide video token pruning. Performance claims (3.42X speedup, 1.4X memory reduction, maintained accuracy) are presented as results from extensive experiments rather than quantities derived by construction from the method's own parameters or prior self-citations. No equations, fitted parameters, or uniqueness theorems are invoked in a way that reduces the central result to its inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the assumption that audio information density can serve as a reliable proxy for video token importance. No explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Audio tokens contain sufficient cross-modal cues to guide safe pruning of video tokens.
    Invoked when the method uses audio retention scores to decide video token retention.

pith-pipeline@v0.9.0 · 5488 in / 1248 out tokens · 33928 ms · 2026-05-17T20:41:48.064979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning... For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

  2. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 2 Pith papers · 15 internal anchors

  1. [1]

    Ming-omni: A unified multimodal model for perception and generation, 2025

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A uni- fied multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025. 2

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2

  3. [3]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3, 2

  4. [4]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2, 3, 5, 6

  5. [5]

    Stream- ingtom: Streaming token compression for efficient video un- derstanding.arXiv preprint arXiv:2510.18269, 2025

    Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. 2, 3

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,

  7. [7]

    FlashAttention-2: Faster attention with better paral- lelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 5, 2

  8. [8]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 5, 2

  9. [9]

    Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

    Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 3

  10. [10]

    Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 99:135–145,

    Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 99:135–145,

  11. [11]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InICML, 2023. 3

  12. [12]

    Vita: Towards open-source interactive omni multimodal llm

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024. 2

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 5

  14. [14]

    VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vi- sion and speech interaction.arXiv preprint arXiv:2501.01957,

  15. [15]

    Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

    Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 1, 2, 5

  16. [16]

    Zipvl: Efficient large vision-language models with dynamic token sparsification

    Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipvl: Efficient large vision-language models with dynamic token sparsification. arXiv preprint arXiv:2410.08584, 2024. 5, 2

  17. [17]

    WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 1, 5

  18. [18]

    Prunevid: Visual token pruning for efficient video large language models

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. In ACL, 2025. 2, 3

  19. [19]

    Token pruning in audio trans- formers: Optimizing performance and decoding patch impor- tance.arXiv preprint arXiv:2504.01690, 2025

    Taehan Lee and Hyukjun Lee. Token pruning in audio trans- formers: Optimizing performance and decoding patch impor- tance.arXiv preprint arXiv:2504.01690, 2025. 3, 2

  20. [20]

    Lmms-eval: Accelerating the development of large multimodal models, 2024

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimodal models, 2024. 6

  21. [21]

    Llava-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.TMLR, 2025. 1, 2

  22. [22]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 2

  23. [23]

    Accelerating transducers through adjacent token merging

    Yuang Li, Yu Wu, Jinyu Li, and Shujie Liu. Accelerating transducers through adjacent token merging. InInterspeech,

  24. [24]

    Baichuan-omni technical report

    Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Baichuan-omni technical report.arXiv preprint ar...

  25. [25]

    Video-llava: Learning united visual represen- tation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. InEMNLP, 2024. 1, 2 9

  26. [26]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration

  27. [27]

    Speechprune: Context-aware token pruning for speech information retrieval

    Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai Li, Yiran Chen, et al. Speechprune: Context-aware token pruning for speech information retrieval. InICME, 2025. 3, 2

  28. [28]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 2

  29. [29]

    Revisiting mllm token technology through the lens of classical visual coding.arXiv preprint arXiv:2508.13460,

    Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, and Xin Jin. Revisiting mllm token technology through the lens of classical visual coding.arXiv preprint arXiv:2508.13460,

  30. [30]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chan- dra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024. 3

  31. [31]

    Streaming long video understanding with large language models

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models. 2024. 3

  32. [32]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2, 3

  33. [33]

    arXiv preprint arXiv:2505.21334 , year=

    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,

  34. [34]

    arXiv preprint arXiv:2507.20198 , year=

    Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang. When tokens talk too much: A survey of multimodal long-context token compression across images, videos, and audios.arXiv preprint arXiv:2507.20198, 2025. 3, 2

  35. [35]

    Fastvid: Dynamic density pruning for fast video large language models

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 1, 3, 4, 2

  36. [36]

    Longvu: Spa- tiotemporal adaptive compression for long video-language understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding. InICML, 2025. 2, 3

  37. [37]

    Audio- visual llm for video understanding

    Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio- visual llm for video understanding. InCVPR, 2025. 1, 2

  38. [38]

    video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. 2, 3

  39. [39]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023. 3

  40. [40]

    To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

    Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025. 2, 3

  41. [41]

    video-SALMONN 2: Caption-enhanced audio-visual large language models

    Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025. 1, 2

  42. [42]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InCVPR, 2025. 1, 2, 3, 4, 5, 6, 7

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  44. [44]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2

  45. [45]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 2

  46. [46]

    Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025

    Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, et al. Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025. 2

  47. [47]

    Gptvq: The blessing of dimensionality for llm quantization

    Mart Van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The bless- ing of dimensionality for llm quantization.arXiv preprint arXiv:2402.15319, 2024. 3

  48. [48]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2

  49. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2

  50. [50]

    arXiv preprint arXiv:2310.06694 (2023)

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023. 3

  51. [51]

    Smoothquant: Accurate and effi- cient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In ICML, 2023. 3

  52. [52]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open- source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190, 2024. 2

  53. [53]

    Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction. InCVPR, 2025. 2, 3

  54. [54]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, 10 et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 1, 2, 6

  55. [55]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1, 2

  56. [56]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 3

  57. [57]

    Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InCVPR, 2025. 2, 3

  58. [58]

    Humanomniv2: From understanding to omni-modal reasoning with context,

    Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, De- tao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understand- ing to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025. 1, 2

  59. [59]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InCVPR, 2025. 2, 3, 5, 8

  60. [60]

    Audio-centric video understanding benchmark without text shortcut

    Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Audio-centric video understanding benchmark without text shortcut. InEMNLP, 2025. 5, 8

  61. [61]

    Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025. 2, 1

  62. [62]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InAAAI, 2025. 2, 3

  63. [63]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 1, 2

  64. [64]

    Video-llama: An instruction-tuned audio-visual language model for video un- derstanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InEMNLP, 2023. 1, 2

  65. [65]

    Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 6

  66. [66]

    Video instruction tuning with synthetic data, 2024

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. 1, 2 11 OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Supplementary Material A. Dynamic Pruning Rate Allocation Algorithm This section expands upon the audio-guided video token compr...