OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Pith reviewed 2026-05-17 20:41 UTC · model grok-4.3
The pith
OmniZip lets audio retention scores decide which video tokens to drop in joint sequences, cutting inference time by 3.42 times and memory by 1.4 times with no retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniZip identifies salient audio tokens, computes an audio retention score for each time group to capture information density, uses this score to dynamically guide video token pruning while preserving audio-anchor cues through cross-modal similarity, and then applies an interleaved spatio-temporal compression scheme to the surviving video tokens.
What carries the argument
Audio retention score per time group, derived from salient audio tokens, that measures local information density and directs selective pruning of video tokens.
If this is right
- OmniLLMs process longer joint audio-video sequences at 3.42 times the speed of prior methods.
- Memory footprint for multimodal inference drops by a factor of 1.4 while accuracy on understanding benchmarks stays the same.
- Token compression now applies to paired audio-video streams rather than to one modality at a time.
- No additional training is needed to obtain the speedup and memory savings.
Where Pith is reading between the lines
- The same audio-density signal could steer compression in other paired modalities such as text and image sequences.
- Lower token counts may cut energy use enough to run these models on mobile or embedded hardware.
- Measuring retention scores on inputs of varying total length could show how the speedup scales with sequence duration.
Load-bearing premise
The audio retention score computed from salient audio tokens reliably identifies time groups where video tokens can be pruned without losing critical cross-modal information.
What would settle it
A measurable drop in accuracy on tasks that require precise audio-video alignment, such as event localization or speech-driven action recognition, when the compression ratios suggested by the retention scores are applied.
Figures
read the original abstract
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding. However, the high computational cost of processing longer joint audio-video token sequences has become a key bottleneck. Existing token compression methods have not addressed the emerging need to jointly compress multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates model inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive results demonstrate the merits of OmniZip: it achieves a 3.42X inference speedup and a 1.4X memory reduction over other top-performing counterparts, while maintaining the performance of OmniLLMs without training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniZip, a training-free, audio-guided framework for joint audio-visual token compression in omnimodal LLMs. It first identifies salient audio tokens, computes an audio retention score per time group to capture information density, uses this to dynamically prune video tokens while enhancing audio anchors via cross-modal similarity, and applies an interleaved spatio-temporal compression scheme within each time window. The central empirical claim is a 3.42X inference speedup and 1.4X memory reduction relative to top-performing counterparts, with no performance degradation on OmniLLM tasks and no training required.
Significance. If the core assumption holds and the reported speedups are reproducible, the work would offer a practical advance for scaling omnimodal models to longer sequences without retraining. The training-free heuristic and explicit use of audio as a guide for video pruning distinguish it from prior unimodal compression techniques; reproducible code or parameter-free derivations would further strengthen its utility for deployment.
major comments (2)
- [Abstract] Abstract: the headline claim of maintained performance (no loss while achieving 3.42X speedup and 1.4X memory reduction) is load-bearing yet rests on the untested premise that the audio retention score reliably identifies safe video-pruning groups; the abstract supplies no ablation, correlation plot, or failure-case analysis quantifying how well audio information density proxies joint audio-visual density when alignment is weak or asymmetric.
- [Method] Method (description of audio retention score and cross-modal similarity step): the procedure for deriving the retention score from salient audio tokens and for recovering lost cues via cross-modal similarity is presented at a high level without an explicit equation, threshold, or pseudocode; this makes it impossible to verify whether the pruning decision is under-constrained in regimes where salient audio tokens are sparse.
minor comments (2)
- [Experiments] Experiments: the abstract refers to 'extensive results' but does not mention error bars, specific evaluation protocols, or the exact datasets and baselines used; adding these details would make the performance-maintenance claim easier to assess.
- [Abstract] Notation: the distinction between 'time group' and 'time window' is used interchangeably in the abstract; a short clarifying sentence or diagram would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and outline revisions that will strengthen the clarity and empirical support of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of maintained performance (no loss while achieving 3.42X speedup and 1.4X memory reduction) is load-bearing yet rests on the untested premise that the audio retention score reliably identifies safe video-pruning groups; the abstract supplies no ablation, correlation plot, or failure-case analysis quantifying how well audio information density proxies joint audio-visual density when alignment is weak or asymmetric.
Authors: We agree that the abstract claim would be strengthened by direct evidence on the audio retention score's reliability under weak or asymmetric alignment. Our main experiments already demonstrate that OmniZip preserves task performance on multiple OmniLLM benchmarks relative to the uncompressed baseline, indicating that the pruning decisions are safe in the evaluated regimes. To address the specific concern, the revised manuscript will add a dedicated ablation subsection with correlation plots between audio retention scores and joint audio-visual information density, plus failure-case analysis on deliberately misaligned or sparse-audio inputs. revision: yes
-
Referee: [Method] Method (description of audio retention score and cross-modal similarity step): the procedure for deriving the retention score from salient audio tokens and for recovering lost cues via cross-modal similarity is presented at a high level without an explicit equation, threshold, or pseudocode; this makes it impossible to verify whether the pruning decision is under-constrained in regimes where salient audio tokens are sparse.
Authors: We acknowledge that the current method description is high-level and would benefit from greater formality. The retention score aggregates normalized importance weights of salient audio tokens per time group, after which cross-modal similarity (computed via cosine similarity in the shared embedding space) is used to up-weight audio anchors that guide video pruning. In the revision we will insert the explicit equations for both the retention score and the cross-modal enhancement step, together with the exact threshold values and a concise pseudocode block for the full per-window pruning procedure. This will make the behavior under sparse salient-token conditions directly verifiable. revision: yes
Circularity Check
No circularity: empirical claims rest on experimental measurements, not self-referential definitions or fitted inputs
full rationale
The paper describes a training-free heuristic that identifies salient audio tokens, computes an audio retention score per time group, and uses cross-modal similarity to guide video token pruning. Performance claims (3.42X speedup, 1.4X memory reduction, maintained accuracy) are presented as results from extensive experiments rather than quantities derived by construction from the method's own parameters or prior self-citations. No equations, fitted parameters, or uniqueness theorems are invoked in a way that reduces the central result to its inputs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Audio tokens contain sufficient cross-modal cues to guide safe pruning of video tokens.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning... For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Reference graph
Works this paper leans on
-
[1]
Ming-omni: A unified multimodal model for perception and generation, 2025
Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A uni- fied multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025. 2
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 3, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 2, 3, 5, 6
work page 2024
-
[5]
Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. Stream- ingtom: Streaming token compression for efficient video understanding.arXiv preprint arXiv:2510.18269, 2025. 2, 3
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InCVPR,
-
[7]
FlashAttention-2: Faster attention with better paral- lelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 5, 2
work page 2024
-
[8]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 5, 2
work page 2022
-
[9]
Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 3
-
[10]
Mingjing Du, Shifei Ding, and Hongjie Jia. Study on density peaks clustering based on k-nearest neighbors and principal component analysis.Knowledge-Based Systems, 99:135–145,
-
[11]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. InICML, 2023. 3
work page 2023
-
[12]
Vita: Towards open-source interactive omni multimodal llm
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024. 2
-
[13]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 5
work page 2025
-
[14]
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vi- sion and speech interaction.arXiv preprint arXiv:2501.01957,
work page internal anchor Pith review arXiv
-
[15]
Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 1, 2, 5
-
[16]
Zipvl: Efficient large vision-language models with dynamic token sparsification
Yefei He, Feng Chen, Jing Liu, Wenqi Shao, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipvl: Efficient large vision-language models with dynamic token sparsification. arXiv preprint arXiv:2410.08584, 2024. 5, 2
-
[17]
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Prunevid: Visual token pruning for efficient video large language models
Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. In ACL, 2025. 2, 3
work page 2025
-
[19]
Taehan Lee and Hyukjun Lee. Token pruning in audio trans- formers: Optimizing performance and decoding patch impor- tance.arXiv preprint arXiv:2504.01690, 2025. 3, 2
-
[20]
Lmms-eval: Accelerating the development of large multimodal models, 2024
Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Accelerating the development of large multimodal models, 2024. 6
work page 2024
-
[21]
Llava-onevision: Easy visual task transfer.TMLR, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.TMLR, 2025. 1, 2
work page 2025
-
[22]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Accelerating transducers through adjacent token merging
Yuang Li, Yu Wu, Jinyu Li, and Shujie Liu. Accelerating transducers through adjacent token merging. InInterspeech,
-
[24]
Baichuan-omni technical report
Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, and Weipeng Chen. Baichuan-omni technical report.arXiv preprint ar...
-
[25]
Video-llava: Learning united visual represen- tation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. InEMNLP, 2024. 1, 2 9
work page 2024
-
[26]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration
-
[27]
Speechprune: Context-aware token pruning for speech information retrieval
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai Li, Yiran Chen, et al. Speechprune: Context-aware token pruning for speech information retrieval. InICME, 2025. 3, 2
work page 2025
-
[28]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 2
work page 2023
-
[29]
Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, and Xin Jin. Revisiting mllm token technology through the lens of classical visual coding.arXiv preprint arXiv:2508.13460,
-
[30]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chan- dra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Streaming long video understanding with large language models
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models. 2024. 3
work page 2024
-
[32]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 2, 3
work page 2025
-
[33]
arXiv preprint arXiv:2505.21334 , year=
Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,
-
[34]
arXiv preprint arXiv:2507.20198 , year=
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang. When tokens talk too much: A survey of multimodal long-context token compression across images, videos, and audios.arXiv preprint arXiv:2507.20198, 2025. 3, 2
-
[35]
Fastvid: Dynamic density pruning for fast video large language models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 1, 3, 4, 2
-
[36]
Longvu: Spa- tiotemporal adaptive compression for long video-language understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding. InICML, 2025. 2, 3
work page 2025
-
[37]
Audio- visual llm for video understanding
Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio- visual llm for video understanding. InCVPR, 2025. 1, 2
work page 2025
-
[38]
video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. 2, 3
-
[39]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025. 2, 3
-
[41]
video-SALMONN 2: Caption-enhanced audio-visual large language models
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025. 1, 2
-
[42]
Dycoke: Dynamic compression of tokens for fast video large language models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InCVPR, 2025. 1, 2, 3, 4, 5, 6, 7
work page 2025
-
[43]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [45]
-
[46]
Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, et al. Interactiveomni: A unified omni-modal model for audio-visual multi-turn dialogue.arXiv preprint arXiv:2510.13747, 2025. 2
-
[47]
Gptvq: The blessing of dimensionality for llm quantization
Mart Van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The bless- ing of dimensionality for llm quantization.arXiv preprint arXiv:2402.15319, 2024. 3
-
[48]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
arXiv preprint arXiv:2310.06694 (2023)
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023. 3
-
[51]
Smoothquant: Accurate and effi- cient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In ICML, 2023. 3
work page 2023
-
[52]
Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities
Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open- source gpt-4o with vision, speech and duplex capabilities. arXiv preprint arXiv:2410.11190, 2024. 2
-
[53]
Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction. InCVPR, 2025. 2, 3
work page 2025
-
[54]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, 10 et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.09608, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InCVPR, 2025. 2, 3
work page 2025
-
[58]
Humanomniv2: From understanding to omni-modal reasoning with context,
Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, De- tao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understand- ing to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025. 1, 2
-
[59]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InCVPR, 2025. 2, 3, 5, 8
work page 2025
-
[60]
Audio-centric video understanding benchmark without text shortcut
Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Audio-centric video understanding benchmark without text shortcut. InEMNLP, 2025. 5, 8
work page 2025
-
[61]
Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, et al. Omnivinci: Enhancing architecture and data for omni-modal understanding llm.arXiv preprint arXiv:2510.15870, 2025. 2, 1
-
[62]
Fit and prune: Fast and training-free visual token pruning for multi-modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InAAAI, 2025. 2, 3
work page 2025
-
[63]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Video-llama: An instruction-tuned audio-visual language model for video un- derstanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. InEMNLP, 2023. 1, 2
work page 2023
-
[65]
Lmms- eval: Reality check on the evaluation of large multimodal models, 2024
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 6
work page 2024
-
[66]
Video instruction tuning with synthetic data, 2024
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024. 1, 2 11 OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Supplementary Material A. Dynamic Pruning Rate Allocation Algorithm This section expands upon the audio-guided video token compr...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.