arxiv: 2605.11605 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

Chaeyoung Jung, Joon Son Chung, Kyeongha Rho

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords token pruningomni-LLMsaudio-visual modelscontext-preserving pruningmultimodal token reductioninference-time optimizationvideo token merging

0 comments

The pith

ContextGuard lets omni-LLMs drop more than half their video tokens at inference without losing accuracy by keeping only what audio cannot convey.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors reframe token reduction in omni-LLMs as preserving broad audio-visual context instead of selecting query-relevant tokens. They propose ContextGuard, which uses audio to predict and prune video tokens with recoverable coarse semantics while retaining extra tokens for localized details audio cannot capture. The approach also merges temporally similar video tokens for further savings. It requires no fine-tuning of the main model and relies on a lightweight independent predictor. This results in matching full performance on five out of six benchmarks with 55 percent token pruning on a 7B model.

Core claim

ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify and merging temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor.

What carries the argument

ContextGuard, an inference-time pruning method that identifies prunable video tokens by checking if their coarse semantics can be recovered from the audio input.

If this is right

Substantial reduction in computational cost for processing multimodal inputs.
No need for model retraining or fine-tuning to apply the pruning.
Better performance than previous inference-time pruning approaches at higher compression rates.
Applicable across different omni-LLM scales without task-specific adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method highlights how audio can serve as a reliable signal for much of the visual context in videos.
It could inspire similar pruning techniques for other multimodal combinations like text-image or audio-text.
Testing on more diverse question types might reveal limits where visual details are crucial beyond coarse semantics.

Load-bearing premise

That the independently trained lightweight predictor can correctly identify video tokens whose information is redundant with the audio for any potential question the model might be asked later.

What would settle it

A benchmark question that requires distinguishing fine visual details in the video that are not predictable from the accompanying audio; if the pruned version performs worse than the full-token version on such questions, the approach would be falsified.

Figures

Figures reproduced from arXiv: 2605.11605 by Chaeyoung Jung, Joon Son Chung, Kyeongha Rho.

**Figure 1.** Figure 1: Main results on Qwen2.5- Omni 7B. ContextGuard outperforms previous token compression methods. Experiments on the 3B and 7B variants of Qwen2.5- Omni [51] and Video-SALMONN2+ show that our method outperforms OmniZip [47], a prior inference-time AV pruning method, in 21 of 24 settings while using fewer input tokens. On the 7B variant of Qwen2.5-Omni, ContextGuard achieves full-token-level performance on fi… view at source ↗

**Figure 2.** Figure 2: Overview of ContextGuard. ContextGuard reduces video tokens before the LLM decoder by removing audio-explainable visual redundancy while preserving broad AV context. For each video chunk t in the interleaved audio-video sequence, an audio-to-video semantic predictor (A2V predictor) estimates coarse visual semantics from the corresponding audio tokens. ContextGuard performs audio-guided semantic pruning by … view at source ↗

**Figure 3.** Figure 3: Main qualitative results. FastV and OmniZip fail to preserve visual evidence that is not directly aligned with the audio narration or with the most salient objects in the video, resulting in incomplete context. In contrast, ContextGuard preserves such non-audio-aligned visual information, maintains broad AV context under aggressive token compression, and recovers the correct answer [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 4.** Figure 4: Qualitative audio-to-video retrieval results using Qwen2.5-Omni 7B. Ground Truth Original Embeddings Our Embeddings Top 1 Top 2 Top 3 Top 1 Top 2 Top 3 People sobbing Ground Truth Original Embeddings Our Embeddings Top 1 Top 2 Top 3 Top 1 Top 2 Top 3 An ambulance siren [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative audio-to-video retrieval results using Video-SALMONN2+ 7B. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of the semantic retention ratio ρsem. Larger ρsem values reduce KL divergence to the full-token output distribution by retaining more tokens, but also weaken compression. We choose ρsem = 0.5 for all models as it already achieves low KL divergence while preserving substantial token reduction [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Hyperparameter analysis [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative results on downstream QA using Qwen2.5-Omni 7B. ContextGuard preserves broad AV context and recovers the correct answer. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative results on downstream QA using Video-SALMONN2+ 7B. ContextGuard preserves non-audio-aligned visual events, maintains broad AV context, and recovers the correct answer. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative results on downstream QA using Video-SALMONN2+ 7B. ContextGuard preserves the full speech cue and broad AV context, and recovers the correct answer. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Failure case on downstream QA using Qwen2.5-Omni 7B. ContextGuard misses a subtle fine-grained detail, the player’s jersey number, leading to an incorrect answer. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Failure case on downstream QA using Video-SALMONN2+ 7B. ContextGuard preserves evidence not recoverable from audio, such as OCR text, but fails to consistently retain fine-grained temporal visual cues, such as facial expressions, needed for the correct answer. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative analysis of non-audio-aligned semantic selection using Qwen2.5-Omni 7B. ContextGuard preserves non-audio-aligned semantic regions such as scene text, while the spatialdetail branch further helps retain localized visual details. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative analysis of non-audio-aligned semantic selection using VideoSALMONN2+ 7B. Similar to Qwen2.5-Omni, ContextGuard preserves non-audio-aligned semantic regions while avoiding over-retention of strongly audio-aligned content. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

read the original abstract

Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextGuard reframes token pruning around audio-recoverable coarse semantics plus retained details, but the abstract's performance claims lack any supporting details or controls.

read the letter

The main takeaway is that this paper proposes ContextGuard as an inference-time pruning method for omni-LLMs. It trains a lightweight predictor to spot video tokens whose broad semantics audio can recover, drops those, keeps extra tokens for localized visual details audio cannot specify, and merges temporally similar ones. No fine-tuning of the main model is needed, and the approach aims to preserve general context instead of tying pruning to the current query or cross-modal alignment scores. That framing is distinct from the query-dependent or alignment-based methods mentioned as prior work. If the results hold, pruning 55% of tokens on Qwen2.5-Omni 7B while matching full performance on five of six benchmarks would be practically useful for lowering inference cost on resource-limited hardware. The independent predictor and explicit retention step are clear design choices that try to address the risk of discarding future-useful information. The soft spot is the complete absence of experimental substance in the abstract. There are no listed baselines, no ablations on the predictor accuracy or the retention heuristic, no variance numbers, and no discussion of how the six benchmarks were chosen or whether they include cases where query-specific details matter. The central assumption—that the predictor plus retained tokens will suffice for arbitrary downstream questions—remains untested in the provided text. The stress-test worry about misclassification of tokens carrying fine-grained visual information needed for specific queries is reasonable and would need direct checks in the full paper. This work is for people already working on token reduction in multimodal LLMs. A reader focused on audio-visual model efficiency would find the reframing worth considering, but only after seeing the actual experiments and controls. I would send it to peer review so the details can be examined properly.

Referee Report

3 major / 1 minor

Summary. The paper proposes ContextGuard, an inference-time token pruning framework for Omni-LLMs. It reframes pruning as preserving broad audio-visual context by using a lightweight predictor to identify and remove video tokens whose coarse semantics are recoverable from audio, while retaining additional tokens for localized visual details and merging temporally similar tokens. The method requires no LLM fine-tuning. Experiments on Qwen2.5-Omni and Video-SALMONN2+ (3B/7B scales) across six audio-visual benchmarks claim that ContextGuard outperforms prior inference-time pruning methods, with the 7B Qwen2.5-Omni model achieving full-token performance on five of six benchmarks at 55% pruning.

Significance. If the empirical claims hold under rigorous controls, the work could meaningfully advance efficient inference for omnimodal models by exploiting cross-modal redundancy without query-specific selection or retraining. The emphasis on context preservation for arbitrary downstream questions addresses a clear limitation of existing pruning strategies and could support longer-context deployments.

major comments (3)

[Method and Experiments] The central performance claims (full-token equivalence at 55% pruning on five of six benchmarks) rest on the unverified assumption that the independently trained predictor plus retained localized tokens suffice for arbitrary queries. No quantitative evaluation of predictor error rates on query-specific visual details (e.g., text, object attributes, or spatial relations misaligned with audio) is reported, leaving the weakest assumption untested.
[Experiments] The manuscript provides no experimental details on baselines, number of runs, statistical significance, or controls for the pruning ratio and retention heuristic. This prevents verification of the reported outperformance and near-full performance claims.
[Experiments] The six benchmarks may not contain sufficient cases exposing the failure mode where fine-grained visual information required by future questions is pruned; the paper should include targeted stress tests or additional datasets with non-audio-aligned details.

minor comments (1)

[Method] Notation for the predictor output and retention threshold should be defined more explicitly with equations to clarify the inference-time operations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Method and Experiments] The central performance claims (full-token equivalence at 55% pruning on five of six benchmarks) rest on the unverified assumption that the independently trained predictor plus retained localized tokens suffice for arbitrary queries. No quantitative evaluation of predictor error rates on query-specific visual details (e.g., text, object attributes, or spatial relations misaligned with audio) is reported, leaving the weakest assumption untested.

Authors: We thank the referee for highlighting this point. The method retains extra tokens precisely to safeguard localized visual details that audio cannot recover, and the reported benchmark results provide indirect support for handling arbitrary queries. To strengthen the evidence, we will add a quantitative analysis of predictor error rates on query-specific details (such as text, object attributes, and spatial relations) in the revised manuscript. revision: yes
Referee: [Experiments] The manuscript provides no experimental details on baselines, number of runs, statistical significance, or controls for the pruning ratio and retention heuristic. This prevents verification of the reported outperformance and near-full performance claims.

Authors: We agree that these details are required for full reproducibility and verification of the claims. The revised manuscript will include complete experimental information: the specific baselines, number of runs, statistical significance (means and standard deviations), and controls for pruning ratios and the retention heuristic. revision: yes
Referee: [Experiments] The six benchmarks may not contain sufficient cases exposing the failure mode where fine-grained visual information required by future questions is pruned; the paper should include targeted stress tests or additional datasets with non-audio-aligned details.

Authors: We acknowledge that the existing benchmarks may not fully expose failure cases involving fine-grained, non-audio-aligned visual information. We will add targeted stress tests and/or supplementary datasets focused on such details (e.g., text reading or precise object attributes) in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes ContextGuard as an inference-time pruning method that relies on an independently trained lightweight predictor to decide which video tokens have coarse semantics recoverable from audio, while retaining additional tokens for localized details and merging temporally similar ones. All reported results are empirical performance measurements on external benchmarks (Qwen2.5-Omni, Video-SALMONN2+, six audio-visual tasks) with no equations, fitted parameters, or self-citations presented as load-bearing derivations. The method is explicitly stated to require no downstream LLM fine-tuning, making the performance claims independent of any internal redefinition or tautological prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the approach rests on the existence of a trainable lightweight predictor whose accuracy is assumed but not detailed.

pith-pipeline@v0.9.0 · 5565 in / 1321 out tokens · 50476 ms · 2026-05-13T01:56:44.922457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 8 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning. InProc. NeurIPS, 2022

work page 2022
[2]

Qwen2.5-VL Technical Report.arXiv, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv, 2025

work page 2025
[3]

VGGSound: A Large-scale Audio-Visual Dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. VGGSound: A Large-scale Audio-Visual Dataset. InProc. ICASSP, 2020

work page 2020
[4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic.arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InProc. ECCV, 2024

work page 2024
[6]

BEATs: Audio Pre-Training with Acoustic Tokenizers

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, and Furu Wei. BEATs: Audio Pre-Training with Acoustic Tokenizers. In Proc. ICML, 2023

work page 2023
[7]

V AST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. V AST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. InProc. NeurIPS, 2023

work page 2023
[8]

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen, Keda Tao, Kele Shao, and Huan Wang. StreamingTOM: Streaming Token Compression for Efficient Video Understanding. InProc. CVPR, 2026

work page 2026
[9]

InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProc. CVPR, 2024

work page 2024
[10]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, and Dinesh Manocha. Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time. InProc. ECCV, 2024

work page 2024
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Omnisift: Modality-asymmetric token compression for efficient omni-modal large language models.arXiv preprint arXiv:2602.04804, 2026

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, et al. OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models.arXiv:2602.04804, 2026

work page internal anchor Pith review arXiv 2026
[14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InProc. ICLR, 2021

work page 2021
[15]

Video-MME: The First-Ever Compre- hensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The First-Ever Compre- hensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. InProc. CVPR, 2025

work page 2025
[16]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Chan- ning Moore, Manoj Plakal, and Marvin Ritter. Audio Set: An ontology and human-labeled dataset for audio events. InProc. ICASSP, 2017. 10

work page 2017
[17]

Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, and Jingjing Chen. Echo- ingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs. arXiv:2512.10324, 2025

work page arXiv 2025
[18]

AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung, and James Glass. AST: Audio Spectrogram Transformer. InProc. Interspeech, 2021

work page 2021
[19]

OneLLM: One Framework to Align All Modalities with Language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. OneLLM: One Framework to Align All Modalities with Language. InProc. CVPR, 2024

work page 2024
[20]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs. InProc. ICLR, 2026

work page 2026
[21]

Language is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, et al. Language is Not All You Need: Aligning Perception with Language Models. InProc. NeurIPS, 2023

work page 2023
[22]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o System Card. arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, and Minho Shim. Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs. InProc. ICCV, 2025

work page 2025
[24]

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, and Wonmin Byeon. STORM: Token-Efficient Long Video Understanding for Multimodal LLMs. InProc. ICCV Workshop, 2025

work page 2025
[25]

LLaV A-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy Visual Task Transfer. Transactions on Machine Learning Research, 2024

work page 2024
[26]

Omnivideobench: Towards audio-visual understanding evaluation for omni mllms, 2025

Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Wentao Wang, Zhenghao Song, Dingling Zhang, et al. OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs.arXiv:2510.10689, 2025

work page arXiv 2025
[27]

BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language- Image Pre-training with Frozen Image Encoders and Large Language Models. InProc. ICML, 2023

work page 2023
[28]

VideoChat- Flash: Hierarchical Compression for Long-Context Video Modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. VideoChat- Flash: Hierarchical Compression for Long-Context Video Modeling. InProc. ICLR, 2026

work page 2026
[29]

Video-LLaV A: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning United Visual Representation by Alignment Before Projection. InProc. EMNLP, 2024

work page 2024
[30]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. In Proc. NeurIPS, 2023

work page 2023
[31]

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Xuyang Liu, Xiyan Gui, Yuchao Zhang, and Linfeng Zhang. Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models. InProc. ICLR, 2026

work page 2026
[32]

arXiv preprint arXiv:2306.09093 , year=

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration.arXiv:2306.09093, 2023

work page arXiv 2023
[33]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. InProc. ACL, 2024. 11

work page 2024
[34]

X-InstructBLIP: A Framework for Aligning X-Modal Instruction-Aware Representations to LLMs and Emergent Cross-modal Reasoning

Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-InstructBLIP: A Framework for Aligning X-Modal Instruction-Aware Representations to LLMs and Emergent Cross-modal Reasoning. arXiv:2311.18799, 2023

work page arXiv 2023
[35]

Adapt- Token: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

Haozhe Qi, Kevin Qu, Mahdi Rad, Rui Wang, Alexander Mathis, and Marc Pollefeys. Adapt- Token: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding. arXiv:2603.28696, 2026

work page arXiv 2026
[36]

Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, and Tat-Seng Chua. Quicksviewer: An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes. arXiv:2504.15270, 2025

work page arXiv 2025
[37]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models From Natural Language Supervision. InProc. ICML, 2021

work page 2021
[38]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust Speech Recognition via Large-Scale Weak Supervision. InProc. ICML, 2023

work page 2023
[39]

LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models. InProc. ICCV, 2025

work page 2025
[40]

C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379–423, 1948

work page 1948
[41]

HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. HoliTom: Holistic Token Merging for Fast Video Large Language Models. InProc. NeurIPS, 2025

work page 2025
[42]

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding. InProc. CVPR, 2025

work page 2025
[43]

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models. InProc. ICML, 2024

work page 2024
[44]

Tokencarve: Information-preserving visual token compression in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. TokenCarve: Information-Preserving Visual Token Compression in Multimodal Large Language Models.arXiv:2503.10501, 2025

work page arXiv 2025
[45]

video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models. arXiv:2506.15220, 2025

work page arXiv 2025
[46]

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models. InProc. CVPR, 2025

work page 2025
[47]

OmniZip: Audio- Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian Liu, and Huan Wang. OmniZip: Audio- Guided Dynamic Token Compression for Fast Omnimodal Large Language Models. InProc. CVPR, 2026

work page 2026
[48]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models.arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. InProc. CVPR, 2025. 12

work page 2025
[51]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2.5-Omni Technical Report.arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, and Jifeng Dai. PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models. InProc. CVPR, 2025

work page 2025
[53]

A VQA: A Dataset for Audio-Visual Question Answering on Videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. A VQA: A Dataset for Audio-Visual Question Answering on Videos. InProc. ACM MM, 2022

work page 2022
[54]

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. InProc. CVPR, 2025

work page 2025
[55]

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos. InProc. ACM MM, 2025

work page 2025
[56]

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios. InProc. ECCV, 2024

work page 2024
[57]

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models. InProc. AAAI, 2025

work page 2025
[58]

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProc. CVPR, 2024

work page 2024
[59]

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.arXiv:2402.12226, 2024

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling.arXiv:2402.12226, 2024

work page arXiv 2024
[60]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. InProc. EMNLP, 2023

work page 2023
[61]

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang, Desen Meng, Ji Qi, Zhenpeng Huang, Tao Wu, and Limin Wang. p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay. InProc. ICCV, 2025

work page 2025
[62]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. InProc. ICLR, 2024

work page 2024
[63]

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst.arXiv:2305.16103, 2023

Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst.arXiv:2305.16103, 2023

work page arXiv 2023
[64]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025

Ziwei Zhou, Rui Wang, Zuxuan Wu, and Yu-Gang Jiang. Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities.arXiv:2505.17862, 2025

work page arXiv 2025
[65]

plucked string instrument music

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. InProc. ICLR, 2024. 13 Appendix A Audio-to-Video Semantic Predictor A.1 Architecture and Training Details Predictor architecture and training objective.We implement the audio-to-video semantic pre- di...

work page arXiv 2024