pith. machine review for the scientific record. sign in

arxiv: 2603.01400 · v2 · submitted 2026-03-02 · 💻 cs.CV

Recognition: unknown

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords video large language modelstoken reductionoptimal transportvisual token pruningtraining-free compressionspatiotemporal efficiencyefficient inference
0
0 comments X

The pith

Token anchors via local-global optimal transport reduce visual tokens in video LLMs while maintaining competitive accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AOT as a training-free approach that defines anchors for visual tokens inside each frame and across consecutive frames. These anchors collect information from tokens that would otherwise be discarded by using optimal transport to move context both spatially within frames and temporally between frames. Existing pruning methods either focus only on spatial redundancy inside single frames or discard context outright, leading to efficiency gains that come at the cost of lost detail. A sympathetic reader would care because video LLMs currently process thousands of tokens per second of video, driving high compute and memory demands that limit real-world use. If the aggregation step succeeds, models can handle longer videos or run on smaller hardware without retraining.

Core claim

We establish local- and global-aware token anchors within each frame under attention guidance, which optimal transport aggregates the informative contexts from pruned tokens to construct intra-frame anchors. Building on temporal frame clips, the first frame within each clip serves as keyframe anchors that ensemble similar information from consecutive frames through optimal transport while keeping distinct tokens to represent temporal dynamics. This produces efficient token reduction in a training-free manner and yields competitive performance across short- and long-video benchmarks on leading video LLMs while preserving temporal and visual fidelity.

What carries the argument

Local- and global-aware token anchors that aggregate pruned-token context via optimal transport (AOT)

If this is right

  • Competitive accuracy on short- and long-video benchmarks for leading video LLMs
  • Substantial reduction in computational cost while retaining temporal and visual fidelity
  • Training-free operation that works directly on existing video LLMs
  • Better handling of both intra-frame spatial redundancy and inter-frame temporal redundancy than prior pruning techniques
  • Ability to keep distinct tokens for motion while aggregating repeated information across clips

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchor-and-transport pattern could be tested on image-only LLMs to reduce spatial tokens without retraining
  • Longer untrimmed videos might become feasible on fixed hardware budgets if the temporal aggregation scales linearly with clip length
  • Replacing attention-guided anchors with learned parameters could be measured to see whether further token savings are possible
  • The method invites direct comparison of per-token information retention against simple averaging or clustering baselines on the same datasets

Load-bearing premise

Optimal transport aggregation from pruned tokens into anchors preserves subtle yet informative context without meaningful loss for downstream video understanding tasks.

What would settle it

Measure accuracy on a fine-grained action recognition benchmark after AOT token reduction; if performance falls more than a few percent relative to the unpruned baseline at the same total token budget, the preservation claim fails.

Figures

Figures reproduced from arXiv: 2603.01400 by Haonan Zhang, Jinlong Li, Liyuan Jiang, Nicu Sebe.

Figure 1
Figure 1. Figure 1: The top is the essential differences compared with com [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of our AOT. Our method compresses tokens of video LLMs across spatiotemporal through optimal transport, first establishing token anchors within each frame to cover semantically important and spatially diverse token candidates, then utilizing optimal transport to aggregate the necessary informative cues within Intra-Frame at phase I, and finally shifting the optimization strategy into tempo… view at source ↗
Figure 3
Figure 3. Figure 3: Left: scaling with more frames leads to more efficient and effective visual information abstraction. Right: sensitivity analysis of weighting coefficient controlling contextual contribu￾tion with consistent configuration, λintra and λinter. 4.4. Ablation Studies In this section, we conduct ablation studies on LLaVA￾OneVision 7B by setting the token retention budget at 10% to gradually demonstrate the impro… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative visualizations of our Local-Global token an￾chors evolution across consecutive frames while optimal transport is adopted to aggregate necessary information from unselected to￾kens to help LLM precess better. 5. Conclusion In this paper, we first investigate how to aggregate neces￾sary yet optimal semantics and contexts from merging or removing tokens into remaining tokens, instead of simply mer… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames on MVBench sample while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. The top is the original sampled frames while the bottom is the corresponding tokens visualization [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames on VideoMME sample while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. The top is the original sampled frames while the bottom is the corresponding tokens visualization [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AOT, a training-free token-reduction method for Video LLMs that constructs intra-frame anchors via attention-guided local-global optimal transport and inter-frame anchors by treating the first frame of each clip as a keyframe and transporting similar information from subsequent frames while retaining distinct tokens for dynamics. It claims competitive accuracy on short- and long-video benchmarks together with substantial efficiency gains while preserving temporal and visual fidelity.

Significance. If the central claim holds, the work would supply a practical, training-free route to compress visual tokens in VLLMs without retraining, directly addressing the quadratic cost of long video contexts and enabling wider deployment of existing models.

major comments (2)
  1. [Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.
  2. [Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.
minor comments (1)
  1. [Method] Notation for the transport plans and anchor definitions is introduced without an explicit equation or algorithm box, making the precise formulation of the local-global OT steps difficult to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger direct evidence for information retention and explicit numerical results would improve the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.

    Authors: We acknowledge that downstream accuracy alone provides only indirect support for the claim of retained subtle context. In the revised manuscript we will add direct quantification: embedding reconstruction error (L2 distance between original and aggregated token embeddings), cosine similarity of attention maps before/after reduction, and per-token entropy comparisons. These metrics will be reported specifically for the inter-frame keyframe OT step on long-video sequences from the ActivityNet and Ego4D benchmarks to address the concern. revision: yes

  2. Referee: [Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.

    Authors: The full manuscript contains tables reporting accuracy on short- and long-video benchmarks together with FLOPs and latency reductions versus prior pruning methods. To make these results immediately visible, we will revise the abstract to include key numerical highlights (e.g., accuracy deltas and efficiency gains) and ensure all ablation tables and baseline comparisons appear in the main body with clear captions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard OT to new anchors without self-referential reduction

full rationale

The paper presents a training-free token reduction method that first selects attention-guided anchors within frames and then applies optimal transport to aggregate pruned tokens locally, followed by inter-frame keyframe OT on clips. All steps invoke the standard OT formulation (transport plans between anchor and pruned token distributions) without fitting any parameters to the target benchmark data and then relabeling those fits as predictions. No self-citations are used to justify uniqueness or to smuggle in an ansatz; the construction is self-contained and externally falsifiable via the reported benchmark scores. The central efficiency-plus-fidelity claim therefore does not collapse to a tautology or to a fitted-input-called-prediction pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or assumptions; the approach implicitly relies on standard optimal transport properties and attention mechanisms from prior VLLM literature.

pith-pipeline@v0.9.0 · 5542 in / 975 out tokens · 45819 ms · 2026-05-15T18:22:03.090415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems

    cs.CV 2026-04 unverdicted novelty 7.0

    PoInit-of-View poisons SfM initialization by optimizing cross-view gradient inconsistencies to disrupt keypoint detection and feature matching, yielding transferable degradation in rendered 3D reconstruction quality a...

  2. OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 2 Pith papers · 18 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 2

  3. [3]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2

  4. [4]

    Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

    Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InAAAI, pages 1773–1781, 2025. 3

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4

  6. [6]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 1, 2, 3

  7. [7]

    Auroracap: Efficient, performant video detailed captioning and a new benchmark

    Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 1

  8. [8]

    Sharegpt4video: Improving video understand- ing and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. InNeurIPS, pages 19472–19495, 2024. 1, 2

  9. [9]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, pages 19–35. Springer, 2024. 3, 6, 7, 12

  10. [10]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

  11. [11]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 1, 2

  12. [12]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 2

  13. [13]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InNeurIPS, 2013. 2, 3, 5, 9, 10

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

  15. [15]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025. 2, 5

  16. [16]

    Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024

    Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024. 2, 3

  17. [17]

    Prunevid: Visual to- ken pruning for efficient video large language models.arXiv preprint arXiv:2412.16117, 2024

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual to- ken pruning for efficient video large language models.arXiv preprint arXiv:2412.16117, 2024. 2, 3, 6, 7, 12

  18. [18]

    Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, pages 13700–13710, 2024. 1, 3

  19. [19]

    Sparsevila: Decoupling visual sparsity for efficient vlm inference

    Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N Plataniotis, Yao Lu, Song Han, and Zhijian Liu. Sparsevila: Decoupling visual sparsity for efficient vlm inference. InICCV, pages 23784–23794,

  20. [20]

    Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality.arXiv preprint arXiv:2505.18227, 2025

    Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Mes- sica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, and Marinka Zitnik. Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality.arXiv preprint arXiv:2505.18227, 2025. 3

  21. [21]

    Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024

    Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, et al. Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024. 6

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 4, 6, 11, 12

  23. [23]

    Expansion and shrinkage of localization for weakly- supervised semantic segmentation.NeurIPS, 35:16037– 16051, 2022

    Jinlong Li, Zequn Jie, Xu Wang, Xiaolin Wei, and Lin Ma. Expansion and shrinkage of localization for weakly- supervised semantic segmentation.NeurIPS, 35:16037– 16051, 2022. 13

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023. 2

  25. [25]

    Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding

    Jinlong Li, Cristiano Saltori, Fabio Poiesi, and Nicu Sebe. Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding. InCVPR, pages 19390– 19400, 2025. 13

  26. [26]

    Orthogonal projection subspace to aggregate online prior-knowledge for continual test-time adaptation

    Jinlong Li, Dong Zhao, Qi Zang, Zequn Jie, Lin Ma, and Nicu Sebe. Orthogonal projection subspace to aggregate online prior-knowledge for continual test-time adaptation. arXiv preprint arXiv:2506.19022, 2025. 13

  27. [27]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 2

  28. [28]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, pages 22195–22206, 2024. 1, 2, 5

  29. [29]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, pages 323–340. Springer, 2024. 1, 3

  30. [30]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2

  31. [31]

    Vila: On pre-training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 3

  32. [32]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 2

  33. [33]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 2

  34. [34]

    Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024. 3

  35. [35]

    Less: Label-efficient and single-stage referring 3d instance segmentation

    Xuexun Liu, Xu Xiaoxu, Jinlong Li, Qiudan Zhang, Xu Wang, Nicu Sebe, Ma Lin, et al. Less: Label-efficient and single-stage referring 3d instance segmentation. InNeurIPS. NeurIPS, 2024. 13

  36. [36]

    Hybrid-level instruction injection for video token com- pression in multi-modal large language models

    Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InCVPR, pages 8568–8578, 2025. 2

  37. [37]

    Nvila: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InCVPR, pages 4122–4134, 2025. 3

  38. [38]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 3

  39. [39]

    Egoschema: A diagnostic benchmark for very long- form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InNeurIPS, pages 46212–46244, 2023. 2, 5

  40. [40]

    Perla: Perceptive 3d language assistant

    Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, and Yiming Wang. Perla: Perceptive 3d language assistant. InCVPR, pages 14369–14379, 2025. 13

  41. [41]

    M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem

    Gaspard Monge. M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem. Math. Phys. Acad. Royale Sci., pages 666– 704, 1781. 9

  42. [42]

    T2td: Text-3d generation model based on prior knowledge guidance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(1):172–189, 2024

    Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, and Nicu Sebe. T2td: Text-3d generation model based on prior knowledge guidance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(1):172–189, 2024. 13

  43. [43]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PmLR, 2021. 1, 3, 11

  44. [44]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, pages 22857–22867,

  45. [45]

    Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,

    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,

  46. [46]

    Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

    Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 3

  47. [47]

    Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 3, 6, 11, 12

  48. [48]

    Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2, 3

  49. [49]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232, 2024. 1

  50. [50]

    To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

    Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025. 3

  51. [51]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InCVPR, pages 18992–19001,

  52. [52]

    Stanford alpaca: An instruction-following llama model, 2023

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023. 2

  53. [53]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

  54. [54]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 3

  55. [55]

    Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018

    Matthew Thorpe. Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018. 9

  56. [56]

    Springer, 2008

    C ´edric Villani et al.Optimal transport: old and new. Springer, 2008. 2

  57. [57]

    Ross3d: Re- constructive visual instruction tuning with 3d-awareness

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Re- constructive visual instruction tuning with 3d-awareness. In CVPR, pages 9275–9286, 2025. 13

  58. [58]

    Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023

    Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023. 1, 2

  59. [59]

    Uvmap-id: A controllable and personalized uv map generative model

    Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humphrey Shi, Nicu Sebe, and Bruno Lepri. Uvmap-id: A controllable and personalized uv map generative model. In ACM MM, pages 10725–10734, 2024. 13

  60. [60]

    Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

    Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 3

  61. [61]

    Longvlm: Efficient long video understand- ing via large language models

    Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InECCV, pages 453–470. Springer, 2024. 1, 2

  62. [62]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 1, 2

  63. [63]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, pages 28828– 28857, 2024. 2, 5

  64. [64]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1, 3, 6, 7, 12

  65. [65]

    Conical visual concentration for efficient large vision-language models

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InCVPR, pages 14593– 14603, 2025. 3

  66. [66]

    Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1, 3

  67. [67]

    Topv: Compatible token pruning with infer- ence time optimization for fast and low-memory multimodal vision language model

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with infer- ence time optimization for fast and low-memory multimodal vision language model. InCVPR, pages 19803–19813, 2025. 3

  68. [68]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In CVPR, pages 19792–19802, 2025. 1, 2, 3, 4, 6, 7, 12

  69. [69]

    Atp-llava: Adaptive token pruning for large vision language models

    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InCVPR, pages 24972–24982,

  70. [70]

    Video question answering with prior knowledge and object-sensitive learning.IEEE Transactions on Image Processing, 31:5936–5948, 2022

    Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Video question answering with prior knowledge and object-sensitive learning.IEEE Transactions on Image Processing, 31:5936–5948, 2022. 2

  71. [71]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 1, 3, 4, 11

  72. [72]

    Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025

    Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hong- ming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025. 3, 4

  73. [73]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2

  74. [74]

    Omnicharacter: Towards immersive role- playing agents with seamless speech-language personality interaction

    Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, et al. Omnicharacter: Towards immersive role- playing agents with seamless speech-language personality interaction. InACL (Volume 1: Long Papers), pages 26318– 26331, 2025. 2

  75. [75]

    Text-video re- trieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

    Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video re- trieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025. 2

  76. [76]

    Lmms-eval: Re- ality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In NAACL 2025, pages 881–916, 2025. 6

  77. [77]

    [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv e- prints, pages arXiv–2412, 2024

    Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv e- prints, pages arXiv–2412, 2024. 3, 4

  78. [78]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 1, 3

  79. [79]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 4

  80. [80]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 2, 6, 11, 12

Showing first 80 references.