arxiv: 2603.01400 · v2 · submitted 2026-03-02 · 💻 cs.CV

Recognition: unknown

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

Jinlong Li , Liyuan Jiang , Haonan Zhang , Nicu Sebe

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords video large language modelstoken reductionoptimal transportvisual token pruningtraining-free compressionspatiotemporal efficiencyefficient inference

0 comments

The pith

Token anchors via local-global optimal transport reduce visual tokens in video LLMs while maintaining competitive accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AOT as a training-free approach that defines anchors for visual tokens inside each frame and across consecutive frames. These anchors collect information from tokens that would otherwise be discarded by using optimal transport to move context both spatially within frames and temporally between frames. Existing pruning methods either focus only on spatial redundancy inside single frames or discard context outright, leading to efficiency gains that come at the cost of lost detail. A sympathetic reader would care because video LLMs currently process thousands of tokens per second of video, driving high compute and memory demands that limit real-world use. If the aggregation step succeeds, models can handle longer videos or run on smaller hardware without retraining.

Core claim

We establish local- and global-aware token anchors within each frame under attention guidance, which optimal transport aggregates the informative contexts from pruned tokens to construct intra-frame anchors. Building on temporal frame clips, the first frame within each clip serves as keyframe anchors that ensemble similar information from consecutive frames through optimal transport while keeping distinct tokens to represent temporal dynamics. This produces efficient token reduction in a training-free manner and yields competitive performance across short- and long-video benchmarks on leading video LLMs while preserving temporal and visual fidelity.

What carries the argument

Local- and global-aware token anchors that aggregate pruned-token context via optimal transport (AOT)

If this is right

Competitive accuracy on short- and long-video benchmarks for leading video LLMs
Substantial reduction in computational cost while retaining temporal and visual fidelity
Training-free operation that works directly on existing video LLMs
Better handling of both intra-frame spatial redundancy and inter-frame temporal redundancy than prior pruning techniques
Ability to keep distinct tokens for motion while aggregating repeated information across clips

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-and-transport pattern could be tested on image-only LLMs to reduce spatial tokens without retraining
Longer untrimmed videos might become feasible on fixed hardware budgets if the temporal aggregation scales linearly with clip length
Replacing attention-guided anchors with learned parameters could be measured to see whether further token savings are possible
The method invites direct comparison of per-token information retention against simple averaging or clustering baselines on the same datasets

Load-bearing premise

Optimal transport aggregation from pruned tokens into anchors preserves subtle yet informative context without meaningful loss for downstream video understanding tasks.

What would settle it

Measure accuracy on a fine-grained action recognition benchmark after AOT token reduction; if performance falls more than a few percent relative to the unpruned baseline at the same total token budget, the preservation claim fails.

Figures

Figures reproduced from arXiv: 2603.01400 by Haonan Zhang, Jinlong Li, Liyuan Jiang, Nicu Sebe.

**Figure 2.** Figure 2: Overall pipeline of our AOT. Our method compresses tokens of video LLMs across spatiotemporal through optimal transport, first establishing token anchors within each frame to cover semantically important and spatially diverse token candidates, then utilizing optimal transport to aggregate the necessary informative cues within Intra-Frame at phase I, and finally shifting the optimization strategy into tempo… view at source ↗

**Figure 3.** Figure 3: Left: scaling with more frames leads to more efficient and effective visual information abstraction. Right: sensitivity analysis of weighting coefficient controlling contextual contribution with consistent configuration, λintra and λinter. 4.4. Ablation Studies In this section, we conduct ablation studies on LLaVAOneVision 7B by setting the token retention budget at 10% to gradually demonstrate the impro… view at source ↗

**Figure 4.** Figure 4: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. 5. Conclusion In this paper, we first investigate how to aggregate necessary yet optimal semantics and contexts from merging or removing tokens into remaining tokens, instead of simply mer… view at source ↗

**Figure 5.** Figure 5: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames on MVBench sample while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. The top is the original sampled frames while the bottom is the corresponding tokens visualization [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative visualizations of our Local-Global token anchors evolution across consecutive frames on VideoMME sample while optimal transport is adopted to aggregate necessary information from unselected tokens to help LLM precess better. The top is the original sampled frames while the bottom is the corresponding tokens visualization [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AOT gives a clean training-free token reduction for video LLMs by building local-global anchors and using optimal transport to fold in pruned tokens, but the fidelity claim rests only on downstream scores.

read the letter

The main thing to know is that this paper introduces AOT, which sets up attention-guided anchors inside each frame, uses optimal transport to pull context from the tokens that get dropped, then repeats a similar OT step across frames inside short clips with the first frame as keyframe. The goal is spatiotemporal reduction without any retraining, and the abstract claims this keeps competitive accuracy on both short and long video benchmarks while cutting compute. That framing is useful because most prior pruning either stays inside one frame or adds overhead inside the LLM layers. The training-free part and the explicit split between local anchors and clip-level temporal handling are the clearest additions. The paper does a reasonable job laying out why existing methods leave subtle context on the table and why OT might be a better aggregator than simple averaging or dropping. The math itself looks like standard optimal transport applied in two stages, with no obvious circularity or free parameters that need fitting. On the soft spots, the available text supplies no numbers, no ablation tables, and no separate check on what information actually survives the transport step. Downstream benchmark wins alone do not confirm that fine-grained temporal or visual distinctions are retained, especially on longer videos where clip-level keyframe assumptions could quietly drop dynamics. That matches the stress-test concern about untested information retention. The citation pattern is standard and points to the right prior pruning papers. This work is aimed at people who already run video LLMs and need a drop-in efficiency trick rather than a new architecture. A reader who cares about practical token budgets would get a usable idea from it. It deserves a serious referee so the full experiments and any direct fidelity checks can be examined.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AOT, a training-free token-reduction method for Video LLMs that constructs intra-frame anchors via attention-guided local-global optimal transport and inter-frame anchors by treating the first frame of each clip as a keyframe and transporting similar information from subsequent frames while retaining distinct tokens for dynamics. It claims competitive accuracy on short- and long-video benchmarks together with substantial efficiency gains while preserving temporal and visual fidelity.

Significance. If the central claim holds, the work would supply a practical, training-free route to compress visual tokens in VLLMs without retraining, directly addressing the quadratic cost of long video contexts and enabling wider deployment of existing models.

major comments (2)

[Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.
[Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.

minor comments (1)

[Method] Notation for the transport plans and anchor definitions is introduced without an explicit equation or algorithm box, making the precise formulation of the local-global OT steps difficult to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger direct evidence for information retention and explicit numerical results would improve the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses

Referee: [Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.

Authors: We acknowledge that downstream accuracy alone provides only indirect support for the claim of retained subtle context. In the revised manuscript we will add direct quantification: embedding reconstruction error (L2 distance between original and aggregated token embeddings), cosine similarity of attention maps before/after reduction, and per-token entropy comparisons. These metrics will be reported specifically for the inter-frame keyframe OT step on long-video sequences from the ActivityNet and Ego4D benchmarks to address the concern. revision: yes
Referee: [Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.

Authors: The full manuscript contains tables reporting accuracy on short- and long-video benchmarks together with FLOPs and latency reductions versus prior pruning methods. To make these results immediately visible, we will revise the abstract to include key numerical highlights (e.g., accuracy deltas and efficiency gains) and ensure all ablation tables and baseline comparisons appear in the main body with clear captions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard OT to new anchors without self-referential reduction

full rationale

The paper presents a training-free token reduction method that first selects attention-guided anchors within frames and then applies optimal transport to aggregate pruned tokens locally, followed by inter-frame keyframe OT on clips. All steps invoke the standard OT formulation (transport plans between anchor and pruned token distributions) without fitting any parameters to the target benchmark data and then relabeling those fits as predictions. No self-citations are used to justify uniqueness or to smuggle in an ansatz; the construction is self-contained and externally falsifiable via the reported benchmark scores. The central efficiency-plus-fidelity claim therefore does not collapse to a tautology or to a fitted-input-called-prediction pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or assumptions; the approach implicitly relies on standard optimal transport properties and attention mechanisms from prior VLLM literature.

pith-pipeline@v0.9.0 · 5542 in / 975 out tokens · 45819 ms · 2026-05-15T18:22:03.090415+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
cs.CV 2026-04 unverdicted novelty 7.0

PoInit-of-View poisons SfM initialization by optimizing cross-view gradient inconsistencies to disrupt keypoint detection and feature matching, yielding transferable degradation in rendered 3D reconstruction quality a...
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 2 Pith papers · 18 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 2

work page 2022
[3]

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InAAAI, pages 1773–1781, 2025. 3

work page 2025
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Auroracap: Efficient, performant video detailed captioning and a new benchmark

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 1

work page arXiv 2024
[8]

Sharegpt4video: Improving video understand- ing and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. InNeurIPS, pages 19472–19495, 2024. 1, 2

work page 2024
[9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, pages 19–35. Springer, 2024. 3, 6, 7, 12

work page 2024
[10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page
[11]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 2

work page 2023
[13]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InNeurIPS, 2013. 2, 3, 5, 9, 10

work page 2013
[14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025. 2, 5

work page 2025
[16]

Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024

Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024. 2, 3

work page arXiv 2024
[17]

Prunevid: Visual to- ken pruning for efficient video large language models.arXiv preprint arXiv:2412.16117, 2024

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual to- ken pruning for efficient video large language models.arXiv preprint arXiv:2412.16117, 2024. 2, 3, 6, 7, 12

work page arXiv 2024
[18]

Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, pages 13700–13710, 2024. 1, 3

work page 2024
[19]

Sparsevila: Decoupling visual sparsity for efficient vlm inference

Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N Plataniotis, Yao Lu, Song Han, and Zhijian Liu. Sparsevila: Decoupling visual sparsity for efficient vlm inference. InICCV, pages 23784–23794,

work page
[20]

Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality.arXiv preprint arXiv:2505.18227, 2025

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Mes- sica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, and Marinka Zitnik. Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality.arXiv preprint arXiv:2505.18227, 2025. 3

work page arXiv 2025
[21]

Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024

Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, et al. Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024. 6

work page 2024
[22]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 4, 6, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Expansion and shrinkage of localization for weakly- supervised semantic segmentation.NeurIPS, 35:16037– 16051, 2022

Jinlong Li, Zequn Jie, Xu Wang, Xiaolin Wei, and Lin Ma. Expansion and shrinkage of localization for weakly- supervised semantic segmentation.NeurIPS, 35:16037– 16051, 2022. 13

work page 2022
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023. 2

work page 2023
[25]

Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi, and Nicu Sebe. Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding. InCVPR, pages 19390– 19400, 2025. 13

work page 2025
[26]

Orthogonal projection subspace to aggregate online prior-knowledge for continual test-time adaptation

Jinlong Li, Dong Zhao, Qi Zang, Zequn Jie, Lin Ma, and Nicu Sebe. Orthogonal projection subspace to aggregate online prior-knowledge for continual test-time adaptation. arXiv preprint arXiv:2506.19022, 2025. 13

work page arXiv 2025
[27]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, pages 22195–22206, 2024. 1, 2, 5

work page 2024
[29]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, pages 323–340. Springer, 2024. 1, 3

work page 2024
[30]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 3

work page 2024
[32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 2

work page 2024
[33]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 2

work page 2024
[34]

Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024. 3

work page arXiv 2024
[35]

Less: Label-efficient and single-stage referring 3d instance segmentation

Xuexun Liu, Xu Xiaoxu, Jinlong Li, Qiudan Zhang, Xu Wang, Nicu Sebe, Ma Lin, et al. Less: Label-efficient and single-stage referring 3d instance segmentation. InNeurIPS. NeurIPS, 2024. 13

work page 2024
[36]

Hybrid-level instruction injection for video token com- pression in multi-modal large language models

Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InCVPR, pages 8568–8578, 2025. 2

work page 2025
[37]

Nvila: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InCVPR, pages 4122–4134, 2025. 3

work page 2025
[38]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InNeurIPS, pages 46212–46244, 2023. 2, 5

work page 2023
[40]

Perla: Perceptive 3d language assistant

Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, and Yiming Wang. Perla: Perceptive 3d language assistant. InCVPR, pages 14369–14379, 2025. 13

work page 2025
[41]

M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem

Gaspard Monge. M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem. Math. Phys. Acad. Royale Sci., pages 666– 704, 1781. 9

work page
[42]

T2td: Text-3d generation model based on prior knowledge guidance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(1):172–189, 2024

Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, and Nicu Sebe. T2td: Text-3d generation model based on prior knowledge guidance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(1):172–189, 2024. 13

work page 2024
[43]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PmLR, 2021. 1, 3, 11

work page 2021
[44]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, pages 22857–22867,

work page
[45]

Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,

work page arXiv
[46]

Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 3

work page arXiv 2024
[47]

Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 3, 6, 11, 12

work page arXiv 2025
[48]

Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2, 3

work page arXiv 2024
[49]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232, 2024. 1

work page 2024
[50]

To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025

Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025. 3

work page arXiv 2025
[51]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InCVPR, pages 18992–19001,

work page
[52]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023. 2

work page 2023
[53]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018

Matthew Thorpe. Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018. 9

work page 2018
[56]

Springer, 2008

C ´edric Villani et al.Optimal transport: old and new. Springer, 2008. 2

work page 2008
[57]

Ross3d: Re- constructive visual instruction tuning with 3d-awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Re- constructive visual instruction tuning with 3d-awareness. In CVPR, pages 9275–9286, 2025. 13

work page 2025
[58]

Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023

Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023. 1, 2

work page arXiv 2023
[59]

Uvmap-id: A controllable and personalized uv map generative model

Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humphrey Shi, Nicu Sebe, and Bruno Lepri. Uvmap-id: A controllable and personalized uv map generative model. In ACM MM, pages 10725–10734, 2024. 13

work page 2024
[60]

Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 3

work page arXiv 2025
[61]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InECCV, pages 453–470. Springer, 2024. 1, 2

work page 2024
[62]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, pages 28828– 28857, 2024. 2, 5

work page 2024
[64]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1, 3, 6, 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InCVPR, pages 14593– 14603, 2025. 3

work page 2025
[66]

Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1, 3

work page arXiv 2024
[67]

Topv: Compatible token pruning with infer- ence time optimization for fast and low-memory multimodal vision language model

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with infer- ence time optimization for fast and low-memory multimodal vision language model. InCVPR, pages 19803–19813, 2025. 3

work page 2025
[68]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In CVPR, pages 19792–19802, 2025. 1, 2, 3, 4, 6, 7, 12

work page 2025
[69]

Atp-llava: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InCVPR, pages 24972–24982,

work page
[70]

Video question answering with prior knowledge and object-sensitive learning.IEEE Transactions on Image Processing, 31:5936–5948, 2022

Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Video question answering with prior knowledge and object-sensitive learning.IEEE Transactions on Image Processing, 31:5936–5948, 2022. 2

work page 2022
[71]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 1, 3, 4, 11

work page 2023
[72]

Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hong- ming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025. 3, 4

work page arXiv 2025
[73]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Omnicharacter: Towards immersive role- playing agents with seamless speech-language personality interaction

Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, et al. Omnicharacter: Towards immersive role- playing agents with seamless speech-language personality interaction. InACL (Volume 1: Long Papers), pages 26318– 26331, 2025. 2

work page 2025
[75]

Text-video re- trieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video re- trieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025. 2

work page 2025
[76]

Lmms-eval: Re- ality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In NAACL 2025, pages 881–916, 2025. 6

work page 2025
[77]

[cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv e- prints, pages arXiv–2412, 2024

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv e- prints, pages arXiv–2412, 2024. 3, 4

work page 2024
[78]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 4

work page 2024
[80]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 2, 6, 11, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.