Recognition: unknown
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3
The pith
Token anchors via local-global optimal transport reduce visual tokens in video LLMs while maintaining competitive accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish local- and global-aware token anchors within each frame under attention guidance, which optimal transport aggregates the informative contexts from pruned tokens to construct intra-frame anchors. Building on temporal frame clips, the first frame within each clip serves as keyframe anchors that ensemble similar information from consecutive frames through optimal transport while keeping distinct tokens to represent temporal dynamics. This produces efficient token reduction in a training-free manner and yields competitive performance across short- and long-video benchmarks on leading video LLMs while preserving temporal and visual fidelity.
What carries the argument
Local- and global-aware token anchors that aggregate pruned-token context via optimal transport (AOT)
If this is right
- Competitive accuracy on short- and long-video benchmarks for leading video LLMs
- Substantial reduction in computational cost while retaining temporal and visual fidelity
- Training-free operation that works directly on existing video LLMs
- Better handling of both intra-frame spatial redundancy and inter-frame temporal redundancy than prior pruning techniques
- Ability to keep distinct tokens for motion while aggregating repeated information across clips
Where Pith is reading between the lines
- The same anchor-and-transport pattern could be tested on image-only LLMs to reduce spatial tokens without retraining
- Longer untrimmed videos might become feasible on fixed hardware budgets if the temporal aggregation scales linearly with clip length
- Replacing attention-guided anchors with learned parameters could be measured to see whether further token savings are possible
- The method invites direct comparison of per-token information retention against simple averaging or clustering baselines on the same datasets
Load-bearing premise
Optimal transport aggregation from pruned tokens into anchors preserves subtle yet informative context without meaningful loss for downstream video understanding tasks.
What would settle it
Measure accuracy on a fine-grained action recognition benchmark after AOT token reduction; if performance falls more than a few percent relative to the unpruned baseline at the same total token budget, the preservation claim fails.
Figures
read the original abstract
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AOT, a training-free token-reduction method for Video LLMs that constructs intra-frame anchors via attention-guided local-global optimal transport and inter-frame anchors by treating the first frame of each clip as a keyframe and transporting similar information from subsequent frames while retaining distinct tokens for dynamics. It claims competitive accuracy on short- and long-video benchmarks together with substantial efficiency gains while preserving temporal and visual fidelity.
Significance. If the central claim holds, the work would supply a practical, training-free route to compress visual tokens in VLLMs without retraining, directly addressing the quadratic cost of long video contexts and enabling wider deployment of existing models.
major comments (2)
- [Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.
- [Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.
minor comments (1)
- [Method] Notation for the transport plans and anchor definitions is introduced without an explicit equation or algorithm box, making the precise formulation of the local-global OT steps difficult to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that stronger direct evidence for information retention and explicit numerical results would improve the manuscript. We address each major comment below and will revise accordingly.
read point-by-point responses
-
Referee: [Method (inter-frame OT) and Experiments] The load-bearing premise that local-global OT aggregation retains subtle context without meaningful loss is stated in the abstract and method description but is supported only by downstream benchmark accuracy; no independent quantification of information retention (embedding reconstruction error, attention-map fidelity, or per-token entropy before/after reduction) is supplied, especially for the inter-frame keyframe step on long videos.
Authors: We acknowledge that downstream accuracy alone provides only indirect support for the claim of retained subtle context. In the revised manuscript we will add direct quantification: embedding reconstruction error (L2 distance between original and aggregated token embeddings), cosine similarity of attention maps before/after reduction, and per-token entropy comparisons. These metrics will be reported specifically for the inter-frame keyframe OT step on long-video sequences from the ActivityNet and Ego4D benchmarks to address the concern. revision: yes
-
Referee: [Abstract and Experiments] The abstract asserts 'competitive performances' and 'substantial computational efficiency' yet the provided text contains no numerical results, ablation tables, or direct comparisons against prior pruning baselines, rendering the efficiency-fidelity trade-off impossible to assess from the manuscript.
Authors: The full manuscript contains tables reporting accuracy on short- and long-video benchmarks together with FLOPs and latency reductions versus prior pruning methods. To make these results immediately visible, we will revise the abstract to include key numerical highlights (e.g., accuracy deltas and efficiency gains) and ensure all ablation tables and baseline comparisons appear in the main body with clear captions. revision: yes
Circularity Check
No significant circularity; derivation applies standard OT to new anchors without self-referential reduction
full rationale
The paper presents a training-free token reduction method that first selects attention-guided anchors within frames and then applies optimal transport to aggregate pruned tokens locally, followed by inter-frame keyframe OT on clips. All steps invoke the standard OT formulation (transport plans between anchor and pruned token distributions) without fitting any parameters to the target benchmark data and then relabeling those fits as predictions. No self-citations are used to justify uniqueness or to smuggle in an ansatz; the construction is self-contained and externally falsifiable via the reported benchmark scores. The central efficiency-plus-fidelity claim therefore does not collapse to a tautology or to a fitted-input-called-prediction pattern.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
PoInit-of-View poisons SfM initialization by optimizing cross-view gradient inconsistencies to disrupt keypoint detection and feature matching, yielding transferable degradation in rendered 3D reconstruction quality a...
-
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 2
work page 2022
-
[3]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InAAAI, pages 1773–1781, 2025. 3
work page 2025
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Auroracap: Efficient, performant video detailed captioning and a new benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 1
-
[8]
Sharegpt4video: Improving video understand- ing and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. InNeurIPS, pages 19472–19495, 2024. 1, 2
work page 2024
-
[9]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, pages 19–35. Springer, 2024. 3, 6, 7, 12
work page 2024
-
[10]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
-
[11]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 2
work page 2023
-
[13]
Sinkhorn distances: Lightspeed computation of optimal transport
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InNeurIPS, 2013. 2, 3, 5, 9, 10
work page 2013
-
[14]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[15]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, pages 24108–24118, 2025. 2, 5
work page 2025
-
[16]
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shen- gen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large visual language models.arXiv preprint arXiv:2501.01986, 2024. 2, 3
-
[17]
Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual to- ken pruning for efficient video large language models.arXiv preprint arXiv:2412.16117, 2024. 2, 3, 6, 7, 12
-
[18]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, pages 13700–13710, 2024. 1, 3
work page 2024
-
[19]
Sparsevila: Decoupling visual sparsity for efficient vlm inference
Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N Plataniotis, Yao Lu, Song Han, and Zhijian Liu. Sparsevila: Decoupling visual sparsity for efficient vlm inference. InICCV, pages 23784–23794,
-
[20]
Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Mes- sica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, and Marinka Zitnik. Token reduction should go beyond effi- ciency in generative models–from vision, language to mul- timodality.arXiv preprint arXiv:2505.18227, 2025. 3
-
[21]
Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024
Bo Li, Peiyuan Zhang, Kaichen Zhang, Fanyi Pu, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li, et al. Lmms-eval: Accelerating the develop- ment of large multimoal models, 2024. 6
work page 2024
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 4, 6, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Jinlong Li, Zequn Jie, Xu Wang, Xiaolin Wei, and Lin Ma. Expansion and shrinkage of localization for weakly- supervised semantic segmentation.NeurIPS, 35:16037– 16051, 2022. 13
work page 2022
-
[24]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023. 2
work page 2023
-
[25]
Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi, and Nicu Sebe. Cross-modal and uncertainty-aware agglomeration for open- vocabulary 3d scene understanding. InCVPR, pages 19390– 19400, 2025. 13
work page 2025
-
[26]
Jinlong Li, Dong Zhao, Qi Zang, Zequn Jie, Lin Ma, and Nicu Sebe. Orthogonal projection subspace to aggregate online prior-knowledge for continual test-time adaptation. arXiv preprint arXiv:2506.19022, 2025. 13
-
[27]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InCVPR, pages 22195–22206, 2024. 1, 2, 5
work page 2024
-
[29]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, pages 323–340. Springer, 2024. 1, 3
work page 2024
-
[30]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InCVPR, pages 26689–26699, 2024. 3
work page 2024
-
[32]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, pages 26296–26306, 2024. 2
work page 2024
-
[33]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 2
work page 2024
-
[34]
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024. 3
-
[35]
Less: Label-efficient and single-stage referring 3d instance segmentation
Xuexun Liu, Xu Xiaoxu, Jinlong Li, Qiudan Zhang, Xu Wang, Nicu Sebe, Ma Lin, et al. Less: Label-efficient and single-stage referring 3d instance segmentation. InNeurIPS. NeurIPS, 2024. 13
work page 2024
-
[36]
Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, and Hongtao Xie. Hybrid-level instruction injection for video token com- pression in multi-modal large language models. InCVPR, pages 8568–8578, 2025. 2
work page 2025
-
[37]
Nvila: Efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models. InCVPR, pages 4122–4134, 2025. 3
work page 2025
-
[38]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Egoschema: A diagnostic benchmark for very long- form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InNeurIPS, pages 46212–46244, 2023. 2, 5
work page 2023
-
[40]
Perla: Perceptive 3d language assistant
Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, and Yiming Wang. Perla: Perceptive 3d language assistant. InCVPR, pages 14369–14379, 2025. 13
work page 2025
-
[41]
M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem
Gaspard Monge. M ´emoire sur la th ´eorie des d ´eblais et des remblais.Mem. Math. Phys. Acad. Royale Sci., pages 666– 704, 1781. 9
-
[42]
Weizhi Nie, Ruidong Chen, Weijie Wang, Bruno Lepri, and Nicu Sebe. T2td: Text-3d generation model based on prior knowledge guidance.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(1):172–189, 2024. 13
work page 2024
-
[43]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763. PmLR, 2021. 1, 3, 11
work page 2021
-
[44]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InICCV, pages 22857–22867,
-
[45]
Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models.arXiv preprint arXiv:2505.21334,
-
[46]
Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 3
-
[47]
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 3, 6, 11, 12
-
[48]
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024. 1, 2, 3
-
[49]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InCVPR, pages 18221–18232, 2024. 1
work page 2024
-
[50]
Xudong Tan, Peng Ye, Chongjun Tu, Jianjian Cao, Yaoxin Yang, Lin Zhang, Dongzhan Zhou, and Tao Chen. To- kencarve: Information-preserving visual token compres- sion in multimodal large language models.arXiv preprint arXiv:2503.10501, 2025. 3
-
[51]
Dycoke: Dynamic compression of tokens for fast video large language models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InCVPR, pages 18992–19001,
-
[52]
Stanford alpaca: An instruction-following llama model, 2023
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023. 2
work page 2023
-
[53]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018
Matthew Thorpe. Introduction to optimal transport.Notes of Course at University of Cambridge, 3, 2018. 9
work page 2018
-
[56]
C ´edric Villani et al.Optimal transport: old and new. Springer, 2008. 2
work page 2008
-
[57]
Ross3d: Re- constructive visual instruction tuning with 3d-awareness
Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Re- constructive visual instruction tuning with 3d-awareness. In CVPR, pages 9275–9286, 2025. 13
work page 2025
-
[58]
Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A tracklet-centric multimodal and versatile video understand- ing system.arXiv preprint arXiv:2304.14407, 2023. 1, 2
-
[59]
Uvmap-id: A controllable and personalized uv map generative model
Weijie Wang, Jichao Zhang, Chang Liu, Xia Li, Xingqian Xu, Humphrey Shi, Nicu Sebe, and Bruno Lepri. Uvmap-id: A controllable and personalized uv map generative model. In ACM MM, pages 10725–10734, 2024. 13
work page 2024
-
[60]
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 3
-
[61]
Longvlm: Efficient long video understand- ing via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InECCV, pages 453–470. Springer, 2024. 1, 2
work page 2024
-
[62]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, pages 28828– 28857, 2024. 2, 5
work page 2024
-
[64]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 1, 3, 6, 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Conical visual concentration for efficient large vision-language models
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InCVPR, pages 14593– 14603, 2025. 3
work page 2025
-
[66]
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024. 1, 3
-
[67]
Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with infer- ence time optimization for fast and low-memory multimodal vision language model. InCVPR, pages 19803–19813, 2025. 3
work page 2025
-
[68]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In CVPR, pages 19792–19802, 2025. 1, 2, 3, 4, 6, 7, 12
work page 2025
-
[69]
Atp-llava: Adaptive token pruning for large vision language models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InCVPR, pages 24972–24982,
-
[70]
Pengpeng Zeng, Haonan Zhang, Lianli Gao, Jingkuan Song, and Heng Tao Shen. Video question answering with prior knowledge and object-sensitive learning.IEEE Transactions on Image Processing, 31:5936–5948, 2022. 2
work page 2022
-
[71]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 1, 3, 4, 11
work page 2023
-
[72]
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hong- ming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025. 3, 4
-
[73]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, et al. Omnicharacter: Towards immersive role- playing agents with seamless speech-language personality interaction. InACL (Volume 1: Long Papers), pages 26318– 26331, 2025. 2
work page 2025
-
[75]
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Yihang Duan, Xinyu Lyu, and Heng Tao Shen. Text-video re- trieval with global-local semantic consistent learning.IEEE Transactions on Image Processing, 2025. 2
work page 2025
-
[76]
Lmms-eval: Re- ality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In NAACL 2025, pages 881–916, 2025. 6
work page 2025
-
[77]
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv e- prints, pages arXiv–2412, 2024. 3, 4
work page 2024
-
[78]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
Llava- next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 4
work page 2024
-
[80]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 2, 6, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.