DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

Minyoung Park; Sangjun Ahn; Taehun Kong

arxiv: 2605.19322 · v1 · pith:UDJZIHYVnew · submitted 2026-05-19 · 💻 cs.CV

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

Minyoung Park , Taehun Kong , Sangjun Ahn This is my paper

Pith reviewed 2026-05-20 06:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords token compressionVideo-LLMstemporal adaptationpositional biastraining-freeVideoQAtoken reduction

0 comments

The pith

DynaTok reduces visual tokens in Video-LLMs by 90 percent while retaining more than 95 percent of baseline accuracy through adaptive temporal and spatial allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video Large Language Models face high computational costs from the many visual tokens extracted from long sequences. DynaTok offers a training-free compression approach that dynamically assigns token budgets across time and space. A lightweight exponential moving average tracks changes between frames so that redundant segments receive fewer tokens. Within each frame the method picks spatially diverse and semantically strong features while using memory to limit repetition from earlier selections. The result is efficient video reasoning that integrates directly into existing models and holds performance on question-answering benchmarks even after aggressive reduction.

Core claim

DynaTok allocates token budgets temporally using a lightweight EMA memory to give more tokens to novel frames and spatially using activation-based attention maps and spatial memory to select important and non-redundant features, enabling seamless integration with models like LLaVA-OneVision and LLaVA-Video to maintain over 95 percent accuracy at 90 percent token reduction on benchmarks including MVBench, LongVideoBench, MLVU, and VideoMME.

What carries the argument

Temporal Budget Allocation module with EMA memory for long-term variation and Spatial Budget Allocation module with activation attention and spatial memory to reduce positional bias.

Load-bearing premise

The lightweight EMA memory and spatial memory mechanisms can effectively capture long-term temporal variations and mitigate positional bias in token selection without requiring any model-specific training or fine-tuning.

What would settle it

Applying DynaTok to videos with sudden scene shifts or to an unseen Video-LLM architecture and measuring whether accuracy falls well below 95 percent of baseline at 90 percent token reduction.

Figures

Figures reproduced from arXiv: 2605.19322 by Minyoung Park, Sangjun Ahn, Taehun Kong.

**Figure 2.** Figure 2: Overview of the proposed DynaTok framework for training-free and efficient token compression in Video-LLMs. The framework [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of accumulated activation-based attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of token selection with and without the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the effect of Temporal Budget Allocation (TBA). Without TBA (frame-wise uniform token compression), an [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DynaTok, a training-free token compression framework for Video-LLMs. It introduces a Temporal Budget Allocation (TBA) module that uses a lightweight exponential moving average (EMA) memory to dynamically allocate fewer tokens to redundant frames and more to novel ones, capturing long-term temporal variation. This is complemented by a Spatial Budget Allocation (SBA) module that selects spatially diverse and semantically important features via activation-based attention maps while using spatial memory to reduce redundancy and mitigate positional bias. The method integrates with existing models such as LLaVA-OneVision and LLaVA-Video without retraining. Experiments on MVBench, LongVideoBench, MLVU, and VideoMME report that DynaTok retains over 95% of baseline accuracy at 90% token reduction, outperforming recent training-free approaches.

Significance. If the reported empirical results hold under detailed scrutiny, DynaTok offers a practical advance for efficient long-video reasoning in Video-LLMs by providing a training-free, modular approach to spatio-temporal token allocation. The emphasis on long-term temporal dynamics via EMA and positional-bias mitigation via spatial memory addresses documented limitations in prior attention-magnitude-based methods. Seamless plug-in compatibility with existing models and strong retention of accuracy at aggressive compression rates would make this relevant for real-time and resource-constrained video understanding applications.

major comments (2)

§4 (Experiments): The central performance claim of >95% baseline accuracy retention at 90% token reduction is load-bearing for the paper's contribution, yet the manuscript provides no error bars, standard deviations across multiple runs, or statistical significance tests against the listed baselines; this weakens the ability to assess whether the reported gains over recent training-free methods are robust.
§3.2 (TBA module): The description of the EMA memory update and budget allocation formula lacks explicit pseudocode or parameter values (e.g., decay rate), making it difficult to verify that the mechanism indeed captures long-term variations independently of short-term locality assumptions in prior work.

minor comments (2)

Abstract and §1: The list of benchmarks is given as 'MVBench, LongVideoBench, MLVU, and VideoMME' but referred to collectively as 'four representative VideoQA benchmarks'; ensure consistent terminology and add a brief note on dataset characteristics (e.g., average video length) for context.
Figure 2 or equivalent architecture diagram: The interaction between TBA and SBA modules and the final token selection step would benefit from an explicit flowchart or equation showing how the allocated budgets are combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments in detail below, proposing specific revisions to improve clarity and robustness.

read point-by-point responses

Referee: §4 (Experiments): The central performance claim of >95% baseline accuracy retention at 90% token reduction is load-bearing for the paper's contribution, yet the manuscript provides no error bars, standard deviations across multiple runs, or statistical significance tests against the listed baselines; this weakens the ability to assess whether the reported gains over recent training-free methods are robust.

Authors: We agree that statistical analysis would strengthen the presentation of our results. Although DynaTok is training-free, minor variations can occur due to data loading or inference settings. In the revised manuscript, we will add standard deviations from multiple runs (using different random seeds for frame sampling where applicable) and report statistical significance tests comparing against the baselines. revision: yes
Referee: §3.2 (TBA module): The description of the EMA memory update and budget allocation formula lacks explicit pseudocode or parameter values (e.g., decay rate), making it difficult to verify that the mechanism indeed captures long-term variations independently of short-term locality assumptions in prior work.

Authors: We appreciate the suggestion to improve reproducibility. Section 3.2 currently provides the mathematical formulation of the EMA update and budget allocation, but we agree that pseudocode and explicit parameter values would be helpful. In the revision, we will add an algorithm box with pseudocode for the TBA module and state the specific decay rate and other hyperparameters used in our experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces DynaTok as a training-free algorithmic framework consisting of independent TBA and SBA modules that rely on lightweight EMA memory and spatial memory heuristics for token allocation. These components are defined procedurally and evaluated through direct empirical measurements on external benchmarks (MVBench, LongVideoBench, MLVU, VideoMME), with performance quantified as retention of baseline accuracy under token reduction. No derivation chain, equations, or self-citations are presented that reduce the central claims to quantities fitted from the paper's own inputs or prior results by construction; the reported outcomes are end-to-end experimental results rather than tautological restatements of fitted parameters or renamed heuristics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework builds on standard attention-based importance assumptions in the field but introduces new memory mechanisms for adaptation. No new entities or heavily fitted parameters are mentioned in the abstract.

axioms (1)

domain assumption Attention magnitude and activation maps serve as reliable proxies for semantic importance and spatial diversity.
This underpins the SBA module's selection process.

pith-pipeline@v0.9.0 · 5832 in / 1207 out tokens · 60203 ms · 2026-05-20T06:51:44.121333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 10 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 3

work page 2024
[3]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7

work page 2024
[4]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 1, 3

work page 2022
[6]

Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13448– 13459, 2025. 3

work page 2025
[7]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5

work page 2025
[8]

Multi-granular spatio-temporal token merging for training-free acceleration of video llms.arXiv preprint arXiv:2507.07990, 2025

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, and Minho Shim. Multi-granular spatio-temporal token merging for training-free acceleration of video llms.arXiv preprint arXiv:2507.07990, 2025. 2, 4

work page arXiv 2025
[9]

Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025. 2

work page arXiv 2025
[10]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 5

work page 2024
[13]

Lion-fs: Fast & slow video-language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3240–3251, 2025. 3

work page 2025
[14]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1

work page 2024
[15]

Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025. 2, 4

work page arXiv 2025
[16]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 1, 3

work page 2024
[17]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 1

work page 2024
[18]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 2, 4

work page arXiv 2025
[19]

Llava-mlb: Mitigating and leveraging attention bias for training-free video llms.arXiv preprint arXiv:2503.11205,

Leqi Shen, Tao He, Guoqiang Gong, Fan Yang, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Llava-mlb: Mitigating and leveraging attention bias for training-free video llms.arXiv preprint arXiv:2503.11205,

work page arXiv
[20]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 1

work page 2024
[21]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 2, 3, 4, 6

work page 2025
[22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Iden- tifying and mitigating position bias of multi-image vision- language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Iden- tifying and mitigating position bias of multi-image vision- language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10599–10609, 2025. 2, 4

work page 2025
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1

work page 2024
[26]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 1, 5

work page 2024
[27]

Video-levelgauge: Inves- tigating contextual positional bias in large video language models.arXiv preprint arXiv:2508.19650, 2025

Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhen- dong Mao, and Yongdong Zhang. Video-levelgauge: Inves- tigating contextual positional bias in large video language models.arXiv preprint arXiv:2508.19650, 2025. 4

work page arXiv 2025
[28]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 1, 2, 3, 6, 7

work page 2025
[29]

Timechat-online: 80% visual tokens are natu- rally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are natu- rally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 2, 3

work page 2025
[30]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5

work page 2023
[31]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1, 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Lmms-eval: Re- ality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 6

work page 2025
[34]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024. 1, 5

work page 2024
[37]

Don’t just chase” highlighted tokens” in mllms: Revisiting visual holistic con- text retention.arXiv preprint arXiv:2510.02912, 2025

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase” highlighted tokens” in mllms: Revisiting visual holistic con- text retention.arXiv preprint arXiv:2510.02912, 2025. 2, 4, 5 DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs Supplementary Material This suppl...

work page arXiv 2025

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Videollm-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024. 3

work page 2024

[3] [3]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7

work page 2024

[4] [4]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 1, 3

work page 2022

[6] [6]

Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. Streammind: Un- locking full frame rate streaming video dialogue through event-gated cognition. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13448– 13459, 2025. 3

work page 2025

[7] [7]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 5

work page 2025

[8] [8]

Multi-granular spatio-temporal token merging for training-free acceleration of video llms.arXiv preprint arXiv:2507.07990, 2025

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, and Minho Shim. Multi-granular spatio-temporal token merging for training-free acceleration of video llms.arXiv preprint arXiv:2507.07990, 2025. 2, 4

work page arXiv 2025

[9] [9]

Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025. 2

work page arXiv 2025

[10] [10]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 1, 5

work page 2024

[13] [13]

Lion-fs: Fast & slow video-language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. Lion-fs: Fast & slow video-language thinker as online video assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3240–3251, 2025. 3

work page 2025

[14] [14]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1

work page 2024

[15] [15]

Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models.arXiv preprint arXiv:2502.11664, 2025. 2, 4

work page arXiv 2025

[16] [16]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 1, 3

work page 2024

[17] [17]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 1

work page 2024

[18] [18]

Fastvid: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large language mod- els.arXiv preprint arXiv:2503.11187, 2025. 2, 4

work page arXiv 2025

[19] [19]

Llava-mlb: Mitigating and leveraging attention bias for training-free video llms.arXiv preprint arXiv:2503.11205,

Leqi Shen, Tao He, Guoqiang Gong, Fan Yang, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Llava-mlb: Mitigating and leveraging attention bias for training-free video llms.arXiv preprint arXiv:2503.11205,

work page arXiv

[20] [20]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 1

work page 2024

[21] [21]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 2, 3, 4, 6

work page 2025

[22] [22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Iden- tifying and mitigating position bias of multi-image vision- language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Iden- tifying and mitigating position bias of multi-image vision- language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10599–10609, 2025. 2, 4

work page 2025

[24] [24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Longvlm: Efficient long video understand- ing via large language models

Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. Longvlm: Efficient long video understand- ing via large language models. InEuropean Conference on Computer Vision, pages 453–470. Springer, 2024. 1

work page 2024

[26] [26]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 1, 5

work page 2024

[27] [27]

Video-levelgauge: Inves- tigating contextual positional bias in large video language models.arXiv preprint arXiv:2508.19650, 2025

Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhen- dong Mao, and Yongdong Zhang. Video-levelgauge: Inves- tigating contextual positional bias in large video language models.arXiv preprint arXiv:2508.19650, 2025. 4

work page arXiv 2025

[28] [28]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 1, 2, 3, 6, 7

work page 2025

[29] [29]

Timechat-online: 80% visual tokens are natu- rally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, et al. Timechat-online: 80% visual tokens are natu- rally redundant in streaming videos. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10807–10816, 2025. 2, 3

work page 2025

[30] [30]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5

work page 2023

[31] [31]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 1, 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Lmms-eval: Re- ality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Re- ality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 6

work page 2025

[34] [34]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024. 1, 5

work page 2024

[37] [37]

Don’t just chase” highlighted tokens” in mllms: Revisiting visual holistic con- text retention.arXiv preprint arXiv:2510.02912, 2025

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase” highlighted tokens” in mllms: Revisiting visual holistic con- text retention.arXiv preprint arXiv:2510.02912, 2025. 2, 4, 5 DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs Supplementary Material This suppl...

work page arXiv 2025