pith. machine review for the scientific record. sign in

arxiv: 2604.11627 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsvisual token scalingdual-mode perceptionlong-form visual understandingstreaming visual understandingKV-cache designadaptive efficiencyfocus and standby modes
0
0 comments X

The pith

POINTS-Long introduces a dual-mode MLLM that switches between full-detail focus and low-token standby modes for efficient long visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces POINTS-Long, a multimodal large language model built with two native perception modes that scale visual tokens dynamically. Focus mode keeps full resolution for detailed tasks while standby mode reduces tokens sharply for long videos and streams. The design draws from human vision to let users trade accuracy for speed on the fly during inference. A sympathetic reader cares because visual token counts grow quickly in long or streaming content, limiting real-world use. The model also includes a detachable KV-cache that supports ongoing visual memory without full recomputation.

Core claim

POINTS-Long is a native dual-mode MLLM featuring dynamic visual token scaling. It supports focus mode for optimal performance on fine-grained tasks and standby mode that retains 97.7-99.7 percent of original accuracy on long-form general visual understanding while using only 1/40 to 1/10 of the visual tokens. It further enables streaming visual understanding through a dynamically detachable KV-cache design for efficient ultra-long visual memory.

What carries the argument

Dual perception modes (focus and standby) with dynamic visual token scaling, plus a dynamically detachable KV-cache for streaming.

If this is right

  • Users can dynamically choose between efficiency and accuracy at inference time.
  • Standby mode enables processing of long videos and streams with far fewer tokens while preserving high accuracy on general tasks.
  • The detachable KV-cache allows efficient maintenance of memory across extended visual sequences without full recomputation.
  • The approach lays groundwork for adaptive designs in future models handling long-form visual content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar mode-switching could extend to other input types like audio or text sequences where token volume varies.
  • Deployment on edge devices might become more feasible if standby mode scales down compute predictably.
  • The human-vision analogy suggests testing against biological benchmarks for token selection patterns.

Load-bearing premise

Dynamic token scaling and mode switching can be implemented natively without adding overhead or accuracy losses not seen in the tested long-form tasks.

What would settle it

Test standby mode on a long video where critical details appear briefly and sparsely; accuracy would drop below 97 percent if the reduced tokens miss those details.

Figures

Figures reproduced from arXiv: 2604.11627 by Haicheng Wang, Jie Zhou, Le Tian, Weidi Xie, Xiao Zhou, Yanfeng Wang, Yangxiu You, Yikun Liu, Yuan Liu, Zhemeng Yu, Zhongyin Zhao, Zilin Yu.

Figure 1
Figure 1. Figure 1: POINTS-Long: Bridging the Gap between Human Visual Perception and MLLM Scalability. Inspired by human’s adaptive visual processing, POINTS-Long introduces a dual-mode system which switches between high-fidelity Focus Mode and effi￾cient Standby Mode, enabling both detailed analysis and long-term streaming understanding with significantly reduced cost. methods like pixel-shuffle [85, 88] and pooling [26]. T… view at source ↗
Figure 2
Figure 2. Figure 2: POINTS-Long Architecture. The original visual patch sequence (blue) is processed by the original ViT modules. We introduce n learnable tokens (orange) processed through duplicated learnable MLPs and projector, to act as the compressed representation of the full sequence. An additional temporal modeling allows better compression for video inputs. With symmetric attention mask, the original path is totally u… view at source ↗
Figure 3
Figure 3. Figure 3: Streaming Inference in LLM. (↑) When handling streaming inputs, general MLLMs discard previous cached context when reaching maximum budget. (↓) POINTS-Long encodes new frames in Focus Mode. When local window is full, the original sequence’s cache is detached, and the compact standby-sequence cache is migrated to a long-term ”Memory Bank”. nate. However, a joint compression that models spatio￾temporal relat… view at source ↗
Figure 4
Figure 4. Figure 4: POINTS1.5-8B-Instruct Architecture. POINTS1.5-8B consists of a native-resolution image encoder (initialized from Qwen2- VL-ViT), a pixel-shuffle projector reducing the token count by a factor of 4, and an LLM initialized from Qwen3-8B-Base. The architecture employs 1D RoPE for the LLM and 2D RoPE for the ViT. tion, we prepend a metadata string to the video input: Video of x fps:. This prefix identifies the… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Position Encoding. We initialize learnable standby tokens by uniformly sampling RoPE embeddings from the original sequence. We visualize their attention maps in the last ViT layer, marking assigned positions with a yellow square. For clarity, we display only the top 10% of attention weights, where darker red indicates higher intensity. The results reveal a strong localization effect: stand… view at source ↗
Figure 6
Figure 6. Figure 6: Failure case analysis. Standby mode fails on spatial or fine-grained perception while the baseline fails more on temporal and general understanding. 8 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces POINTS-Long, a native dual-mode MLLM with focus and standby perception modes that dynamically scales visual tokens, inspired by human vision. It claims that standby mode retains 97.7-99.7% of baseline accuracy on long-form general visual understanding tasks while using only 1/40 to 1/10 of the visual tokens, and that a detachable KV-cache enables efficient streaming visual understanding.

Significance. If the empirical results hold under proper controls, the work is significant for addressing visual token scalability in long-video and streaming MLLM applications. The dual-mode architecture and native streaming support provide a concrete mechanism for accuracy-efficiency trade-offs, with potential to influence efficient multimodal model design.

minor comments (1)
  1. The abstract reports specific accuracy retention ranges (97.7-99.7%) and token reduction factors (1/40-1/10) but does not name the long-form tasks, datasets, or baseline models used; adding these references would strengthen verifiability without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of POINTS-Long and for recommending minor revision. We appreciate the recognition that the dual-mode architecture and detachable KV-cache provide a concrete mechanism for accuracy-efficiency trade-offs in long-form and streaming MLLM settings.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture for adaptive dual-mode visual reasoning in MLLMs, with claims resting on reported accuracy retention metrics (97.7-99.7% at reduced token counts) from experimental evaluations on long-form tasks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. The design is motivated by a human-vision analogy but does not reduce any result to its own inputs by construction; performance numbers are direct outcomes of the implemented model rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on training hyperparameters, loss terms, or architectural assumptions; the dual modes and KV-cache design are presented as novel contributions without further breakdown.

pith-pipeline@v0.9.0 · 5525 in / 1025 out tokens · 56976 ms · 2026-05-10T15:16:55.318007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

132 extracted references · 63 canonical work pages · 23 internal anchors

  1. [1]

    Divprune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Ak- bari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 9392–9401, 2025. 2

  2. [2]

    Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,

    Alan Baddeley. Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 6, 7

  4. [4]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2

  5. [5]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6, 4

  6. [6]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 6, 4

  7. [7]

    arXiv preprint arXiv:2412.12075 , year=

    Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024. 6, 4

  8. [8]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 2

  9. [9]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024. 5, 4

  10. [10]

    Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,

  11. [11]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models. InEuropean Confer- ence on Computer Vision, pages 19–35. Springer, 2024. 2, 7, 5

  12. [12]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7

  13. [13]

    Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

  14. [14]

    Opencompass: A univer- sal evaluation platform for foundation models.https: / / github

    OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models.https: / / github . com / open - compass / opencompass,

  15. [15]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1

  16. [16]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 1, 2, 3

  17. [17]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

  18. [18]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 6, 4

  19. [19]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 6, 4

  20. [20]

    Finevideo.https:// huggingface

    Miquel Farr ´e, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https:// huggingface . co / datasets / HuggingFaceFV / finevideo, 2024. 1

  21. [21]

    What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007

    Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007. 2

  22. [22]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118,

  23. [23]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1 9

  24. [24]

    Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,

  25. [25]

    Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...

  26. [26]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1, 2

  27. [27]

    arXiv preprint arXiv:2502.04326 (2025)

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 6, 4

  28. [28]

    Dynamic-llava: Efficient multimodal large language models via dy- namic vision-language context sparsification.arXiv preprint arXiv:2412.00876,

    Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, and Shaohui Lin. Dynamic-llava: Efficient multimodal large language models via dynamic vision-language context spar- sification.arXiv preprint arXiv:2412.00876, 2024. 2

  29. [29]

    Prunevid: Vi- sual token pruning for efficient video large language mod- els.arXiv preprint arXiv:2412.16117, 2024

    Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Vi- sual token pruning for efficient video large language mod- els.arXiv preprint arXiv:2412.16117, 2024. 6, 5

  30. [30]

    Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024

    Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qin- glong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, and Ming Yang. Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024. 2

  31. [31]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 5

  32. [32]

    arXiv preprint arXiv:2503.10582 , year=

    Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. Visualwebinstruct: Scaling up mul- timodal instruction data through web search.arXiv preprint arXiv:2503.10582, 2025. 1

  33. [33]

    Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025

    Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 6

  34. [34]

    Turbo: Informativity-driven acceleration plug-in for vision-language large models

    Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InEuropean Conference on Computer Vision, pages 436–455. Springer,

  35. [35]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on com- puter vision, pages 235–251. Springer, 2016. 5, 4

  36. [36]

    Spvit: Enabling faster vision transformers via soft token pruning, 2022

    Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, and Yanzhi Wang. Spvit: Enabling faster vision transformers via soft token pruning, 2022. 2

  37. [37]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 1, 2, 3, 8

  38. [38]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 6

  39. [39]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 5

  40. [40]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2, 1

  41. [41]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 6, 4

  42. [42]

    Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025. 2

  43. [43]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 1, 2, 7

  44. [44]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024. 2

  45. [45]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2

  46. [46]

    Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

  47. [47]

    Multi-stage vision token drop- ping: Towards efficient multimodal large language model

    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token drop- ping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 1 10

  48. [48]

    Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

    Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 1, 2, 7

  49. [49]

    Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025

    Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, and Honggang Chen. Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025. 2

  50. [50]

    Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 5, 4

  51. [51]

    Tempcompass: Do video llms really understand videos?,

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 6, 4

  52. [52]

    Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 4

  53. [53]

    Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, and Jie Zhou. Points1. 5: Building a vision- language model towards real world applications.arXiv preprint arXiv:2412.08443, 2024. 2, 3, 7, 1, 6

  54. [54]

    Points: Improving your vision- language model with affordable strategies.arXiv preprint arXiv:2409.04828, 2024

    Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, and Jie Zhou. Points: Improving your vision- language model with affordable strategies.arXiv preprint arXiv:2409.04828, 2024. 1

  55. [55]

    Points-reader: Distillation-free adaptation of vision-language models for document conversion

    Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Zhou Xiao, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, 2025. 1

  56. [56]

    Seeing, listening, remem- bering, and reasoning: A multimodal agent with long-term memory.arXiv:2508.09736, 2025

    Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, re- membering, and reasoning: A multimodal agent with long- term memory.arXiv preprint arXiv:2508.09736, 2025. 4

  57. [57]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts.arXiv preprint arXiv:2310.02255, 2023. 5, 4

  58. [58]

    Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jian- shan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025. 1, 7

  59. [59]

    The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997

    Steven J Luck and Edward K V ogel. The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997. 2

  60. [60]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 6, 4

  61. [61]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 1

  62. [62]

    Gpt-4 technical report, 2024

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, Red Avila, et al. Gpt-4 technical report, 2024. 1

  63. [63]

    Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024. 2, 7, 8

  64. [64]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  65. [65]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,

  66. [66]

    Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024

    Ruchit Rawal, Khalid Saifullah, Miquel Farr´e, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024. 1

  67. [67]

    Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1

  68. [68]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 2

  69. [69]

    Objects365: A large-scale, high-quality dataset for object detection

    Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 1

  70. [70]

    arXiv preprint arXiv:2503.11187 (2025)

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large lan- guage models.arXiv preprint arXiv:2503.11187, 2025. 5

  71. [71]

    When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462

    Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462. Springer, 2024. 2

  72. [72]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models us- ing model parallelism.arXiv preprint arXiv:1909.08053,

  73. [73]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 1, 2, 7

  74. [74]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18221–18232, 2024. 6, 4

  75. [75]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 3, 1

  76. [76]

    Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high- throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024. 2

  77. [77]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025. 1

  78. [78]

    Dycoke: Dynamic compression of tokens for fast video large language models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 1, 2, 5

  79. [79]

    Mimo-vl technical report, 2025

    Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report, 2025. 1, 2, 6

  80. [80]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 6

Showing first 80 references.