arxiv: 2604.11627 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

Haicheng Wang , Yuan Liu , Yikun Liu , Zhemeng Yu , Zhongyin Zhao , Yangxiu You , Zilin Yu , Le Tian

show 4 more authors

Xiao Zhou Jie Zhou Weidi Xie Yanfeng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsvisual token scalingdual-mode perceptionlong-form visual understandingstreaming visual understandingKV-cache designadaptive efficiencyfocus and standby modes

0 comments

The pith

POINTS-Long introduces a dual-mode MLLM that switches between full-detail focus and low-token standby modes for efficient long visual reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces POINTS-Long, a multimodal large language model built with two native perception modes that scale visual tokens dynamically. Focus mode keeps full resolution for detailed tasks while standby mode reduces tokens sharply for long videos and streams. The design draws from human vision to let users trade accuracy for speed on the fly during inference. A sympathetic reader cares because visual token counts grow quickly in long or streaming content, limiting real-world use. The model also includes a detachable KV-cache that supports ongoing visual memory without full recomputation.

Core claim

POINTS-Long is a native dual-mode MLLM featuring dynamic visual token scaling. It supports focus mode for optimal performance on fine-grained tasks and standby mode that retains 97.7-99.7 percent of original accuracy on long-form general visual understanding while using only 1/40 to 1/10 of the visual tokens. It further enables streaming visual understanding through a dynamically detachable KV-cache design for efficient ultra-long visual memory.

What carries the argument

Dual perception modes (focus and standby) with dynamic visual token scaling, plus a dynamically detachable KV-cache for streaming.

If this is right

Users can dynamically choose between efficiency and accuracy at inference time.
Standby mode enables processing of long videos and streams with far fewer tokens while preserving high accuracy on general tasks.
The detachable KV-cache allows efficient maintenance of memory across extended visual sequences without full recomputation.
The approach lays groundwork for adaptive designs in future models handling long-form visual content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mode-switching could extend to other input types like audio or text sequences where token volume varies.
Deployment on edge devices might become more feasible if standby mode scales down compute predictably.
The human-vision analogy suggests testing against biological benchmarks for token selection patterns.

Load-bearing premise

Dynamic token scaling and mode switching can be implemented natively without adding overhead or accuracy losses not seen in the tested long-form tasks.

What would settle it

Test standby mode on a long video where critical details appear briefly and sparsely; accuracy would drop below 97 percent if the reduced tokens miss those details.

Figures

Figures reproduced from arXiv: 2604.11627 by Haicheng Wang, Jie Zhou, Le Tian, Weidi Xie, Xiao Zhou, Yanfeng Wang, Yangxiu You, Yikun Liu, Yuan Liu, Zhemeng Yu, Zhongyin Zhao, Zilin Yu.

**Figure 1.** Figure 1: POINTS-Long: Bridging the Gap between Human Visual Perception and MLLM Scalability. Inspired by human’s adaptive visual processing, POINTS-Long introduces a dual-mode system which switches between high-fidelity Focus Mode and efficient Standby Mode, enabling both detailed analysis and long-term streaming understanding with significantly reduced cost. methods like pixel-shuffle [85, 88] and pooling [26]. T… view at source ↗

**Figure 2.** Figure 2: POINTS-Long Architecture. The original visual patch sequence (blue) is processed by the original ViT modules. We introduce n learnable tokens (orange) processed through duplicated learnable MLPs and projector, to act as the compressed representation of the full sequence. An additional temporal modeling allows better compression for video inputs. With symmetric attention mask, the original path is totally u… view at source ↗

**Figure 3.** Figure 3: Streaming Inference in LLM. (↑) When handling streaming inputs, general MLLMs discard previous cached context when reaching maximum budget. (↓) POINTS-Long encodes new frames in Focus Mode. When local window is full, the original sequence’s cache is detached, and the compact standby-sequence cache is migrated to a long-term ”Memory Bank”. nate. However, a joint compression that models spatiotemporal relat… view at source ↗

**Figure 4.** Figure 4: POINTS1.5-8B-Instruct Architecture. POINTS1.5-8B consists of a native-resolution image encoder (initialized from Qwen2- VL-ViT), a pixel-shuffle projector reducing the token count by a factor of 4, and an LLM initialized from Qwen3-8B-Base. The architecture employs 1D RoPE for the LLM and 2D RoPE for the ViT. tion, we prepend a metadata string to the video input: Video of x fps:. This prefix identifies the… view at source ↗

**Figure 5.** Figure 5: Visualization of Position Encoding. We initialize learnable standby tokens by uniformly sampling RoPE embeddings from the original sequence. We visualize their attention maps in the last ViT layer, marking assigned positions with a yellow square. For clarity, we display only the top 10% of attention weights, where darker red indicates higher intensity. The results reveal a strong localization effect: stand… view at source ↗

**Figure 6.** Figure 6: Failure case analysis. Standby mode fails on spatial or fine-grained perception while the baseline fails more on temporal and general understanding. 8 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POINTS-Long proposes a dual-mode MLLM that switches between full-detail focus and low-token standby modes, plus a detachable KV cache for streaming, with claims of 97-99% accuracy retention at 1/10 to 1/40 the visual tokens on long tasks.

read the letter

Colleague, the main point is that this paper introduces POINTS-Long as a native dual-mode architecture for MLLMs. Focus mode keeps performance on fine-grained tasks while standby mode drops most visual tokens yet still reports 97.7-99.7% of baseline accuracy on long-form general visual understanding. The detachable KV-cache is meant to let the model handle streaming inputs without memory blowup, all inspired by how human vision shifts between detailed and efficient perception. That combination of dynamic scaling and mode switching is the concrete new piece, and the efficiency numbers on long sequences are the practical hook if they hold up under scrutiny. The design choices line up consistently with the stated goal of trading accuracy for speed in video and streaming settings, and the empirical retention rates are presented as direct outcomes rather than fitted claims. What is missing is depth on the actual implementation. The abstract and high-level description give the percentages and the human-vision analogy, but there is little on the precise mechanism for deciding when to switch modes, how the token scaling is trained, or controlled comparisons against other token-pruning or KV-cache methods. Without those details it is hard to judge whether the reported savings are robust or depend on particular task distributions. This work is aimed at people building or deploying MLLMs for long video and real-time visual streams who need concrete efficiency levers. A reader already working on token reduction or streaming multimodal models would get the most out of the architectural pattern and the reported trade-offs. The paper deserves a serious referee because the problem is real, the proposed pattern is testable, and the numbers are specific enough to check. I would send it out for review rather than desk reject.

Referee Report

0 major / 1 minor

Summary. The paper introduces POINTS-Long, a native dual-mode MLLM with focus and standby perception modes that dynamically scales visual tokens, inspired by human vision. It claims that standby mode retains 97.7-99.7% of baseline accuracy on long-form general visual understanding tasks while using only 1/40 to 1/10 of the visual tokens, and that a detachable KV-cache enables efficient streaming visual understanding.

Significance. If the empirical results hold under proper controls, the work is significant for addressing visual token scalability in long-video and streaming MLLM applications. The dual-mode architecture and native streaming support provide a concrete mechanism for accuracy-efficiency trade-offs, with potential to influence efficient multimodal model design.

minor comments (1)

The abstract reports specific accuracy retention ranges (97.7-99.7%) and token reduction factors (1/40-1/10) but does not name the long-form tasks, datasets, or baseline models used; adding these references would strengthen verifiability without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of POINTS-Long and for recommending minor revision. We appreciate the recognition that the dual-mode architecture and detachable KV-cache provide a concrete mechanism for accuracy-efficiency trade-offs in long-form and streaming MLLM settings.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture for adaptive dual-mode visual reasoning in MLLMs, with claims resting on reported accuracy retention metrics (97.7-99.7% at reduced token counts) from experimental evaluations on long-form tasks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. The design is motivated by a human-vision analogy but does not reduce any result to its own inputs by construction; performance numbers are direct outcomes of the implemented model rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on training hyperparameters, loss terms, or architectural assumptions; the dual modes and KV-cache design are presented as novel contributions without further breakdown.

pith-pipeline@v0.9.0 · 5525 in / 1025 out tokens · 56976 ms · 2026-05-10T15:16:55.318007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

132 extracted references · 63 canonical work pages · 23 internal anchors

[1]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Ak- bari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 9392–9401, 2025. 2

2025
[2]

Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,

Alan Baddeley. Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2

work page internal anchor Pith review arXiv 2022
[5]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6, 4

2015
[6]

Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 6, 4

work page arXiv 2024
[7]

arXiv preprint arXiv:2412.12075 , year=

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024. 6, 4

work page arXiv 2024
[8]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 2

work page arXiv 2023
[9]

Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024. 5, 4

2024
[10]

Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,
[11]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models. InEuropean Confer- ence on Computer Vision, pages 19–35. Springer, 2024. 2, 7, 5

2024
[12]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

2024
[14]

Opencompass: A univer- sal evaluation platform for foundation models.https: / / github

OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models.https: / / github . com / open - compass / opencompass,
[15]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1

work page internal anchor Pith review arXiv 2025
[16]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 1, 2, 3

work page internal anchor Pith review arXiv 2023
[17]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review arXiv 2025
[18]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 6, 4

2024
[19]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 6, 4

2024
[20]

Finevideo.https:// huggingface

Miquel Farr ´e, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https:// huggingface . co / datasets / HuggingFaceFV / finevideo, 2024. 1

2024
[21]

What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007

Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007. 2

2007
[22]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118,
[23]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1 9

2022
[24]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,
[25]

Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...

2024
[26]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[27]

arXiv preprint arXiv:2502.04326 (2025)

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 6, 4

work page arXiv 2025
[28]

Dynamic-llava: Efficient multimodal large language models via dy- namic vision-language context sparsification.arXiv preprint arXiv:2412.00876,

Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, and Shaohui Lin. Dynamic-llava: Efficient multimodal large language models via dynamic vision-language context spar- sification.arXiv preprint arXiv:2412.00876, 2024. 2

work page arXiv 2024
[29]

Prunevid: Vi- sual token pruning for efficient video large language mod- els.arXiv preprint arXiv:2412.16117, 2024

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Vi- sual token pruning for efficient video large language mod- els.arXiv preprint arXiv:2412.16117, 2024. 6, 5

work page arXiv 2024
[30]

Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024

Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qin- glong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, and Ming Yang. Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024. 2

2024
[31]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 5

2021
[32]

arXiv preprint arXiv:2503.10582 , year=

Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. Visualwebinstruct: Scaling up mul- timodal instruction data through web search.arXiv preprint arXiv:2503.10582, 2025. 1

work page arXiv 2025
[33]

Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025

Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 6

work page arXiv 2025
[34]

Turbo: Informativity-driven acceleration plug-in for vision-language large models

Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InEuropean Conference on Computer Vision, pages 436–455. Springer,
[35]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on com- puter vision, pages 235–251. Springer, 2016. 5, 4

2016
[36]

Spvit: Enabling faster vision transformers via soft token pruning, 2022

Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, and Yanzhi Wang. Spvit: Enabling faster vision transformers via soft token pruning, 2022. 2

2022
[37]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 1, 2, 3, 8

2023
[38]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 5

2023
[40]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2, 1

work page internal anchor Pith review arXiv 2023
[41]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 6, 4

2024
[42]

Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025. 2

2025
[43]

Videochat-flash: Hierarchical compression for long-context video modeling,

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 1, 2, 7

work page arXiv 2024
[44]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024. 2

2024
[45]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2

2024
[46]

Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,

work page arXiv
[47]

Multi-stage vision token drop- ping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token drop- ping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 1 10

work page arXiv 2024
[48]

Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 1, 2, 7

work page arXiv 2025
[49]

Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025

Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, and Honggang Chen. Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025. 2

2025
[50]

Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 5, 4

2024
[51]

Tempcompass: Do video llms really understand videos?,

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 6, 4

work page arXiv 2024
[52]

Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 4

2024
[53]

Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, and Jie Zhou. Points1. 5: Building a vision- language model towards real world applications.arXiv preprint arXiv:2412.08443, 2024. 2, 3, 7, 1, 6

work page arXiv 2024
[54]

Points: Improving your vision- language model with affordable strategies.arXiv preprint arXiv:2409.04828, 2024

Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, and Jie Zhou. Points: Improving your vision- language model with affordable strategies.arXiv preprint arXiv:2409.04828, 2024. 1

work page arXiv 2024
[55]

Points-reader: Distillation-free adaptation of vision-language models for document conversion

Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Zhou Xiao, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, 2025. 1

2025
[56]

Seeing, listening, remem- bering, and reasoning: A multimodal agent with long-term memory.arXiv:2508.09736, 2025

Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, re- membering, and reasoning: A multimodal agent with long- term memory.arXiv preprint arXiv:2508.09736, 2025. 4

work page arXiv 2025
[57]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts.arXiv preprint arXiv:2310.02255, 2023. 5, 4

work page internal anchor Pith review arXiv 2023
[58]

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jian- shan Zhao, et al. Ovis2. 5 technical report.arXiv preprint arXiv:2508.11737, 2025. 1, 7

work page arXiv 2025
[59]

The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997

Steven J Luck and Edward K V ogel. The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997. 2

1997
[60]

Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 6, 4

2023
[61]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 1

work page internal anchor Pith review arXiv 2024
[62]

Gpt-4 technical report, 2024

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, Red Avila, et al. Gpt-4 technical report, 2024. 1

2024
[63]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024. 2, 7, 8

2024
[64]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

2021
[65]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,
[66]

Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024

Ruchit Rawal, Khalid Saifullah, Miquel Farr´e, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024. 1

work page arXiv 2024
[67]

Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1

2022
[68]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 2

work page arXiv 2024
[69]

Objects365: A large-scale, high-quality dataset for object detection

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 1

2019
[70]

arXiv preprint arXiv:2503.11187 (2025)

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large lan- guage models.arXiv preprint arXiv:2503.11187, 2025. 5

work page arXiv 2025
[71]

When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462. Springer, 2024. 2

2024
[72]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models us- ing model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review arXiv 1909
[73]

Video-xl: Extra-long vision language model for hour-scale video understanding

Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 1, 2, 7

work page arXiv 2024
[74]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18221–18232, 2024. 6, 4

2024
[75]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 3, 1

2024
[76]

Shadowkv: Kv cache in shadows for high-throughput long-context llm inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high- throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024. 2

work page arXiv 2024
[77]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025. 1

work page arXiv 2025
[78]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 1, 2, 5

2025
[79]

Mimo-vl technical report, 2025

Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report, 2025. 1, 2, 6

2025
[80]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 6

work page internal anchor Pith review arXiv 2025

Showing first 80 references.