Recognition: unknown
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3
The pith
POINTS-Long introduces a dual-mode MLLM that switches between full-detail focus and low-token standby modes for efficient long visual reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POINTS-Long is a native dual-mode MLLM featuring dynamic visual token scaling. It supports focus mode for optimal performance on fine-grained tasks and standby mode that retains 97.7-99.7 percent of original accuracy on long-form general visual understanding while using only 1/40 to 1/10 of the visual tokens. It further enables streaming visual understanding through a dynamically detachable KV-cache design for efficient ultra-long visual memory.
What carries the argument
Dual perception modes (focus and standby) with dynamic visual token scaling, plus a dynamically detachable KV-cache for streaming.
If this is right
- Users can dynamically choose between efficiency and accuracy at inference time.
- Standby mode enables processing of long videos and streams with far fewer tokens while preserving high accuracy on general tasks.
- The detachable KV-cache allows efficient maintenance of memory across extended visual sequences without full recomputation.
- The approach lays groundwork for adaptive designs in future models handling long-form visual content.
Where Pith is reading between the lines
- Similar mode-switching could extend to other input types like audio or text sequences where token volume varies.
- Deployment on edge devices might become more feasible if standby mode scales down compute predictably.
- The human-vision analogy suggests testing against biological benchmarks for token selection patterns.
Load-bearing premise
Dynamic token scaling and mode switching can be implemented natively without adding overhead or accuracy losses not seen in the tested long-form tasks.
What would settle it
Test standby mode on a long video where critical details appear briefly and sparsely; accuracy would drop below 97 percent if the reduced tokens miss those details.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POINTS-Long, a native dual-mode MLLM with focus and standby perception modes that dynamically scales visual tokens, inspired by human vision. It claims that standby mode retains 97.7-99.7% of baseline accuracy on long-form general visual understanding tasks while using only 1/40 to 1/10 of the visual tokens, and that a detachable KV-cache enables efficient streaming visual understanding.
Significance. If the empirical results hold under proper controls, the work is significant for addressing visual token scalability in long-video and streaming MLLM applications. The dual-mode architecture and native streaming support provide a concrete mechanism for accuracy-efficiency trade-offs, with potential to influence efficient multimodal model design.
minor comments (1)
- The abstract reports specific accuracy retention ranges (97.7-99.7%) and token reduction factors (1/40-1/10) but does not name the long-form tasks, datasets, or baseline models used; adding these references would strengthen verifiability without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of POINTS-Long and for recommending minor revision. We appreciate the recognition that the dual-mode architecture and detachable KV-cache provide a concrete mechanism for accuracy-efficiency trade-offs in long-form and streaming MLLM settings.
Circularity Check
No significant circularity
full rationale
The paper presents an empirical architecture for adaptive dual-mode visual reasoning in MLLMs, with claims resting on reported accuracy retention metrics (97.7-99.7% at reduced token counts) from experimental evaluations on long-form tasks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described content. The design is motivated by a human-vision analogy but does not reduce any result to its own inputs by construction; performance numbers are direct outcomes of the implemented model rather than self-referential definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Ak- bari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 9392–9401, 2025. 2
2025
-
[2]
Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,
Alan Baddeley. Working memory: looking back and look- ing forward.Nature reviews neuroscience, 4(10):829–839,
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[5]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6, 4
2015
-
[6]
Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models,
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 6, 4
-
[7]
arXiv preprint arXiv:2412.12075 , year=
Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024. 6, 4
-
[8]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 2
-
[9]
Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Infor- mation Processing Systems, 37:27056–27087, 2024. 5, 4
2024
-
[10]
Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Zhenyu Tang, Li Yuan, et al. Sharegpt4video: Improving video under- standing and generation with better captions.Advances in Neural Information Processing Systems, 37:19472–19495,
-
[11]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference accelera- tion for large vision-language models. InEuropean Confer- ence on Computer Vision, pages 19–35. Springer, 2024. 2, 7, 5
2024
-
[12]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision founda- tion models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2
2024
-
[14]
Opencompass: A univer- sal evaluation platform for foundation models.https: / / github
OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models.https: / / github . com / open - compass / opencompass,
-
[15]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[16]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 1, 2, 3
work page internal anchor Pith review arXiv 2023
-
[17]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multi- modal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[18]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InPro- ceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 6, 4
2024
-
[19]
Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024
Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing.Advances in Neural Information Processing Sys- tems, 37:89098–89124, 2024. 6, 4
2024
-
[20]
Finevideo.https:// huggingface
Miquel Farr ´e, Andi Marafioti, Lewis Tunstall, Leandro V on Werra, and Thomas Wolf. Finevideo.https:// huggingface . co / datasets / HuggingFaceFV / finevideo, 2024. 1
2024
-
[21]
What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007
Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. What do we perceive in a glance of a real-world scene? Journal of vision, 7(1):10–10, 2007. 2
2007
-
[22]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118,
-
[23]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1 9
2022
-
[24]
Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431,
-
[25]
Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language halluci- nation and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p...
2024
-
[26]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1, 2
work page internal anchor Pith review arXiv 2025
-
[27]
arXiv preprint arXiv:2502.04326 (2025)
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 6, 4
-
[28]
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Yao Hu, and Shaohui Lin. Dynamic-llava: Efficient multimodal large language models via dynamic vision-language context spar- sification.arXiv preprint arXiv:2412.00876, 2024. 2
-
[29]
Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Vi- sual token pruning for efficient video large language mod- els.arXiv preprint arXiv:2412.16117, 2024. 6, 5
-
[30]
Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024
Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qin- glong Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, and Ming Yang. Accelerating pre-training of multimodal llms via chain-of-sight.Advances in Neural Information Processing Systems, 37:75668–75691, 2024. 2
2024
-
[31]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 5
2021
-
[32]
arXiv preprint arXiv:2503.10582 , year=
Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, and Wenhu Chen. Visualwebinstruct: Scaling up mul- timodal instruction data through web search.arXiv preprint arXiv:2503.10582, 2025. 1
-
[33]
Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025
Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, et al. Token-efficient long video understanding for multimodal llms.arXiv preprint arXiv:2503.04130, 2025. 6
-
[34]
Turbo: Informativity-driven acceleration plug-in for vision-language large models
Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, and Bo Zheng. Turbo: Informativity-driven acceleration plug-in for vision-language large models. InEuropean Conference on Computer Vision, pages 436–455. Springer,
-
[35]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on com- puter vision, pages 235–251. Springer, 2016. 5, 4
2016
-
[36]
Spvit: Enabling faster vision transformers via soft token pruning, 2022
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, and Yanzhi Wang. Spvit: Enabling faster vision transformers via soft token pruning, 2022. 2
2022
-
[37]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operat- ing Systems Principles, 2023. 1, 2, 3, 8
2023
-
[38]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 5
2023
-
[40]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2, 1
work page internal anchor Pith review arXiv 2023
-
[41]
Mvbench: A comprehensive multi-modal video under- standing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 6, 4
2024
-
[42]
Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025. 2
2025
-
[43]
Videochat-flash: Hierarchical compression for long-context video modeling,
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical com- pression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024. 1, 2, 7
-
[44]
Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural In- formation Processing Systems, 37:22947–22970, 2024. 2
2024
-
[45]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2
2024
-
[46]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814,
-
[47]
Multi-stage vision token drop- ping: Towards efficient multimodal large language model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token drop- ping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 1 10
-
[48]
Xiangrui Liu, Yan Shu, Zheng Liu, Ao Li, Yang Tian, and Bo Zhao. Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025. 1, 2, 7
-
[49]
Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025
Xuyang Liu, Ziming Wang, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Bo Zheng, Linfeng Zhang, Siteng Huang, and Honggang Chen. Compression with global guidance: Towards training-free high-resolution mllms ac- celeration.arXiv e-prints, pages arXiv–2501, 2025. 2
2025
-
[50]
Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024. 5, 4
2024
-
[51]
Tempcompass: Do video llms really understand videos?,
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Temp- compass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 6, 4
-
[52]
Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wen- wen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mys- tery of ocr in large multimodal models.Science China In- formation Sciences, 67(12):220102, 2024. 5, 4
2024
- [53]
-
[54]
Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, and Jie Zhou. Points: Improving your vision- language model with affordable strategies.arXiv preprint arXiv:2409.04828, 2024. 1
-
[55]
Points-reader: Distillation-free adaptation of vision-language models for document conversion
Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xub- ing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Zhou Xiao, Yang Yu, et al. Points-reader: Distillation-free adaptation of vision-language models for document conversion. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1576–1601, 2025. 1
2025
-
[56]
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, re- membering, and reasoning: A multimodal agent with long- term memory.arXiv preprint arXiv:2508.09736, 2025. 4
-
[57]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts.arXiv preprint arXiv:2310.02255, 2023. 5, 4
work page internal anchor Pith review arXiv 2023
- [58]
-
[59]
The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997
Steven J Luck and Edward K V ogel. The capacity of visual working memory for features and conjunctions.Nature, 390(6657):279–281, 1997. 2
1997
-
[60]
Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 6, 4
2023
-
[61]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[62]
Gpt-4 technical report, 2024
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, Red Avila, et al. Gpt-4 technical report, 2024. 1
2024
-
[63]
Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuan- grui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37:119336– 119360, 2024. 2, 7, 8
2024
-
[64]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3
2021
-
[65]
Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.NeurIPS,
-
[66]
Ruchit Rawal, Khalid Saifullah, Miquel Farr´e, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark.arXiv preprint arXiv:2405.08813, 2024. 1
-
[67]
Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 1
2022
-
[68]
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024. 2
-
[69]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019. 1
2019
-
[70]
arXiv preprint arXiv:2503.11187 (2025)
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. Fastvid: Dynamic density pruning for fast video large lan- guage models.arXiv preprint arXiv:2503.11187, 2025. 5
-
[71]
When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision mod- els? InEuropean Conference on Computer Vision, pages 444–462. Springer, 2024. 2
2024
-
[72]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models us- ing model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review arXiv 1909
-
[73]
Video-xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Peitian Zhang, Zheng Liu, Minghao Qin, Junjie Zhou, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. arXiv preprint arXiv:2409.14485, 2024. 1, 2, 7
-
[74]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18221–18232, 2024. 6, 4
2024
-
[75]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568: 127063, 2024. 3, 1
2024
-
[76]
Shadowkv: Kv cache in shadows for high-throughput long-context llm inference
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high- throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024. 2
-
[77]
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752, 2025. 1
-
[78]
Dycoke: Dynamic compression of tokens for fast video large language models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 1, 2, 5
2025
-
[79]
Mimo-vl technical report, 2025
Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report, 2025. 1, 2, 6
2025
-
[80]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025. 6
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.