Visionllm v2: An end-to-end general- ist multimodal large language model for hundreds of vision- language tasks

Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CV · 2025-12-11 · unverdicted · novelty 5.0

GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.

Showing 1 of 1 citing paper.

Grounding Everything in Tokens for Multimodal Large Language Models cs.CV · 2025-12-11 · unverdicted · none · ref 70
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.