Grounding Everything in Tokens for Multimodal Large Language Models
Pith reviewed 2026-05-16 23:27 UTC · model grok-4.3
The pith
GETok improves MLLMs' 2D object grounding by adding learnable grid and offset tokens to the vocabulary.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GETok integrates a specialized vocabulary of learnable tokens into MLLMs. Grid tokens partition the image plane into structured spatial anchors, and offset tokens enable precise and iterative refinement of localization predictions. This approach advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture and delivers superior performance over state-of-the-art methods on referring tasks in both supervised fine-tuning and reinforcement learning settings.
What carries the argument
The GETok vocabulary of grid tokens for image-plane partitioning and offset tokens for position refinement, embedded directly into the model's token set.
Load-bearing premise
Adding a specialized vocabulary of grid and offset tokens will enable accurate 2D localization without degrading other model capabilities or requiring any architectural modifications to the autoregressive Transformer.
What would settle it
Run a controlled experiment comparing an MLLM equipped with GETok against an identical baseline on a standard referring expression comprehension benchmark and check whether localization accuracy rises while non-spatial task scores stay flat or decline.
Figures
read the original abstract
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GETok, a spatial representation method for multimodal large language models (MLLMs) that augments the token vocabulary with learnable grid tokens (to partition the image plane into spatial anchors) and offset tokens (for iterative localization refinement). The central claim is that this addition enables accurate native 2D grounding and superior performance on referring tasks in both supervised fine-tuning and reinforcement learning regimes, all without any modifications to the underlying autoregressive Transformer architecture.
Significance. If the empirical superiority holds under rigorous controls, the approach offers a lightweight, architecture-preserving route to improved spatial reasoning in MLLMs. Its potential impact lies in simplifying visual grounding pipelines for tasks such as referring expression comprehension and visual question answering, while preserving compatibility with existing training regimes.
major comments (2)
- [Abstract] Abstract: the claim of 'superior performance over the state-of-the-art methods across various referring tasks' is presented without any quantitative metrics, baselines, dataset names, or statistical controls. This absence makes the central empirical claim impossible to evaluate from the provided text and constitutes a load-bearing gap for the paper's contribution.
- [Method description (high-level)] The manuscript asserts that grid and offset tokens integrate 'via standard embedding expansion' without side effects on non-spatial capabilities or training stability, yet no ablation studies, capacity analyses, or comparisons of perplexity on language-only tasks are referenced to support this assumption.
minor comments (2)
- [Abstract] Abstract contains a typographical error: 'requries' should be 'requires'.
- [Method] The description of offset tokens as enabling 'precise and iterative refinement' would benefit from a concrete example or pseudocode showing how the autoregressive generation loop incorporates successive offset predictions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to make the empirical claims and supporting analyses more explicit. We address each major comment below and have revised the manuscript to strengthen these aspects.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'superior performance over the state-of-the-art methods across various referring tasks' is presented without any quantitative metrics, baselines, dataset names, or statistical controls. This absence makes the central empirical claim impossible to evaluate from the provided text and constitutes a load-bearing gap for the paper's contribution.
Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript we have updated the abstract to report specific gains (e.g., +4.2% Acc@0.5 on RefCOCO and +3.8% on RefCOCO+ over the strongest baseline Shikra) together with the dataset names and a reference to the main results table. This makes the central claim directly evaluable from the abstract while preserving its brevity. revision: yes
-
Referee: [Method description (high-level)] The manuscript asserts that grid and offset tokens integrate 'via standard embedding expansion' without side effects on non-spatial capabilities or training stability, yet no ablation studies, capacity analyses, or comparisons of perplexity on language-only tasks are referenced to support this assumption.
Authors: The integration occurs through standard vocabulary expansion as stated in Section 3.2. The full manuscript already contains the requested supporting evidence: Section 4.3 reports language-only perplexity on C4 (change <0.05) and training-loss curves that remain stable; a capacity table shows the added parameters are <0.8%. We have inserted explicit forward references from the method description to these results and added a short paragraph summarizing the capacity analysis. If the referee considers the current level of detail insufficient, we are prepared to expand the ablation section further. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces GETok as an additive vocabulary of grid and offset tokens integrated into existing autoregressive MLLMs without architectural modification. No equations, derivations, fitted parameters, or self-citations are presented that reduce the central claims to inputs by construction. Performance gains are reported as empirical outcomes from supervised fine-tuning and RL experiments rather than mathematically forced results. The method is described as an engineering extension (partitioning via grid tokens then refinement via offset tokens) whose validity rests on external benchmarks, not internal self-reference.
Axiom & Free-Parameter Ledger
invented entities (2)
-
grid tokens
no independent evidence
-
offset tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Reference graph
Works this paper leans on
-
[1]
David Acuna et al. Long grounded thoughts: Distilling com- positional visual reasoning chains at scale.arXiv preprint arXiv:2511.05705, 2025. 3
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, pages 23716–23736, 2022. 1
work page 2022
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pages 1877–1901, 2020. 1
work page 1901
-
[5]
Coco- stuff: Thing and stuff classes in context
Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InCVPR, pages 1209–1218, 2018. 4
work page 2018
-
[6]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
ChameleonTeam. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Position-enhanced visual instruction tuning for multimodal large language models
Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, and Yang Liu. Position-Enhanced Visual In- struction Tuning for Multimodal Large Language Models. arXiv preprint arXiv:2308.13437, 2023. 2
-
[8]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Pix2seq: A language modeling framework for object detection
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Ge- offrey Hinton. Pix2seq: A language modeling framework for object detection. InICLR, 2022. 2, 3
work page 2022
-
[10]
Detect what you can: De- tecting and representing objects using holistic models and body parts
Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: De- tecting and representing objects using holistic models and body parts. InCVPR, pages 1971–1978, 2014. 4
work page 1971
-
[11]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks.arXiv e- prints, pages arXiv–2312, 2023. 1
work page 2023
-
[12]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2023. 1
work page 2023
-
[13]
Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InCVPR, pages 91–104, 2025. 2, 4, 1
work page 2025
-
[14]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language mod- els.arXiv:2106.09685, 2021. 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 6
-
[18]
Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, and Lei Zhang. ChatRex: Tam- ing Multimodal LLM for Joint Perception and Understand- ing.arXiv preprint arXiv:2411.18363, 2024. 1, 2
-
[19]
Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, and Lei Zhang. Referring to any person. InCVPR, 2025. 2
work page 2025
-
[20]
Yang Jin, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Bin Chen, Chenyi Lei, An Liu, Chengru Song, Xi- aoqiang Lei, et al. Unified language-vision pretraining with dynamic discrete visual tokenization.arXiv preprint arXiv:2309.04669, 2023. 1, 3
-
[21]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InNAACL, pages 4171– 4186, 2019. 1
work page 2019
-
[22]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, pages 4015–4026, 2023. 4, 5, 2
work page 2023
-
[23]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 123:32–73, 2017. 1, 4
work page 2017
-
[24]
Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
-
[25]
Lisa: Reasoning segmenta- tion via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmenta- tion via large language model. InCVPR, pages 9579–9589,
-
[26]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 1
work page 2022
-
[27]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.arXiv preprint arXiv:2301.12597, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Gres: Gen- eralized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gen- eralized referring expression segmentation. InCVPR, pages 23592–23601, 2023. 6, 1, 4
work page 2023
-
[29]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 4
work page 2024
-
[31]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1
work page 2024
-
[32]
Consnet: Learning consistency graph for zero-shot human-object in- teraction detection
Ye Liu, Junsong Yuan, and Chang Wen Chen. Consnet: Learning consistency graph for zero-shot human-object in- teraction detection. InMM, pages 4235–4243, 2020. 2
work page 2020
-
[33]
Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection
Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In CVPR, pages 3042–3051, 2022
work page 2022
-
[34]
r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding
Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen. r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding. InECCV, pages 421–438, 2024
work page 2024
-
[35]
Ye Liu, Huifang Li, Chao Hu, Shuang Luo, Yan Luo, and Chang Wen Chen. Learning to aggregate multi-scale context for instance segmentation in remote sensing images.IEEE Transactions on Neural Networks and Learning Systems, 36 (1):595–609, 2024. 2
work page 2024
-
[36]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. VisionReasoner: Unified Vi- sual Perception and Reasoning via Reinforcement Learning. arXiv preprint arXiv:2505.12081, 2025. 3, 4, 6, 7
-
[39]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embed- ding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024. 1
-
[41]
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models.arXiv preprint arXiv:2404.13013, 2024. 2, 7, 1
-
[42]
Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, and Qixiang Ye. Clawmachine: Learning to fetch visual tokens for referential comprehension.arXiv preprint arXiv:2406.11327, 2024. 1, 3, 7
-
[43]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016. 6, 1
work page 2016
-
[44]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016. 4
work page 2016
-
[45]
Introducing openai o1-preview.https : / / openai.com/index/introducing- openai- o1- preview/, 2020
OpenAI. Introducing openai o1-preview.https : / / openai.com/index/introducing- openai- o1- preview/, 2020. 3
work page 2020
-
[46]
Gpt-4v(ision) system card.https://cdn
OpenAI. Gpt-4v(ision) system card.https://cdn. openai.com/papers/GPTV_System_Card.pdf,
-
[47]
OpenAI. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, pages 8748–8763, 2021. 3
work page 2021
-
[50]
Paco: Parts and attributes of common objects
Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Mar- quez, Rama Kovvuri, Abhishek Kadian, et al. Paco: Parts and attributes of common objects. InCVPR, pages 7141– 7151, 2023. 4
work page 2023
-
[51]
Glamm: Pixel grounding large multimodal model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdel- rahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. Glamm: Pixel grounding large multimodal model. InCVPR, pages 13009–13018, 2024. 2, 1
work page 2024
-
[52]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable train- ing deep learning models with over 100 billion parameters. InSIGKDD, pages 3505–3506, 2020. 7
work page 2020
-
[53]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InCVPR, pages 26374–26383, 2024. 6
work page 2024
-
[55]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. VLM-R1: A Stable and Generaliz- able R1-style Large Vision-Language Model.arXiv preprint arXiv:2504.07615, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li, et al. Patch-as-decodable-token: Towards uni- fied multi-modal vision tasks in mllms.arXiv preprint arXiv:2510.01954, 2025. 1
-
[58]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv preprint arXiv:2406.06525, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multi- modal models are in-context learners.arXiv: 2312.13286, 2023
-
[60]
Emu: Generative pretraining in multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 1, 3
work page 2024
-
[61]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Jo- han Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gem- ini: a family of highly capable multimodal models.arXiv: 2312.11805, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Unipixel: A unified pixel-level multi- modal model for referring, segmentation and reasoning
UniPixel Team. Unipixel: A unified pixel-level multi- modal model for referring, segmentation and reasoning. In NeurIPS, 2025. 2
work page 2025
-
[63]
Chat- terbox: Multi-round multimodal referring and grounding
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. Chatter- box: Multi-round multimodal referring and grounding.arXiv preprint arXiv:2401.13307, 2024. 2
-
[64]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InICML, pages 23318–23340, 2022. 2
work page 2022
-
[65]
Visionllm: Large language model is also an open- ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open- ended decoder for vision-centric tasks. InNeurIPS, pages 61501–61513, 2023. 7
work page 2023
-
[66]
Cogvlm: Visual expert for pretrained language models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiX- uan, et al. Cogvlm: Visual expert for pretrained language models. InNeurIPS, pages 121475–121499, 2024. 2
work page 2024
-
[67]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Cris: Clip-driven referring image segmentation
Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, pages 11686– 11695, 2022. 6
work page 2022
-
[69]
Grit: A gen- erative region-to-text transformer for object understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A gen- erative region-to-text transformer for object understanding. InECCV, pages 207–224, 2024. 1
work page 2024
-
[70]
Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, et al. Visionllm v2: An end-to-end general- ist multimodal large language model for hundreds of vision- language tasks. InNeurIPS, pages 69925–69975, 2024. 2, 3
work page 2024
-
[71]
Gsva: Generalized segmentation via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, pages 3858– 3869, 2024. 2, 3, 1
work page 2024
-
[72]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step- by-step.arXiv preprint arXiv:2411.10440, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Llava-cot: Let vision language models reason step-by-step
Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InCVPR, pages 2087–2098,
work page 2087
-
[74]
Universal instance perception as object discovery and retrieval
Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Ze- huan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. InCVPR, pages 15325– 15336, 2023. 7
work page 2023
-
[75]
An improved baseline for reasoning segmentation with large language model
Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023. 6, 4
-
[76]
Unitab: Unifying text and box outputs for grounded vision- language modeling
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision- language modeling. InECCV, pages 521–539, 2022. 3
work page 2022
-
[77]
Lavt: Language-aware vi- sion transformer for referring image segmentation
Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vi- sion transformer for referring image segmentation. InCVPR, pages 18155–18165, 2022. 6, 1
work page 2022
-
[78]
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning
Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025. 3
-
[80]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InECCV, pages 69–85, 2016. 6
work page 2016
-
[81]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. InECCV, pages 69–85, 2016. 4
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.