pith. machine review for the scientific record. sign in

arxiv: 2402.03766 · v1 · pith:SCPEKNYMnew · submitted 2024-02-06 · 💻 cs.CV · cs.AI

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Pith reviewed 2026-05-18 15:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision language modelsmobile AIefficient multimodal modelsmodel scaling lawsbenchmark evaluationarchitectural improvementsdataset curation
0
0 comments X

The pith

MobileVLM V2 shows that 1.7B and 3B vision-language models can match or surpass much larger systems on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MobileVLM V2, an improved family of vision language models built for mobile settings. It claims that a careful mix of new architectural choices, training methods adjusted for limited hardware, and high-quality data selection produces clear performance lifts. This would matter because it points to a practical path for running capable multimodal AI on phones and other small devices instead of relying on large cloud servers. A sympathetic reader would see value in whether these gains allow smaller models to handle real tasks without the compute cost of bigger alternatives.

Core claim

MobileVLM V2 establishes that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. The 1.7B model achieves better or on-par results compared with much larger VLMs at the 3B scale, while the 3B model outperforms a large variety of VLMs at the 7B+ scale.

What carries the argument

The orchestration of novel architectural design, mobile-tailored training scheme, and high-quality dataset curation that together support strong results at reduced model sizes.

If this is right

  • A 1.7 billion parameter vision-language model can equal or exceed the benchmark results of many 3 billion parameter systems.
  • A 3 billion parameter model can surpass the results of many models at 7 billion parameters and above.
  • Vision language models can be made efficient enough for direct use on mobile hardware while retaining competitive accuracy.
  • Dataset curation and training adjustments matter as much as raw parameter count for multimodal performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same combination of changes could be tested on other multimodal tasks such as video understanding to check for similar size reductions.
  • Teams building practical AI applications might shift focus toward data selection and device-specific training rather than always increasing model scale.
  • Further model compression experiments could start from these designs to explore even smaller footprints for edge devices.

Load-bearing premise

The performance gains stem directly from the described architectural, training, and data choices rather than from hidden tuning or selection of favorable test sets.

What would settle it

An independent training run of the same architectures using only public datasets, followed by evaluation on a fresh set of vision-language tasks never seen during development.

read the original abstract

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MobileVLM V2, a family of vision-language models improving on MobileVLM. It claims that a delicate orchestration of novel architectural design, an improved mobile-tailored training scheme, and rich high-quality dataset curation substantially boosts performance. Specifically, the 1.7B model achieves better or on-par results versus 3B-scale VLMs on standard benchmarks, while the 3B model outperforms many 7B+ VLMs; models will be released publicly.

Significance. If the results hold under rigorous verification, the work would be significant for efficient VLMs by showing that smaller-scale models can compete with larger ones via targeted design and curation. This has clear implications for mobile and edge deployment. The planned model release is a strength that supports reproducibility.

major comments (2)
  1. [Experiments] Experiments section: Performance claims (e.g., 1.7B vs. 3B and 3B vs. 7B+) are presented without specifying exact baselines (re-implemented or literature-reported), evaluation prompts/settings, error bars, statistical significance, or data splits. This is load-bearing for the central comparison claims and leaves the reported deltas difficult to verify.
  2. [Section 3] Section 3 and dataset description: The paper stresses 'rich high-quality dataset curation' as a key ingredient alongside architecture and training, yet provides no ablations isolating data effects from the proposed architectural tweaks and mobile-tailored training. Without such controls or details on the exact training mixture relative to baselines, attribution of gains to the orchestration remains under-supported.
minor comments (1)
  1. [Abstract] Abstract: 'Standard VLM benchmarks' is mentioned but not enumerated; adding the primary evaluation datasets (e.g., VQAv2, GQA) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and outline the revisions we will make to improve the verifiability and attribution of our results.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: Performance claims (e.g., 1.7B vs. 3B and 3B vs. 7B+) are presented without specifying exact baselines (re-implemented or literature-reported), evaluation prompts/settings, error bars, statistical significance, or data splits. This is load-bearing for the central comparison claims and leaves the reported deltas difficult to verify.

    Authors: We agree that greater specificity is needed for reproducibility. In the revised manuscript we will explicitly note that all baseline numbers are taken from the original publications (with citations) rather than re-implementations, except where we state otherwise. We will add a dedicated subsection describing the exact prompts, decoding parameters, and evaluation protocols used for our models, which follow the standard settings established in prior VLM works such as LLaVA. We acknowledge the absence of error bars and statistical significance tests; these are omitted because repeated full training runs are computationally prohibitive at this scale, a practice common in the field. We will insert a brief discussion of this limitation and note that all results use the official test splits of each benchmark. revision: yes

  2. Referee: [Section 3] Section 3 and dataset description: The paper stresses 'rich high-quality dataset curation' as a key ingredient alongside architecture and training, yet provides no ablations isolating data effects from the proposed architectural tweaks and mobile-tailored training. Without such controls or details on the exact training mixture relative to baselines, attribution of gains to the orchestration remains under-supported.

    Authors: We accept that the current version does not isolate the contribution of dataset curation. In the revision we will add a controlled ablation that trains the same architecture and training schedule on the prior MobileVLM data mixture versus the new high-quality curation, thereby quantifying the data effect. We will also expand the dataset description to include the precise composition, sources, and relative proportions of the training mixture, together with a comparison to the data used by the cited baseline models. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with independent external comparisons

full rationale

The paper reports training and evaluation of MobileVLM V2 models on standard VLM benchmarks, claiming performance gains from architectural tweaks, training scheme, and dataset curation. No mathematical derivations, equations, or first-principles predictions are present that could reduce to inputs by construction. Claims rest on direct empirical comparisons to other published models (external benchmarks), not on self-referential fits or self-citation chains that justify uniqueness. Any references to prior MobileVLM work serve as baseline context rather than load-bearing justification for the reported results. The analysis chain is self-contained experimental reporting against outside data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact hyperparameters and assumptions; the work implicitly relies on standard VLM evaluation practices and the transferability of curated data improvements.

free parameters (1)
  • model scale choices
    Selection of 1.7B and 3B parameter sizes for mobile focus; values chosen to balance performance and efficiency.
axioms (1)
  • domain assumption Standard VLM benchmarks accurately reflect real-world mobile deployment performance
    Claims rest on comparison to these benchmarks without discussion of potential distribution shift.

pith-pipeline@v0.9.0 · 5674 in / 1205 out tokens · 38474 ms · 2026-05-18T15:21:40.446906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  2. Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

    cs.LG 2025-09 conditional novelty 7.0

    Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.

  3. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    cs.CV 2024-10 unverdicted novelty 7.0

    Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

  4. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    cs.CV 2024-06 unverdicted novelty 7.0

    Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...

  5. LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.

  6. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  7. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  8. ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

    cs.CV 2026-04 unverdicted novelty 6.0

    ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...

  9. Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

    cs.AI 2026-03 unverdicted novelty 6.0

    Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.

  10. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  11. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  12. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  13. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  14. Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.

  15. TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

    cs.RO 2024-09 unverdicted novelty 4.0

    TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.

  16. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  17. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 17 Pith papers · 23 internal anchors

  1. [1]

    An in- depth look at gemini’s language abilities

    Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex B ¨auerle, ´Angel Alexander Cabrera, Krish Dho- lakia, Chenyan Xiong, and Graham Neubig. An in- depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023. 1

  2. [2]

    Openflamingo, Mar

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, Mar. 2023. 6

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 6

  5. [5]

    Pythia: A suite for analyz- 8 ing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyz- 8 ing large language models across training and scaling. In In- ternational Conference on Machine Learning , pages 2397–

  6. [6]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

  7. [7]

    Honeybee: Locality-enhanced projector for multimodal llm

    Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023. 4

  8. [8]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 6

  9. [9]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 6

  10. [10]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 1, 2, 4, 6, 7

  11. [11]

    Lawrence Zit- nick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zit- nick. Microsoft coco captions: Data collection and evalu- ation server, 2015. 4

  12. [12]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In Interna- tional Conference on Machine Learning , pages 1931–1942. PMLR, 2021. 2

  13. [13]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

  14. [14]

    Make repvgg greater again: A quantization-aware approach

    Xiangxiang Chu, Liang Li, and Bo Zhang. Make repvgg greater again: A quantization-aware approach. In AAAI,

  15. [15]

    MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023. 1, 2, 3, 4, 5, 6, 7, 8

  16. [16]

    Conditional positional encodings for vision transformers

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. 3, 8

  17. [17]

    Redpajama: An open source recipe to reproduce llama training dataset, 2023

    Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. 5

  18. [18]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 6

  19. [19]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Ba- tra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335,

  20. [20]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Ji- aqi Wang. Internlm-xcomposer2: Mastering free-form text- image composition and compr...

  21. [21]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InPro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 320–335, 2022. 2

  22. [22]

    Learning factored representations in a deep mixture of ex- perts

    David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of ex- perts. 2013. 1

  23. [23]

    Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. 2

  24. [24]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 2, 5

  25. [25]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5, 6

  26. [26]

    A challenger to gpt-4v? early explorations of gemini in visual expertise

    Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Zhang Mengdan, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436,

  27. [27]

    llama.cpp

    Georgi Gerganov. llama.cpp. https://github.com/g gerganov/llama.cpp. [Accessed: 2023-11-07]. 7

  28. [28]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3

  29. [29]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 5, 6

  30. [30]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Stuart J Nowlan, and Ge- offrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. 1

  31. [31]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. Accessed: 2023-03-01. 12 9

  32. [32]

    Grounding language models to images for multimodal gen- eration

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal gen- eration. arXiv preprint arXiv:2301.13823, 2023. 2

  33. [33]

    Lisa: Reasoning segmentation via large language model, 2024

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,

  34. [34]

    Obelisc: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527 ,

  35. [35]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 6

  36. [36]

    Align before fuse: Vision and language representation learn- ing with momentum distillation

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2

  37. [37]

    Norm tweaking: High-performance low-bit quantization of large language models

    Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2024. 2, 5

  38. [38]

    A speed odyssey for deployable quantization of llms

    Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yi- fan Lu, Xiangxiang Chu, Yerui Sun, and Yuchen Xie. A speed odyssey for deployable quantization of llms. arXiv preprint arXiv:2311.09550, 2023. 2

  39. [39]

    Textbooks are all you need ii: phi-1.5 technical report, 2023

    Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023. 2

  40. [40]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 5, 6

  41. [41]

    Moe-llava: Mixture of experts for large vision-language models, 2024

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models, 2024. 1, 2, 5, 6

  42. [42]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014. 12

  43. [43]

    Visual spatial reasoning

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 4

  44. [45]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 5

  45. [46]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

  46. [47]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6

  47. [48]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 4

  48. [49]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems , pages 27730– 27744, 2022. 1, 4, 5, 6

  49. [50]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4

  50. [51]

    OpenAI. ChatGPT. https://openai.com/blog/ChatGPT/,

  51. [52]

    Online; accessed 2023-01-01. 2

  52. [53]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. 2023. Technical Report. 2

  53. [54]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. 2023. 1

  54. [55]

    Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned pho- tographs. In Neural Information Processing Systems (NIPS),

  55. [56]

    Training language models to follow instructions with human feed- back

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2

  56. [57]

    Tinyllama, Sep 2023

    Tianduo Wang Peiyuan Zhang, Guangtao Zeng and Wei Lu. Tinyllama, Sep 2023. 2

  57. [58]

    Detgpt: Detect what you need via reasoning

    Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, and Ling- peng Kong Tong Zhang. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023. 2

  58. [59]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3, 4

  59. [60]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagn ´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall´e, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022. 2

  60. [61]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 1, 4, 5, 6

  61. [62]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2 10

  62. [63]

    Galactica: A large language model for science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poul- ton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. 2022. 2

  63. [64]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://gith ub.com/InternLM/InternLM, 2023. 2

  64. [65]

    Vigc: Visual instruction generation and correction

    Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023. 4

  65. [66]

    To see is to believe: Prompting gpt-4v for better visual instruction tuning

    Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023. 2

  66. [67]

    Image as a foreign language: Beit pretraining for all vision and vision- language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision- language tasks. arXiv preprint arXiv:2208.10442, 2022. 2

  67. [68]

    Vary: Scaling up the vision vocabulary for large vision-language models

    Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109,

  68. [69]

    Smoothquant: Accurate and effi- cient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR, 2023. 2

  69. [70]

    Baichuan 2: Open Large-scale Language Models

    Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 2

  70. [71]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 6

  71. [72]

    Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue

    Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent ad- vances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024. 1

  72. [73]

    OPT: Open pre-trained transformer language models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. 2022. 2

  73. [74]

    Svit: Scaling up visual instruction tuning

    Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,

  74. [75]

    Lidar-ptq:post-training quantization for point cloud 3d object detection

    Sifan Zhou, Liang Li, Xinyu Zhang, Bo Zhang, Shipeng Bai, Miao Sun, Ziyu Zhao, Xiaobo Lu, and Xiangxiang Chu. Lidar-ptq:post-training quantization for point cloud 3d object detection. International Conference on Learning Represen- tations (ICLR 2024), 2024. 5

  75. [76]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2, 6

  76. [77]

    The Kitchen Store

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Llava- ϕ: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024. 2 11 A. Dialogue formats of various datasets. During the pre-training phase, we utilized the 1.2 million image-text pairs from the pre-training phase of ShareGPT4V , which pri...