MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Pith reviewed 2026-05-18 15:21 UTC · model grok-4.3
The pith
MobileVLM V2 shows that 1.7B and 3B vision-language models can match or surpass much larger systems on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileVLM V2 establishes that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. The 1.7B model achieves better or on-par results compared with much larger VLMs at the 3B scale, while the 3B model outperforms a large variety of VLMs at the 7B+ scale.
What carries the argument
The orchestration of novel architectural design, mobile-tailored training scheme, and high-quality dataset curation that together support strong results at reduced model sizes.
If this is right
- A 1.7 billion parameter vision-language model can equal or exceed the benchmark results of many 3 billion parameter systems.
- A 3 billion parameter model can surpass the results of many models at 7 billion parameters and above.
- Vision language models can be made efficient enough for direct use on mobile hardware while retaining competitive accuracy.
- Dataset curation and training adjustments matter as much as raw parameter count for multimodal performance.
Where Pith is reading between the lines
- The same combination of changes could be tested on other multimodal tasks such as video understanding to check for similar size reductions.
- Teams building practical AI applications might shift focus toward data selection and device-specific training rather than always increasing model scale.
- Further model compression experiments could start from these designs to explore even smaller footprints for edge devices.
Load-bearing premise
The performance gains stem directly from the described architectural, training, and data choices rather than from hidden tuning or selection of favorable test sets.
What would settle it
An independent training run of the same architectures using only public datasets, followed by evaluation on a fresh set of vision-language tasks never seen during development.
read the original abstract
We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MobileVLM V2, a family of vision-language models improving on MobileVLM. It claims that a delicate orchestration of novel architectural design, an improved mobile-tailored training scheme, and rich high-quality dataset curation substantially boosts performance. Specifically, the 1.7B model achieves better or on-par results versus 3B-scale VLMs on standard benchmarks, while the 3B model outperforms many 7B+ VLMs; models will be released publicly.
Significance. If the results hold under rigorous verification, the work would be significant for efficient VLMs by showing that smaller-scale models can compete with larger ones via targeted design and curation. This has clear implications for mobile and edge deployment. The planned model release is a strength that supports reproducibility.
major comments (2)
- [Experiments] Experiments section: Performance claims (e.g., 1.7B vs. 3B and 3B vs. 7B+) are presented without specifying exact baselines (re-implemented or literature-reported), evaluation prompts/settings, error bars, statistical significance, or data splits. This is load-bearing for the central comparison claims and leaves the reported deltas difficult to verify.
- [Section 3] Section 3 and dataset description: The paper stresses 'rich high-quality dataset curation' as a key ingredient alongside architecture and training, yet provides no ablations isolating data effects from the proposed architectural tweaks and mobile-tailored training. Without such controls or details on the exact training mixture relative to baselines, attribution of gains to the orchestration remains under-supported.
minor comments (1)
- [Abstract] Abstract: 'Standard VLM benchmarks' is mentioned but not enumerated; adding the primary evaluation datasets (e.g., VQAv2, GQA) would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and outline the revisions we will make to improve the verifiability and attribution of our results.
read point-by-point responses
-
Referee: [Experiments] Experiments section: Performance claims (e.g., 1.7B vs. 3B and 3B vs. 7B+) are presented without specifying exact baselines (re-implemented or literature-reported), evaluation prompts/settings, error bars, statistical significance, or data splits. This is load-bearing for the central comparison claims and leaves the reported deltas difficult to verify.
Authors: We agree that greater specificity is needed for reproducibility. In the revised manuscript we will explicitly note that all baseline numbers are taken from the original publications (with citations) rather than re-implementations, except where we state otherwise. We will add a dedicated subsection describing the exact prompts, decoding parameters, and evaluation protocols used for our models, which follow the standard settings established in prior VLM works such as LLaVA. We acknowledge the absence of error bars and statistical significance tests; these are omitted because repeated full training runs are computationally prohibitive at this scale, a practice common in the field. We will insert a brief discussion of this limitation and note that all results use the official test splits of each benchmark. revision: yes
-
Referee: [Section 3] Section 3 and dataset description: The paper stresses 'rich high-quality dataset curation' as a key ingredient alongside architecture and training, yet provides no ablations isolating data effects from the proposed architectural tweaks and mobile-tailored training. Without such controls or details on the exact training mixture relative to baselines, attribution of gains to the orchestration remains under-supported.
Authors: We accept that the current version does not isolate the contribution of dataset curation. In the revision we will add a controlled ablation that trains the same architecture and training schedule on the prior MobileVLM data mixture versus the new high-quality curation, thereby quantifying the data effect. We will also expand the dataset description to include the precise composition, sources, and relative proportions of the training mixture, together with a comparison to the data used by the cited baseline models. revision: yes
Circularity Check
No circularity: empirical benchmark results with independent external comparisons
full rationale
The paper reports training and evaluation of MobileVLM V2 models on standard VLM benchmarks, claiming performance gains from architectural tweaks, training scheme, and dataset curation. No mathematical derivations, equations, or first-principles predictions are present that could reduce to inputs by construction. Claims rest on direct empirical comparisons to other published models (external benchmarks), not on self-referential fits or self-citation chains that justify uniqueness. Any references to prior MobileVLM work serve as baseline context rather than load-bearing justification for the reported results. The analysis chain is self-contained experimental reporting against outside data.
Axiom & Free-Parameter Ledger
free parameters (1)
- model scale choices
axioms (1)
- domain assumption Standard VLM benchmarks accurately reflect real-world mobile deployment performance
Forward citations
Cited by 17 Pith papers
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
-
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
-
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
-
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
-
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
An in- depth look at gemini’s language abilities
Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex B ¨auerle, ´Angel Alexander Cabrera, Krish Dho- lakia, Chenyan Xiong, and Graham Neubig. An in- depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023. 1
-
[2]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bit- ton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, Mar. 2023. 6
work page 2023
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Pythia: A suite for analyz- 8 ing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyz- 8 ing large language models across training and scaling. In In- ternational Conference on Machine Learning , pages 2397–
-
[6]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2
work page 1901
-
[7]
Honeybee: Locality-enhanced projector for multimodal llm
Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023. 4
-
[8]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 1, 2, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zit- nick. Microsoft coco captions: Data collection and evalu- ation server, 2015. 4
work page 2015
-
[12]
Unifying vision-and-language tasks via text generation
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In Interna- tional Conference on Machine Learning , pages 1931–1942. PMLR, 2021. 2
work page 1931
-
[13]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Make repvgg greater again: A quantization-aware approach
Xiangxiang Chu, Liang Li, and Bo Zhang. Make repvgg greater again: A quantization-aware approach. In AAAI,
-
[15]
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023. 1, 2, 3, 4, 5, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Conditional positional encodings for vision transformers
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. 3, 8
work page 2023
-
[17]
Redpajama: An open source recipe to reproduce llama training dataset, 2023
Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. 5
work page 2023
-
[18]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Ba- tra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335,
-
[20]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Ji- aqi Wang. Internlm-xcomposer2: Mastering free-form text- image composition and compr...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. InPro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 320–335, 2022. 2
work page 2022
-
[22]
Learning factored representations in a deep mixture of ex- perts
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of ex- perts. 2013. 1
work page 2013
-
[23]
Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. 2
work page 2023
-
[24]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
A challenger to gpt-4v? early explorations of gemini in visual expertise
Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Zhang Mengdan, Peixian Chen, Sirui Zhao, Shaohui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hongsheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436,
- [27]
-
[28]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 3
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 5, 6
work page 2019
-
[30]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Stuart J Nowlan, and Ge- offrey E Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. 1
work page 1991
-
[31]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. arXiv preprint arXiv:2304.02643, 2023. Accessed: 2023-03-01. 12 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Grounding language models to images for multimodal gen- eration
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal gen- eration. arXiv preprint arXiv:2301.13823, 2023. 2
-
[33]
Lisa: Reasoning segmentation via large language model, 2024
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,
-
[34]
Obelisc: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527 ,
-
[35]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Align before fuse: Vision and language representation learn- ing with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation. Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2
work page 2021
-
[37]
Norm tweaking: High-performance low-bit quantization of large language models
Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2024. 2, 5
work page 2024
-
[38]
A speed odyssey for deployable quantization of llms
Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yi- fan Lu, Xiangxiang Chu, Yerui Sun, and Yuchen Xie. A speed odyssey for deployable quantization of llms. arXiv preprint arXiv:2311.09550, 2023. 2
-
[39]
Textbooks are all you need ii: phi-1.5 technical report, 2023
Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023. 2
work page 2023
-
[40]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Moe-llava: Mixture of experts for large vision-language models, 2024
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models, 2024. 1, 2, 5, 6
work page 2024
-
[42]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014. 12
work page 2014
-
[43]
Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 4
work page 2023
-
[45]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 4
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems , pages 27730– 27744, 2022. 1, 4, 5, 6
work page 2022
-
[50]
Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4
-
[51]
OpenAI. ChatGPT. https://openai.com/blog/ChatGPT/,
-
[52]
Online; accessed 2023-01-01. 2
work page 2023
- [53]
- [54]
-
[55]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned pho- tographs. In Neural Information Processing Systems (NIPS),
-
[56]
Training language models to follow instructions with human feed- back
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2
work page 2022
-
[57]
Tianduo Wang Peiyuan Zhang, Guangtao Zeng and Wei Lu. Tinyllama, Sep 2023. 2
work page 2023
-
[58]
Detgpt: Detect what you need via reasoning
Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, and Ling- peng Kong Tong Zhang. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023. 2
-
[59]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3, 4
work page 2021
-
[60]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagn ´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall´e, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 1, 4, 5, 6
work page 2019
-
[62]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019. 2 10
-
[63]
Galactica: A large language model for science
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poul- ton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. 2022. 2
work page 2022
-
[64]
Internlm: A multilingual language model with progressively enhanced capabilities
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://gith ub.com/InternLM/InternLM, 2023. 2
work page 2023
-
[65]
Vigc: Visual instruction generation and correction
Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023. 4
-
[66]
To see is to believe: Prompting gpt-4v for better visual instruction tuning
Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023. 2
-
[67]
Image as a foreign language: Beit pretraining for all vision and vision- language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil- iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo- hammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision- language tasks. arXiv preprint arXiv:2208.10442, 2022. 2
-
[68]
Vary: Scaling up the vision vocabulary for large vision-language models
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109,
-
[69]
Smoothquant: Accurate and effi- cient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and effi- cient post-training quantization for large language models. In International Conference on Machine Learning , pages 38087–38099. PMLR, 2023. 2
work page 2023
-
[70]
Baichuan 2: Open Large-scale Language Models
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, and Xiangyu Yue
Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent ad- vances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024. 1
-
[73]
OPT: Open pre-trained transformer language models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. 2022. 2
work page 2022
-
[74]
Svit: Scaling up visual instruction tuning
Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087,
-
[75]
Lidar-ptq:post-training quantization for point cloud 3d object detection
Sifan Zhou, Liang Li, Xinyu Zhang, Bo Zhang, Shipeng Bai, Miao Sun, Ziyu Zhao, Xiaobo Lu, and Xiangxiang Chu. Lidar-ptq:post-training quantization for point cloud 3d object detection. International Conference on Learning Represen- tations (ICLR 2024), 2024. 5
work page 2024
-
[76]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Llava- ϕ: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024. 2 11 A. Dialogue formats of various datasets. During the pre-training phase, we utilized the 1.2 million image-text pairs from the pre-training phase of ShareGPT4V , which pri...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.