Recognition: 1 theorem link
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Pith reviewed 2026-05-16 04:01 UTC · model grok-4.3
The pith
A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks. The image encoder together with image resolution and image token count has substantial impact on performance, while the vision-language connector design is of comparatively negligible importance. Scaling this recipe produces the MM1 family of models up to 30B parameters, both dense and mixture-of-experts variants, that reach leading pre-training metrics and competitive supervised fine-tuning results on established multimodal benchmarks while exhibiting enhanced in-context learning and multi-
What carries the argument
The data mixture strategy that combines image-caption pairs, interleaved image-text sequences, and text-only data, together with choices of image encoder, resolution, and token count.
If this is right
- Pre-training with the identified data mix will produce higher few-shot accuracy on multimodal benchmarks than pre-training with narrower data sources.
- Changes to the image encoder, resolution, or token count will produce measurable shifts in overall model quality.
- Variations in vision-language connector architecture will leave few-shot performance largely unchanged.
- Scaled models will display stronger in-context learning and multi-image reasoning without additional fine-tuning.
Where Pith is reading between the lines
- Future scaling efforts could prioritize expanding the volume and diversity of the three data types rather than further connector redesigns.
- The same mixture principle may apply when extending these models to video or audio if analogous interleaved and caption-style sources are available.
- Practitioners could reduce experimentation time by fixing connector designs early and focusing compute on data curation and encoder selection.
- Additional benchmarks that test long-context multi-image reasoning would help confirm whether the observed in-context gains hold beyond current evaluations.
Load-bearing premise
The ablations performed isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.
What would settle it
Training a model of comparable scale with the same image encoder and resolution but using only one data type or a different untested mixture, then measuring whether it matches or exceeds MM1 few-shot scores on the same benchmarks.
read the original abstract
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MM1, a family of multimodal LLMs (dense and MoE variants up to 30B parameters) built via large-scale pre-training. It claims that careful ablations reveal the image encoder, resolution, and token count to have substantial impact while the vision-language connector is comparatively unimportant, and that a specific pre-training data mix of image-caption, interleaved image-text, and text-only data is crucial for achieving SOTA few-shot results across benchmarks compared to prior work. Scaling the resulting recipe yields competitive post-SFT performance and strong properties such as in-context learning and multi-image reasoning.
Significance. If the ablation controls are sound, the work supplies actionable empirical guidance on data composition and encoder choices for multimodal pre-training at scale, extending scaling observations to the multimodal regime and demonstrating practical benefits of mixed data sources for few-shot generalization.
major comments (2)
- [Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.
- [Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.
minor comments (2)
- [Abstract] Abstract: the phrase 'SOTA in pre-training metrics' should be accompanied by the specific metrics and direct numerical comparisons to the strongest published baselines.
- [Figures/Tables] Figure and table captions throughout: ensure all ablation plots and tables explicitly state the total token count or training steps used for each condition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our ablation controls. We address each major comment below and will revise the manuscript to improve clarity on experimental controls without altering the core claims.
read point-by-point responses
-
Referee: [Pre-training data ablations] Pre-training data section (and associated ablation tables): the central claim that the image-caption + interleaved + text-only mix is crucial for SOTA few-shot performance requires explicit confirmation that total pre-training tokens, steps, or compute budget were held fixed across all compared data compositions. If sample counts or epochs were scaled differently without token-budget normalization, the reported gains cannot be unambiguously attributed to composition rather than effective data volume.
Authors: All data composition ablations were performed with a fixed total pre-training token budget of approximately 1.2T tokens. Sample counts from each source (image-caption, interleaved, text-only) were adjusted proportionally to maintain this fixed budget while varying the mixture ratios. We will add an explicit statement and a footnote in Section 4.2 clarifying the token normalization procedure to eliminate any ambiguity. revision: yes
-
Referee: [Architecture ablations] Image encoder and resolution ablations: the reported substantial impact of encoder choice, image resolution, and token count must be shown to be independent of interactions with the data mix; if these ablations were run only at a single fixed mix or without re-optimizing hyperparameters for each encoder variant, the isolation of effects is incomplete.
Authors: The encoder, resolution, and token-count ablations were conducted using the final recommended data mixture identified in the data ablations. While we did not re-optimize every hyperparameter for every encoder variant, the relative performance trends remained consistent across preliminary checks with alternative mixes. We will revise Section 3 to explicitly state the fixed data mix used for these ablations and add a brief discussion of potential interactions as a limitation. revision: partial
Circularity Check
No circularity: purely empirical ablations with external benchmarks
full rationale
The paper conducts large-scale empirical ablations on image encoders, vision-language connectors, and pre-training data mixes (image-caption, interleaved, text-only) to identify design lessons for MLLMs. No mathematical derivations, equations, or fitted parameters are presented that could reduce to self-definitions or internal predictions. All claims of SOTA few-shot performance are grounded in comparisons to external published results and standard benchmarks, with no load-bearing self-citations or uniqueness theorems invoked. The study is self-contained against external validation, yielding no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard large-model scaling assumptions hold for multimodal pre-training.
Forward citations
Cited by 20 Pith papers
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
-
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
MMaDA: Multimodal Large Diffusion Language Models
MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-im...
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
-
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: ICCV (2019)
work page 2019
-
[3]
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...
work page 2022
-
[4]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 17 Wortsman, M., Schmidt, L.: Openflamingo: An open-source framework for train- ing large autoregressive vision-language m...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: EMNLP (2013)
work page 2013
-
[7]
Bisk,Y.,Zellers,R.,Lebras,R.,Gao,J.,Choi,Y.:Piqa:Reasoningaboutphysical commonsense in natural language. AAAI (2020)
work page 2020
-
[8]
Training Diffusion Models with Reinforcement Learning
Black, K., Janner, M., Du, Y., Kostrikov, I., Levine, S.: Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few- shot learners. NeurIPS (2020)
work page 2020
-
[11]
https://github.com/kakaobrain/coyo-dataset (2022)
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022)
work page 2022
-
[12]
arXiv preprint arXiv:2312.06742 (2023)
Cha, J., Kang, W., Mun, J., Roh, B.: Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742 (2023)
-
[13]
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
work page 2021
-
[14]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D.: Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Chen, T., Chen, X., Du, X., Rashwan, A., Yang, F., Chen, H., Wang, Z., Li, Y.: Adamv-moe: Adaptive multi-task vision mixture-of-experts. In: ICCV (2023)
work page 2023
-
[17]
arXiv preprint arXiv:2305.18565 (2023)
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al.: Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 (2023)
-
[18]
Microsoft COCO Captions: Data Collection and Evaluation Server
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling lan- guage modeling with pathways. JMLR (2023)
work page 2023
-
[20]
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)
-
[21]
Scaling Instruction-Finetuned Language Models
Chung,H.W.,Hou,L.,Longpre,S.,Zoph,B.,Tay,Y.,Fedus,W.,Li,Y.,Wang,X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457 (2018) 18 B. McKinzie et al
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., Xie, Z., Li, Y.K., Huang, P., Luo, F., Ruan, C., Sui, Z., Liang, W.: Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
work page 2023
-
[25]
Daxberger, E., Weers, F., Zhang, B., Gunter, T., Pang, R., Eichner, M., Emmers- berger, M., Yang, Y., Toshev, A., Du, X.: Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts (2023)
work page 2023
-
[26]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[28]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., Zoph, B., Fedus, L., Bosma, M.P., Zhou, Z., Wang, T., Wang, E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q., Wu, Y., Chen, Z., Cui, C.: GLaM: Efficient scaling of language models with mixture-of-ex...
work page 2022
-
[30]
arXiv preprint arXiv:2401.08541 (2024)
El-Nouby, A., Klein, M., Zhai, S., Bautista, M.A., Shankar, V., Toshev, A., Susskind, J., Joulin, A.: Scalable pre-training of large autoregressive image mod- els. arXiv preprint arXiv:2401.08541 (2024)
-
[31]
arXiv preprint arXiv:2309.17425 (2023)
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A., Shankar, V.: Data filtering networks. arXiv preprint arXiv:2309.17425 (2023)
-
[32]
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity (2022)
work page 2022
-
[33]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
arXiv preprint arXiv:2309.17102 (2023)
Fu, T.J., Hu, W., Du, X., Wang, W.Y., Yang, Y., Gan, Z.: Guiding instruction- based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023)
-
[35]
https://doi.org/10.5281/zenodo.10256836 , https:// zenodo.org/records/10256836
Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding,L.,Hsu,J.,LeNoac’h,A.,Li,H.,McDonell,K.,Muennighoff,N.,Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: A framework for few-shot language model evaluation (12 2023). https://doi.org/10.5281...
-
[36]
arXiv preprint arXiv:2402.05935 (2024)
Gao, P., Zhang, R., Liu, C., Qiu, L., Huang, S., Lin, W., Zhao, S., Geng, S., Lin, Z., Jin, P., et al.: Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935 (2024)
-
[37]
Multimodal-gpt: A vision and language model for dialogue with humans
Gong, T., Lyu, C., Zhang, S., Wang, Y., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., Chen, K.: Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 19
-
[38]
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)
work page 2017
-
[39]
Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: CVPR (2018)
work page 2018
-
[40]
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
work page 2022
-
[41]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
work page 2016
-
[42]
arXiv preprint arXiv:2402.11530 (2024)
He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient mul- timodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)
-
[43]
Scaling Laws for Autoregressive Generative Modeling
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T.B., Dhariwal, P., Gray, S., et al.: Scaling laws for autoregressive gener- ative modeling. arXiv preprint arXiv:2010.14701 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[44]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J.W., Vinyals, O., Sifre, L.: Training compute-optimal large language models (2022)
work page 2022
-
[45]
Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mo- hammed, O.K., Patra, B., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, J., Chaudhary, V., Som, S., Song, X., Wei, F.: Language is not all you need: Aligning perception with language models (2023)
work page 2023
-
[46]
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
work page 2019
-
[47]
https://huggingface.co/blog/idefics (2023)
IDEFICS: Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics (2023)
work page 2023
-
[48]
Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., Koyejo, S.: Scaling laws for downstream task performance of large language models (2024)
work page 2024
-
[49]
Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M.A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Scao, T.L., Gervet, T., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mixtral of experts (2024)
work page 2024
-
[50]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale dis- tantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualiza- tions via question answering. In: CVPR (2018)
work page 2018
-
[52]
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: ECCV (2016)
work page 2016
-
[53]
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: ECCV (2022)
work page 2022
-
[54]
arXiv preprint arXiv:2305.17216 (2023)
Koh, J.Y., Fried, D., Salakhutdinov, R.: Generating images with multimodal lan- guage models. arXiv preprint arXiv:2305.17216 (2023)
-
[55]
Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C.R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., Houlsby, N.: Sparse upcycling: Training mixture-of- experts from dense checkpoints. In: ICLR (2023) 20 B. McKinzie et al
work page 2023
-
[56]
arXiv preprint arXiv:2308.00692 (2023)
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
-
[57]
arXiv preprint arXiv:2310.07699 (2023)
Lai, Z., Zhang, H., Wu, W., Bai, H., Timofeev, A., Du, X., Gan, Z., Shan, J., Chuah, C.N., Yang, Y., et al.: From scarcity to efficiency: Improving clip training via visual-enriched captions. arXiv preprint arXiv:2310.07699 (2023)
-
[58]
Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A.M., Kiela, D., Cord, M., Sanh, V.: Obelics: An open web-scale filtered dataset of interleaved image-text documents (2023)
work page 2023
-
[59]
Lepikhin,D.,Lee,H.,Xu,Y.,Chen,D.,Firat,O.,Huang,Y.,Krikun,M.,Shazeer, N., Chen, Z.: {GS}hard: Scaling giant models with conditional computation and automatic sharding. In: ICLR (2021)
work page 2021
-
[60]
arXiv preprint arXiv:2306.05425 (2023)
Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic- it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)
-
[61]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
arXiv preprint arXiv:2309.10020 (2023)
Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.: Multimodal foun- dation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020 (2023)
-
[64]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models (2023)
work page 2023
-
[65]
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
arXiv preprint arXiv:2306.04387 (2023)
Li, L., Yin, Y., Li, S., Chen, L., Wang, P., Ren, S., Li, M., Yang, Y., Xu, J., Sun, X.,etal.:M 3it:Alarge-scaledatasettowardsmulti-modalmultilingualinstruction tuning. arXiv preprint arXiv:2306.04387 (2023)
-
[67]
VisualBERT: A Simple and Performant Baseline for Vision and Language
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[68]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Monkey: Image resolution and text label are important things for large multi-modal models
Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., Bai, X.: Monkey: Image resolution and text label are important things for large multi- modal models. arXiv preprint arXiv:2311.06607 (2023)
-
[70]
Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Huang, J., Zhang, J., Ning, M., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models (2024)
work page 2024
-
[71]
arXiv preprint arXiv:2312.07533 (2023)
Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533 (2023)
-
[72]
Microsoft COCO: Common Objects in Context
Lin,T., Maire,M.,Belongie, S.J.,Bourdev, L.D., Girshick, R.B.,Hays,J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[73]
Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023) MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training 21
-
[74]
Improved Baselines with Visual Instruction Tuning
Liu,H.,Li,C.,Li,Y.,Lee,Y.J.:Improvedbaselineswithvisualinstructiontuning. arXiv preprint arXiv:2310.03744 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
io/blog/2024-01-30-llava-next/
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/
work page 2024
-
[76]
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)
work page 2023
-
[77]
arXiv preprint arXiv:2311.05437 (2023)
Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)
-
[78]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolin- guistic representations for vision-and-language tasks. NeurIPS (2019)
work page 2019
-
[80]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.