pith. machine review for the scientific record. sign in

arxiv: 2403.18814 · v1 · submitted 2024-03-27 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords Mini-GeminiVision Language ModelsMulti-modalityHigh-resolution visual tokensVLM-guided generationZero-shot benchmarksImage understanding and generation
0
0 comments X

The pith

Mini-Gemini narrows the gap with top vision-language models by adding high-resolution image handling, better data, and self-guided generation without raising token counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mini-Gemini as a framework that strengthens existing vision-language models through three targeted changes. An extra visual encoder supplies finer image details while the total number of visual tokens stays fixed. A new dataset emphasizes accurate image reading and reasoning that leads to generated outputs. The model then uses its own outputs to steer further generation steps. These steps together let the same models handle image understanding, reasoning, and creation in one flow and work with language models ranging from 2 billion to 34 billion parameters. Readers care because current open models still trail closed systems such as GPT-4 on many tasks, and the work shows concrete ways to close that distance using existing model sizes.

Core claim

Mini-Gemini mines the potential of multi-modality vision-language models by introducing high-resolution visual tokens through an additional encoder that does not increase token count, a high-quality dataset that supports precise comprehension and reasoning-based generation, and VLM-guided generation, thereby enabling simultaneous image understanding, reasoning, and generation while supporting dense and MoE large language models from 2B to 34B and reaching leading results on multiple zero-shot benchmarks.

What carries the argument

The three-aspect enhancement framework that pairs an extra high-resolution visual encoder with a curated dataset and VLM-guided generation to expand capabilities while holding visual token count steady.

If this is right

  • VLMs gain the ability to perform any-to-any workflows that include both understanding and generation in one session.
  • The same base models now reach or exceed some private models on several zero-shot benchmarks.
  • The method applies uniformly to both dense and mixture-of-experts language models across the 2B-to-34B size range.
  • Image reasoning and generation become native operations inside the same forward pass rather than separate stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Resolution limits in current visual tokenizers may be a larger bottleneck than previously assumed, since extra detail can be added without token inflation.
  • High-quality paired data focused on reasoning chains could transfer to other multimodal tasks beyond the ones tested here.
  • Self-guided generation opens a route for iterative refinement loops that stay inside the model rather than relying on external tools.
  • The approach may generalize to video or other sequential visual inputs by extending the same high-resolution refinement idea.

Load-bearing premise

The reported gains come mainly from the added high-resolution encoder, the new dataset, and the guided generation step rather than from undisclosed training choices or benchmark selection.

What would settle it

A controlled experiment that removes the high-resolution encoder or the new dataset and still matches the original benchmark scores on the same zero-shot tasks would falsify the claim that these three components are the primary drivers.

read the original abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mini-Gemini, a framework for enhancing multi-modal vision-language models via three components: an additional high-resolution visual encoder that refines tokens without increasing their count, a constructed high-quality dataset promoting precise comprehension and reasoning-based generation, and VLM-guided generation. The method supports dense and MoE LLMs ranging from 2B to 34B parameters and reports leading results on multiple zero-shot benchmarks, sometimes surpassing private models.

Significance. If the reported gains can be shown to arise primarily from the three proposed components rather than from uncontrolled differences in training regime or evaluation protocol, the work would offer a practical route to improve VLM efficiency and capability in understanding, reasoning, and generation tasks while keeping visual token budgets fixed. Open-sourcing of code and models supports reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.
  2. [Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.
minor comments (2)
  1. [§3] The integration of the high-resolution encoder with the base visual encoder is described at a high level; a precise statement of how token count is preserved (e.g., via pooling or projection) would aid clarity.
  2. [Abstract] The abstract states that Mini-Gemini 'surpasses the developed private models' without naming those models or providing the corresponding scores in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our experimental design and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.

    Authors: We appreciate the referee's point regarding the need for more rigorously controlled ablations. The manuscript reports incremental results when adding each component (high-resolution refinement, curated data, and guided generation) while keeping the underlying LLM and training framework consistent across scales from 2B to 34B. However, we acknowledge that a complete factorial design holding every hyperparameter, data ratio, and epoch count fixed would provide stronger isolation of effects. Full factorial experiments at the 34B scale are computationally prohibitive; therefore, in the revision we will add targeted controlled ablations on the 2B and 7B models, fixing all other variables and toggling one component at a time, to better substantiate the contribution of each element. revision: yes

  2. Referee: [Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.

    Authors: We agree that reporting variability would allow better evaluation of statistical robustness. Given the substantial compute required to train and evaluate models up to 34B parameters, multiple independent runs per configuration were not feasible, which is a common practical constraint in large-scale VLM literature. In the revised manuscript we will add an explicit discussion of this limitation in Section 4, note that all compared models followed identical training protocols, and emphasize that performance gains remain consistent across diverse benchmarks and model sizes as supporting evidence of reliability. revision: partial

Circularity Check

0 steps flagged

Empirical framework evaluated on external benchmarks exhibits no circularity

full rationale

The manuscript introduces Mini-Gemini as an engineering framework that augments existing VLMs via a high-resolution encoder, a constructed high-quality dataset, and VLM-guided generation. Performance is reported on independent zero-shot benchmarks external to the paper. No equations, fitted parameters, or predictions are defined in terms of themselves; no self-citation chain is invoked to justify uniqueness or load-bearing premises; and no renaming of known results occurs. The derivation chain consists of component proposals followed by empirical measurement against outside references, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical engineering contribution rather than a derivation from first principles. It relies on standard assumptions about VLM training dynamics and benchmark validity.

free parameters (1)
  • high-resolution encoder hyperparameters
    Tuned parameters controlling the additional visual encoder's refinement process.
axioms (1)
  • domain assumption An auxiliary high-resolution visual encoder can improve feature quality while keeping visual token count unchanged.
    Invoked in the first contribution axis for visual token enhancement.

pith-pipeline@v0.9.0 · 5544 in / 1172 out tokens · 62809 ms · 2026-05-17T07:39:59.742394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  2. Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    cs.CV 2025-03 unverdicted novelty 7.0

    Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...

  3. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    cs.CV 2024-06 unverdicted novelty 7.0

    Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...

  4. GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.

  5. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  6. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  7. Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.

  8. SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

    cs.CV 2026-03 unverdicted novelty 6.0

    SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

  9. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    cs.CV 2024-12 unverdicted novelty 6.0

    VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

  10. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  11. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    cs.CV 2024-10 accept novelty 6.0

    SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

  12. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    cs.CV 2024-09 unverdicted novelty 6.0

    VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

  13. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  14. Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

    cs.CV 2026-04 unverdicted novelty 5.0

    RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.

  15. CogVLM2: Visual Language Models for Image and Video Understanding

    cs.CV 2024-08 conditional novelty 5.0

    CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.

  16. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  17. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  18. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 18 Pith papers · 26 internal anchors

  1. [1]

    OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023. 2

  2. [2]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022. 2

  3. [3]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,

  4. [4]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 2, 6

  5. [5]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023. 2, 6, 7

  6. [6]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023. 2, 3

  7. [7]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeruIPS, 2023. 2, 3, 4

  8. [8]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023. 2, 3

  9. [9]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023. 2

  10. [10]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023. 2, 6

  11. [11]

    Llava-next: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 2, 5, 6, 8

  12. [12]

    Otterhd: A high- resolution multi-modality model

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023. 2, 6

  13. [13]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b. 2

  14. [14]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv:2311.12793, 2023. 2, 5, 8

  15. [15]

    arXiv preprint arXiv:2402.11684 , year=

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv:2402.11684, 2024. 2, 5, 8

  16. [16]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 2, 3

  17. [17]

    Document collection visual question answering

    Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Document collection visual question answering. In ICDAR 2021, 2021. 5, 11

  18. [18]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244, 2022. 5, 11

  19. [19]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 2, 5, 11

  20. [20]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. 2, 5

  21. [21]

    Openassistant conversations- democratizing large language model alignment

    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations- democratizing large language model alignment. Advances in Neural Information Processing Systems , 36,

  22. [22]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2, 3, 5

  23. [23]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966,

  24. [24]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7

  25. [25]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  26. [26]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2

  27. [27]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv:1810.04805, 2018. 2

  28. [28]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2

  29. [29]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv:2401.04088, 2024. 2, 3

  30. [30]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021. 3

  31. [31]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 3

  32. [32]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 3

  33. [33]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ ,

  34. [34]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023. 3

  35. [35]

    Gpt4tools: Teaching large language model to use tools via self-instruction

    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023. 3

  36. [36]

    Gemma: Introducing new state-of-the-art open models

    Google. Gemma: Introducing new state-of-the-art open models. hhttps://blog.google/technology/ developers/gemma-open-models/, 2024. 3, 6

  37. [37]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325,

  38. [38]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022. 3

  39. [39]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023. 3

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 3, 4, 6 17

  41. [41]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 3

  42. [42]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023. 3, 4, 5, 6, 7

  43. [43]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 3, 5, 6, 7

  44. [44]

    Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

    Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and com...

  45. [45]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

  46. [46]

    Emu: Generative Pretraining in Multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023. 3

  47. [47]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 3, 5

  48. [48]

    arXiv preprint arXiv:2307.08041 , year=

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023. 3

  49. [49]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 3

  50. [50]

    Llmga: Multimodal large language model based generation assistant

    Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, and Jiaya Jia. Llmga: Multimodal large language model based generation assistant. arXiv preprint arXiv:2311.16500, 2023. 3, 5, 15

  51. [51]

    Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model

    Xiaowei Chi, Yijiang Liu, Zhengkai Jiang, Rongyu Zhang, Ziyi Lin, Renrui Zhang, Peng Gao, Chaoyou Fu, Shanghang Zhang, Qifeng Liu, et al. Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model. arXiv preprint arXiv:2311.17963, 2023. 8, 10, 15

  52. [52]

    Anygpt: Unified multimodal llm with discrete sequence modeling

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024. 3, 5, 8, 10

  53. [53]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , 2(3):8, 2023. 3, 5

  54. [54]

    LAION-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text mod...

  55. [55]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. 4

  56. [56]

    Richter, Christopher J

    Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023. 5

  57. [57]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. URL https://openai.com/research/ video-generation-models-as-world-simulators . 5

  58. [58]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. 5 18

  59. [59]

    Textcaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020. 5, 8

  60. [60]

    Laion/gpt4v-dataset· datasets at hugging face

    LAION eV . Laion/gpt4v-dataset· datasets at hugging face. URL https://huggingface.co/datasets/ laion/gpt4v-dataset. 5, 8

  61. [61]

    Dvqa: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018. 5

  62. [62]

    URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts

    stable-diffusion-prompts. URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts. 5

  63. [63]

    MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

    Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886, 2023. 6, 7

  64. [64]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023. 6

  65. [65]

    Introducing idefics: An open reproduction of state-of-the-art visual language model

    IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https: //huggingface.co/blog/idefics, 2023. 6

  66. [66]

    CogVLM: Visual Expert for Pretrained Language Models

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079, 2023. 6

  67. [67]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019. 6, 7, 8, 11

  68. [68]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023. 6, 7

  69. [69]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490,

  70. [70]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024. 6, 7

  71. [71]

    Awesome multilingual ocr toolkits based on paddlepaddle

    PaddleOCR. Awesome multilingual ocr toolkits based on paddlepaddle. URL https://github.com/ PaddlePaddle/PaddleOCR. 11 19