arxiv: 2403.18814 · v1 · submitted 2024-03-27 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li , Yuechen Zhang , Chengyao Wang , Zhisheng Zhong , Yixin Chen , Ruihang Chu , Shaoteng Liu , Jiaya Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-17 07:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords Mini-GeminiVision Language ModelsMulti-modalityHigh-resolution visual tokensVLM-guided generationZero-shot benchmarksImage understanding and generation

0 comments

The pith

Mini-Gemini narrows the gap with top vision-language models by adding high-resolution image handling, better data, and self-guided generation without raising token counts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mini-Gemini as a framework that strengthens existing vision-language models through three targeted changes. An extra visual encoder supplies finer image details while the total number of visual tokens stays fixed. A new dataset emphasizes accurate image reading and reasoning that leads to generated outputs. The model then uses its own outputs to steer further generation steps. These steps together let the same models handle image understanding, reasoning, and creation in one flow and work with language models ranging from 2 billion to 34 billion parameters. Readers care because current open models still trail closed systems such as GPT-4 on many tasks, and the work shows concrete ways to close that distance using existing model sizes.

Core claim

Mini-Gemini mines the potential of multi-modality vision-language models by introducing high-resolution visual tokens through an additional encoder that does not increase token count, a high-quality dataset that supports precise comprehension and reasoning-based generation, and VLM-guided generation, thereby enabling simultaneous image understanding, reasoning, and generation while supporting dense and MoE large language models from 2B to 34B and reaching leading results on multiple zero-shot benchmarks.

What carries the argument

The three-aspect enhancement framework that pairs an extra high-resolution visual encoder with a curated dataset and VLM-guided generation to expand capabilities while holding visual token count steady.

If this is right

VLMs gain the ability to perform any-to-any workflows that include both understanding and generation in one session.
The same base models now reach or exceed some private models on several zero-shot benchmarks.
The method applies uniformly to both dense and mixture-of-experts language models across the 2B-to-34B size range.
Image reasoning and generation become native operations inside the same forward pass rather than separate stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Resolution limits in current visual tokenizers may be a larger bottleneck than previously assumed, since extra detail can be added without token inflation.
High-quality paired data focused on reasoning chains could transfer to other multimodal tasks beyond the ones tested here.
Self-guided generation opens a route for iterative refinement loops that stay inside the model rather than relying on external tools.
The approach may generalize to video or other sequential visual inputs by extending the same high-resolution refinement idea.

Load-bearing premise

The reported gains come mainly from the added high-resolution encoder, the new dataset, and the guided generation step rather than from undisclosed training choices or benchmark selection.

What would settle it

A controlled experiment that removes the high-resolution encoder or the new dataset and still matches the original benchmark scores on the same zero-shot tasks would falsify the claim that these three components are the primary drivers.

read the original abstract

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mini-Gemini shows a workable way to add high-res visual refinement without token growth, pair it with reasoning-focused data, and use VLM-guided generation to lift open models on benchmarks, though the gains are not tightly isolated from other training choices.

read the letter

The main thing here is that Mini-Gemini combines an extra high-resolution visual encoder for detail without raising token count, a new dataset aimed at precise image understanding and reasoning generation, and VLM-guided output to push open VLMs closer to closed-model levels on zero-shot tasks. It works across dense and MoE backbones from 2B to 34B parameters and releases the code and models publicly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mini-Gemini, a framework for enhancing multi-modal vision-language models via three components: an additional high-resolution visual encoder that refines tokens without increasing their count, a constructed high-quality dataset promoting precise comprehension and reasoning-based generation, and VLM-guided generation. The method supports dense and MoE LLMs ranging from 2B to 34B parameters and reports leading results on multiple zero-shot benchmarks, sometimes surpassing private models.

Significance. If the reported gains can be shown to arise primarily from the three proposed components rather than from uncontrolled differences in training regime or evaluation protocol, the work would offer a practical route to improve VLM efficiency and capability in understanding, reasoning, and generation tasks while keeping visual token budgets fixed. Open-sourcing of code and models supports reproducibility.

major comments (2)

[§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.
[Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.

minor comments (2)

[§3] The integration of the high-resolution encoder with the base visual encoder is described at a high level; a precise statement of how token count is preserved (e.g., via pooling or projection) would aid clarity.
[Abstract] The abstract states that Mini-Gemini 'surpasses the developed private models' without naming those models or providing the corresponding scores in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our experimental design and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.

Authors: We appreciate the referee's point regarding the need for more rigorously controlled ablations. The manuscript reports incremental results when adding each component (high-resolution refinement, curated data, and guided generation) while keeping the underlying LLM and training framework consistent across scales from 2B to 34B. However, we acknowledge that a complete factorial design holding every hyperparameter, data ratio, and epoch count fixed would provide stronger isolation of effects. Full factorial experiments at the 34B scale are computationally prohibitive; therefore, in the revision we will add targeted controlled ablations on the 2B and 7B models, fixing all other variables and toggling one component at a time, to better substantiate the contribution of each element. revision: yes
Referee: [Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.

Authors: We agree that reporting variability would allow better evaluation of statistical robustness. Given the substantial compute required to train and evaluate models up to 34B parameters, multiple independent runs per configuration were not feasible, which is a common practical constraint in large-scale VLM literature. In the revised manuscript we will add an explicit discussion of this limitation in Section 4, note that all compared models followed identical training protocols, and emphasize that performance gains remain consistent across diverse benchmarks and model sizes as supporting evidence of reliability. revision: partial

Circularity Check

0 steps flagged

Empirical framework evaluated on external benchmarks exhibits no circularity

full rationale

The manuscript introduces Mini-Gemini as an engineering framework that augments existing VLMs via a high-resolution encoder, a constructed high-quality dataset, and VLM-guided generation. Performance is reported on independent zero-shot benchmarks external to the paper. No equations, fitted parameters, or predictions are defined in terms of themselves; no self-citation chain is invoked to justify uniqueness or load-bearing premises; and no renaming of known results occurs. The derivation chain consists of component proposals followed by empirical measurement against outside references, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical engineering contribution rather than a derivation from first principles. It relies on standard assumptions about VLM training dynamics and benchmark validity.

free parameters (1)

high-resolution encoder hyperparameters
Tuned parameters controlling the additional visual encoder's refinement process.

axioms (1)

domain assumption An auxiliary high-resolution visual encoder can improve feature quality while keeping visual token count unchanged.
Invoked in the first contribution axis for visual token enhancement.

pith-pipeline@v0.9.0 · 5544 in / 1172 out tokens · 62809 ms · 2026-05-17T07:39:59.742394+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count... patch info mining... TV = MLP(Q + Softmax(ϕ(Q) × ϕ(K)T ) × ϕ(V ))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B... leading performance in several zero-shot benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
cs.CV 2025-03 unverdicted novelty 7.0

Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
cs.CV 2024-06 unverdicted novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
cs.CV 2024-12 unverdicted novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
cs.CV 2024-10 accept novelty 6.0

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
cs.CV 2024-09 unverdicted novelty 6.0

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis
cs.CV 2026-04 unverdicted novelty 5.0

RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
CogVLM2: Visual Language Models for Image and Video Understanding
cs.CV 2024-08 conditional novelty 5.0

CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 18 Pith papers · 26 internal anchors

[1]

OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023. 2

work page 2023
[2]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeruIPS, 2023. 2, 3, 4

work page 2023
[8]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023. 2, 6

work page arXiv 2023
[11]

Llava-next: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 2, 5, 6, 8

work page 2024
[12]

Otterhd: A high- resolution multi-modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023. 2, 6

work page arXiv 2023
[13]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b. 2

work page 2023
[14]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv:2311.12793, 2023. 2, 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

arXiv preprint arXiv:2402.11684 , year=

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv:2402.11684, 2024. 2, 5, 8

work page arXiv 2024
[16]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 2, 3

work page 2017
[17]

Document collection visual question answering

Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Document collection visual question answering. In ICDAR 2021, 2021. 5, 11

work page 2021
[18]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244, 2022. 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 2, 5, 11

work page 2016
[20]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. 2, 5

work page 2024
[21]

Openassistant conversations- democratizing large language model alignment

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations- democratizing large language model alignment. Advances in Neural Information Processing Systems , 36,

work page
[22]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024
[26]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2

work page 2017
[27]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv:1810.04805, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2

work page 2020
[29]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv:2401.04088, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 3

work page 2022
[32]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023. 3

work page 2023
[33]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ ,

work page 2023
[34]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Gpt4tools: Teaching large language model to use tools via self-instruction

Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023. 3

work page arXiv 2023
[36]

Gemma: Introducing new state-of-the-art open models

Google. Gemma: Introducing new state-of-the-art open models. hhttps://blog.google/technology/ developers/gemma-open-models/, 2024. 3, 6

work page 2024
[37]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022. 3

work page 2022
[39]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023. 3

work page arXiv 2023
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 3, 4, 6 17

work page 2021
[41]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 3

work page 2022
[42]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023. 3, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition

Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and com...

work page arXiv 2023
[45]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Emu: Generative Pretraining in Multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 3, 5

work page arXiv 2023
[48]

arXiv preprint arXiv:2307.08041 , year=

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023. 3

work page arXiv 2023
[49]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 3

work page arXiv 2023
[50]

Llmga: Multimodal large language model based generation assistant

Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, and Jiaya Jia. Llmga: Multimodal large language model based generation assistant. arXiv preprint arXiv:2311.16500, 2023. 3, 5, 15

work page arXiv 2023
[51]

Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model

Xiaowei Chi, Yijiang Liu, Zhengkai Jiang, Rongyu Zhang, Ziyi Lin, Renrui Zhang, Peng Gao, Chaoyou Fu, Shanghang Zhang, Qifeng Liu, et al. Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model. arXiv preprint arXiv:2311.17963, 2023. 8, 10, 15

work page arXiv 2023
[52]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024. 3, 5, 8, 10

work page arXiv 2024
[53]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , 2(3):8, 2023. 3, 5

work page 2023
[54]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text mod...

work page 2022
[55]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. 4

work page 2022
[56]

Richter, Christopher J

Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023. 5

work page 2023
[57]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. URL https://openai.com/research/ video-generation-models-as-world-simulators . 5

work page
[58]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. 5 18

work page 2018
[59]

Textcaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020. 5, 8

work page 2020
[60]

Laion/gpt4v-dataset· datasets at hugging face

LAION eV . Laion/gpt4v-dataset· datasets at hugging face. URL https://huggingface.co/datasets/ laion/gpt4v-dataset. 5, 8

work page
[61]

Dvqa: Understanding data visualizations via question answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018. 5

work page 2018
[62]

URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts

stable-diffusion-prompts. URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts. 5

work page
[63]

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Introducing idefics: An open reproduction of state-of-the-art visual language model

IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https: //huggingface.co/blog/idefics, 2023. 6

work page 2023
[66]

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019. 6, 7, 8, 11

work page 2019
[68]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024. 6, 7

work page 2024
[71]

Awesome multilingual ocr toolkits based on paddlepaddle

PaddleOCR. Awesome multilingual ocr toolkits based on paddlepaddle. URL https://github.com/ PaddlePaddle/PaddleOCR. 11 19

work page