ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Benyou Wang; Guiming Hardy Chen; Jianquan Li; Junying Chen; Ruifei Zhang; Shunian Chen; Xiangbo Wu; Xiang Wan; Zhihong Chen; Zhiyi Zhang

arxiv: 2402.11684 · v2 · pith:Q65S5ERWnew · submitted 2024-02-18 · 💻 cs.CL · cs.AI

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li

show 2 more authors

Xiang Wan Benyou Wang

This is my paper

Pith reviewed 2026-05-23 22:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ALLaVAsynthetic datavision-language modelsGPT-4Vlite modelsvisual question answeringmodel efficiencyinstruction tuning

0 comments

The pith

A pipeline using GPT-4V to create 1.3 million synthetic samples lets lite vision-language models reach competitive results on 17 benchmarks and match larger models on several tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the performance gap between large vision-language models and smaller, cheaper ones by focusing on data quality instead of model scale. It builds a pipeline that prompts GPT-4V to produce detailed image annotations for alignment and complex visual question-answer pairs for instruction tuning, yielding the ALLaVA dataset of 1.3 million samples. Lite models trained on this data then show strong results across many standard tests. A reader would care because current high-performing models demand heavy computing resources that limit who can use or improve them.

Core claim

The central claim is that high-quality synthetic data generated by a strong proprietary model can substitute for larger model scale: a 1.3M-sample dataset of fine-grained image annotations and complex reasoning VQA pairs produced by GPT-4V enables lite VLMs to achieve competitive performance on 17 benchmarks among 4B-scale models and to perform on par with 7B/13B-scale models on various benchmarks.

What carries the argument

The two-stage synthetic data pipeline that first generates fine-grained image annotations for vision-language alignment and then produces complex reasoning visual question-answering pairs for visual instruction fine-tuning.

If this is right

Lite VLMs trained on the synthetic dataset reach competitive results among 4B-scale models across 17 benchmarks.
The same models perform on par with 7B- and 13B-scale models on various benchmarks.
High-quality synthetic data reduces the need for massive human-labeled corpora when building resource-efficient LVLMs.
The approach demonstrates that data synthesis can make vision-language capabilities available with lower training and deployment costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-sourcing the ALLaVA dataset lets independent groups replicate or extend the efficiency gains without repeating the GPT-4V generation step.
The same synthesis method could be applied to other multimodal tasks such as image captioning or visual reasoning to test whether data quality continues to substitute for scale.
Combining the synthetic pre-training with later fine-tuning on real user data might further close remaining gaps on tasks where synthetic artifacts remain visible.

Load-bearing premise

The GPT-4V generated annotations and VQA pairs are of high enough quality and free of systematic artifacts that training on them produces genuine capability gains rather than benchmark-specific overfitting.

What would settle it

A large performance drop on a fresh collection of vision-language benchmarks that were never used during data generation or model selection would indicate the gains come from overfitting to the synthetic data distribution.

read the original abstract

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALLaVA generates 1.3M synthetic samples via GPT-4V and shows 4B VLMs reaching competitive results on 17 benchmarks, but the gains rest on unverified assumptions about data quality and no leakage.

read the letter

The core takeaway is that this paper builds a pipeline to create fine-grained image annotations and complex reasoning VQA pairs from GPT-4V, releases the resulting 1.3M-sample ALLaVA dataset, and reports that 4B-scale models trained on it match or approach 7B/13B models on various benchmarks. That dataset and the two-stage construction method are the concrete new pieces; prior work has used strong models to distill data, but applying this exact split to lite VLMs and open-sourcing the output is useful for the efficiency side of the field. The approach is straightforward and the release lowers the barrier for others to test similar ideas. The main soft spot is the stress-test concern around benchmark leakage. GPT-4V's training data overlaps with common VLM test sets, so the generated pairs could embed test questions or distributions without explicit decontamination or overlap analysis in the reported experiments. The abstract gives no error bars, training details, or baseline comparisons, which makes it hard to judge whether the parity with larger models reflects real capability or artifacts. If the full paper includes those checks and shows they hold, the results strengthen; otherwise the central claim stays provisional. This work is aimed at groups building or evaluating resource-efficient VLMs and anyone interested in data-centric scaling for multimodal models. It deserves a serious referee because the dataset itself is a reusable artifact and the empirical question it raises is timely, even if revisions will likely be needed on the evaluation rigor and contamination controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes ALLaVA, a 1.3M-sample synthetic dataset generated via GPT-4V to produce fine-grained image annotations for vision-language alignment and complex reasoning VQA pairs for instruction tuning. Lite VLMs (around 4B scale) are trained on this data and reported to achieve competitive results on 17 benchmarks relative to other 4B LVLMs, with parity to some 7B/13B models on selected tasks, demonstrating that high-quality synthetic data can close the performance gap for resource-efficient models.

Significance. If the gains are attributable to genuine capability rather than artifacts, the work would show that carefully synthesized data from strong proprietary models can enable smaller VLMs to approach the performance of much larger ones, with direct implications for lowering training and inference costs in multimodal systems.

major comments (2)

[§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.
[Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.

minor comments (2)

[§4] Notation for model scales (e.g., '4B LVLMs') should be defined consistently with parameter counts of the trained models and baselines.
[§3] The pipeline description would benefit from an explicit diagram or table enumerating the exact prompts and filtering steps used to produce the 1.3M samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify important gaps in validating the empirical claims. We address each below and commit to revisions that strengthen the manuscript without overstating what was originally done.

read point-by-point responses

Referee: [§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.

Authors: We agree this is a substantive concern. The synthesis pipeline generates new annotations and VQA pairs via GPT-4V, but the original manuscript contains no decontamination analysis or overlap statistics against the 17 evaluation sets. We will add this analysis in revision, including n-gram overlap counts, semantic similarity filtering, and checks for near-duplicate VQA pairs where computationally feasible. This will be reported in a new subsection under data generation. revision: yes
Referee: [Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.

Authors: We acknowledge the lack of statistical reporting. All main results were obtained from single training runs owing to the scale of 4B-model training. In the revision we will state the number of runs explicitly, add error bars for the primary benchmarks by re-running a subset of models with different random seeds, and include variance estimates from the existing ablation studies to contextualize the reported parity with larger models. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline or benchmark results

full rationale

The paper presents an empirical pipeline: GPT-4V is used to synthesize 1.3M image annotations and VQA pairs, lite VLMs are trained on them, and performance is measured directly on 17 external benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations are present in the abstract or described claims. Results are falsifiable experimental outcomes rather than reductions to inputs by construction, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified quality of GPT-4V outputs as training data; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption GPT-4V can reliably produce fine-grained image annotations and complex reasoning VQA pairs suitable for VLM training
This assumption underpins the entire data-generation pipeline and the subsequent performance claims.

pith-pipeline@v0.9.0 · 5774 in / 1224 out tokens · 24450 ms · 2026-05-23T22:17:52.956750+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
cs.CV 2025-04 unverdicted novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
cs.IR 2024-10 conditional novelty 7.0

VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
cs.CV 2024-06 unverdicted novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
cs.CV 2024-12 unverdicted novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification
cs.CV 2026-05 unverdicted novelty 5.0

A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
Qwen2.5-VL Technical Report
cs.CV 2025-02 unverdicted novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
cs.CV 2025-01 unverdicted novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
NVILA: Efficient Frontier Visual Language Models
cs.CV 2024-12 unverdicted novelty 5.0

NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5....
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
cs.CV 2024-08 unverdicted novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
cs.CV 2024-03 unverdicted novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 22 Pith papers · 5 internal anchors

[1]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023
[2]

2023 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[3]

2023 , eprint=

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices , author=. 2023 , eprint=

work page 2023
[8]

2023 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

work page 2023
[9]

2023 , eprint=

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. 2023 , eprint=

work page 2023
[10]

Introducing our Multimodal Models , url =

Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

work page
[11]

2023 , howpublished =

IDEFICS , title =. 2023 , howpublished =

work page 2023
[12]

GPT-4V(ision) System Card , author=

work page
[13]

Vision-Flan:Scaling Visual Instruction Tuning , url =

Zhiyang Xu and Trevor Ashby and Chao Feng and Rulin Shao and Ying Shen and Di Jin and Qifan Wang and Lifu Huang , month =. Vision-Flan:Scaling Visual Instruction Tuning , url =

work page
[14]

LIMA: Less Is More for Alignment

Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

2022 , eprint=

LAION-5B: An open large-scale dataset for training next generation image-text models , author=. 2022 , eprint=

work page 2022
[18]

2022 , eprint=

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. 2022 , eprint=

work page 2022
[19]

2023 , eprint=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

work page 2023
[20]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023
[21]

2015 , eprint=

Microsoft COCO: Common Objects in Context , author=. 2015 , eprint=

work page 2015
[23]

2016 , eprint=

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , eprint=

work page 2016
[24]

2019 international conference on document analysis and recognition (ICDAR) , pages=

Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

work page 2019
[25]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Towards VQA Models That Can Read , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[26]

2019 , eprint=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=

work page 2019
[27]

2024 , eprint=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

work page 2024
[28]

2021 , eprint=

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 , eprint=

work page 2021
[29]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[30]

2021 , eprint=

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

work page 2021
[31]

Advances in neural information processing systems , volume=

Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

work page
[32]

2023 , eprint=

Segment Anything , author=. 2023 , eprint=

work page 2023
[33]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024
[34]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[35]

2024 , eprint=

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=

work page 2024
[36]

2023 , eprint=

Qwen Technical Report , author=. 2023 , eprint=

work page 2023
[37]

Stable LM 2 1.6B , author=

work page
[38]

2023 , eprint=

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs , author=. 2023 , eprint=

work page 2023
[39]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[40]

2023 , eprint=

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones , author=. 2023 , eprint=

work page 2023
[41]

2023 , eprint=

Phoenix: Democratizing ChatGPT across Languages , author=. 2023 , eprint=

work page 2023
[42]

2023 , eprint=

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=

work page 2023
[43]

2023 , eprint=

Investigating the Catastrophic Forgetting in Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023
[44]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[45]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[46]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[47]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[48]

2022 , eprint=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

work page 2022
[49]

2023 , eprint=

MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2023 , eprint=

work page 2023
[50]

2023 , eprint=

MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V , author=. 2023 , eprint=

work page 2023
[51]

2023 , eprint=

TouchStone: Evaluating Vision-Language Models by Language Models , author=. 2023 , eprint=

work page 2023
[52]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

work page 2023
[53]

2023 , eprint=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023
[54]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[55]

2023 , eprint=

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. 2023 , eprint=

work page 2023
[57]

2023 , eprint=

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , eprint=

work page 2023
[58]

2024 , eprint=

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

work page 2024
[59]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[60]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Nocaps: Novel object captioning at scale , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[61]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=

work page 2014
[63]

2023 , eprint=

Aligning Large Multimodal Models with Factually Augmented RLHF , author=. 2023 , eprint=

work page 2023
[64]

2023 , eprint=

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders , author=. 2023 , eprint=

work page 2023
[65]

2022 , eprint=

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=

work page 2022
[66]

2023 , eprint=

DINOv2: Learning Robust Visual Features without Supervision , author=. 2023 , eprint=

work page 2023
[67]

2022 , eprint=

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2022 , eprint=

work page 2022
[68]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[69]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

work page 2023
[70]

2022 , eprint=

Scaling Vision Transformers , author=. 2022 , eprint=

work page 2022
[71]

Ilharco, M

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773
[72]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. arXiv preprint arXiv:2312.14238 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Journal of machine Learning research , volume=

Latent dirichlet allocation , author=. Journal of machine Learning research , volume=

work page
[74]

MALLET: A Machine Learning for Language Toolkit

Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit

work page
[75]

2023 , eprint=

On the Difference of BERT-style and CLIP-style Text Encoders , author=. 2023 , eprint=

work page 2023
[76]

, author=

Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

work page
[77]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[78]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

work page 2007
[79]

Dan Gusfield , title =. 1997

work page 1997
[80]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[81]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

work page 2005
[82]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

work page 1965
[83]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[84]

Publications Manual , year = "1983", publisher =

work page 1983
[85]

Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen...

work page 2024
[86]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8948--8957, 2019

work page 2019
[87]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

work page 2023
[88]

Touchstone: Evaluating vision-language models by language models, 2023 b

Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models, 2023 b

work page 2023

Showing first 80 references.

[1] [1]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023

[2] [2]

2023 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[3] [3]

2023 , eprint=

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[4] [5]

2023 , eprint=

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices , author=. 2023 , eprint=

work page 2023

[5] [8]

2023 , eprint=

CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

work page 2023

[6] [9]

2023 , eprint=

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. 2023 , eprint=

work page 2023

[7] [10]

Introducing our Multimodal Models , url =

Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

work page

[8] [11]

2023 , howpublished =

IDEFICS , title =. 2023 , howpublished =

work page 2023

[9] [12]

GPT-4V(ision) System Card , author=

work page

[10] [13]

Vision-Flan:Scaling Visual Instruction Tuning , url =

Zhiyang Xu and Trevor Ashby and Chao Feng and Rulin Shao and Ying Shen and Di Jin and Qifan Wang and Lifu Huang , month =. Vision-Flan:Scaling Visual Instruction Tuning , url =

work page

[11] [14]

LIMA: Less Is More for Alignment

Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [15]

2022 , eprint=

LAION-5B: An open large-scale dataset for training next generation image-text models , author=. 2022 , eprint=

work page 2022

[13] [18]

2022 , eprint=

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. 2022 , eprint=

work page 2022

[14] [19]

2023 , eprint=

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

work page 2023

[15] [20]

2023 , eprint=

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

work page 2023

[16] [21]

2015 , eprint=

Microsoft COCO: Common Objects in Context , author=. 2015 , eprint=

work page 2015

[17] [23]

2016 , eprint=

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , eprint=

work page 2016

[18] [24]

2019 international conference on document analysis and recognition (ICDAR) , pages=

Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

work page 2019

[19] [25]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Towards VQA Models That Can Read , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[20] [26]

2019 , eprint=

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=

work page 2019

[21] [27]

2024 , eprint=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

work page 2024

[22] [28]

2021 , eprint=

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 , eprint=

work page 2021

[23] [29]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[24] [30]

2021 , eprint=

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

work page 2021

[25] [31]

Advances in neural information processing systems , volume=

Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

work page

[26] [32]

2023 , eprint=

Segment Anything , author=. 2023 , eprint=

work page 2023

[27] [33]

2024 , eprint=

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

work page 2024

[28] [34]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023

[29] [35]

2024 , eprint=

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=

work page 2024

[30] [36]

2023 , eprint=

Qwen Technical Report , author=. 2023 , eprint=

work page 2023

[31] [37]

Stable LM 2 1.6B , author=

work page

[32] [38]

2023 , eprint=

HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs , author=. 2023 , eprint=

work page 2023

[33] [39]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023

[34] [40]

2023 , eprint=

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones , author=. 2023 , eprint=

work page 2023

[35] [41]

2023 , eprint=

Phoenix: Democratizing ChatGPT across Languages , author=. 2023 , eprint=

work page 2023

[36] [42]

2023 , eprint=

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=

work page 2023

[37] [43]

2023 , eprint=

Investigating the Catastrophic Forgetting in Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023

[38] [44]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021

[39] [45]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021

[40] [46]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[41] [47]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page

[42] [48]

2022 , eprint=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

work page 2022

[43] [49]

2023 , eprint=

MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2023 , eprint=

work page 2023

[44] [50]

2023 , eprint=

MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V , author=. 2023 , eprint=

work page 2023

[45] [51]

2023 , eprint=

TouchStone: Evaluating Vision-Language Models by Language Models , author=. 2023 , eprint=

work page 2023

[46] [52]

2023 , eprint=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

work page 2023

[47] [53]

2023 , eprint=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023

[48] [54]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[49] [55]

2023 , eprint=

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. 2023 , eprint=

work page 2023

[50] [57]

2023 , eprint=

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , eprint=

work page 2023

[51] [58]

2024 , eprint=

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

work page 2024

[52] [59]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[53] [60]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Nocaps: Novel object captioning at scale , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[54] [61]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=

work page 2014

[55] [63]

2023 , eprint=

Aligning Large Multimodal Models with Factually Augmented RLHF , author=. 2023 , eprint=

work page 2023

[56] [64]

2023 , eprint=

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders , author=. 2023 , eprint=

work page 2023

[57] [65]

2022 , eprint=

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=

work page 2022

[58] [66]

2023 , eprint=

DINOv2: Learning Robust Visual Features without Supervision , author=. 2023 , eprint=

work page 2023

[59] [67]

2022 , eprint=

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2022 , eprint=

work page 2022

[60] [68]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021

[61] [69]

2023 , eprint=

Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

work page 2023

[62] [70]

2022 , eprint=

Scaling Vision Transformers , author=. 2022 , eprint=

work page 2022

[63] [71]

Ilharco, M

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773

[64] [72]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. arXiv preprint arXiv:2312.14238 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [73]

Journal of machine Learning research , volume=

Latent dirichlet allocation , author=. Journal of machine Learning research , volume=

work page

[66] [74]

MALLET: A Machine Learning for Language Toolkit

Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit

work page

[67] [75]

2023 , eprint=

On the Difference of BERT-style and CLIP-style Text Encoders , author=. 2023 , eprint=

work page 2023

[68] [76]

, author=

Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

work page

[69] [77]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[70] [78]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

work page 2007

[71] [79]

Dan Gusfield , title =. 1997

work page 1997

[72] [80]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[73] [81]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

work page 2005

[74] [82]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

work page 1965

[75] [83]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[76] [84]

Publications Manual , year = "1983", publisher =

work page 1983

[77] [85]

Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen...

work page 2024

[78] [86]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8948--8957, 2019

work page 2019

[79] [87]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

work page 2023

[80] [88]

Touchstone: Evaluating vision-language models by language models, 2023 b

Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models, 2023 b

work page 2023