pith. sign in

arxiv: 2402.11684 · v2 · pith:Q65S5ERWnew · submitted 2024-02-18 · 💻 cs.CL · cs.AI

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Pith reviewed 2026-05-23 22:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords ALLaVAsynthetic datavision-language modelsGPT-4Vlite modelsvisual question answeringmodel efficiencyinstruction tuning
0
0 comments X

The pith

A pipeline using GPT-4V to create 1.3 million synthetic samples lets lite vision-language models reach competitive results on 17 benchmarks and match larger models on several tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the performance gap between large vision-language models and smaller, cheaper ones by focusing on data quality instead of model scale. It builds a pipeline that prompts GPT-4V to produce detailed image annotations for alignment and complex visual question-answer pairs for instruction tuning, yielding the ALLaVA dataset of 1.3 million samples. Lite models trained on this data then show strong results across many standard tests. A reader would care because current high-performing models demand heavy computing resources that limit who can use or improve them.

Core claim

The central claim is that high-quality synthetic data generated by a strong proprietary model can substitute for larger model scale: a 1.3M-sample dataset of fine-grained image annotations and complex reasoning VQA pairs produced by GPT-4V enables lite VLMs to achieve competitive performance on 17 benchmarks among 4B-scale models and to perform on par with 7B/13B-scale models on various benchmarks.

What carries the argument

The two-stage synthetic data pipeline that first generates fine-grained image annotations for vision-language alignment and then produces complex reasoning visual question-answering pairs for visual instruction fine-tuning.

If this is right

  • Lite VLMs trained on the synthetic dataset reach competitive results among 4B-scale models across 17 benchmarks.
  • The same models perform on par with 7B- and 13B-scale models on various benchmarks.
  • High-quality synthetic data reduces the need for massive human-labeled corpora when building resource-efficient LVLMs.
  • The approach demonstrates that data synthesis can make vision-language capabilities available with lower training and deployment costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-sourcing the ALLaVA dataset lets independent groups replicate or extend the efficiency gains without repeating the GPT-4V generation step.
  • The same synthesis method could be applied to other multimodal tasks such as image captioning or visual reasoning to test whether data quality continues to substitute for scale.
  • Combining the synthetic pre-training with later fine-tuning on real user data might further close remaining gaps on tasks where synthetic artifacts remain visible.

Load-bearing premise

The GPT-4V generated annotations and VQA pairs are of high enough quality and free of systematic artifacts that training on them produces genuine capability gains rather than benchmark-specific overfitting.

What would settle it

A large performance drop on a fresh collection of vision-language benchmarks that were never used during data generation or model selection would indicate the gains come from overfitting to the synthetic data distribution.

read the original abstract

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ALLaVA, a 1.3M-sample synthetic dataset generated via GPT-4V to produce fine-grained image annotations for vision-language alignment and complex reasoning VQA pairs for instruction tuning. Lite VLMs (around 4B scale) are trained on this data and reported to achieve competitive results on 17 benchmarks relative to other 4B LVLMs, with parity to some 7B/13B models on selected tasks, demonstrating that high-quality synthetic data can close the performance gap for resource-efficient models.

Significance. If the gains are attributable to genuine capability rather than artifacts, the work would show that carefully synthesized data from strong proprietary models can enable smaller VLMs to approach the performance of much larger ones, with direct implications for lowering training and inference costs in multimodal systems.

major comments (2)
  1. [§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.
  2. [Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.
minor comments (2)
  1. [§4] Notation for model scales (e.g., '4B LVLMs') should be defined consistently with parameter counts of the trained models and baselines.
  2. [§3] The pipeline description would benefit from an explicit diagram or table enumerating the exact prompts and filtering steps used to produce the 1.3M samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify important gaps in validating the empirical claims. We address each below and commit to revisions that strengthen the manuscript without overstating what was originally done.

read point-by-point responses
  1. Referee: [§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.

    Authors: We agree this is a substantive concern. The synthesis pipeline generates new annotations and VQA pairs via GPT-4V, but the original manuscript contains no decontamination analysis or overlap statistics against the 17 evaluation sets. We will add this analysis in revision, including n-gram overlap counts, semantic similarity filtering, and checks for near-duplicate VQA pairs where computationally feasible. This will be reported in a new subsection under data generation. revision: yes

  2. Referee: [Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.

    Authors: We acknowledge the lack of statistical reporting. All main results were obtained from single training runs owing to the scale of 4B-model training. In the revision we will state the number of runs explicitly, add error bars for the primary benchmarks by re-running a subset of models with different random seeds, and include variance estimates from the existing ablation studies to contextualize the reported parity with larger models. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline or benchmark results

full rationale

The paper presents an empirical pipeline: GPT-4V is used to synthesize 1.3M image annotations and VQA pairs, lite VLMs are trained on them, and performance is measured directly on 17 external benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations are present in the abstract or described claims. Results are falsifiable experimental outcomes rather than reductions to inputs by construction, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified quality of GPT-4V outputs as training data; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption GPT-4V can reliably produce fine-grained image annotations and complex reasoning VQA pairs suitable for VLM training
    This assumption underpins the entire data-generation pipeline and the subsequent performance claims.

pith-pipeline@v0.9.0 · 5774 in / 1224 out tokens · 24450 ms · 2026-05-23T22:17:52.956750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.

  3. From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

    cs.MA 2025-06 accept novelty 7.0

    A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

  4. FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

    cs.CV 2025-04 unverdicted novelty 7.0

    FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...

  5. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

    cs.IR 2024-10 conditional novelty 7.0

    VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

  6. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    cs.CV 2024-06 unverdicted novelty 7.0

    Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...

  7. Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.

  8. LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.

  9. VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    cs.CV 2024-12 unverdicted novelty 6.0

    VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

  10. MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    cs.CV 2024-12 unverdicted novelty 6.0

    VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

  11. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  12. A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

    cs.CV 2026-05 unverdicted novelty 5.0

    A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.

  13. Qwen2.5-VL Technical Report

    cs.CV 2025-02 unverdicted novelty 5.0

    Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...

  14. InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    cs.CV 2025-01 unverdicted novelty 5.0

    InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...

  15. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  16. NVILA: Efficient Frontier Visual Language Models

    cs.CV 2024-12 unverdicted novelty 5.0

    NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5....

  17. mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    cs.CV 2024-08 unverdicted novelty 5.0

    mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

  18. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  19. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  20. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  21. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    cs.CV 2024-03 unverdicted novelty 5.0

    Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

  22. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  23. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 22 Pith papers · 5 internal anchors

  1. [1]

    2023 , eprint=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

  2. [2]

    2023 , eprint=

    Improved Baselines with Visual Instruction Tuning , author=. 2023 , eprint=

  3. [3]

    2023 , eprint=

    To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning , author=. 2023 , eprint=

  4. [5]

    2023 , eprint=

    MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices , author=. 2023 , eprint=

  5. [8]

    2023 , eprint=

    CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=

  6. [9]

    2023 , eprint=

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. 2023 , eprint=

  7. [10]

    Introducing our Multimodal Models , url =

    Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

  8. [11]

    2023 , howpublished =

    IDEFICS , title =. 2023 , howpublished =

  9. [12]

    GPT-4V(ision) System Card , author=

  10. [13]

    Vision-Flan:Scaling Visual Instruction Tuning , url =

    Zhiyang Xu and Trevor Ashby and Chao Feng and Rulin Shao and Ying Shen and Di Jin and Qifan Wang and Lifu Huang , month =. Vision-Flan:Scaling Visual Instruction Tuning , url =

  11. [14]

    LIMA: Less Is More for Alignment

    Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=

  12. [15]

    2022 , eprint=

    LAION-5B: An open large-scale dataset for training next generation image-text models , author=. 2022 , eprint=

  13. [18]

    2022 , eprint=

    BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. 2022 , eprint=

  14. [19]

    2023 , eprint=

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=

  15. [20]

    2023 , eprint=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

  16. [21]

    2015 , eprint=

    Microsoft COCO: Common Objects in Context , author=. 2015 , eprint=

  17. [23]

    2016 , eprint=

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , eprint=

  18. [24]

    2019 international conference on document analysis and recognition (ICDAR) , pages=

    Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=

  19. [25]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Towards VQA Models That Can Read , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  20. [26]

    2019 , eprint=

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=

  21. [27]

    2024 , eprint=

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

  22. [28]

    2021 , eprint=

    Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 , eprint=

  23. [29]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  24. [30]

    2021 , eprint=

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

  25. [31]

    Advances in neural information processing systems , volume=

    Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=

  26. [32]

    2023 , eprint=

    Segment Anything , author=. 2023 , eprint=

  27. [33]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  28. [34]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  29. [35]

    2024 , eprint=

    MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=

  30. [36]

    2023 , eprint=

    Qwen Technical Report , author=. 2023 , eprint=

  31. [37]

    Stable LM 2 1.6B , author=

  32. [38]

    2023 , eprint=

    HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs , author=. 2023 , eprint=

  33. [39]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  34. [40]

    2023 , eprint=

    TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones , author=. 2023 , eprint=

  35. [41]

    2023 , eprint=

    Phoenix: Democratizing ChatGPT across Languages , author=. 2023 , eprint=

  36. [42]

    2023 , eprint=

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=

  37. [43]

    2023 , eprint=

    Investigating the Catastrophic Forgetting in Multimodal Large Language Models , author=. 2023 , eprint=

  38. [44]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  39. [45]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  40. [46]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  41. [47]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  42. [48]

    2022 , eprint=

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=

  43. [49]

    2023 , eprint=

    MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2023 , eprint=

  44. [50]

    2023 , eprint=

    MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V , author=. 2023 , eprint=

  45. [51]

    2023 , eprint=

    TouchStone: Evaluating Vision-Language Models by Language Models , author=. 2023 , eprint=

  46. [52]

    2023 , eprint=

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

  47. [53]

    2023 , eprint=

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=

  48. [54]

    2023 , eprint=

    Visual Instruction Tuning , author=. 2023 , eprint=

  49. [55]

    2023 , eprint=

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. 2023 , eprint=

  50. [57]

    2023 , eprint=

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , eprint=

  51. [58]

    2024 , eprint=

    Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=

  52. [59]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  53. [60]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Nocaps: Novel object captioning at scale , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  54. [61]

    Transactions of the Association for Computational Linguistics , volume=

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=

  55. [63]

    2023 , eprint=

    Aligning Large Multimodal Models with Factually Augmented RLHF , author=. 2023 , eprint=

  56. [64]

    2023 , eprint=

    ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders , author=. 2023 , eprint=

  57. [65]

    2022 , eprint=

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=

  58. [66]

    2023 , eprint=

    DINOv2: Learning Robust Visual Features without Supervision , author=. 2023 , eprint=

  59. [67]

    2022 , eprint=

    EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2022 , eprint=

  60. [68]

    2021 , eprint=

    Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

  61. [69]

    2023 , eprint=

    Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=

  62. [70]

    2022 , eprint=

    Scaling Vision Transformers , author=. 2022 , eprint=

  63. [71]

    Ilharco, M

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

  64. [72]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. arXiv preprint arXiv:2312.14238 , year=

  65. [73]

    Journal of machine Learning research , volume=

    Latent dirichlet allocation , author=. Journal of machine Learning research , volume=

  66. [74]

    MALLET: A Machine Learning for Language Toolkit

    Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit

  67. [75]

    2023 , eprint=

    On the Difference of BERT-style and CLIP-style Text Encoders , author=. 2023 , eprint=

  68. [76]

    , author=

    Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

  69. [77]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  70. [78]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

  71. [79]

    Dan Gusfield , title =. 1997

  72. [80]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  73. [81]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

  74. [82]

    and Tukey, John W

    Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

  75. [83]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  76. [84]

    Publications Manual , year = "1983", publisher =

  77. [85]

    Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen...

  78. [86]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8948--8957, 2019

  79. [87]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a

  80. [88]

    Touchstone: Evaluating vision-language models by language models, 2023 b

    Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models, 2023 b

Showing first 80 references.