pith. sign in

arxiv: 2406.08478 · v2 · pith:4FS3OBEFnew · submitted 2024-06-12 · 💻 cs.CV · cs.CL

What If We Recaption Billions of Web Images with LLaMA-3?

classification 💻 cs.CV cs.CL
keywords imagesmodelsdatasetenhancedlikellama-3pairsrecap-datacomp-1b
0
0 comments X
read the original abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DataComp-VLM: Improved Open Datasets for Vision-Language Models

    cs.CV 2026-06 conditional novelty 8.0

    DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

  2. DataComp-VLM: Improved Open Datasets for Vision-Language Models

    cs.CV 2026-06 unverdicted novelty 6.0

    DataComp-VLM benchmark shows instruction-heavy data mixtures outperform caption-heavy ones for VLM training, with DCVLM-Baseline reaching 63.6% on 33 tasks using 200B tokens, +5.4pp over FineVision.

  3. EmoCtrl: Controllable Emotional Image Content Generation

    cs.CV 2025-12 unverdicted novelty 6.0

    EmoCtrl generates images faithful to content prompts while expressing target emotions via textual/visual enhancement modules and emotion-driven preference optimization.

  4. OmniGen2: Towards Instruction-Aligned Multimodal Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consis...

  5. ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

    cs.CV 2026-06 unverdicted novelty 4.0

    ReasonCLIP-58M applies continual pretraining with visually grounded reasoning captions on 58M examples to improve CLIP-style models on commonsense and compositional reasoning tasks.