What If We Recaption Billions of Web Images with LLaMA-3?

Bingchen Zhao; Cihang Xie; Haoqin Tu; Huangjie Zheng; Jieru Mei; Junfei Xiao; Mude Hui; Qing Liu; Sucheng Ren; Xianhang Li

arxiv: 2406.08478 · v2 · pith:4FS3OBEFnew · submitted 2024-06-12 · 💻 cs.CV · cs.CL

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li , Haoqin Tu , Mude Hui , Zeyu Wang , Bingchen Zhao , Junfei Xiao , Sucheng Ren , Jieru Mei

show 4 more authors

Qing Liu Huangjie Zheng Yuyin Zhou Cihang Xie

This is my paper

classification 💻 cs.CV cs.CL

keywords imagesmodelsdatasetenhancedlikellama-3pairsrecap-datacomp-1b

0 comments

read the original abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 conditional novelty 8.0

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

DataComp-VLM benchmark shows instruction-heavy data mixtures outperform caption-heavy ones for VLM training, with DCVLM-Baseline reaching 63.6% on 33 tasks using 200B tokens, +5.4pp over FineVision.
EmoCtrl: Controllable Emotional Image Content Generation
cs.CV 2025-12 unverdicted novelty 6.0

EmoCtrl generates images faithful to content prompts while expressing target emotions via textual/visual enhancement modules and emotion-driven preference optimization.
OmniGen2: Towards Instruction-Aligned Multimodal Generation
cs.CV 2025-06 unverdicted novelty 5.0

OmniGen2 introduces a unified generative model with two distinct decoding pathways and a decoupled image tokenizer that achieves competitive results on text-to-image and editing benchmarks plus state-of-the-art consis...
ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP
cs.CV 2026-06 unverdicted novelty 4.0

ReasonCLIP-58M applies continual pretraining with visually grounded reasoning captions on 58M examples to improve CLIP-style models on commonsense and compositional reasoning tasks.