What If We Recaption Billions of Web Images with LLaMA-3?

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. We release this dataset publicly at https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B.

Recaption Pipeline

Pipeline Overview. The illustration of our recaptioning pipeline on DataComp-1B. We use LLaMA-3-powered LLaVA to reception images, which enables us to train stronger CLIP models and Text-to-Image Diffusion models.

Recaption Model: LLaVA-LLaMA3-8B

Recaption Model Capabilities. We follow the setup of LLaVA-1.5. to build our captioner model, except that we use LLaMA-3-8B as the language decoder because of its superior performance.

Dataset Statistics

Recaption Data Statistics. Instance length distributions from ~133 million sampled examples of both the original captions and our recaptioned data in DataComp-1B.

Results of CLIP model Trained on Recaption-Datacomp-1B

Recap-CLIP Results. We set the training mixing ratio p of recaptioned data and the original data to 0.8 for recaption-based models. We report zero-shot top-1 accuracy on ImageNet-1K and top-1 recall on COCO and Flickr30K.

Results of Diffusion model Trained on Recaption-Datacomp-1B

Recap-Diffusion Results. Text-to-Image evaluation on COCO-30K results of DiT-BASE/4, trained with different mix ratios on Recap-DataComp-1B. Note for GPT-4V Score, we use a subset of 3K for the evaluation.

Examples of the Outputs of Recap-Diffusion Model

Results Examples. Visual comparison of results from DiT-L/2 and DiT-B/4 trained with the mixed raio p=0.0.

Dataset and Model Zoo

We are pleased to announce the release of our recaptioned datasets, including Recap-DataComp-1B and Recap-COCO-30K, as well as our caption model, LLaVA-1.5-LLaMA3-8B. Stay tuned for the upcoming release of our CLIP and T2I models!

Dataset

Dataset	Num. of Sample	url
Recap-DataComp-1B	1.24B	https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B
Recap-COCO-30K	30.5K	https://huggingface.co/datasets/UCSC-VLAA/Recap-COCO-30K

Models

Model	Type	url
LLaVA-1.5-LLaMA3-8B	Our recaption model	https://huggingface.co/tennant/llava-llama-3-8b-hqedit
Recap-CLIP	CLIP	https://huggingface.co/UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP
Recap-DiT	Text2Image	Coming Soon...

Poster on CVPR Workshop

Acknowledge

This work is partially supported by a gift from Adobe, TPU Research Cloud (TRC) program, Google Cloud Research Credits program, AWS Cloud Credit for Research program, Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.

BibTeX


@article{li2024recaption,
  title   = {What If We Recaption Billions of Web Images with LLaMA-3?},
  author  = {Li, Xianhang and Tu, Haoqin and Hui, Mude and Wang, Zeyu and Zhao, Bingchen and Xiao, Junfei and Mei, Jieru and Liu, Qing and Zheng, Huangjie and Zhou, Yuyin and Xie, Cihang},
  journal = {arXiv preprint arXiv:2406.08478},
  year    = {2024}
}

What If We Recaption Billions of Web Images with LLaMA-3 ?

Examples of the original caption and our recaption in DataComp-1B, and the word distributions.