{"work":{"id":"fbf16034-512e-4dd9-a0bd-7ddd23f532a6","openalex_id":null,"doi":null,"arxiv_id":"2111.02114","raw_key":null,"title":"LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs","authors":null,"authors_text":"Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta","year":2021,"venue":"cs.CV","abstract":"Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.","external_url":"https://arxiv.org/abs/2111.02114","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:10:23.739252+00:00","pith_arxiv_id":"2111.02114","created_at":"2026-05-10T03:29:22.054348+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs","render_title":"LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs"},"hub":{"state":{"work_id":"fbf16034-512e-4dd9-a0bd-7ddd23f532a6","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":77,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2022-04-29T16:29:01+00:00","last_pith_cited_at":"2026-05-21T06:36:59+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-12T07:29:14.310730+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"dataset","n":17},{"context_role":"background","n":6},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"use_dataset","n":16},{"context_polarity":"background","n":6},{"context_polarity":"unclear","n":2},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:29:55.160650+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":8},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6},{"title":"Coca: Con- trastive captioners are image-text foundation models","work_id":"5dd5bf10-d548-40ff-9b6c-6735129b27ee","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":5},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":5},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":5},{"title":"LAION-5B: An open large-scale dataset for training next generation image-text models","work_id":"1d19deb4-3043-409d-b901-f047c51a323b","shared_citers":4},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Pali: A jointly-scaled mul- tilingual language-image model","work_id":"29921cff-29c1-4aad-a27d-adac346027ec","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":4},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":4},{"title":"13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu","work_id":"d28390f3-8b21-4b2a-a523-473a16c2e43a","shared_citers":3},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":3},{"title":"DINOv3","work_id":"c8b07deb-8fe7-4e18-9620-f3569d3529ce","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":3}],"time_series":[{"n":5,"year":2022},{"n":4,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":18,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:30:03.729046+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:29:59.751392+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs","claims":[{"claim_text":"Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:30:16.243897+00:00"}},"summary":{"title":"LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs","claims":[{"claim_text":"Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","shared_citers":8},{"title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","work_id":"34430d19-7919-48ce-88a5-17b3bfe2192e","shared_citers":7},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6},{"title":"Coca: Con- trastive captioners are image-text foundation models","work_id":"5dd5bf10-d548-40ff-9b6c-6735129b27ee","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":5},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":5},{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":5},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":5},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":5},{"title":"LAION-5B: An open large-scale dataset for training next generation image-text models","work_id":"1d19deb4-3043-409d-b901-f047c51a323b","shared_citers":4},{"title":"Learning Transferable Visual Models From Natural Language Supervision","work_id":"6de86bb5-27bd-4d5c-8b89-967ebfc52659","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Pali: A jointly-scaled mul- tilingual language-image model","work_id":"29921cff-29c1-4aad-a27d-adac346027ec","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":4},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":4},{"title":"13 Published as a conference paper at ICLR 2026 Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, and Shijian Lu","work_id":"d28390f3-8b21-4b2a-a523-473a16c2e43a","shared_citers":3},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":3},{"title":"DINOv3","work_id":"c8b07deb-8fe7-4e18-9620-f3569d3529ce","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","shared_citers":3}],"time_series":[{"n":5,"year":2022},{"n":4,"year":2023},{"n":4,"year":2024},{"n":1,"year":2025},{"n":18,"year":2026}],"dependency_candidates":[]},"authors":[]}}