LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Aarush Katta; Aran Komatsuzaki; Christoph Schuhmann; Clayton Mullis; Jenia Jitsev; Richard Vencu; Robert Kaczmarczyk; Romain Beaumont; Theo Coombes

arxiv: 2111.02114 · v1 · submitted 2021-11-03 · 💻 cs.CV · cs.CL· cs.LG

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann , Richard Vencu , Romain Beaumont , Robert Kaczmarczyk , Clayton Mullis , Aarush Katta , Theo Coombes , Jenia Jitsev

show 1 more author

Aran Komatsuzaki

This is my paper

Pith reviewed 2026-05-12 10:14 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords LAION-400Mimage-text pairsCLIP filteringopen datasetmultimodal modelsvision-languageembeddingskNN indices

0 comments

The pith

A community effort releases LAION-400M, an open collection of 400 million CLIP-filtered image-text pairs with embeddings and search indices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds and releases LAION-400M to solve the absence of large public datasets needed for training multimodal vision-language models from scratch. Prior work such as CLIP and DALL-E succeeded at scale but relied on private data, blocking wider replication and extension. The new resource supplies the pairs themselves, their CLIP embeddings, and kNN indices that support fast similarity search. If the filtering step preserves useful signal, researchers gain a concrete starting point for training competitive zero-shot and few-shot models without proprietary resources.

Core claim

The authors assembled and opened LAION-400M, a dataset of 400 million web-scraped image-text pairs that CLIP has filtered for relevance, together with the corresponding CLIP embeddings and kNN indices that enable efficient similarity search across the collection.

What carries the argument

The CLIP-filtered image-text pair collection, which supplies the raw training material plus precomputed embeddings and kNN indices that turn the 400 million pairs into a searchable, usable resource for model training.

If this is right

Any lab can now attempt to reproduce or extend CLIP-style training at hundreds-of-millions scale using only public data.
The supplied embeddings and kNN indices allow immediate construction of retrieval-augmented systems or nearest-neighbor baselines without recomputing features.
Downstream experiments in zero-shot classification, image generation, and captioning can start from the same large open corpus rather than from scratch.
Community members can iterate on filtering rules or add metadata while keeping the core 400 million pairs fixed as a shared reference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing the raw pairs alongside embeddings lowers the barrier for groups that lack large-scale compute for feature extraction.
The dataset could serve as a fixed benchmark corpus for comparing future filtering or cleaning methods against one another.
If models trained on it generalize well, it would support arguments that scale and public web data together suffice for many multimodal capabilities.

Load-bearing premise

CLIP-based filtering of web-scraped pairs alone yields data of sufficient quality and coverage to train competitive multimodal models without extra validation or human review.

What would settle it

Train a model from scratch on LAION-400M and measure its zero-shot accuracy on standard benchmarks; performance substantially below that of models trained on comparable private datasets would indicate the filtered pairs lack adequate signal.

read the original abstract

Multi-modal language-vision models trained on hundreds of millions of image-text pairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero- or few-shot learning and transfer even in absence of per-sample labels on target image data. Despite this trend, to date there has been no publicly available datasets of sufficient scale for training such models from scratch. To address this issue, in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward data release paper for a 400M CLIP-filtered image-text dataset with embeddings and indices, which fills a real access gap even if the writeup is mostly descriptive.

read the letter

The main thing to know is that the authors built and released LAION-400M, a public collection of 400 million image-text pairs filtered for CLIP similarity, plus the embeddings and kNN indices that let people search it efficiently. Before this, anyone wanting to train a CLIP-style model from scratch had to either scrape their own data or rely on closed resources held by a few labs. This changes that by making the scale available to more groups. They source from Common Crawl, apply the filter, deduplicate, and ship the extras, which is the practical contribution. Releasing the precomputed embeddings and indices is especially useful because it cuts down on the compute users need to get started. The pipeline description is clear enough that someone could replicate the steps if they wanted to. The soft spot is the limited evidence on whether the filtered data is actually good. The paper lays out the process but does not show human checks on sample quality, error rates from the CLIP filter, topic coverage, or even a small training run to see how models perform on it compared with other data. That leaves the quality claim resting on the assumption that CLIP similarity thresholding works well enough, which may or may not hold for every use case. This is for people working on large vision-language models who need open data at this scale. A reader focused on multimodal training or dataset construction will find the release details directly useful. I would send it to peer review. The dataset itself is substantial enough that the construction choices and release artifacts deserve a referee's look for reproducibility and any legal notes.

Referee Report

1 major / 2 minor

Summary. The manuscript announces the construction and public release of LAION-400M, a dataset of 400 million image-text pairs filtered via CLIP similarity from Common Crawl data, together with the associated CLIP embeddings and kNN indices for efficient similarity search. It positions this as a community effort to provide a large-scale public resource for training multimodal models from scratch.

Significance. If the pipeline was executed as described, this release is significant because it supplies the first openly available dataset at this scale with precomputed embeddings and search indices, directly addressing the prior lack of public resources for training models such as CLIP. The provision of the full dataset, embeddings, and kNN indices is a concrete strength that enables immediate community use and reproducibility.

major comments (1)

Abstract: the claim that LAION-400M addresses the lack of public datasets 'of sufficient scale for training such models from scratch' is not supported by any quality metrics, retention rates after filtering, error analysis, or downstream validation; without these, it is difficult to assess whether the CLIP-filtered pairs meet the implied standard of usability.

minor comments (2)

The manuscript would benefit from an explicit statement of the exact CLIP similarity threshold and any deduplication parameters used, even if only in a methods paragraph, to allow readers to understand the precise construction choices.
Consider adding a short related-work paragraph referencing prior open image-text datasets (e.g., Conceptual Captions, WIT) to better situate the scale and filtering approach.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and the recommendation for minor revision. We address the major comment below and have incorporated revisions to provide additional supporting details on the dataset.

read point-by-point responses

Referee: Abstract: the claim that LAION-400M addresses the lack of public datasets 'of sufficient scale for training such models from scratch' is not supported by any quality metrics, retention rates after filtering, error analysis, or downstream validation; without these, it is difficult to assess whether the CLIP-filtered pairs meet the implied standard of usability.

Authors: We agree that the original abstract claim would benefit from additional context to allow readers to assess usability. The manuscript's core contribution is the public release of the 400M-pair dataset, embeddings, and indices together with the reproducible pipeline; at the time of writing, no comparable public resource existed at this scale. To directly address the concern, the revised manuscript adds a new section on dataset statistics. This includes the retention rate after CLIP filtering (pairs retained at cosine similarity > 0.3 from a larger Common Crawl crawl), the distribution of similarity scores, and a brief error analysis via manual review of random samples. We also cite early downstream uses in which models trained from scratch on LAION-400M have achieved competitive zero-shot performance, providing external validation of practical utility. The abstract has been lightly revised to emphasize the release and reproducibility aspects while retaining the scale claim now supported by these additions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a data construction and release announcement. It describes sourcing image-text pairs from Common Crawl, applying CLIP similarity filtering, deduplication, and distributing the resulting 400M-pair dataset together with embeddings and kNN indices. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. The central claim is the factual existence and public availability of the artifacts produced by the described pipeline; this claim does not reduce to any self-referential input or self-citation chain. All steps are externally verifiable by inspecting the released data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset release paper with no mathematical derivations; relies on the pre-existing CLIP model for filtering and on web-scraped data whose collection details are not specified in the abstract.

pith-pipeline@v0.9.0 · 5447 in / 1116 out tokens · 61972 ms · 2026-05-12T10:14:08.251672+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

in a community effort we build and release for public LAION-400M, a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search
Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use CLIP to compute embeddings of the image and alt-text. Then we compute the cosine similarity of both embeddings and drop all samples with cosine similarity below 0.3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Watch Your Step: Information Injection in Diffusion Models via Shadow Timestep Embedding
cs.LG 2026-05 unverdicted novelty 7.0

Timestep embeddings in diffusion models function as a separable side channel that can carry dedicated information for adversarial injection or detection.
VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching
cs.CV 2026-04 unverdicted novelty 7.0

VeraRetouch is a 0.5B VLM-based framework with a differentiable Retouch Renderer and a new million-scale AetherRetouch-1M+ dataset that claims state-of-the-art results in reasoning photo retouching while enabling mobi...
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
cs.CV 2026-04 unverdicted novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection
cs.CV 2026-04 unverdicted novelty 7.0

DifFoundMAD improves differential morphing attack detection by replacing traditional embeddings with those from vision foundation models and applying class-balanced lightweight fine-tuning, cutting high-security error...
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
cs.CV 2026-04 unverdicted novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
cs.LG 2026-04 unverdicted novelty 7.0

PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
Distance Comparison Operations Are Not Silver Bullets in Vector Similarity Search: A Benchmark Study on Their Merits and Limits
cs.DB 2026-04 accept novelty 7.0

Benchmark study shows DCO methods for vector similarity search are not reliable silver bullets due to high sensitivity to data properties and hardware, making them unsuitable for production deployment.
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO creates localized preference pairs from real videos by applying random spatio-temporal masks and restoring masked regions with the frozen base model, then applies region-restricted DPO loss to improve fidelit...
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
cs.CV 2026-01 unverdicted novelty 7.0

LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data
cs.LG 2025-09 unverdicted novelty 7.0

Introduces the first active learning framework for unaligned multimodal data that selects alignments using uncertainty and diversity to cut annotation costs by up to 40% on benchmarks while preserving accuracy.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Language Is Not All You Need: Aligning Perception with Language Models
cs.CL 2023-02 conditional novelty 7.0

Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Phenaki: Variable Length Video Generation From Open Domain Textual Description
cs.CV 2022-10 unverdicted novelty 7.0

Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images ...
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
cs.CV 2022-05 accept novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations
cs.CV 2026-05 unverdicted novelty 6.0

Memorization in diffusion models is detected via latent update norm instability and mitigated on-the-fly, yielding AUC over 0.999 and zero memorization rate on Stable Diffusion 1.4.
CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 6.0

CPC-VAR adds Gradient-based Concept Neuron Selection for continual single-concept learning and a context-aware multi-branch composition strategy to reduce forgetting and entanglement in VAR-based personalized image ge...
PipeANN-Filter: An Efficient Filtered Vector Search System on SSD
cs.OS 2026-05 unverdicted novelty 6.0

PipeANN-Filter improves filtered vector search latency and throughput on SSD by exploring a superset of valid vectors identified via probabilistic filters and verifying attributes only after selecting top-k candidates.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
cs.CV 2026-04 conditional novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted...
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

A minimally modified vanilla Transformer called Volt achieves state-of-the-art 3D semantic and instance segmentation by using volumetric tokens, 3D rotary embeddings, and a data-efficient training recipe that scales b...
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Evidence for cross-modal representational convergence weakens substantially at scale and in realistic many-to-many settings, indicating models learn rich but distinct representations.
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation
cs.CV 2026-03 unverdicted novelty 6.0

CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over ...
Vision Transformers Need More Than Registers
cs.CV 2026-02 unverdicted novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, tex...
Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models
cs.CR 2025-12 unverdicted novelty 6.0

Concept filtering of child images from training data offers only limited protection against CSAM generation in text-to-image models, as prompting strategies and fine-tuning can bypass filters even when most child imag...
SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design
cs.CV 2025-11 unverdicted novelty 6.0

SkyReels-Text enables simultaneous fine-grained editing of multiple text regions in posters using arbitrary glyph patches for font control without labels or test-time fine-tuning.
The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models
cs.CV 2025-11 unverdicted novelty 6.0

Diffusion models show distinct patterns of recognizing versus replicating culturally iconic references, with recognition linked to data frequency, textual uniqueness, popularity, and creation date rather than simple copying.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
cs.CV 2025-06 unverdicted novelty 6.0

CPAM proposes a context-preserving adaptive manipulation method for zero-shot real image editing in diffusion models via preservation adaptation and localized extraction modules, outperforming prior techniques on a ne...
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
cs.LG 2023-10 conditional novelty 6.0

SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
cs.CL 2023-04 unverdicted novelty 6.0

mPLUG-Owl introduces a two-stage modular training paradigm that aligns images with text in LLMs via frozen visual modules followed by LoRA fine-tuning, achieving strong multimodal instruction following.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations
cs.CV 2026-05 unverdicted novelty 5.0

Proposes stability regions based on latent update norms to detect and mitigate memorization in diffusion models, reporting AUC over 0.999 and zero memorization rate after mitigation on Stable Diffusion 1.4.
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
cs.CV 2026-05 unverdicted novelty 5.0

MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
cs.CV 2026-05 unverdicted novelty 5.0

PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.
VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching
cs.CV 2026-04 unverdicted novelty 5.0

VeraRetouch is a lightweight fully differentiable framework using a 0.5B VLM for retouching plans and a custom renderer for end-to-end training, backed by a new million-scale dataset and RL post-training, to achieve S...
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
cs.CV 2026-04 unverdicted novelty 5.0

DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
Dynamic Eraser for Guided Concept Erasure in Diffusion Models
cs.CV 2026-04 unverdicted novelty 5.0

DSS is a lightweight inference-time framework that erases concepts in diffusion models at 91% average rate while preserving image fidelity, outperforming prior methods.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
eess.SY 2026-04 unverdicted novelty 5.0

High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
cs.CV 2025-09 unverdicted novelty 5.0

Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen ...
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
cs.RO 2025-07 accept novelty 5.0

Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pre...
Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift
cs.CV 2025-05 unverdicted novelty 5.0

Proposes Lipschitz regularization during fine-tuning to prevent distributional drift in personalized diffusion models, improving subject fidelity and prompt adherence.
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
cs.LG 2025-04 unverdicted novelty 5.0

PODS applies max-variance down-sampling to GRPO rollouts in LLM RLVR, delivering at least 1.7x faster training to peak test accuracy on reasoning benchmarks.
Wan: Open and Advanced Large-Scale Video Generative Models
cs.CV 2025-03 unverdicted novelty 5.0

Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 77 Pith papers · 6 internal anchors

[1]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv e-prints, page arXiv:2103.00020, February 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv e-prints, page arXiv:2102.12092, February 2021

work page internal anchor Pith review arXiv 2021
[3]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv e-prints, page arXiv:2102.05918, February 2021

work page arXiv 2021
[4]

One Epoch Is All You Need

Aran Komatsuzaki. One Epoch Is All You Need. arXiv e-prints, page arXiv:1906.06669, Jun 2019

work page Pith review arXiv 1906
[5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models. arXiv e-prints, page arXiv:2001.08361, Jan 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling Laws for Autoregressive Generative Modeling. arXiv e-prints, page arXi...

work page internal anchor Pith review arXiv 2010
[7]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing

work page 2020
[8]

Scaling vision transform- ers, 6 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. arXiv preprint arXiv:2106.04560, 2021

work page arXiv 2021
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[10]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Ja- son Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv e-prints, page arXiv:2101.00027, December 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

Dall-e in pytorch: A text to image transformer, 2021

Phil Wang. Dall-e in pytorch: A text to image transformer, 2021

work page 2021
[12]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 5

work page 2020

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv e-prints, page arXiv:2103.00020, February 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv e-prints, page arXiv:2102.12092, February 2021

work page internal anchor Pith review arXiv 2021

[3] [3]

Le, Yunhsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv e-prints, page arXiv:2102.05918, February 2021

work page arXiv 2021

[4] [4]

One Epoch Is All You Need

Aran Komatsuzaki. One Epoch Is All You Need. arXiv e-prints, page arXiv:1906.06669, Jun 2019

work page Pith review arXiv 1906

[5] [5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models. arXiv e-prints, page arXiv:2001.08361, Jan 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[6] [6]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling Laws for Autoregressive Generative Modeling. arXiv e-prints, page arXi...

work page internal anchor Pith review arXiv 2010

[7] [7]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 491–507, Cham, 2020. Springer International Publishing

work page 2020

[8] [8]

Scaling vision transform- ers, 6 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers. arXiv preprint arXiv:2106.04560, 2021

work page arXiv 2021

[9] [9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[10] [10]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Ja- son Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv e-prints, page arXiv:2101.00027, December 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

Dall-e in pytorch: A text to image transformer, 2021

Phil Wang. Dall-e in pytorch: A text to image transformer, 2021

work page 2021

[12] [12]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 5

work page 2020