The Journey, Not the Destination: How Data Guides Diffusion Models

Aleksander Madry; Hadi Salman; Joshua Vendrow; Kristian Georgiev; Sung Min Park

arxiv: 2312.06205 · v1 · pith:E4Z4YTHRnew · submitted 2023-12-11 · 💻 cs.CV · cs.LG

The Journey, Not the Destination: How Data Guides Diffusion Models

Kristian Georgiev , Joshua Vendrow , Hadi Salman , Sung Min Park , Aleksander Madry This is my paper

classification 💻 cs.CV cs.LG

keywords diffusionmodelsattributionstraineddataimagesmethodtraining

0 comments

read the original abstract

Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO. We provide code at https://github.com/MadryLab/journey-TRAK .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What's a Credit Worth? A Market Framework for Attribution-Aware Compensation in Generative Music
cs.CY 2026-07 conditional novelty 7.0

Proposes an attribution-aware compensation framework for generative music that derives closed-form payments from catalog-level attribution informativeness and quantifies welfare effects under competition.
Variance Reduction for Expectations with Diffusion Teachers
cs.LG 2026-05 unverdicted novelty 6.0

CARV amortizes upstream diffusion teacher costs over noise resamples with timestep importance sampling and stratified-inverse-CDF sampling, delivering 2-3x effective compute gains in text-to-3D experiments and order-o...
DMin: Scalable Training Data Influence Estimation for Diffusion Models
cs.CV 2024-12 unverdicted novelty 6.0

DMin uses gradient compression to scalably estimate training data influence in billion-parameter diffusion models.
Variance Reduction for Expectations with Diffusion Teachers
cs.LG 2026-05 unverdicted novelty 5.0

CARV introduces a hierarchical Monte Carlo estimator with amortized reuse, importance sampling, and stratification that yields 2-3x effective compute gains on diffusion-teacher pipelines while cutting gradient varianc...