pith. machine review for the scientific record. sign in

arxiv: 2605.07078 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Test-Time Compositional Generalization in Diffusion Models via Concept Discovery

Anant Gupta, Christopher J. MacLellan, Tianyi Zhu, Zekun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion modelscompositional generalizationtest-time adaptationconcept discoveryscore functionproduct of expertsdensity modes
0
0 comments X

The pith

Pretrained diffusion models can discover reusable density modes from their time-indexed score functions and compose them at test time to generate novel concept combinations from a single out-of-distribution query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a diffusion model's learned score geometry for noisy data distributions at different timesteps encodes local density modes that act as reusable concepts. For any new query, the method recovers these modes through gradient ascent on the score, projects them into clean data space as Gaussians, selects a relevant subset via submodular optimization, and combines them into a product-of-experts model whose analytic score supports direct sampling. This matters for a sympathetic reader because it removes the need for a hand-curated concept library or retraining on all possible combinations, letting the model generalize compositionally on benchmarks such as ColorMNIST and CelebA where it beats query-only and nearest-class baselines. The approach works either by using the composed score directly through classifier-free guidance or by distilling it into a low-rank adapter plus new class embedding.

Core claim

The central claim is that the time-indexed score function s_θ(x_t, t) of a pretrained diffusion model contains local density modes corresponding to meaningful concepts. Gradient ascent on this score at multiple noise levels recovers those modes; they are then mapped to clean-space Gaussians, greedily selected with a submodular likelihood objective, and fused into a product-of-experts teacher whose closed-form score can be sampled directly or used to fine-tune a lightweight adapter, enabling compositional generation on held-out queries without any predefined concept library.

What carries the argument

Gradient ascent on the time-indexed score function s_θ(x_t, t) ≈ ∇_{x_t} log p_t(x_t) to recover local density modes at multiple timesteps, followed by Gaussian mapping, submodular prototype selection, and product-of-experts composition with an analytic score.

If this is right

  • The analytic product-of-experts score can be sampled directly via classifier-free guidance without further training.
  • The discovered modes can be distilled into a new class embedding plus low-rank adapter that improves performance on the target query.
  • Performance exceeds both a query-only baseline and the nearest trained class on held-out ColorMNIST and CelebA composition tasks.
  • No external concept library or pre-defined conditioning signals are required for the test-time process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If score geometry reliably encodes concepts, the same ascent-and-compose procedure could be tested on other score-based or flow-based generative models.
  • The method might reduce the data needed for fine-tuning when new attribute combinations appear, by reusing modes already latent in the pretrained model.
  • Submodular selection could be replaced or augmented with other relevance criteria to handle cases where modes overlap or conflict.
  • Success on simple benchmarks raises the question of whether the same density-mode recovery scales to higher-resolution natural images without additional regularization.

Load-bearing premise

The local density modes found by ascending the score at different noise levels are meaningful, query-relevant concepts that can be mapped to Gaussians and combined without introducing artifacts or omitting key elements.

What would settle it

A controlled test on a new composition benchmark where the product-of-experts samples or the adapted model consistently fail to produce the intended novel attribute combinations while still matching the query elements, or where the recovered modes do not correspond to human-interpretable factors.

Figures

Figures reproduced from arXiv: 2605.07078 by Anant Gupta, Christopher J. MacLellan, Tianyi Zhu, Zekun Wang.

Figure 1
Figure 1. Figure 1: Overview of our DDPM-based concept discovery framework for compositional generaliza [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Found local modes in pretrained DDPMs on LSUN Churches and CelebA-HQ. Empirically, we observe a related hierarchy in the modes of the learned noisy marginals. As the noise level increases, fine instance-level modes progres￾sively merge into coarser concept prototypes, from which clear object-level classes emerge. See Ap￾pendix A. Taken together, these results motivate treating the modes of pt at intermedia… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of found prototypes, given an OOD query of unseen compositions, and generated [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Modes of the noisy marginals pt(xt) learned by a DDPM on Fashion-MNIST [54]. The t = 0 panel shows clean reference images, while larger t panels show prototypes recovered by mode ascent at progressively noisier marginals. As t increases, fine instance details are smoothed away and modes consolidate into coarser object-level prototypes, suggesting that diffusion marginals encode an implicit hierarchy of dis… view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative results for the ColorMNIST and CelebA. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Concept discovery on novel background primitives, pink digit fixed from ColorMNIST. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Compositional generalization requires models to produce novel configurations from familiar parts. In diffusion models, prior compositional generation methods typically assume that the relevant concepts or conditioning signals are already available. We instead ask whether a pretrained diffusion model can discover query-specific concepts from the time-indexed scores it learns for the noisy marginals $p_t(x_t)$ and compose them at test time. Given a single out-of-distribution query, our method performs gradient ascent on $s_\theta(x_t,t) \approx \nabla_{x_t}\log p_t(x_t)$ at multiple noising timesteps to recover local density modes, maps these modes into clean-space Gaussians, greedily selects relevant prototypes with a submodular likelihood objective, and combines them into a product-of-experts (PoE) teacher model with an analytic score. This teacher model can be sampled directly through classifier-free guidance or used to generate a sample pool for training a new class embedding and low-rank adapter. On held-out composition benchmarks built from ColorMNIST and CelebA, both the analytic PoE sampler and the low-rank adapted model outperform query-only and nearest trained-class baselines. These results suggest that the time-indexed score geometry of the diffusion model contains reusable density-mode concepts that support test-time compositional generation without a predefined concept library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a test-time method for compositional generalization in pretrained diffusion models without a predefined concept library. Given a single OOD query, it performs gradient ascent on the score function s_θ(x_t, t) at multiple noising timesteps to recover local density modes, maps these to clean-space Gaussians, greedily selects relevant prototypes via a submodular likelihood objective, and composes them into an analytic product-of-experts (PoE) teacher model. This teacher can be sampled directly via classifier-free guidance or used to generate data for training a new class embedding and LoRA adapter. The approach is evaluated on held-out composition benchmarks derived from ColorMNIST and CelebA, where both the PoE sampler and adapted model outperform query-only and nearest-class baselines.

Significance. If the recovered modes prove to be stable, distinct, and semantically aligned with query elements, the work would establish that the time-indexed score geometry of diffusion models encodes reusable density-mode concepts usable for test-time composition. This could reduce reliance on curated concept libraries and enable more flexible handling of novel combinations in generative models, with the analytic PoE and adaptation steps providing concrete implementation paths.

major comments (2)
  1. [§3] §3 (Method): The central claim that gradient ascent on s_θ(x_t, t) at multiple timesteps recovers query-relevant, reusable concepts is load-bearing, yet the manuscript provides no verification (e.g., stability across runs, semantic alignment with query attributes, or distinction from diffusion artifacts) that these modes survive the Gaussian mapping and submodular selection without introducing spurious elements or missing key factors.
  2. [§4] §4 (Experiments): The reported outperformance on ColorMNIST and CelebA benchmarks lacks quantitative metrics, error bars, ablation results on the number/choice of timesteps or submodular objective parameters, and full protocol details, preventing assessment of whether the gains are robust or attributable to the discovered concepts rather than procedural degrees of freedom.
minor comments (1)
  1. [Abstract and §3] The abstract and method description would benefit from explicit notation for the submodular objective function and the precise form of the analytic PoE score to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that gradient ascent on s_θ(x_t, t) at multiple timesteps recovers query-relevant, reusable concepts is load-bearing, yet the manuscript provides no verification (e.g., stability across runs, semantic alignment with query attributes, or distinction from diffusion artifacts) that these modes survive the Gaussian mapping and submodular selection without introducing spurious elements or missing key factors.

    Authors: We agree that explicit verification of the recovered modes would better substantiate the central claim. The current manuscript emphasizes end-to-end compositional performance rather than intermediate diagnostics. In the revision we will add a dedicated analysis subsection to §3 that reports: (i) stability of the selected prototypes across five independent gradient-ascent runs (measured by set overlap and mean pairwise distance of the mapped Gaussians), (ii) semantic alignment via attribute classifiers trained on the source datasets and applied to the discovered modes, and (iii) a controlled comparison against modes obtained from random starting points to separate query-relevant concepts from generic diffusion artifacts. These additions will be presented without changing the core algorithm. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported outperformance on ColorMNIST and CelebA benchmarks lacks quantitative metrics, error bars, ablation results on the number/choice of timesteps or submodular objective parameters, and full protocol details, preventing assessment of whether the gains are robust or attributable to the discovered concepts rather than procedural degrees of freedom.

    Authors: We concur that additional experimental rigor is required for a convincing evaluation. While the original submission already includes mean performance numbers on the held-out benchmarks, we will expand §4 and the appendix to provide: standard deviations across at least five random seeds as error bars, systematic ablations on the number and selection of timesteps (e.g., 5/10/20) and on the submodular objective hyperparameters, and a complete experimental protocol listing all hyperparameters, data splits, baseline implementations, and compute details. These changes will allow readers to assess both robustness and the contribution of the discovered concepts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an independent algorithmic procedure

full rationale

The paper describes a test-time procedure that applies standard gradient ascent to the pretrained score function s_θ(x_t, t) at multiple timesteps, maps recovered modes to Gaussians, performs submodular selection, and forms a product-of-experts score. None of these steps reduce to the target compositional result by construction, nor do they rely on fitted parameters tuned to the held-out benchmarks or on self-citation chains that presuppose the claimed discovery. The derivation is therefore self-contained: the method is a composition of off-the-shelf optimization primitives whose outputs are then evaluated empirically, without the result being presupposed in the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on several optimization and modeling choices that are not derived from first principles and on the assumption that score-function modes are semantically meaningful and composable.

free parameters (2)
  • number and choice of noising timesteps
    Selected to recover local density modes; not derived from the model
  • submodular objective parameters
    Control greedy prototype selection; tuned for the likelihood objective
axioms (2)
  • standard math Gradient ascent on the score function recovers local modes of the noisy marginals
    Invoked to discover concepts from s_theta(x_t, t)
  • domain assumption Local modes can be accurately mapped to clean-space Gaussians
    Required for the product-of-experts construction
invented entities (1)
  • query-specific density-mode concepts no independent evidence
    purpose: Reusable building blocks extracted at test time
    Postulated as extractable from the pretrained score geometry without independent verification outside the method

pith-pipeline@v0.9.0 · 5539 in / 1556 out tokens · 44738 ms · 2026-05-11T01:38:41.756936+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

  1. [1]

    Fodor and Zenon W

    Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis.Cognition, 28(1–2):3–71, 1988

  2. [2]

    Lake, Tomer D

    Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people.Behavioral and Brain Sciences, 40:e253, 2017

  3. [3]

    Lake and Marco Baroni

    Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compo- sitional skills of sequence-to-sequence recurrent networks. InProceedings of the 35th Inter- national Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 2873–2882. PMLR, 2018

  4. [4]

    Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M. Lake. A benchmark for systematic generalization in grounded language understanding. InAdvances in Neural Information Processing Systems, volume 33, pages 19861–19872, 2020

  5. [5]

    Neural module networks

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39–48, 2016

  6. [6]

    Learning to compose neural networks for question answering

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1545–1554. Association for Computational Linguistics, 2016

  7. [7]

    Hopfield

    John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

  8. [8]

    Amit, Hanoch Gutfreund, and H

    Daniel J. Amit, Hanoch Gutfreund, and H. Sompolinsky. Spin-glass models of neural networks. Physical Review A, 32(2):1007–1018, 1985

  9. [9]

    Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8):1771–1800, 2002

  10. [10]

    Compositional visual generation and inference with energy based models

    Yilun Du, Shuang Li, and Igor Mordatch. Compositional visual generation and inference with energy based models. InAdvances in Neural Information Processing Systems, volume 33, pages 6637–6647, 2020

  11. [11]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean Conference on Computer Vision (ECCV), pages 423–439, 2022

  12. [12]

    Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl

    Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl. Reduce, reuse, recycle: Com- positional generation with energy-based diffusion models and mcmc. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learni...

  13. [13]

    Network dissec- tion: Quantifying interpretability of deep visual representations

    David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3319–3327, 2017

  14. [14]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational Conference on Machine Learning, pages 2668–2677. PMLR, 2018

  15. [15]

    Zou, and Been Kim

    Amirata Ghorbani, James Wexler, James Y . Zou, and Been Kim. Towards automatic concept- based explanations. InAdvances in Neural Information Processing Systems, volume 32, 2019

  16. [16]

    This looks like that: Deep learning for interpretable image recognition

    Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. 10

  17. [17]

    Tenenbaum, and Antonio Torralba

    Nan Liu, Yilun Du, Shuang Li, Joshua B. Tenenbaum, and Antonio Torralba. Unsupervised compositional concepts discovery with text-to-image generative models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2095, 2023

  18. [18]

    Mean shift: A robust approach toward feature space analysis

    Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002

  19. [19]

    Mode- seeking clustering and density ridge estimation via direct estimation of density-derivative-ratios

    Hiroaki Sasaki, Takafumi Kanamori, Aapo Hyvärinen, Gang Niu, and Masashi Sugiyama. Mode- seeking clustering and density ridge estimation via direct estimation of density-derivative-ratios. Journal of Machine Learning Research, 18(180):1–47, 2018

  20. [20]

    Rates of convergence for the cluster tree.Advances in Neural Information Processing Systems, 23, 2010

    Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree.Advances in Neural Information Processing Systems, 23, 2010

  21. [21]

    MacLellan

    Zekun Wang, Ethan Haarer, Tianyi Zhu, Zhiyi Dai, and Christopher J. MacLellan. Deep taxonomic networks for unsupervised hierarchical prototype discovery. InAdvances in Neural Information Processing Systems, 2025

  22. [22]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning (ICML), 2015

  23. [23]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020

  24. [24]

    Kingma, Tim Salimans, Ben Poole, and Jonathan Ho

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  25. [25]

    Ladder variational autoencoders

    Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. InAdvances in Neural Information Processing Systems (NeurIPS), 2016

  26. [26]

    NV AE: A deep hierarchical variational autoencoder

    Arash Vahdat and Jan Kautz. NV AE: A deep hierarchical variational autoencoder. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  27. [27]

    A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

    Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

  28. [28]

    Dream- time: An improved optimization strategy for diffusion-guided 3d generation

    Yukun Huang, Jianan Wang, Yukai Shi, Boshi Tang, Xianbiao Qi, and Lei Zhang. Dream- time: An improved optimization strategy for diffusion-guided 3d generation. InInternational Conference on Learning Representations (ICLR), 2024

  29. [29]

    Good-enough compositional data augmentation

    Jacob Andreas. Good-enough compositional data augmentation. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7556–7566. Association for Computational Linguistics, 2020

  30. [30]

    Compositional generalization for neural semantic parsing via span-level supervised attention

    Pengcheng Yin, Hao Fang, Graham Neubig, Adam Pauls, Emmanouil Antonios Platanios, Yu Su, Sam Thomson, and Jacob Andreas. Compositional generalization for neural semantic parsing via span-level supervised attention. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

  31. [31]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  32. [32]

    A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7):1661–1674, 2011

  33. [33]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  34. [34]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021. 11

  35. [35]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  36. [36]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInterna- tional Conference on Learning Representations (ICLR), 2015

  37. [37]

    An empirical Bayes approach to statistics

    Herbert Robbins. An empirical Bayes approach to statistics. InProceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 157–163, 1956

  38. [38]

    Miyasawa

    K. Miyasawa. An empirical Bayes estimator of the mean of a normal population.Bulletin of the International Statistical Institute, 38(4):181–188, 1961

  39. [39]

    Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

    Bradley Efron. Tweedie’s formula and selection bias.Journal of the American Statistical Association, 106(496):1602–1614, 2011

  40. [40]

    Implicit generation and modeling with energy-based models

    Yilun Du and Igor Mordatch. Implicit generation and modeling with energy-based models. In Advances in Neural Information Processing Systems (NeurIPS), 2019

  41. [41]

    Nemhauser, Laurence A

    George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approxima- tions for maximizing submodular set functions—I.Mathematical Programming, 14(1):265–294, 1978

  42. [42]

    Submodular function maximization

    Andreas Krause and Daniel Golovin. Submodular function maximization. In Lucas Bordeaux, Youssef Hamadi, and Pushmeet Kohli, editors,Tractability: Practical Approaches to Hard Problems, pages 71–104. Cambridge University Press, 2014

  43. [43]

    Bermano, Gal Chechik, and Daniel Cohen-Or

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InInternational Conference on Learning Representations (ICLR), 2023

  44. [44]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

  45. [45]

    Multi- concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi- concept customization of text-to-image diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  46. [46]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015

  47. [47]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems, volume 30, 2017

  48. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, 2021

  49. [49]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InAdvances in Neural Information Processing Systems, 2019

  50. [50]

    Hashimoto

    Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-LM improves controllable text generation. InAdvances in Neural Information Processing Systems, 2022

  51. [51]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 12

  52. [52]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

  53. [53]

    Lake, Ruslan Salakhutdinov, and Joshua B

    Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction.Science, 350(6266):1332–1338, 2015

  54. [54]

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

  55. [55]

    Hutchinson

    Michael F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines.Communications in Statistics – Simulation and Computation, 19(2):433–450, 1990. A Diffusion marginals as a hierarchy of modes (a)t= 0 (b)t= 50 (c)t= 200 (d)t= 300 (e)t= 500 Figure 4: Modes of the noisy marginals pt(xt) learned by a DDPM on Fa...