When Do Diffusion Models learn to Generate Multiple Objects?

Anna Rohrbach; Arnas Uselis; Iro Laina; Seong Joon Oh; Yujin Jeong

arxiv: 2605.00273 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

When Do Diffusion Models learn to Generate Multiple Objects?

Yujin Jeong , Arnas Uselis , Iro Laina , Seong Joon Oh , Anna Rohrbach This is my paper

Pith reviewed 2026-05-07 04:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords conceptdatadiffusionmodelscompositionalgeneralizationgenerationmulti-object

0 comments

The pith

Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as more combinations are excluded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors created a synthetic dataset generator called MOSAIC to create images with controlled numbers of objects, their positions, attributes, and counts. They trained diffusion models in two ways: one where individual objects appear in training but some combinations do not, and another where they vary how many object types are seen. Results showed that as scenes got more complex with more objects, performance dropped sharply. Counting specific numbers of objects was hardest when data was limited. When many possible object combinations were never shown during training, the models could not generate those new combinations well.

Core claim

By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training.

Load-bearing premise

That the synthetic MOSAIC datasets and the defined regimes of concept versus compositional generalization capture the essential factors driving failures in real-world text-to-image diffusion models trained on natural image distributions.

read the original abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the validity of the synthetic MOSAIC data as a proxy for real diffusion training dynamics and on standard assumptions about how diffusion models learn distributions from finite datasets.

axioms (1)

domain assumption Synthetic datasets generated with controlled concept distributions and held-out combinations can isolate the effects of data composition on diffusion model behavior.
Invoked when the authors use MOSAIC to disentangle data effects from model limitations.

invented entities (1)

MOSAIC framework no independent evidence
purpose: To generate controlled multi-object datasets for studying generalization regimes in diffusion models.
New tool introduced to create the experimental conditions; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5484 in / 1325 out tokens · 89198 ms · 2026-05-07T04:43:19.926396+00:00 · methodology

When Do Diffusion Models learn to Generate Multiple Objects?

Core claim

Load-bearing premise

discussion (0)