{"paper":{"title":"When Do Diffusion Models learn to Generate Multiple Objects?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as more combinations are excluded.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Anna Rohrbach, Arnas Uselis, Iro Laina, Seong Joon Oh, Yujin Jeong","submitted_at":"2026-04-30T22:18:33Z","abstract_excerpt":"Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are system"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the synthetic MOSAIC datasets and the defined regimes of concept versus compositional generalization capture the essential factors driving failures in real-world text-to-image diffusion models trained on natural image distributions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as more combinations are excluded.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"25ee0cc8fd9827ac6ffbe2796e2692facf9343fd438b35eecb8962609a949746"},"source":{"id":"2605.00273","kind":"arxiv","version":2},"verdict":{"id":"118b219b-4541-4dbd-9d24-6361e8c40c1e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-07T04:43:19.926396Z","strongest_claim":"By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training.","one_line_summary":"Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as more combinations are excluded.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the synthetic MOSAIC datasets and the defined regimes of concept versus compositional generalization capture the essential factors driving failures in real-world text-to-image diffusion models trained on natural image distributions.","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.00273/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-20T20:35:28.792103Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"c64f3565077591964cb65b2a8902d18cfe62c1d8acc5df925b197950e8f09a3c"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}