MONet: Unsupervised Scene Decomposition and Representation

Christopher P. Burgess , Loic Matthey , Nicholas Watters , Rishabh Kabra , Irina Higgins , Matt Botvinick , Alexander Lerchner

Authors on Pith no claims yet

classification 💻 cs.CV cs.LGstat.ML

keywords scenesattentionblocksbuildingcapabledecomposedecompositionsmeaningful

0 comments

read the original abstract

The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
cs.CV 2026-05 unverdicted novelty 7.0

The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-super...
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
cs.RO 2026-04 unverdicted novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis
cs.CV 2026-05 unverdicted novelty 6.0

Anatomy-Slot improves AUC by 4.2% on ODIR-5K over a ViT-L baseline by unsupervised slot factorization and cross-eye alignment.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
When Attention Collapses: Residual Evidence Modeling for Compositional Inference
cs.LG 2026-05 unverdicted novelty 6.0

Standard attention collapses on additively mixed signals because it is memoryless with respect to explained evidence, but adding multiplicative depletion with an attention bias prevents collapse and enables multi-sour...
Deep sprite-based image models: An analysis
cs.CV 2026-04 unverdicted novelty 6.0

A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
cs.LG 2026-04 unverdicted novelty 5.0

KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
cs.LG 2026-04 unverdicted novelty 5.0

CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.