Recognition: unknown
MONet: Unsupervised Scene Decomposition and Representation
read the original abstract
The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-super...
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis
Anatomy-Slot improves AUC by 4.2% on ODIR-5K over a ViT-L baseline by unsupervised slot factorization and cross-eye alignment.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
When Attention Collapses: Residual Evidence Modeling for Compositional Inference
Standard attention collapses on additively mixed signals because it is memoryless with respect to explained evidence, but adding multiplicative depletion with an attention bias prevents collapse and enables multi-sour...
-
Deep sprite-based image models: An analysis
A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.
-
Kuramoto Oscillatory Phase Encoding: Neuro-inspired Synchronization for Improved Learning Efficiency
KoPE adds Kuramoto-based oscillatory phase states and synchronization to Vision Transformers, improving training, parameter, and data efficiency on structured vision tasks.
-
CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics
CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.