A feed-forward framework learns instance-structured 3D token groups from unposed multi-view images via differentiable rendering, enabling native object-level segmentation, editing, and retrieval without 3D supervision.
Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views
A feed-forward framework learns instance-structured 3D token groups from unposed multi-view images via differentiable rendering, enabling native object-level segmentation, editing, and retrieval without 3D supervision.