pith. sign in

arxiv: 1907.10008 · v1 · pith:RCPGB5Z7new · submitted 2019-07-23 · 💻 cs.CV · cs.RO

Incremental Class Discovery for Semantic Segmentation with RGBD Sensing

Pith reviewed 2026-05-24 17:21 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords semantic segmentationRGBD sensingincremental class discovery3D map aggregationopen world segmentationcoherent regionssemi-real-time mapping
0
0 comments X

The pith

Aggregating RGBD frames into a dense 3D map discovers new semantic classes by clustering coherent unlabeled regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an incremental method for open-world semantic segmentation that processes RGBD video to learn new object classes over time. Each frame is segmented using color and geometric cues, then the results are fused into one dense 3D map. Unlabeled coherent regions within that map become the units for discovering and clustering previously unseen classes. The 3D map primitive replaces surfels or voxels to cut memory and computation, enabling updates at 10.7 Hz. A reader would care because the approach relaxes the closed-world assumption that limits most current segmentation systems to a fixed set of trained classes.

Core claim

By first segmenting each RGBD frame with both color and geometry and then aggregating the results into a single segmented dense 3D map, the system identifies coherent regions that lack semantic labels and treats them as new object classes; these regions serve as the basic element for clustering both known and unseen objects while keeping memory and runtime low enough for incremental, semi-real-time operation.

What carries the argument

Coherent regions in the aggregated dense 3D map, which act as the primitive element for identifying and clustering new semantic classes instead of surfels or voxels.

If this is right

  • The 3D map representation reduces both computational complexity and memory use relative to surfel- or voxel-based alternatives.
  • The system runs at 10.7 Hz while incrementally updating the dense 3D map at every frame.
  • Experiments on NYUDv2 demonstrate correct clustering of objects from both known and unseen classes.
  • Quantitative comparisons with state-of-the-art supervised methods are reported alongside timing and component analyses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coherent-region idea could be tested on other 3D sensors or SLAM pipelines if the initial segmentation step can be adapted.
  • Treating the 3D map as the discovery primitive may reduce sensitivity to per-frame viewpoint changes compared with 2D-only methods.
  • Integrating an online model update step after discovery could close the loop between segmentation and class learning.

Load-bearing premise

The initial per-frame segmentation of known classes is accurate enough that any unlabeled coherent region in the 3D map reliably represents a distinct semantic object rather than noise or a partial view.

What would settle it

Running the method on NYUDv2 and finding that a large fraction of the discovered coherent regions either merge multiple objects, split single objects, or fail to separate known from unseen classes would falsify the claim.

Figures

Figures reproduced from arXiv: 1907.10008 by Byeongkeun Kang, Hideo Saito, Kris Kitani, Yoshikatsu Nakajima.

Figure 1
Figure 1. Figure 1: Proposed method incrementally discovers new classes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Building 3D Segmentation Map. The output of this processing is object-level segments in 3D. We build the 3D map by propagating 2D segmentation to the existing 3D segmentation map. (Section 3.1). we generate surfels and fuse them into the existing recon￾structed 3D map. Hence, building the 3D segmentation map includes building a reconstructed 3D map using SLAM and grouping surfels in the reconstructed 3D ma… view at source ↗
Figure 4
Figure 4. Figure 4: Incremental 3D Segment Clustering. This clustering is to associates objects of the same class or to discover new classes using object-level segments in the 3D segmentation map. (Sec￾tion 3.2). improve the robustness of the features in the 3D segmenta￾tion map. Moreover, storing/updating the features for each segment is a very effective strategy for both saving memory usage and reducing computations for 3D … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of dense 3D incremental semantic [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of the 3D segmentation map. The pro [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Processing time for each frame of the sequence [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of memory usage for storing seman [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

This work addresses the task of open world semantic segmentation using RGBD sensing to discover new semantic classes over time. Although there are many types of objects in the real-word, current semantic segmentation methods make a closed world assumption and are trained only to segment a limited number of object classes. Towards a more open world approach, we propose a novel method that incrementally learns new classes for image segmentation. The proposed system first segments each RGBD frame using both color and geometric information, and then aggregates that information to build a single segmented dense 3D map of the environment. The segmented 3D map representation is a key component of our approach as it is used to discover new object classes by identifying coherent regions in the 3D map that have no semantic label. The use of coherent region in the 3D map as a primitive element, rather than traditional elements such as surfels or voxels, also significantly reduces the computational complexity and memory use of our method. It thus leads to semi-real-time performance at {10.7}Hz when incrementally updating the dense 3D map at every frame. Through experiments on the NYUDv2 dataset, we demonstrate that the proposed method is able to correctly cluster objects of both known and unseen classes. We also show the quantitative comparison with the state-of-the-art supervised methods, the processing time of each step, and the influences of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes an incremental open-world semantic segmentation pipeline for RGBD data. Per-frame segmentation combines color and geometric cues; information is fused into a dense 3D map whose unlabeled coherent regions are treated as candidate instances of unseen classes. The 3D coherent-region primitive is claimed to reduce complexity, enabling 10.7 Hz incremental updates. Experiments on NYUDv2 are said to demonstrate correct clustering of both known and unseen classes together with runtime breakdowns and comparisons against supervised baselines.

Significance. If the discovery mechanism proves reliable, the work would contribute a practical route toward open-world mapping in robotics without requiring retraining for every new class. The efficiency argument for coherent 3D regions over surfels or voxels is a concrete engineering contribution, and the reported semi-real-time rate is a verifiable strength. The absence of quantitative metrics for the unseen-class clustering step, however, prevents a full assessment of impact.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments: the central claim that the method 'is able to correctly cluster objects of both known and unseen classes' is presented without any reported quantitative metric (purity, ARI, region-to-instance IoU, or similar) on NYUDv2 held-out classes; only qualitative demonstration, runtime, and supervised comparisons are mentioned.
  2. [Method] Method (coherent-region identification): the assumption that geometrically coherent unlabeled regions after known-class fusion correspond one-to-one with distinct semantic objects is load-bearing, yet no definition of coherence (connectivity rule, geometric threshold, handling of partial views) or ablation on fragmentation/aggregation artifacts is supplied.
minor comments (3)
  1. [Abstract] Abstract: 'real-word' should read 'real-world'.
  2. [Abstract] Abstract: the notation '{10.7}Hz' contains extraneous braces; write 10.7 Hz.
  3. [Abstract / Experiments] Abstract: the phrase 'influences of each component' is vague; the corresponding experimental subsection should list the exact ablations performed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify important gaps in quantitative evaluation and methodological detail. We address each below and will revise the manuscript to incorporate the requested additions.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments: the central claim that the method 'is able to correctly cluster objects of both known and unseen classes' is presented without any reported quantitative metric (purity, ARI, region-to-instance IoU, or similar) on NYUDv2 held-out classes; only qualitative demonstration, runtime, and supervised comparisons are mentioned.

    Authors: We agree this is a substantive omission. The manuscript currently supports the clustering claim only with qualitative examples on NYUDv2. In revision we will add quantitative metrics (ARI, purity, and region-to-instance IoU) computed on held-out unseen classes, together with the corresponding experimental protocol. revision: yes

  2. Referee: [Method] Method (coherent-region identification): the assumption that geometrically coherent unlabeled regions after known-class fusion correspond one-to-one with distinct semantic objects is load-bearing, yet no definition of coherence (connectivity rule, geometric threshold, handling of partial views) or ablation on fragmentation/aggregation artifacts is supplied.

    Authors: The current text describes coherent regions at a high level but does not supply the requested formal definition or ablation. We will expand the method section with the exact connectivity rule, geometric thresholds, partial-view handling, and an ablation study quantifying fragmentation and aggregation effects. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method with no derivations or fitted predictions

full rationale

The paper describes a systems-level pipeline that segments RGBD frames, aggregates them into a dense 3D map, and identifies unlabeled coherent regions for new class discovery. No equations, parameter fits, or predictions appear in the provided text. The method relies on standard segmentation and geometric coherence steps without self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. Experiments on NYUDv2 provide external validation rather than internal equivalence to inputs. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; evaluation limited to surface description.

pith-pipeline@v0.9.0 · 5787 in / 1005 out tokens · 17029 ms · 2026-05-24T17:21:39.724595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Arbelaez, M

    P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. From con- tours to regions: An empirical evaluation. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 2294–2301, June 2009

  2. [2]

    Arbelaez, M

    P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- tour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, May 2011

  3. [3]

    Boykov, O

    Y . Boykov, O. Veksler, and R. Zabih. Fast approximate en- ergy minimization via graph cuts.IEEE Transactions on Pat- tern Analysis and Machine Intelligence , 23(11):1222–1239, Nov 2001

  4. [4]

    L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, April 2018

  5. [5]

    Comaniciu and P

    D. Comaniciu and P. Meer. Mean shift: a robust ap- proach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(5):603–619, May 2002

  6. [6]

    Couprie, C

    C. Couprie, C. Farabet, L. Najman, and Y . LeCun. Indoor semantic segmentation using depth information. In Interna- tional Conference on Learning Representations, 2013

  7. [7]

    Deng and B

    Y . Deng and B. S. Manjunath. Unsupervised segmenta- tion of color-texture regions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):800–810, Aug 2001

  8. [8]

    P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph- based image segmentation. International Journal of Com- puter Vision, 59(2):167–181, Sep 2004

  9. [9]

    Fulkerson and S

    B. Fulkerson and S. Soatto. Really quick shift: Image seg- mentation on a gpu. In K. N. Kutulakos, editor, Trends and Topics in Computer Vision, pages 350–358, Berlin, Heidel- berg, 2012. Springer Berlin Heidelberg

  10. [10]

    Hermans, G

    A. Hermans, G. Floros, and B. Leibe. Dense 3d seman- tic mapping of indoor scenes from rgb-d images. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 2631–2638, May 2014

  11. [11]

    Huang and D.-R

    Y .-L. Huang and D.-R. Chen. Watershed segmentation for breast tumor in 2-d sonography.Ultrasound in Medicine and Biology, 30(5):625 – 632, 2004

  12. [12]

    Izadi, D

    S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceed- ings of the 24th Annual ACM Symposium on User Interface Software and Technology, UIST ’11, pages 559–568, New York, NY , USA...

  13. [13]

    S. H. Kasaei, A. M. Tom ´e, L. S. Lopes, and M. Oliveira. Good: A global orthographic object descriptor for 3d object recognition and manipulation. Pattern Recognition Letters, 83:312–320, 2016

  14. [14]

    Keller, D

    M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb. Real-time 3d reconstruction in dynamic scenes using point-based fusion. In 2013 International Conference on 3D Vision - 3DV 2013, pages 1–8, June 2013

  15. [15]

    H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se- mantic labeling of 3d point clouds for indoor scenes. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 244–252. Curran Associates, Inc., 2011

  16. [16]

    Kundu, Y

    A. Kundu, Y . Li, F. Dellaert, F. Li, and J. M. Rehg. Joint semantic segmentation and 3d reconstruction from monocu- lar video. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuyte- laars, editors, Computer Vision – ECCV 2014 , pages 703– 718, Cham, 2014. Springer International Publishing

  17. [17]

    Lee and T

    K.-R. Lee and T. Nguyen. Realistic surface geometry recon- struction using a hand-held rgb-d camera. Machine Vision and Applications, 27(3):377–385, Apr 2016

  18. [18]

    X. Li, H. Ao, R. Belaroussi, and D. Gruyer. Fast semi-dense 3d semantic mapping with monocular visual slam. In 2017 IEEE 20th International Conference on Intelligent Trans- portation Systems (ITSC), pages 385–390, Oct 2017

  19. [19]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 3431–3440, June 2015

  20. [20]

    McCormac, A

    J. McCormac, A. Handa, A. Davison, and S. Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolu- tional neural networks. In 2017 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 4628–4635, May 2017

  21. [21]

    Nakajima, K

    Y . Nakajima, K. Tateno, F. Tombari, and H. Saito. Fast and accurate semantic mapping through geometric-based in- cremental segmentation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 385–392, Oct 2018

  22. [22]

    R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface map- ping and tracking. In 2011 10th IEEE International Sympo- sium on Mixed and Augmented Reality, pages 127–136, Oct 2011

  23. [23]

    C. V . Nguyen, S. Izadi, and D. Lovell. Modeling kinect sen- sor noise for improved 3d reconstruction and tracking. In 2012 second international conference on 3D imaging, mod- eling, processing, visualization & transmission , pages 524–

  24. [24]

    Pont-Tuset, P

    J. Pont-Tuset, P. Arbelez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for im- age segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, Jan 2017

  25. [25]

    V . A. Prisacariu, O. K ¨ahler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr, and D. W. Murray. Infinitam v3: A framework for large-scale 3d reconstruction with loop clo- sure. CoRR, abs/1708.00783, 2017

  26. [26]

    Ray and R

    S. Ray and R. H. Turi. Determination of number of clus- ters in k-means clustering and application in colour segmen- tation. In The 4th International Conference on Advances in Pattern Recognition and Digital Techniques, pages 137–143, 1999

  27. [27]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox. U-net: Convo- lutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , pages 234–241, Cham, 2015. Springer International Publishing

  28. [28]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog- nition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015

  29. [29]

    Sengupta, E

    S. Sengupta, E. Greveson, A. Shahrokni, and P. H. S. Torr. Urban 3d semantic modelling using stereo vision. In 2013 IEEE International Conference on Robotics and Automation, pages 580–585, May 2013

  30. [30]

    C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948

  31. [31]

    Shelhamer, J

    E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(4):640–651, April 2017

  32. [32]

    Shi and J

    J. Shi and J. Malik. Normalized cuts and image segmenta- tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, Aug 2000

  33. [33]

    Silberman, D

    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. In- door segmentation and support inference from rgbd images. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y . Sato, and C. Schmid, editors, Computer Vision – ECCV 2012 , pages 746–760, Berlin, Heidelberg, 2012. Springer Berlin Heidel- berg

  34. [34]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015

  35. [35]

    S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 567–576, June 2015

  36. [36]

    Tateno, F

    K. Tateno, F. Tombari, and N. Navab. Real-time and scalable incremental segmentation on dense slam. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4465–4472, Sep. 2015

  37. [37]

    van Dongen

    S. van Dongen. Graph clustering by flow simulation. Uni- versity of Utrecht, 2000

  38. [38]

    Vineet, O

    V . Vineet, O. Miksik, M. Lidegaard, M. Niener, S. Golodetz, V . A. Prisacariu, O. Khler, D. W. Murray, S. Izadi, P. Prez, and P. H. S. Torr. Incremental dense semantic stereo fu- sion for large-scale semantic scene reconstruction. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 75–82, May 2015

  39. [39]

    W-Net: A Deep Model for Fully Unsupervised Image Segmentation

    X. Xia and B. Kulis. W-net: A deep model for fully unsuper- vised image segmentation. CoRR, abs/1711.08506, 2017

  40. [40]

    J. Yang, Z. Gan, K. Li, and C. Hou. Graph-based segmen- tation for rgb-d data using 3-d geometry enhanced superpix- els. IEEE Transactions on Cybernetics, 45(5):927–940, May 2015

  41. [41]

    S. Yang, Y . Huang, and S. Scherer. Semantic 3d occupancy mapping through efficient high order crfs. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 590–597, Sep. 2017

  42. [42]

    Z. Zhang. Microsoft kinect sensor and its effect. IEEE Mul- tiMedia, 19(2):4–10, Feb 2012