pith. sign in

arxiv: 2512.00995 · v4 · submitted 2025-11-30 · 💻 cs.CV

S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

Pith reviewed 2026-05-17 02:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords point cloud segmentationpart segmentation3D visionscale controlcontrastive learningmulti-view features2D priors3D dataset
0
0 comments X

The pith

S2AM3D merges 2D priors with 3D contrastive learning for scale-controllable part segmentation in point clouds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to overcome limitations in 3D point cloud part segmentation where native 3D models suffer from data scarcity and 2D knowledge transfer causes view inconsistencies. S2AM3D addresses this by designing a point-consistent part encoder that uses native 3D contrastive learning to aggregate multi-view 2D features into consistent point representations. It also introduces a scale-aware prompt decoder that takes continuous scale signals to adjust the granularity of the segmentation in real time. To support training, the authors create a new large-scale dataset containing more than 100,000 high-quality part-level point cloud samples. This setup is shown to deliver superior performance and better handling of size variations in parts.

Core claim

S2AM3D incorporates 2D segmentation priors with 3D consistent supervision. The point-consistent part encoder aggregates multi-view 2D features through native 3D contrastive learning to produce globally consistent point features. The scale-aware prompt decoder enables real-time adjustment of segmentation granularity via continuous scale signals. A new dataset with over 100k samples provides the necessary supervision, leading to leading performance in robustness and controllability for complex structures.

What carries the argument

The point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, combined with the scale-aware prompt decoder using continuous scale signals.

If this is right

  • Produces globally consistent point features across views.
  • Enables real-time control over segmentation scale and granularity.
  • Improves robustness for parts with significant size variations.
  • Benefits from the new large-scale dataset for better training supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar consistency mechanisms could be applied to other 3D tasks such as semantic segmentation or object classification to reduce view dependency.
  • The scale controllability opens possibilities for interactive 3D modeling tools where users adjust detail levels dynamically.
  • Releasing the dataset publicly could serve as a benchmark for future methods in controllable 3D segmentation.

Load-bearing premise

That aggregating multi-view 2D features through native 3D contrastive learning will produce globally consistent point features without discarding view-specific details needed for accurate part boundaries.

What would settle it

Demonstrating that segmentation results remain inconsistent when viewed from different angles on the same point cloud, or that scale signals fail to change the part granularity appropriately, would falsify the main claims.

Figures

Figures reproduced from arXiv: 2512.00995 by Han Su, Tianyu Huang, Wangmeng Zuo, Xiaohe Wu, Zichen Wan.

Figure 1
Figure 1. Figure 1: Paradigm comparison (left): Native 3D methods present [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: S2AM3D pipeline. Left: under 3D supervision with contrastive learning, the input point cloud P ∈ R N×3 is encoded into per-point features F ∈ R N×D. Right: given a prompt (p, s), s is mapped by a sinusoidal embedding e(s) to FiLM parameters [γ, β], which perform channel-wise modulation to obtain a scale-enhanced representation F˜; the prompt vector F˜ p is then indexed and interacts with the global feature… view at source ↗
Figure 3
Figure 3. Figure 3: Dataset overview: covering diverse categories and providing high-quality part-level annotations; the histogram shows the long [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on our curated dataset (see Sec. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of full segmentation (PartObjaverse-Tiny [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the ablation study on encoder feature [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of continuous scale controllability. With [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 1
Figure 1. Figure 1: Additional qualitative results of S2AM3D on full segmentation and interactive segmentation on our curated dataset. Method Params (M) Time (ms) Point-SAM [7] 311 ∼5 P 3 -SAM [3] 112 ∼3 Ours 120 ∼3 [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes S2AM3D for scale-controllable part segmentation of 3D point clouds. It addresses data scarcity in native 3D models and view inconsistency from 2D priors by introducing a point-consistent part encoder that aggregates multi-view 2D features via native 3D contrastive learning to produce globally consistent point features, followed by a scale-aware prompt decoder that uses continuous scale signals to adjust segmentation granularity in real time. The authors also contribute a new large-scale part-level point cloud dataset containing more than 100k samples. The central claim is that S2AM3D achieves leading performance across multiple evaluation settings while exhibiting exceptional robustness and controllability on complex structures and parts with large size variations.

Significance. If the empirical validation holds, the combination of 2D priors with explicit 3D consistency enforcement and the scale-controllable decoder represents a practical advance for part segmentation tasks where part sizes vary substantially. The new dataset with over 100k samples would constitute a substantial community resource. The approach avoids parameter-heavy fitting by relying on contrastive aggregation and prompt-based decoding rather than self-referential tuning.

major comments (2)
  1. [Abstract] Abstract: The headline claim that S2AM3D 'achieves leading performance across multiple evaluation settings' and exhibits 'exceptional robustness and controllability' is asserted without any quantitative metrics, ablation studies, baseline comparisons, or error analysis. This absence prevents verification of the data-to-claim link for the central performance assertions.
  2. [Method] Point-consistent part encoder (Method section): The native 3D contrastive learning step that aggregates multi-view 2D features must demonstrably preserve view-specific boundary cues while enforcing global consistency. If the contrastive objective inadvertently averages or suppresses fine-grained localization signals, the subsequent scale-aware decoder cannot recover the precision required for accurate delineation of parts with significant size variations, directly undermining both the accuracy and controllability claims.
minor comments (2)
  1. [Method] The description of the scale-aware prompt decoder would benefit from an explicit equation or diagram showing how the continuous scale signal is injected into the decoder layers.
  2. [Dataset] Dataset statistics (size distribution, part category balance, annotation protocol) should be reported in a dedicated table or subsection to allow readers to assess the supervision quality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of results and clarify methodological details. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that S2AM3D 'achieves leading performance across multiple evaluation settings' and exhibits 'exceptional robustness and controllability' is asserted without any quantitative metrics, ablation studies, baseline comparisons, or error analysis. This absence prevents verification of the data-to-claim link for the central performance assertions.

    Authors: We agree that the abstract would be strengthened by including key quantitative results to directly support the performance claims. The full manuscript reports extensive experiments with mIoU comparisons against multiple baselines, ablation studies on each component, and analyses of robustness across scale variations and complex structures. In the revision we will add specific metrics (e.g., leading mIoU on the 100k-sample dataset and relative gains over prior methods) to the abstract while retaining its concise style. revision: yes

  2. Referee: [Method] Point-consistent part encoder (Method section): The native 3D contrastive learning step that aggregates multi-view 2D features must demonstrably preserve view-specific boundary cues while enforcing global consistency. If the contrastive objective inadvertently averages or suppresses fine-grained localization signals, the subsequent scale-aware decoder cannot recover the precision required for accurate delineation of parts with significant size variations, directly undermining both the accuracy and controllability claims.

    Authors: We share the concern that contrastive aggregation must not erode localization. Our formulation applies the contrastive loss on point-wise features derived from multi-view 2D priors while retaining the original 3D geometric structure and local neighborhood information; the loss encourages cross-view agreement without explicit averaging of boundary signals. Ablation results in the manuscript show that disabling the contrastive term reduces boundary precision and overall mIoU, indicating that fine-grained cues are preserved. To make this explicit, we will add feature visualization comparisons and boundary-specific error metrics in the revised method and experiments sections. revision: partial

Circularity Check

0 steps flagged

No significant circularity in S2AM3D method or claims

full rationale

The paper introduces an architectural pipeline consisting of a point-consistent part encoder that aggregates multi-view 2D features via native 3D contrastive learning and a scale-aware prompt decoder controlled by continuous scale signals, together with a newly collected dataset exceeding 100k samples. Performance claims of leading results and robustness are presented as outcomes of extensive experiments on multiple evaluation settings rather than any derivation that reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations. No equations or steps in the abstract or described components exhibit self-referential fitting or renaming of known results; the central claims rest on independent design choices and empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented physical entities are described. The scale signal functions as a controllable input rather than a fitted constant, and contrastive learning is treated as a standard technique.

pith-pipeline@v0.9.0 · 5486 in / 1296 out tokens · 63093 ms · 2026-05-17T02:37:29.101308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

  1. [1]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, pages 226–231, 1996. 1

  2. [2]

    Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models

    Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 21736–21746, 2023. 1

  3. [3]

    P3-sam: Native 3d part segmentation

    Changfeng Ma, Y ang Li, Xinhao Y an, Jiachen Xu, Y unhan Y ang, Chunshi Wang, Zibo Zhao, Y anwen Guo, Zhuo Chen, and Chunchao Guo. P3-sam: Native 3d part segmentation. arXiv preprint arXiv:2509.06784, 2025. 2

  4. [4]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI con- ference on artificial intelligence , 2018. 1

  5. [5]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 1

  6. [6]

    Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184, 2024

    Y unhan Y ang, Y ukun Huang, Y uan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y Lam, Y an-Pei Cao, and Xihui Liu. Sampart3d: Segment any part in 3d objects. arXiv preprint arXiv:2411.07184, 2024. 1, 2, 3

  7. [7]

    arXiv preprint arXiv:2406.17741 (2024)

    Y uchen Zhou, Jiayuan Gu, Tung Y en Chiang, Fanbo Xiang, and Hao Su. Point-sam: Promptable 3d segmentation model for point clouds. arXiv preprint arXiv:2406.17741, 2024. 2 3