S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
Pith reviewed 2026-05-17 02:37 UTC · model grok-4.3
The pith
S2AM3D merges 2D priors with 3D contrastive learning for scale-controllable part segmentation in point clouds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S2AM3D incorporates 2D segmentation priors with 3D consistent supervision. The point-consistent part encoder aggregates multi-view 2D features through native 3D contrastive learning to produce globally consistent point features. The scale-aware prompt decoder enables real-time adjustment of segmentation granularity via continuous scale signals. A new dataset with over 100k samples provides the necessary supervision, leading to leading performance in robustness and controllability for complex structures.
What carries the argument
The point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, combined with the scale-aware prompt decoder using continuous scale signals.
If this is right
- Produces globally consistent point features across views.
- Enables real-time control over segmentation scale and granularity.
- Improves robustness for parts with significant size variations.
- Benefits from the new large-scale dataset for better training supervision.
Where Pith is reading between the lines
- Similar consistency mechanisms could be applied to other 3D tasks such as semantic segmentation or object classification to reduce view dependency.
- The scale controllability opens possibilities for interactive 3D modeling tools where users adjust detail levels dynamically.
- Releasing the dataset publicly could serve as a benchmark for future methods in controllable 3D segmentation.
Load-bearing premise
That aggregating multi-view 2D features through native 3D contrastive learning will produce globally consistent point features without discarding view-specific details needed for accurate part boundaries.
What would settle it
Demonstrating that segmentation results remain inconsistent when viewed from different angles on the same point cloud, or that scale signals fail to change the part granularity appropriately, would falsify the main claims.
Figures
read the original abstract
Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes S2AM3D for scale-controllable part segmentation of 3D point clouds. It addresses data scarcity in native 3D models and view inconsistency from 2D priors by introducing a point-consistent part encoder that aggregates multi-view 2D features via native 3D contrastive learning to produce globally consistent point features, followed by a scale-aware prompt decoder that uses continuous scale signals to adjust segmentation granularity in real time. The authors also contribute a new large-scale part-level point cloud dataset containing more than 100k samples. The central claim is that S2AM3D achieves leading performance across multiple evaluation settings while exhibiting exceptional robustness and controllability on complex structures and parts with large size variations.
Significance. If the empirical validation holds, the combination of 2D priors with explicit 3D consistency enforcement and the scale-controllable decoder represents a practical advance for part segmentation tasks where part sizes vary substantially. The new dataset with over 100k samples would constitute a substantial community resource. The approach avoids parameter-heavy fitting by relying on contrastive aggregation and prompt-based decoding rather than self-referential tuning.
major comments (2)
- [Abstract] Abstract: The headline claim that S2AM3D 'achieves leading performance across multiple evaluation settings' and exhibits 'exceptional robustness and controllability' is asserted without any quantitative metrics, ablation studies, baseline comparisons, or error analysis. This absence prevents verification of the data-to-claim link for the central performance assertions.
- [Method] Point-consistent part encoder (Method section): The native 3D contrastive learning step that aggregates multi-view 2D features must demonstrably preserve view-specific boundary cues while enforcing global consistency. If the contrastive objective inadvertently averages or suppresses fine-grained localization signals, the subsequent scale-aware decoder cannot recover the precision required for accurate delineation of parts with significant size variations, directly undermining both the accuracy and controllability claims.
minor comments (2)
- [Method] The description of the scale-aware prompt decoder would benefit from an explicit equation or diagram showing how the continuous scale signal is injected into the decoder layers.
- [Dataset] Dataset statistics (size distribution, part category balance, annotation protocol) should be reported in a dedicated table or subsection to allow readers to assess the supervision quality.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the presentation of results and clarify methodological details. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that S2AM3D 'achieves leading performance across multiple evaluation settings' and exhibits 'exceptional robustness and controllability' is asserted without any quantitative metrics, ablation studies, baseline comparisons, or error analysis. This absence prevents verification of the data-to-claim link for the central performance assertions.
Authors: We agree that the abstract would be strengthened by including key quantitative results to directly support the performance claims. The full manuscript reports extensive experiments with mIoU comparisons against multiple baselines, ablation studies on each component, and analyses of robustness across scale variations and complex structures. In the revision we will add specific metrics (e.g., leading mIoU on the 100k-sample dataset and relative gains over prior methods) to the abstract while retaining its concise style. revision: yes
-
Referee: [Method] Point-consistent part encoder (Method section): The native 3D contrastive learning step that aggregates multi-view 2D features must demonstrably preserve view-specific boundary cues while enforcing global consistency. If the contrastive objective inadvertently averages or suppresses fine-grained localization signals, the subsequent scale-aware decoder cannot recover the precision required for accurate delineation of parts with significant size variations, directly undermining both the accuracy and controllability claims.
Authors: We share the concern that contrastive aggregation must not erode localization. Our formulation applies the contrastive loss on point-wise features derived from multi-view 2D priors while retaining the original 3D geometric structure and local neighborhood information; the loss encourages cross-view agreement without explicit averaging of boundary signals. Ablation results in the manuscript show that disabling the contrastive term reduces boundary precision and overall mIoU, indicating that fine-grained cues are preserved. To make this explicit, we will add feature visualization comparisons and boundary-specific error metrics in the revised method and experiments sections. revision: partial
Circularity Check
No significant circularity in S2AM3D method or claims
full rationale
The paper introduces an architectural pipeline consisting of a point-consistent part encoder that aggregates multi-view 2D features via native 3D contrastive learning and a scale-aware prompt decoder controlled by continuous scale signals, together with a newly collected dataset exceeding 100k samples. Performance claims of leading results and robustness are presented as outcomes of extensive experiments on multiple evaluation settings rather than any derivation that reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations. No equations or steps in the abstract or described components exhibit self-referential fitting or renaming of known results; the central claims rest on independent design choices and empirical validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning... scale-aware prompt decoder... sinusoidal embedding e(s) = [sin(ω_k s + ϕ_k), cos(ω_k s + ϕ_k)]... FiLM(X; s) = X ⊙ (1 + α γ) + α β
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
scale modulator... continuous scale signals... bi-directional cross-attention
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A density-based algorithm for discovering clusters in large spatial databases with noise
Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, pages 226–231, 1996. 1
work page 1996
-
[2]
Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models
Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part seg- mentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 21736–21746, 2023. 1
work page 2023
-
[3]
P3-sam: Native 3d part segmentation
Changfeng Ma, Y ang Li, Xinhao Y an, Jiachen Xu, Y unhan Y ang, Chunshi Wang, Zibo Zhao, Y anwen Guo, Zhuo Chen, and Chunchao Guo. P3-sam: Native 3d part segmentation. arXiv preprint arXiv:2509.06784, 2025. 2
-
[4]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI con- ference on artificial intelligence , 2018. 1
work page 2018
-
[5]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 1
work page 2017
-
[6]
Sampart3d: Segment any part in 3d objects.arXiv preprint arXiv:2411.07184, 2024
Y unhan Y ang, Y ukun Huang, Y uan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y Lam, Y an-Pei Cao, and Xihui Liu. Sampart3d: Segment any part in 3d objects. arXiv preprint arXiv:2411.07184, 2024. 1, 2, 3
-
[7]
arXiv preprint arXiv:2406.17741 (2024)
Y uchen Zhou, Jiayuan Gu, Tung Y en Chiang, Fanbo Xiang, and Hao Su. Point-sam: Promptable 3d segmentation model for point clouds. arXiv preprint arXiv:2406.17741, 2024. 2 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.