Recognition: no theorem link
UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition
Pith reviewed 2026-05-11 01:43 UTC · model grok-4.3
The pith
Decomposing 2D image and 3D point cloud features into shared semantic and private modality-specific subspaces unifies cross-modal fusion and improves accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present UniD-Shift, a unified multimodal framework for joint 2D-3D semantic segmentation. Features from a SAM-based vision encoder and an SPTNet-based geometric encoder are decomposed into shared subspaces that summarize common semantic factors and private subspaces that preserve modality-specific properties. A lightweight attention-based fusion module aggregates the shared features, while a regularized training objective enforces semantic alignment and subspace independence, producing improved segmentation on SemanticKITTI and nuScenes benchmarks along with strong cross-domain generalization on nuScenes USA-Singapore.
What carries the argument
The interpretable share-private multimodal decomposition, which separates common semantic factors from modality-unique properties to support cross-modal alignment and fusion.
If this is right
- Segmentation accuracy rises consistently over representative multimodal baselines on SemanticKITTI and nuScenes.
- Performance remains stable under distribution shifts in cross-domain evaluation on nuScenes USA-Singapore.
- Computational efficiency stays competitive because the fusion module is kept lightweight.
- The explicit decomposition yields interpretable separation between shared semantics and modality-specific details.
Where Pith is reading between the lines
- The same share-private split could be tested on additional sensor pairs such as radar and camera to check whether common semantics transfer across more modalities.
- Visual inspection of the learned shared subspace might identify which object categories align most reliably between 2D and 3D, informing sensor placement choices.
- If private components turn out to be indispensable, the work implies that fully shared multimodal representations have inherent limits for segmentation tasks.
- The regularization for subspace independence could be reused in other multimodal settings where alignment and uniqueness must both be preserved.
Load-bearing premise
That features learned from 2D images and 3D point clouds contain common semantic content that can be cleanly separated into independent shared and private subspaces without losing essential information.
What would settle it
An ablation experiment that removes the share-private decomposition, replaces it with direct feature concatenation, and shows no accuracy gain or a performance drop on the SemanticKITTI or nuScenes validation sets.
Figures
read the original abstract
Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UniD-Shift, a unified 2D-3D semantic segmentation framework that extracts features via a SAM-based vision encoder and SPTNet-based geometric encoder, explicitly decomposes them into shared (common semantics) and private (modality-specific) subspaces, applies a lightweight attention-based fusion module on the shared components, and optimizes with a regularized objective enforcing alignment and subspace independence. It reports consistent accuracy gains over multimodal baselines on SemanticKITTI and nuScenes, competitive efficiency, and stable cross-domain performance on nuScenes USA-Singapore splits, with public code released.
Significance. If the share-private decomposition can be shown to drive the reported gains and to produce genuinely independent subspaces, the work would provide a principled, interpretable mechanism for cross-modal fusion that addresses view-dependent distortions and sparse sampling in LiDAR-image pairs. The public implementation and cross-domain stability results would be concrete strengths for reproducibility and generalization claims in autonomous-driving segmentation.
major comments (3)
- [Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.
- [§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.
- [Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.
minor comments (2)
- [§1] The abstract and §1 state that 2D and 3D features “share some common semantics” but do not cite prior empirical evidence or provide a motivating figure; a short related-work paragraph or illustrative example would strengthen the motivation.
- [§3] Notation for the shared/private projections and the attention aggregator is introduced without an explicit table of symbols; adding one would improve readability for readers unfamiliar with the decomposition.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the referee's identification of areas where additional evidence would strengthen the central claims regarding the share-private decomposition. We address each major comment below and will revise the manuscript accordingly to provide the requested isolations and diagnostics.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript reports accuracy improvements over representative multimodal baselines but provides no ablation that removes the share-private decomposition (or the independence regularizer) while retaining the same SAM/SPTNet encoders and attention aggregator. Without this isolation, gains cannot be attributed to the proposed mechanism rather than backbone choice or fusion architecture, which is load-bearing for the central claim.
Authors: We agree that an explicit ablation isolating the contribution of the share-private decomposition is necessary to attribute performance gains specifically to this mechanism. In the revised manuscript, we will add experiments that retain the identical SAM and SPTNet encoders along with the attention-based aggregator but replace the decomposition step with direct fusion of the full 2D and 3D features. This will allow direct comparison to the full UniD-Shift pipeline and quantify the incremental benefit of the decomposition and regularizer. revision: yes
-
Referee: [§3.2] §3.2 (decomposition and objective): no quantitative diagnostics are reported to verify that the learned subspaces actually separate as intended (e.g., subspace correlation, mutual information between shared and private components, or reconstruction fidelity after decomposition). The claim of “interpretable” and “independent” subspaces therefore rests on the regularizer alone without empirical confirmation.
Authors: We acknowledge that empirical verification of subspace separation would provide stronger support for the interpretability and independence claims. In the revision, we will report quantitative diagnostics including (i) average correlation coefficients between shared and private subspaces, (ii) estimated mutual information between the components, and (iii) reconstruction fidelity metrics when reconstructing original features from the decomposed subspaces. These will be added to §3.2 and the experimental section. revision: yes
-
Referee: [Cross-domain evaluation] Cross-domain evaluation (nuScenes USA-Singapore): the stability result is presented as evidence of strong generalization, yet the paper does not analyze whether the shared subspace remains consistent across domains or whether private components absorb the shift; this leaves the mechanism’s contribution to robustness unverified.
Authors: We recognize that the current cross-domain results demonstrate stability but do not directly verify the role of the shared subspace in achieving robustness. In the revised manuscript, we will add analysis of the cross-domain behavior, including quantitative similarity measures (e.g., cosine similarity or correlation) of the shared features across the USA and Singapore splits, as well as qualitative comparisons showing how private components capture domain-specific variations while the shared subspace remains consistent. revision: yes
Circularity Check
No circularity: novel decomposition framework with independent empirical validation on benchmarks
full rationale
The paper introduces a multimodal decomposition into shared and private subspaces using SAM and SPTNet encoders, followed by attention fusion and a regularized objective. No equations or claims reduce by construction to fitted parameters, self-citations, or prior ansatzes from the same authors. The derivation chain consists of standard feature extraction, explicit subspace separation motivated by domain insight, and fusion, all validated externally on SemanticKITTI, nuScenes, and cross-domain splits without tautological reductions. Self-citations are absent from load-bearing steps, and the method does not rename known results or import uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific.
invented entities (1)
-
Shared and private subspaces
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Min Wu, Ming-Ming Cheng, Ender Konukoglu, and Serge Belongie. Multimodality helps few-shot 3d point cloud semantic seg- mentation.arXiv preprint arXiv:2410.22489, 2024. 2
-
[2]
Generalized few-shot 3d point cloud segmentation with vision-language model
Zhaochong An, Guolei Sun, Yun Liu, Runjia Li, Junlin Han, Ender Konukoglu, and Serge Belongie. Generalized few-shot 3d point cloud segmentation with vision-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16997– 17007, 2025. 2
work page 2025
-
[3]
Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving
Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, and Renaud Marlet. Rangevit: Towards vision transformers for 3d semantic segmentation in au- tonomous driving. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5240–5250, 2023. 5
work page 2023
-
[4]
Se- mantickitti: A dataset for semantic scene understanding of lidar sequences
Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,
-
[5]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 5, 6
work page 2020
-
[6]
Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation
Haozhi Cao, Yuecong Xu, Jianfei Yang, Pengyu Yin, Sheng- hai Yuan, and Lihua Xie. Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9463–9470. IEEE, 2024. 3
work page 2024
-
[7]
Adriano Cardace, Pierluigi Zama Ramirez, Samuele Salti, and Luigi Di Stefano. Exploiting the complementarity of 2d and 3d networks to address domain-shift in 3d semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 98–109,
-
[8]
Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, and Qifeng Chen. Cmdfusion: Bidirectional fusion network with cross-modality knowledge distillation for lidar semantic segmentation.IEEE Robotics and Automation Letters, 9(1):771–778, 2023. 5, 3
work page 2023
-
[9]
Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation
Xuechao Chen, Shuangjie Xu, Xiaoyi Zou, Tongyi Cao, Dit- Yan Yeung, and Lu Fang. Svqnet: Sparse voxel-adjacent query network for 4d spatio-temporal lidar semantic segmen- tation. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 8569–8578, 2023. 3
work page 2023
-
[10]
4d spatio-temporal convnets: Minkowski convolutional neural networks
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084,
-
[11]
Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds
Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. InInternational Symposium on Visual Computing, pages 207–222. Springer, 2020. 5, 7
work page 2020
-
[12]
Jiajun Deng, Wengang Zhou, Yanyong Zhang, and Houqiang Li. From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection.IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021. 2
work page 2021
-
[13]
Learning 3d semantic segmentation with only 2d image supervision
Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Panto- faru, Forrester Cole, Avneesh Sud, Brian Brewington, Brian Shucker, and Thomas Funkhouser. Learning 3d semantic segmentation with only 2d image supervision. In2021 In- ternational Conference on 3D Vision (3DV), pages 361–372. IEEE, 2021. 5, 3
work page 2021
-
[14]
3d semantic segmentation with submani- fold sparse convolutional networks
Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 3d semantic segmentation with submani- fold sparse convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 9224–9232, 2018. 2
work page 2018
-
[15]
Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey.IEEE transactions on pattern analysis and machine intelligence, 43(12):4338–4364, 2020. 1
work page 2020
-
[16]
Mvtn: Multi-view transformation network for 3d shape recognition supplementary material
Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition supplementary material. 2
-
[17]
3d-sis: 3d se- mantic instance segmentation of rgb-d scans
Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 2
work page 2019
-
[18]
Randla-net: Efficient semantic segmentation of large-scale point clouds
Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11108– 11117, 2020. 7
work page 2020
-
[19]
xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation
Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emi- lie Wirbel, and Patrick P´erez. xmuda: Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12605–12614, 2020. 3
work page 2020
-
[20]
Maximilian Jaritz, Tuan-Hung Vu, Raoul De Charette, ´Emilie Wirbel, and Patrick P ´erez. Cross-modal learning for domain adaptation in 3d semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1533–1544, 2022. 2, 3, 8
work page 2022
-
[21]
Pointgroup: Dual-set point grouping for 3d instance segmentation
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 2
work page 2020
-
[22]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3
work page 2023
-
[23]
Oneformer3d: One transformer for unified point cloud segmentation
Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 1
work page 2024
-
[24]
Stratified trans- former for 3d point cloud segmentation
Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans- former for 3d point cloud segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8500–8509, 2022. 2
work page 2022
-
[25]
Spherical transformer for lidar-based 3d recognition
Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 17545–17555, 2023. 5, 3
work page 2023
-
[26]
Pointpillars: Fast encoders for object detection from point clouds
Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2
work page 2019
-
[27]
Effective sam combination for open-vocabulary semantic segmenta- tion
Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, and Sangyoun Lee. Effective sam combination for open-vocabulary semantic segmenta- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26081–26090, 2025. 3
work page 2025
-
[28]
Memoryseg: Online lidar semantic segmentation with a latent memory
Enxu Li, Sergio Casas, and Raquel Urtasun. Memoryseg: Online lidar semantic segmentation with a latent memory. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 745–754, 2023. 5
work page 2023
-
[29]
Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing
Jiale Li, Hang Dai, Hao Han, and Yong Ding. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driv- ing. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21694–21704,
-
[30]
Miaoyu Li, Yachao Zhang, Yuan Xie, Zuodong Gao, Cui- hua Li, Zhizhong Zhang, and Yanyun Qu. Cross-domain and cross-modal knowledge distillation in domain adaptation for 3d semantic segmentation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3829–3837,
-
[31]
Bidirectional learning for domain adaptation of semantic segmentation
Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6936–6945, 2019. 2, 8
work page 2019
-
[32]
Deep continuous fusion for multi-sensor 3d object detection
Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. InProceedings of the European conference on computer vi- sion (ECCV), pages 641–656, 2018. 2
work page 2018
-
[33]
Zhengyin Liang, Hui Yin, Min Liang, Qianqian Du, Ying Yang, and Hua Huang. Unidxmd: Towards unified represen- tation for cross-modal unsupervised domain adaptation in 3d semantic segmentation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 20346– 20356, 2025. 3, 8
work page 2025
-
[34]
Wei Liu, Zhiming Luo, Yuanzheng Cai, Ying Yu, Yang Ke, Jos´e Marcato Junior, Wesley Nunes Gonc ¸alves, and Jonathan Li. Adversarial unsupervised domain adaptation for 3d semantic segmentation with multi-modal learning.ISPRS Journal of Photogrammetry and Remote Sensing, 176:211– 221, 2021. 3, 8
work page 2021
-
[35]
Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase
Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, et al. Uniseg: A unified multi-modal li- dar segmentation network and the openpcseg codebase. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21662–21673, 2023. 2, 7, 8
work page 2023
-
[36]
Yang Luo, Ting Han, Yujun Liu, Jinhe Su, Yiping Chen, Jinyuan Li, Yundong Wu, and Guorong Cai. Csfnet: Cross- modal semantic focus network for sematic segmentation of large-scale point clouds.IEEE Transactions on Geoscience and Remote Sensing, 2025. 2, 5, 6, 7, 8
work page 2025
-
[37]
Yang Luo, Ting Han, Xiaorong Zhang, Yujun Liu, Duxin Zhu, Jinyuan Li, Yiping Chen, Yundong Wu, Guorong Cai, Yingchao Piao, et al. Paseg: positional-guided segmenter with multimodal semantic alignment for enhancing urban scene 3d semantic segmentation.International Journal of Digital Earth, 18(1):2528811, 2025. 2
work page 2025
-
[38]
V oxnet: A 3d con- volutional neural network for real-time object recognition
Daniel Maturana and Sebastian Scherer. V oxnet: A 3d con- volutional neural network for real-time object recognition. In2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015. 2
work page 2015
-
[39]
Rangenet++: Fast and accurate lidar semantic segmentation
Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. Rangenet++: Fast and accurate lidar semantic segmentation. In2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4213–4220. IEEE, 2019. 5, 7
work page 2019
-
[40]
Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. Minimal-entropy correlation alignment for unsupervised deep domain adaptation.arXiv preprint arXiv:1711.10288,
-
[41]
Duo Peng, Yinjie Lei, Wen Li, Pingping Zhang, and Yulan Guo. Sparse-to-dense feature matching: Intra and inter do- main cross-modal learning in domain adaptation for 3d se- mantic segmentation. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7108–7117,
-
[42]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[43]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[44]
Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, and Liang Xiao. An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27325–27335, 2025. 1
work page 2025
-
[45]
Deep sliding shapes for amodal 3d object detection in rgb-d images
Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 808–816, 2016. 2
work page 2016
-
[46]
Multi-view convolutional neural networks for 3d shape recognition
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. InProceedings of the IEEE in- ternational conference on computer vision, pages 945–953,
-
[47]
Tianfang Sun, Zhizhong Zhang, Xin Tan, Yong Peng, Yanyun Qu, and Yuan Xie. Uni-to-multi modal knowledge distillation for bidirectional lidar-camera semantic segmen- tation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11059–11072, 2024. 5, 7, 3
work page 2024
-
[48]
Searching efficient 3d architec- tures with sparse point-voxel convolution
Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and Song Han. Searching efficient 3d architec- tures with sparse point-voxel convolution. InEuropean con- ference on computer vision, pages 685–702. Springer, 2020. 6, 7, 8, 3
work page 2020
-
[49]
Kpconv: Flexible and deformable convolution for point clouds
Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019. 2
work page 2019
-
[50]
Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation
Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick P ´erez. Advent: Adversarial entropy min- imization for domain adaptation in semantic segmentation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2517–2526, 2019. 2, 8
work page 2019
-
[51]
Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmenta- tion from a lidar point cloud. In2019 international confer- ence on robotics and automation (ICRA), pages 4376–4382. IEEE, 2019. 7
work page 2019
-
[52]
Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond
Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, and Risheng Liu. Every sam drop counts: Embracing semantic priors for multi-modality image fusion and beyond. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17882–17891,
-
[53]
Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022. 2, 6
work page 2022
-
[54]
Taseg: Temporal aggregation network for li- dar semantic segmentation
Xiaopei Wu, Yuenan Hou, Xiaoshui Huang, Binbin Lin, Tong He, Xinge Zhu, Yuexin Ma, Boxi Wu, Haifeng Liu, Deng Cai, et al. Taseg: Temporal aggregation network for li- dar semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15311–15320, 2024. 5, 3
work page 2024
-
[55]
Point transformer v3: Simpler faster stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 2, 5, 7
work page 2024
-
[56]
Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Jianping Fan, Zhongchao Shi, and Yanyun Qu. Cross-modal unsuper- vised domain adaptation for 3d semantic segmentation via bidirectional fusion-then-distillation. InProceedings of the 31st ACM International Conference on Multimedia, pages 490–498, 2023. 3, 8
work page 2023
-
[57]
Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior
Yao Wu, Mingwei Xing, Yachao Zhang, Xiaotong Luo, Yuan Xie, and Yanyun Qu. Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior. Advances in Neural Information Processing Systems, 37: 101223–101249, 2024. 2, 3, 8
work page 2024
-
[58]
Yao Wu, Mingwei Xing, Yachao Zhang, Yuan Xie, Kaibei Peng, and Yanyun Qu. Fusion-then-distillation: Toward cross-modal positive distillation for domain adaptive 3d se- mantic segmentation.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 3, 8
work page 2025
-
[59]
Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation
Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze- segv3: Spatially-adaptive convolution for efficient point- cloud segmentation. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2020. 7
work page 2020
-
[60]
Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation
Jianyun Xu, Ruixiang Zhang, Jian Dou, Yushi Zhu, Jie Sun, and Shiliang Pu. Rpvnet: A deep and efficient range-point- voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16024–16033, 2021. 2
work page 2021
-
[61]
2dpass: 2d priors assisted semantic segmentation on lidar point clouds
Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, and Zhen Li. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European conference on computer vision, pages 677–695. Springer, 2022. 2, 5, 6, 7, 3
work page 2022
-
[62]
Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018
Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018. 2
work page 2018
-
[63]
Pixor: Real- time 3d object detection from point clouds
Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real- time 3d object detection from point clouds. InProceedings of the IEEE conference on Computer Vision and Pattern Recog- nition, pages 7652–7660, 2018. 2
work page 2018
-
[64]
Swin3D: A Pre- trained Transformer Backbone for 3D Indoor Scene Understanding,
Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding.arXiv preprint arXiv:2304.06906,
-
[65]
Yu-Qi Yang, Yu-Xiao Guo, and Yang Liu. Swin3d++: Effec- tive multi-source pretraining for 3d indoor scene understand- ing.arXiv preprint arXiv:2402.14215, 2024. 2
-
[66]
Lidarmultinet: Towards a unified multi-task network for lidar perception
Dongqiangzi Ye, Zixiang Zhou, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarmultinet: Towards a unified multi-task network for lidar perception. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 3231–3240, 2023. 5, 3
work page 2023
-
[67]
Lixin Zhan, Wei Li, and Weidong Min. Fa-resnet: Feature affine residual network for large-scale point cloud segmen- tation.International Journal of Applied Earth Observation and Geoinformation, 118:103259, 2023. 7
work page 2023
-
[68]
Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation
Boxiang Zhang, Zunran Wang, Yonggen Ling, Yuanyuan Guan, Shenghao Zhang, and Wenhui Li. Mx2m: masked cross-modality modeling in domain adaptation for 3d seman- tic segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3401–3409, 2023. 3
work page 2023
-
[69]
Shuai Zhang, Yiping Chen, Biao Wang, Dong Pan, Wum- ing Zhang, and Aiguang Li. Sptnet: Sparse convolution and transformer network for woody and foliage components sep- aration from point clouds.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024. 3
work page 2024
-
[70]
Shuai Zhang, Biao Wang, Yiping Chen, Shuhang Zhang, and Wuming Zhang. Point and voxel cross perception with lightweight cosformer for large-scale point cloud semantic segmentation.International Journal of Applied Earth Ob- servation and Geoinformation, 131:103951, 2024. 1
work page 2024
-
[71]
Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation
Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze- rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: An improved grid representation for online lidar point clouds se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9601–9610, 2020. 5, 3
work page 2020
-
[72]
Yachao Zhang, Miaoyu Li, Yuan Xie, Cuihua Li, Cong Wang, Zhizhong Zhang, and Yanyun Qu. Self-supervised exclusive learning for 3d segmentation with cross-modal un- supervised domain adaptation. InProceedings of the 30th ACM International Conference on Multimedia, pages 3338– 3346, 2022. 3, 8
work page 2022
-
[73]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2
work page 2021
-
[74]
Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis
Weiguang Zhao, Rui Zhang, Qiufeng Wang, Guangliang Cheng, and Kaizhu Huang. Bfanet: Revisiting 3d semantic segmentation with boundary feature analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29395–29405, 2025. 1
work page 2025
-
[75]
Yu Zheng, Guangming Wang, Jiuming Liu, Marc Polle- feys, and Hesheng Wang. Spherical frustum sparse con- volution network for lidar point cloud semantic segmenta- tion.Advances in Neural Information Processing Systems, 37:121827–121858, 2024. 5
work page 2024
-
[76]
Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation
Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, and Dahua Lin. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550, 2020. 5, 3
-
[77]
Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion
Zixiang Zhou, Yang Zhang, and Hassan Foroosh. Panoptic- polarnet: Proposal-free lidar point cloud panoptic segmenta- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13194–13203,
-
[78]
Lidarformer: A unified transformer-based multi-task network for lidar per- ception
Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, Yu Wang, Panqu Wang, and Hassan Foroosh. Lidarformer: A unified transformer-based multi-task network for lidar per- ception. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14740–14747. IEEE, 2024. 5, 3
work page 2024
-
[79]
Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting
Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, and Chi-Wing Fu. Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3656–3665, 2025. 2
work page 2025
-
[80]
Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation
Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidar seg- mentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9939–9948,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.