DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts
Pith reviewed 2026-05-17 22:09 UTC · model grok-4.3
The pith
DoReMi combines self-supervised structural pre-training with dynamic topology-guided expert routing to build a single model that handles 3D data from different sensors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the synergy of these dual branches, DoReMi achieves a deep integration of universal feature extraction and highly adaptive expert allocation. It achieves 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, comprehensively outperforming existing state-of-the-art methods.
What carries the argument
Dual-branch architecture: a self-supervised multi-attribute pre-training branch that anchors topological and texture priors, paired with a domain-aware expert branch using Domain Spatial-Guided Routing to sense local topology and Entropy-controlled Dynamic Allocation to adjust active experts by routing uncertainty.
If this is right
- A single set of weights can now process both indoor and outdoor 3D scenes without separate domain-specific fine-tuning.
- Expert routing decisions become sensitive to local spatial structure rather than semantics alone, reducing misallocation on heterogeneous inputs.
- Dynamic control of expert activation via routing entropy keeps training stable while still allowing adaptation to varying topology.
- The same dual-branch pattern yields measurable lifts on semantic segmentation across multiple public 3D benchmarks.
Where Pith is reading between the lines
- The same structural-prior anchoring could be tested on 3D tasks beyond segmentation, such as detection or reconstruction, where topology also varies by sensor.
- MoE designs in other fields that face heterogeneous inputs might reduce routing bias by adding an early self-supervised branch focused on domain-specific structure.
- If the entropy-controlled allocation proves robust, it offers a practical knob for trading compute against accuracy on edge devices that ingest mixed 3D streams.
Load-bearing premise
That self-supervised training on topological and texture variations can anchor cross-domain structural priors strongly enough to override semantics-driven routing bias in existing 3D MoE networks.
What would settle it
Train the model on a fresh collection of 3D scenes that keep semantic labels consistent but introduce new topological shifts from unseen sensor setups, then measure whether ablating the pre-training branch or the spatial-routing and entropy mechanisms eliminates the reported gains over standard MoE baselines.
Figures
read the original abstract
Constructing a unified 3D scene understanding model has long been hindered by the significant topological discrepancies across different sensor modalities. While applying the Mixture-of-Experts (MoE) architecture is an effective approach to achieving universal understanding, we observe that existing 3D MoE networks often suffer from semantics-driven routing bias. This makes it challenging to address cross-domain data characterized by "semantic consistency yet topological heterogeneity." To overcome this challenge, we propose DoReMi (Topology-Aware Domain-Representation Mixture of Experts). Specifically, we introduce a self-supervised pre-training branch based on multi attributes, such as topological and texture variations, to anchor cross-domain structural priors. Building upon this, we design a domain-aware expert branch comprising two core mechanisms: Domain Spatial-Guided Routing (DSR), which achieves an acute perception of local topological variations by extracting spatial contexts, and Entropy-controlled Dynamic Allocation (EDA), which dynamically adjusts the number of activated experts by quantifying routing uncertainty to ensure training stability. Through the synergy of these dual branches, DoReMi achieves a deep integration of universal feature extraction and highly adaptive expert allocation. Extensive experiments across various tasks, encompassing both indoor and outdoor scenes, validate the superiority of DoReMi. It achieves 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, comprehensively outperforming existing state-of-the-art methods. The code will be released soon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DoReMi, a topology-aware Mixture-of-Experts architecture for unified 3D scene understanding across sensor modalities with semantic consistency but topological heterogeneity. It proposes a dual-branch design consisting of a self-supervised multi-attribute pre-training branch to anchor cross-domain structural priors and a domain-aware expert branch that incorporates Domain Spatial-Guided Routing (DSR) for local topological perception via spatial contexts and Entropy-controlled Dynamic Allocation (EDA) for uncertainty-aware expert activation. The authors report state-of-the-art results of 80.1% mIoU on the ScanNet validation set and 77.2% mIoU on S3DIS, with code release planned.
Significance. If the reported gains hold under rigorous validation, the work offers a practical advance toward universal 3D models by explicitly targeting semantics-driven routing bias in MoE networks. The integration of self-supervised structural priors with adaptive routing mechanisms is a coherent response to cross-domain topological discrepancies, and the explicit plan to release code is a clear strength for reproducibility and follow-on research.
major comments (2)
- [§4.1 and Table 2] §4.1 and Table 2: the central claim of comprehensive outperformance relies on the dual-branch synergy, yet the ablation study does not isolate the contribution of the self-supervised pre-training branch versus the DSR/EDA components with statistical significance testing or multiple random seeds; this weakens the attribution of the 80.1% and 77.2% mIoU figures to the proposed mechanisms rather than standard MoE scaling.
- [§3.2] §3.2, description of EDA: the entropy-controlled dynamic allocation is presented as ensuring training stability by quantifying routing uncertainty, but the manuscript does not specify the exact threshold or functional form used to adjust the number of activated experts (e.g., whether it is a fixed entropy cutoff or a learned schedule), leaving the mechanism under-specified for reproduction.
minor comments (2)
- [Figure 3] Figure 3: the visualization of expert activation patterns would be clearer with an additional panel showing routing entropy values across domains to directly illustrate the EDA mechanism.
- [§2] §2: the motivation section would benefit from a brief quantitative comparison (e.g., routing bias statistics) on a small cross-domain subset to ground the claim that existing 3D MoE networks suffer primarily from semantics-driven bias.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§4.1 and Table 2] §4.1 and Table 2: the central claim of comprehensive outperformance relies on the dual-branch synergy, yet the ablation study does not isolate the contribution of the self-supervised pre-training branch versus the DSR/EDA components with statistical significance testing or multiple random seeds; this weakens the attribution of the 80.1% and 77.2% mIoU figures to the proposed mechanisms rather than standard MoE scaling.
Authors: We thank the referee for this observation. Our ablation studies in §4.1 and Table 2 progressively incorporate the self-supervised pre-training branch followed by the DSR and EDA components to demonstrate their individual and combined contributions. However, we agree that the absence of multiple random seeds and statistical significance testing limits the strength of attribution. In the revised manuscript we will report mean and standard deviation over at least three independent runs with different seeds and include paired statistical tests (e.g., t-tests) against the baseline MoE variants to better isolate the effect of the proposed mechanisms. revision: yes
-
Referee: [§3.2] §3.2, description of EDA: the entropy-controlled dynamic allocation is presented as ensuring training stability by quantifying routing uncertainty, but the manuscript does not specify the exact threshold or functional form used to adjust the number of activated experts (e.g., whether it is a fixed entropy cutoff or a learned schedule), leaving the mechanism under-specified for reproduction.
Authors: We agree that the current description of EDA in §3.2 lacks the precise implementation details required for reproduction. We will revise this section to explicitly state the functional form, including the entropy threshold, the rule for increasing the number of activated experts, and whether the schedule is fixed or adaptive, together with the corresponding pseudocode or equations. revision: yes
Circularity Check
No significant circularity; novel components on standard MoE base
full rationale
The derivation introduces a self-supervised multi-attribute pre-training branch to anchor structural priors and a domain-aware expert branch with DSR (spatial-guided routing) and EDA (entropy-controlled allocation). These are presented as new mechanisms addressing observed semantics-driven routing bias in existing 3D MoE networks. Performance numbers (80.1% mIoU ScanNet, 77.2% S3DIS) are empirical outcomes, not reductions by construction. No equations or self-citations in the provided text reduce the central claims to fitted inputs, self-definitions, or prior author uniqueness theorems. Minor self-citation risk at most, but central argument remains independent.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Domain Spatial-Guided Routing (DSR)
no independent evidence
-
Entropy-controlled Dynamic Allocation (EDA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we introduce a self-supervised pre-training branch based on multi attributes, such as topological and texture variations, to anchor cross-domain structural priors... Domain Spatial-Guided Routing (DSR), which achieves an acute perception of local topological variations by extracting spatial contexts
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Entropy-controlled Dynamic Allocation (EDA), which dynamically adjusts the number of activated experts by quantifying routing uncertainty
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Data-Efficient Semantic Segmentation of 3D Point Clouds via Open-Vocabulary Image Segmentation-based Pseudo-Labeling
PLOVIS generates pseudo-labels for 3D point clouds by rendering them into 2D images and applying open-vocabulary segmentation, then filters the labels and uses a class-balanced memory bank to train effective models wi...
Reference graph
Works this paper leans on
-
[1]
3d semantic parsing of large-scale indoor spaces
Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. InCVPR, 2016. 5
work page 2016
-
[2]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 1, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 6
work page 2020
-
[4]
Unsupervised learn- ing of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. NeurIPS, 2020. 4
work page 2020
-
[5]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing reasoning and planning. InCVPR, 2024. 1
work page 2024
-
[7]
4d spatio-temporal convnets: Minkowski convolutional neural networks
Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InCVPR, 2019. 5
work page 2019
-
[8]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 1, 5
work page 2017
-
[9]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1
work page 2009
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 2, 4
work page 2022
-
[12]
3d-front: 3d furnished rooms with layouts and semantics
Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. InICCV, 2021. 5
work page 2021
-
[13]
Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, and Qi Tian. Vimoe: An empirical study of designing vision mixture-of-experts.arXiv preprint arXiv:2410.15732, 2024. 2
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR,
-
[15]
Pointinst3d: Segmenting 3d instances by points
Tong He, Wei Yin, Chunhua Shen, and Anton Van den Hen- gel. Pointinst3d: Segmenting 3d instances by points. In ECCV, 2022. 2
work page 2022
-
[16]
Exploring data-efficient 3d scene understanding with contrastive scene contexts
Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. InCVPR, 2021. 5
work page 2021
-
[17]
Adaptive mixtures of local experts.Neu- ral Comput., 1991
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral Comput., 1991. 2
work page 1991
-
[18]
Pointgroup: Dual-set point grouping for 3d instance segmentation
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InCVPR, 2020. 2
work page 2020
-
[19]
Pointpillars: Fast encoders for object detection from point clouds
Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InCVPR, pages 12697–12705, 2019. 2
work page 2019
-
[20]
Pamba: Enhancing global inter- action in point clouds via state space model
Zhuoyuan Li, Yubo Ai, Jiahao Lu, ChuXin Wang, Jiacheng Deng, Hanzhi Chang, Yanzhe Liang, Wenfei Yang, Shifeng Zhang, and Tianzhu Zhang. Pamba: Enhancing global inter- action in point clouds via state space model. InAAAI, 2025. 5
work page 2025
-
[21]
A convolutional decoder for point clouds using adaptive instance normaliza- tion
Isaak Lim, Moritz Ibing, and Leif Kobbelt. A convolutional decoder for point clouds using adaptive instance normaliza- tion. InComputer graphics forum, 2019. 1, 2
work page 2019
-
[22]
Yueen Ma, Yuzheng Zhuang, Jianye Hao, and Irwin King. 3d-moe: A mixture-of-experts multi-modal llm for 3d vi- sion and pose diffusion via rectified flow.arXiv preprint arXiv:2501.16698, 2025. 2
-
[23]
Spatiallm: Train- ing large language models for structured indoor modeling
Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, and Zihan Zhou. Spatiallm: Train- ing large language models for structured indoor modeling. In NeurIPS, 2025. 1, 5, 6, 7
work page 2025
-
[24]
Multimodal contrastive learn- ing with limoe: the language-image mixture of experts
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learn- ing with limoe: the language-image mixture of experts. NeurIPS, 2022. 2
work page 2022
-
[25]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 5
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
work page 2023
-
[27]
Masked autoencoders for point cloud self-supervised learning
Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InECCV, 2022. 2
work page 2022
-
[28]
Openscene: 3d scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InCVPR, 2023. 2
work page 2023
-
[29]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017. 2 9
work page 2017
-
[30]
Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 2017
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.NeurIPS, 2017. 2
work page 2017
-
[31]
Pointnext: Revisiting pointnet++ with improved training and scaling strategies.NeurIPS, 2022
Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies.NeurIPS, 2022. 2
work page 2022
-
[32]
Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.NeurIPS, 2019
Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption net- work for point cloud representation.NeurIPS, 2019. 1, 2
work page 2019
-
[33]
Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, and Liang Xiao. An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. InCVPR, 2025. 5
work page 2025
-
[34]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, pages 18332–18346, 2022. 2
work page 2022
-
[35]
Scaling vision with sparse mix- ture of experts.NeurIPS, 2021
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 2
work page 2021
-
[36]
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv ´e J ´egou. Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018. 4
-
[37]
Laion-5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.NeurIPS, 2022. 1
work page 2022
-
[38]
Objects365: A large-scale, high-quality dataset for object detection
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019. 1
work page 2019
-
[39]
Pv-rcnn: Point- voxel feature set abstraction for 3d object detection
Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In CVPR, 2020. 2
work page 2020
-
[40]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InCVPR,
-
[41]
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.NeurIPS, 2017. 3
work page 2017
-
[42]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Kpconv: Flexible and deformable convolution for point clouds
Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc ¸ois Goulette, and Leonidas J Guibas. Kpconv: Flexible and deformable convolution for point clouds. InICCV, 2019. 2
work page 2019
-
[44]
Hy- bridtm: Combining transformer and mamba for 3d semantic segmentation
Xinyu Wang, Jinghua Hou, Zhe Liu, and Yingying Zhu. Hy- bridtm: Combining transformer and mamba for 3d semantic segmentation. InIROS, 2025. 5
work page 2025
-
[45]
Point transformer v2: Grouped vector atten- tion and partition-based pooling.NeurIPS, 2022
Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.NeurIPS, 2022. 2
work page 2022
-
[46]
Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning
Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsuper- vised 3d representation learning. InCVPR, 2023. 5
work page 2023
-
[47]
Point transformer v3: Simpler faster stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. In CVPR, 2024. 2, 5, 6, 8
work page 2024
-
[48]
Towards large- scale 3d representation learning with multi-dataset point prompt training
Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InCVPR, 2024. 1, 5, 6, 8
work page 2024
-
[49]
Sonata: Self- supervised learning of reliable point representations
Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. In CVPR, 2025. 2, 3, 5, 6, 8
work page 2025
-
[50]
Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolu- tional multi-scale feature interaction for dense predictions. InCVPR, 2024. 1
work page 2024
-
[51]
Pointcontrast: Unsupervised pre- training for 3d point cloud understanding
Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InECCV, 2020. 2, 5
work page 2020
-
[52]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InCVPR, 2023. 2
work page 2023
-
[53]
Habitat-matterport 3d semantics dataset
Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakr- ishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, et al. Habitat-matterport 3d semantics dataset. InCVPR,
-
[54]
Geometry-guided do- main generalization for monocular 3d object detection
Fan Yang, Hui Chen, Yuwei He, Sicheng Zhao, Chenghao Zhang, Kai Ni, and Guiguang Ding. Geometry-guided do- main generalization for monocular 3d object detection. In AAAI, 2024. 1, 2
work page 2024
-
[55]
3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F Fouhey, and Joyce Chai. 3d-grand: A million-scale dataset for 3d-llms with better grounding and less hallucination. InCVPR, 2025. 1
work page 2025
-
[56]
Deepla-net: Very deep lo- cal aggregation networks for point cloud analysis
Ziyin Zeng, Mingyue Dong, Jian Zhou, Huan Qiu, Zhen Dong, Man Luo, and Bijun Li. Deepla-net: Very deep lo- cal aggregation networks for point cloud analysis. InCVPR,
-
[57]
Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.NeurIPS, 2022
Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training.NeurIPS, 2022. 2
work page 2022
-
[58]
Yue Zhang, Yingzhao Jian, Hehe Fan, Yi Yang, and Roger Zimmermann. Uni3d-moe: Scalable multimodal 3d scene understanding via mixture of experts.arXiv preprint arXiv:2505.21079, 2025. 2 10
-
[59]
Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations. InNeurIPS, 2025. 8
work page 2025
-
[60]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InICCV, 2021. 1
work page 2021
-
[61]
Structured3d: A large photo-realistic dataset for structured 3d modeling
Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InECCV, 2020. 5
work page 2020
-
[62]
Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,
-
[63]
V oxelnet: End-to-end learning for point cloud based 3d object detection
Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InCVPR, 2018. 2 11
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.